Mode
Dark | Light
Email Me , μBlog me
or follow the feed  

New Geometry Validation Method

Introduction

While I was still playing with the osmdata R package to decide how to make a situation map of Karabel (for background information, see my last post about mapping a random run), I met some old friends that I have not seen for a long time: geometry validity errors. This note explore the new method implemented in R to validate (simple feature) polygons.

R, rgeos and geos

My aim was to download areas covered by forest using the following lines:

# This is not run. In the meantime, the geometry is correct
library(sf)
library(osmdata)

karabel_bb  <- st_bbox(c(xmin = 26.17, xmax = 28.0,
                         ymin = 37.9,  ymax = 38.8),
                         crs = st_crs(4326))

query_box  <- opq(bbox = karabel_bb)

q <- query_box %>%
     add_osm_feature (key="landuse", value="forest") %>%
     osmdata_sf ()

Then I wanted to plot the polygons with tmap, but I got an error message Error: Shape contains invalid polygons. Please fix it. Three years ago, I used the library rgeos to find the problem. However, the loading of the package rgeos adverts that it will be deprecated by printing this warning message .onLoad:

[…] Please note that rgeos will be retired by the end of 2023,
plan transition to sf functions using GEOS at your earliest
convenience […] 

The retirement message was added Sep. 7 2021 (SVN revision 673). At the first popping of the message, I was not so pleased because some of my previous scripts rely on rgeos, and they will brake sooner than I would like. However, I completely understand the change, and instead of complaining, it is a good occasion to thank those behind the package for their tremendous and reliable work. Thank you Roger Bivand. I also do appreciate to have more than two years of time to change my scripts, this is inspiring community stewardship and the R community would benefit to take backward compatibility so seriously.

It is clear that the package sf will eventually replace the older sp and rgeos packages. Therefore, I dug into the documentation of the sf package, to look what is the new function name and what are the parameters. Using the function st_is_valid(reason=TRUE) pointed to the error

# Edge 183 has duplicate vertex with edge 191

However, by querying the word ?valid, I spotted the function st_make_valid() as it is documented together with the function st_is_valid(). It impressed me, because when I used it for the “malformed” OSM geometries, it solved most of my problems. However, some polygons could not be fixed with this routine. The st_make_valid() documentation tells us:

st_make_valid uses the lwgeom_makevalid method also used by the PostGIS command ST_makevalid if the GEOS version linked to is smaller than 3.8.0, and otherwise the version shipped in GEOS

This reflects a change that happened at the beginning of 2019 when the GEOS library adopted a PostGIS command ST_makevalid. By looking at GEOS, I noticed that a recent release of GEOS (Version 3.10.0 from Oct. 21, 2021) stated that this version adds a new parameter called “method” which enable to choose between two methods:

  • method=linework: Original method, combines all rings into a set of noded lines and then extracts valid polygons from that linework.
  • method=structure: first makes all rings valid then merges shells and subtracts holes from shells to generate valid result. Assumes that holes and shells are correctly categorized.

The function has also an option “KeepCollapsed” to “drop any component that has collapsed into a lower dimensionality. For example, a ring collapsing to a line, or a line collapsing to a point.”. Paul Ramsey, one of the authors of the new “structure” method with Martin Davis presents it in this post (July 19, 2021) on Crunchy Data. Originally, this algorithm was discussed in the “RFC: Fixing invalid geometry #652 before implementation in the JTS Topology Suite. When reading Paul Ramsey’s blog post, I particularly like the conclusion, stating that it is worth to implement a “trigger” function to check the validity and avoid algorithm failures later. It is similar to my conclusion when this function was not available.1

This triggered2 piqued my curiosity to check how it works in R.

Set-up

While we could use rgeos, I was curious to test the new package geos, even if I prefer the particularly good documentation of rgeos and its syntax.3 Moreover rgeos provides an incredibly series of examples, much better that what will follow. So please have a look at example(rgeos::gMakeValid), it is worth to read the doc. The package geos (and its companion libgeos) are new and still at the experimental stage, so it may brakes in the futur…

Let start with creating a polygon, which has a geometry that is not valid. I am reusing here the example Polygon - Inverted shell, line touch, exterior, plotted bellow

# Load the packages ("`stopifnot` installed"). 
library(sf)
library(geos)
test_polygon <- paste0(
              "POLYGON((",
                 "10 -90, 70 -90, 30 -90, 30 -60, ",
                 "70 -60, 70 -90, 90 -90, 90 -10, ",
                 "10 -10, 10 -90", "))"
)

plot of chunk plot_non_valid

The numbering shows the order of the nodes of the polygon. It clearly indicates an invalid geometry. We would expect 9 edges (to define a closed shape, such as a polygon, the last point must close onto the first point). However, we have here 10 points. The point 6 overlaps the point 2. This breaks the simple feature validity rule. If we check the validity with geos_is_valid(), we have a FALSE answer and we can look at the reason with geos_is_valid_detail()

geos_is_valid_detail(test_polygon)
##   is_valid            reason         location
## 1    FALSE Self-intersection <POINT (70 -90)>

This geometry does not define a valid polygon per definition. The function geos_is_valid_detail() indicates the broken validity rule (i.e., self-intersection) and gives the coordinates <POINT (70 -90)>. However, with the same set of points, it is possible to define a valid linestring:

control_line <- geos_node(test_polygon)
geos_is_valid(control_line)
## [1] TRUE

It looks almost identical, but it better illustrates the problem in the topology of the object.

plot of chunk plot_valid_line

Even if these two figures look similar, obviously the (invalid) polygon and the linestring behave differently. In both plots, the parameter colour is set. In the first case, the inside of the polygon is colourised, in the second, it is the line and you cannot colourise the “inside” (that does not exist).

A bit of make-up

Let us now play with this invalid polygon. First, to test and compare the new geos package, I use the same “linework” method with the package geos and the package sf.

Method Linework

Here the (verbose) code, first for geos and then for sf:

# geos
validated_polygon_1 <- geos_make_valid(test_polygon)

# sf
sf_validated_polygon_1 <- st_make_valid(sf_test_polygon)

And we plot the results side by side

plot of chunk validate_1_plot

As expected, both functions return the same result, but what did happen? There is still a line to be seen, but this time, the colour changed. Indeed, if we look at the geometry of the validated polygon with the function call geos_unnest(), we see that the object is a geometry collection containing a polygon and a line: <POLYGON [10 -90…90 -10]>, <LINESTRING (30 -90, 30 -60, 70 -60, 70 -90)>. After separating the geometrycollection, it is possible to plot these two elements side by side:

plot of chunk cast_validate_1_plot

The outer part is assembled in a rectangular well-defined polygon and the nodes are reorganised around it in clockwise direction.4 The “inner part” is a new geometry: a linestring. In total we have 7+4 points: 11! But that is fair, it follows the definition we saw at the beginning for the method=linework: combines all rings and then extracts valid polygons from that linework.

What happens if we use the second method?

Method Structure

The package geos follows closely the classes and methods from the C library. It first checks the validity of the declared parameters with geos_make_valid_params(), to be used in the function call geos_make_valid().

val_method <- geos_make_valid_params(
                              keep_collapsed = TRUE,
                              "make_valid_structure")
val_polygon_2 <- geos_make_valid(
                              test_polygon,
                              make_valid_params = val_method)
# Ok this is a bit long to write, and no shortcut via the
# packages sf or lwgeom. It is yet not implemented
# as far as I am aware.  I hope that we soon have the
# possibility to use instead of the previous call:

# Ceci n'est pas une pipe. Do not run/smoke
st_make_valid(polygon, method = "structure",
              keep_collapsed = FALSE)

But how does the results look like?

plot of chunk val_method_2_plot

Waow… not bad! 9 points! If I was the person who did draw the invalid polygon, that was probably what I had I mind. And it is neat, we do not have a complex geometry collection but only one simple polygon, defined as:
<POLYGON [10 -90…90 -10]>

This sparks my curiosity and the need to check, what would have happened if I had used this function instead of correcting manually three years ago.

Looking back

First, we must read the polygon that I corrected:5

plot of chunk impoart_building_79

We check the validity reason with geos_is_valid_detail(geb79)

geos_is_valid_detail(geb79)
##   is_valid            reason                       location
## 1    FALSE Self-intersection <POINT (3051.9693 3987.16362)>

This is the same situation as three years ago. The wall of the Middle Bronze Age building 79 is invalid. Let us test the two methods to make it valid and compare the results:

# Linework
val_method_1 <- geos_make_valid_params(
                          keep_collapsed = TRUE,
                          "make_valid_linework")
geb79_val1 <- geos_make_valid(geb79,
                          make_valid_params = val_method_1)

# Structure
val_method_2 <- geos_make_valid_params(
                          keep_collapsed = TRUE,
                          "make_valid_structure")
geb79_val2 <- geos_make_valid(geb79,
                          make_valid_params = val_method_2)

The top left is the invalid polygon, right is the polygon after application of the “linework” method and at the bottom the “structure” method. After the correction (right and bottom) it looks slightly different from the original (left).

plot of chunk compre

With a closer look, it is easier to spot the differences. Originally, there were no point at the intersection, as it was a “self-intersection” (left). The whole geometry was one polygon. The two other methods present two polygons, the intersection being now a common point. Therefore, it is possible to have distinct colours. The two methods give the same result and define two points at the intersection to separate two valid geometries.

plot of chunk unnamed-chunk-2

I was less impressed by the results, because for me, I would have preferred if it would have been possible to swap the points 1/15 and 14 in the original polygon (top left). But I prefer these results against aggressive changes. It gives me more confidence to use this function and it does what it was designed for: “It is desirable to have a way to convert an invalid geometry into a valid one that preserves the character of the input (to some extent)”.

However, this is not a reason to give up. While I was reading about OpenStreetMap data to understand how to manipulate them to make a map of Karabel (what brought me back to this problem), I came across a paper discussing how the problem exist in OpenStreetMap and how it was dealt there.

Dealing with polygons at OpenStreetMap

I am starting to read the Overpass API User’s Manual to see what is going on under the hood when I am using the package osmdata. What is the model behind OSM data? Why do I have “invalid Polygon” and how the OSM community deals with it? I was surprised to read that OSM does not have a polygon or multipolygon data type. Only nodes and ways (i.e., points and linestrings) have geometric definitions. Polygons are defined in relation to these two entities. Polygons are a sequence of reference to nodes and ways. OSMlab, an organization for OpenStreetMap related projects had a project called “fixing polygons in osm” to corrects the topology of polygons created out of “invalid relations” (and wrong metadata). The rationale was that the tool Osm2pgsql is part of many rendering toolchains and hence decides what appears on most OSM-based maps. It creates a PostgreSQL/PostGIS database. To deal with invalid polygons, “By default Osm2pgsql will not discard invalid polygons, it will try to fix them (using a buffer(0) GEOS operation), which mostly works.” This was the first method adopted.6 However, to solve the problem, eventually the polygons were manually edited and corrected (second method). In 2017 an impressive community effort was made to fix more than 100,000 bad (multi)polygons. The one I found (and corrected) were more recent. To me, this shows that you can make a geometry valid with an algorithm, but eventually, you will end up changing things manually if you want high quality data.

Takeaway

Validating geometry at the time of creation should be always enforced with a kind of linter (“a tool used to flag stylistic errors and suspicious constructs”, Wikipedia s.v. ‘linter’). This will avoid to meet these old friends geometry validity errors repeatedly. They are mind-numbing, such as this blog post!

Footnotes

  1. The first geos::MakeValid implementation I found was realeased 2019. 

  2. I do not want firing triggers: “The application of military metaphors is unnecessary because more positive alternatives are available” (Jing-Bao Nie, Adam Gilbertson, Malcolm de Roubaix, Ciara Staunton, Anton van Niekerk, Joseph D. Tucker & Stuart Rennie (2016) Healing Without Waging War: Beyond Military Metaphors in Medicine and HIV Cure Research, The American Journal of Bioethics, 16:10, 3-11, DOI: 10.1080/15265161.2016.1214305

  3. I was pleasantly surprised to see that rgeos is up-to-date. It exposes the parameters original (for the method) and keepCollapsed since April 2021 (SVN revision 655). Moreover, the implementation in rgeos is well documented. It also points to the apparition of this novel function: “not available before GEOS 3.8.0; from 3.10.0 two correction strategies” 

  4. I guess that direction is clockwise because we have negative Y coordinates. See this comment on #winding-order with further ref. 

  5. To be honest, it is my only post that is not reproducible (it is plain markdown) and I did not dig into my git history, but extract the corrected feature and made it invalid 

  6. I did not understand how it works because even if the documentation states that Osm2pgsql “never loads invalid geometries into your database”, they are plenty of examples as you can check with the tool OpenStreetMap inspector