wk version 0.6.0!

wk version 0.6.0!

It’s time for another release of the wk package! The wk package is a small R package that powers spatial data interchange in packages like geos, s2, and some exciting other packages still in development. It provides the things that other packages need to be really good at what they do and work seamlessly with other spatial packages: vector geometry classes, generics, and a stable C ABI that packages can use to interchange data without the overhead of R function calls. The 0.6.0 release adds a few new features but is focused on refactoring some of the internals to make them faster and use less memory.

The envelope handler

The concept of a 2-dimensional Cartesian bounding box is fundamental to many spatial formats. For example, shapefiles store a bounding box at the file level and at the feature level for some geometry types, and the TWKB, gpkg and spatialite variants of WKB can have a bounding box stored alongside the geometry. R-tree indexes can use this information to build a spatial index without looping over all the coordinates, and it’s completely natural that one might want this information for a variety of spatial formats. The wk_bbox() function has been able to do this at the vector level for several versions, but doing this for each feature required something like geos::geos_envelope_rct() which is requires the heavy libgeos dependency. The 0.6.0 release adds the wk_envelope_handler() and the wk_envelope() generic with optimized methods for circles, points, and rectangles.

nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))

wk_plot(wk_envelope(nc), add = T)

Mark geodesic geometry vectors

In the last two releases, wk vector objects carry and propagate a crs attribute. In general, wk tries to stay out of CRS/projections but has a few generics that users can use to make their code “just work” with out worrying about CRS details. For example, the crs attribute is inspected whenever two wk objects are combined to make sure they’re the same:

  wkt("POINT (1 2)", crs = "OGC:CRS84"),
  wkt("POINT (1 2)", crs = "EPSG:4326")
#> Error: CRS objects 'OGC:CRS84' and 'EPSG:4326' are not equal.

The CRS handling doesn’t do anything that requires interpreting the values because to do that it would need to know about CRS objects in other packages, so it passes the responsibility for that to the CRS objects themselves via the wk_crs_equal_generic() generic (implemented for sf CRS objects).

  wkt("POINT (1 2)", crs = 4326),
  wkt("POINT (1 2)", crs = sf::st_crs("EPSG:4326"))
#> <wk_wkt[2] with CRS=EPSG:4326>
#> [1] POINT (1 2) POINT (1 2)

The 0.6.0 release adds a geodesic attribute to support WKT and WKB as interchange formats for s2 geometry, or other geometry whose vertices are connected by a geodesic edge instead of a planar one. Engines such as BigQuery also output WKT and WKB in this form and it shouldn’t be sent to engines like GEOS that assume planar edges! There’s no way to flag this in the CRS (and there shouldn’t be, since the CRS is about coordinates and not edges), so it’s a separate attribute with a getter and a setter:

(geodesic_line <- wkt("LINESTRING (0 0, 1 1)", geodesic = TRUE))
#> <geodesic wk_wkt[1]>
#> [1] LINESTRING (0 0, 1 1)
#> [1] TRUE

The wk_proj_definition() generic

The 0.6.0 release also adds a wk_proj_definition(x, verbose = TRUE|FALSE) generic so that CRS objects can declare how they should be sent to a PROJ engine. It also formalizes, like sf does, the use of an integer as an EPSG code. This helps wk objects when they get sent elsewhere and makes it possible to transform them consistently without wk having to know anything about CRS objects themselves. The new generic is implemented for sf CRS objects, assumes integers are EPSG codes, and assumes strings are valid PROJ definitions:

#> [1] "OGC:CRS84"
#> [1] "EPSG:4326"
#> [1] "EPSG:32620"

Sometimes it’s necessary to get the most verbose representation of a CRS possible (e.g., when writing to disk). For this, use verbose = TRUE:

wk_crs_proj_definition(sf::st_crs("OGC:CRS84"), verbose = TRUE)
#> [1] "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],AXIS[\"Latitude\",NORTH]]"

This can’t be implemented for a CRS specified in the form “EPSG:CODE” without PROJ, but most libraries that can write spatial files have PROJ on hand to expand it when needed. From a programming perspective, it’s helpful to keep a CRS object as-is until it needs to be interpreted to avoid problems roundtripping between formats (as has been seen between proj4 strings and WKT2).

The wk_crs_longlat() CRS helper

Axis order is a point of friction in CRS transforms in modern PROJ, which changed a few releases ago from assuming that longitude was always first in a longitude, latitude CRS, to having the axis order as part of the CRS. This is a good thing, because before it just wasn’t possible to represent a coordinate system where latitude was actually the first coordinate! Unfortunately, literally all of modern GIS is built under the longitude first assumption.

The 0.6.0 release has a modest attempt to help change this. The ubiquitous use of “EPSG:4326” (which is latitude first, longitude second if authority compliant) makes it hard for change to happen, but there is a CRS with the proper longitude first, latitude second ordering: “OGC:CRS84”. If you don’t want to remember that, just do this:

wkt("POINT (0 1)", crs = wk_crs_longlat())
#> <wk_wkt[1] with CRS=OGC:CRS84>
#> [1] POINT (0 1)

This works on sf objects, too:

sf::st_sfc(sf::st_point(c(0, 1))) %>% 
#> Geometry set for 1 feature 
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 1 xmax: 0 ymax: 1
#> Geodetic CRS:  WGS 84
#> POINT (0 1)

…or this:

#> Coordinate Reference System:
#>   User input: OGC:CRS84 
#>   wkt:
#> GEOGCRS["WGS 84",
#>     DATUM["World Geodetic System 1984",
#>         ELLIPSOID["WGS 84",6378137,298.257223563,
#>             LENGTHUNIT["metre",1]],
#>         ID["EPSG",6326]],
#>     PRIMEM["Greenwich",0,
#>         ANGLEUNIT["degree",0.0174532925199433],
#>         ID["EPSG",8901]],
#>     CS[ellipsoidal,2],
#>         AXIS["longitude",east,
#>             ORDER[1],
#>             ANGLEUNIT["degree",0.0174532925199433,
#>                 ID["EPSG",9122]]],
#>         AXIS["latitude",north,
#>             ORDER[2],
#>             ANGLEUNIT["degree",0.0174532925199433,
#>                 ID["EPSG",9122]]]]

If you’re using NAD27 or NAD83, you can specify that too (the default is WGS84):

#> [1] "OGC:CRS27"

ALTREP vector support

Most of wk is written in C, so it can’t take advantage of cpp11’s iterators that make it easy to not materialize ALTREP vectors. ALTREP was introduced in R 3.5 to make code like this:

x <- 1:1e7
#> [1] 4 5 6 7 8

…allocate almost nothing. It does this by representing 1:1e7 as a sequence with two endpoints taking up almost no memory:

#> @118c213f8 13 INTSXP g0c0 [MARK,REF(65535)]  1 : 10000000 (compact)
#> 680 B

Whenever you do something to it with code that doesn’t know about ALTREP or can’t use ALTREP for some reason, it materializes into the full vector of numbers:

x[1] <- 7L
#> @10c900000 13 INTSXP g0c7 [REF(1)] (len=10000000, tl=0) 7,2,3,4,5,...
#> 40,000,048 B

Packages like arrow and readr use this mechanism to make it possible to work with bigger data because fewer copies are created in memory. Prior to the 0.6.0 release, wk materialized any ALTREP object that touched its C code. After 0.6.0, these objects are not materialized in any C or C++ code in the wk package.

points <- xy(1:100, 1:100)
#> @118534100 14 REALSXP g0c0 [REF(65535)]  1 : 100 (compact)

wk_handle(points, wk_void_handler())
#> @118534100 14 REALSXP g0c0 [REF(65535)]  1 : 100 (compact)

A real-live use-case for this is when using readr::read_csv(..., lazy = TRUE), which memory-maps a file rather than reading it into memory. This can be useful from a user perspective because it can read files really fast, and it can be useful from a developer perspective because you can take advantage of wk’s parsers without allocating huge vectors (if you’re willing to define your own ALTREP class, which, I get it, is a niche crowd). The basic idea is that you can deal with bigger data because there are fewer copies of it at any given time. If you only need part of your data, it can be much faster!

WKT reader and writer updates

The well-known text reader was something I wrote early on in my C++ career and it was showing its age. When prototyping some Arrow geometry representations I wanted a WKT reader for that, too, but the current implementation was too messy to port anywhere else. The new version, in addition to being much easier to use in other contexts, is way faster! The main optimization was the use of fast_float, which parses numbers about 4 times faster than what I was using before. As a consequence, the reader is about 4 times faster than it was, and compared to the geos::geos_geometry_writer(), can read WKT into a GEOS pointer in R about 2 times faster than geos::geos_read_wkt().

The WKT writer is much simpler and needed fewer changes, but was rewritten to drop a dependency on cpp11. The cpp11 package is awesome but was only getting used for a few things. As a result, after the rewrite, wk is now a zero-dependency package.

Testing on big-endian

The wk package was written theoretically work on big endian systems; however, it was never tested! In particular this matters for wk because it writes and reads well-known binary and needs to convert between system endian and whatever endian the WKB is written in. It turns out it’s not that hard to do, but there’s little guidance on the internet other than “you can just use Ubuntu on s390x with qemu”. I’m grateful for the setup guide for Ubuntu Server for s390x and for homebrew for a qemu install as easy as brew install qemu. The steps I followed (on MacOS BigSur + M1) were:

  • Install qemu (brew install qemu)
  • Download the latest Ubuntu Server ISO image for s390x.
  • Mount the ISO image using the Disk Utility (for me it put the image at “/Volumes/Ubuntu-Server 20”)
  • Create a disk for qemu with qemu-img create -f qcow2 disk-image.qcow2 10G
  • Install Ubuntu! The setup guide I linked earlier has a cool thing you can do that blows through all the steps automatically but I didn’t feel like setting that up so I just did the steps manually. This only took a few minutes (although the whole install took the better part of an hour).
qemu-system-s390x -no-reboot -name test-s390x -nographic -m 2048 \
    -drive file=disk-image.qcow2,format=qcow2,cache=none,if=virtio \
    -cdrom ubuntu-20.04.3-live-server-s390x.iso \
    -kernel /Volumes/Ubuntu-Server\ 20/boot/kernel.ubuntu \
    -initrd /Volumes/Ubuntu-Server\ 20/boot/initrd.ubuntu

After the install is finished, mount the ISO image and start the emulator!

qemu-system-s390x -no-reboot -name auto-inst-test -nographic -m 2048 \
    -drive file=disk-image.qcow2,format=qcow2,cache=none,if=virtio

Once I was in the virtual big-endian computer for the first time, I had to install R and devtools so that I could run the tests. I also installed the tidyverse because it takes care of a installing bunch of package dependencies that are annoying to install later. This takes a while but because the VM is using a virtual disk you don’t have to do it every time you start the VM.

sudo apt-get update &&  \
    sudo apt-get install -y r-base \
    libxml2-dev libcurl4-openssl-dev libssl-dev libgit2-dev \
    libfontconfig1-dev libfreetype6-dev libharfbuzz-dev libfribidi-dev \
    libpng-dev libtiff5-dev libjpeg-dev
sudo R -e 'install.packages(c("devtools", "tidyverse"), dependencies = TRUE)'

I know there are some options for using SSH and VSCode, but I opted to use git to shuffle code I was trying to fix back and forth between my computer and the VM. On my computer I created a new branch (I used usethis::pr_init("big-endian") and then usethis::pr_push()). The on the emulator:

git clone https://github.com/paleolimbot/wk.git
git checkout big-endian

Then, run the tests!

cd wk
R -e 'devtools::test()'

For me, quite a few were failing because they implicitly assumed that system endian was little endian. As I updated the code in wk on my regular computer in RStudio, I’d git push and then pull the changes on the VM. Because the VM terminal doesn’t scroll in the same way, I saved the test results to a file and used nano to scroll through the results.

git pull
R -e 'devtools::test()' > ../test-results.Rout
nano ../test-results.Rout

And, if you’ve never shutdown Ubuntu from the terminal (as I hadn’t!), shutdown the VM to exit:

sudo shutdown now

Thanks to mwhudson for alerting me that it wasn’t working on s390x and for pointing me to the check page for the package binary for s390x on Ubuntu!

Dewey Dunnington
Geoscientist, Programmer, Educator