Prototyping an Apache Arrow representation of geometry

After spending a few kiddo nap times experimenting with low-level Apache Arrow representations of vectors in R, I remembered a lively GitHub discussion of Arrow representations of geometry in R. The arrow R package can already round-trip sf objects to and from feather and parquet formats, and the sfarrow package does so as well with a slightly different encoding.

Why bother?

Why do we need another binary representation of geometry? Aren’t WKB, EKWB, TWKB, FlatGeoBuf, Shapefile, GPKG WKB, SpatialLite WKB, (and many many more), enough?

I think it’s a matter of use-case. Given the number of developers coming up with new and awesome ways to represent geometries as big data systems get bigger and better, I think I’m not alone. I also love Apache Arrow and I’m excited about how the widespread multi-language support in the ecosystem can make open-source GIS faster, better, and easier to maintain.

In particular, I think that an Arrow representation of Geometry can reduce (or eliminate!) the need to copy geometries between packages or between languages. Right now in R, for example, the primary method by which one might compute an intersection involves two serializations to WKB and two conversions from WKB to a native geometry representation (in this case, GEOS and sf). You can represent geometry as pointers to a GEOSGeometry (e.g., via the geos package or geopandas), but that requires a system GEOS install which might be difficult, impossible, or unnecessary for an application (e.g., if coordinates are longitude/latitude and you want to use s2geometry instead of GEOS). You can skip the intermediary WKB serialization using the wk package or Rust’s geozero, but those have their own performance limitations and doesn’t eliminate all copies.

Another cool thing about using Arrow for geometry is that we get dictionary representations for free. This means that for things like meshes or polygon coverages we don’t have to store repeated coordinates (just arrange their IDs). The vectors are arranged the same way so we don’t have to do as much special-casing for mesh types. So cool!

Arrow might not be the answer to all binary geometry woes, but it’s definitely worth a shot. Let’s try!

An Arrow geometry representation

There’s a few ways that have already been proposed to do this and this shouldn’t be read as another proposition…I just need something to work with. For argument’s sake, let’s define a few line features and represent them as a List<Struct<x: Float64, y: Float64>>. This stores x and y coordinates as double vectors and lets us separate line features by defining a series of offsets.

example_lines <- c("LINESTRING (0 0, 5 10)", "LINESTRING (0 5, 10 0)")
plot(wk::wkt(example_lines), col = c("black", "brown"))

Using arrowvctrs we can piece together the memory needed to represent these in Arrow format. Anything other than this toy example needs a helper function but the key part is that if you already have double() vectors in R of x and y coordinates then the data is never copied!

library(arrowvctrs)

geo_line_schema <- arrow_schema(
  "+l", 
  children = list(
    arrow_schema(
      "+s",
      children = list(
        arrow_schema("g", name = "x"),
        arrow_schema("g", name = "y")
      )
    )
  )
)

geo_line <- arrow_vctr(
  geo_line_schema,
  arrow_array(
    buffers = c(0L, 2L, 4L),
    length = 2L,
    children = list(
      arrow_array(
        length = 4L,
        children = list(
          arrow_array(c(0, 5, 0, 10), length = 4L),
          arrow_array(c(0, 10, 5, 0), length = 4L)
        )
      )
    )
  )
)

This is where things get heavy. We’re going to pop into C++ via cpp11 and unpack this puppy at the C level. arrowvctrs arranges and protects R memory using the Arrow C data interface which is ABI stable and can be passed to other packages, programs, or anything else with the abi.h header. To illustrate how this can be done, I’m just going to print out the coordinates of each line:

#include "cpp11.hpp"
#include "abi.h"

using namespace cpp11;

[[cpp11::register]]
void print_arrow_line_array(sexp geo_line_array) {
  struct ArrowArray* geo_line = (struct ArrowArray*) R_ExternalPtrAddr(geo_line_array);
  int* offsets = (int*) geo_line->buffers[0];
  double* x = (double*) geo_line->children[0]->children[0]->buffers[0];
  double* y = (double*) geo_line->children[0]->children[1]->buffers[0];
  
  for (int i = 0; i < geo_line->length; i++) {
    int line_size = offsets[i + 1] - offsets[i];
    Rprintf("Line %d: \n", i);
    for (int j = 0; j < line_size; j++) {
      Rprintf("- x: %g\ty: %g\n", x[offsets[i] + j], y[offsets[i] + j]);
    }
  }
}
print_arrow_line_array(geo_line$array)
#> Line 0: 
#> - x: 0  y: 0
#> - x: 5  y: 10
#> Line 1: 
#> - x: 0  y: 5
#> - x: 10  y: 0

It should be said that the situation gets a little more complicated when you have NULL geometries or coordinates, but overall it’s not bad!

Zero-copy S2/Arrow

I know that s2geometry isn’t the most famous geometry engine (GEOS being the most famous one I know of), but it is a geometry engine that was designed from the ground up for speed and flexibility. Its Shape interface provides a useful wrapper around coordinates that exist elsewhere and might need to be calculated on demand; however, its methods are limited to coordinates in latitude/longitude assuming a spherical (rather than ellipsoidal) earth. Building S2 is a bit of a nightmare but once you have it installed it’s a joy to work with!

#include "cpp11.hpp"
#include "abi.h"

#include "s2/s2lax_polyline_shape.h"
#include "s2/s2boolean_operation.h"

using namespace cpp11;

class ArrowS2LaxPolylineShape : public S2Shape {
 public:
  ArrowS2LaxPolylineShape() {
    this->size = 0;
    this->base_offset = 0;
    this->x = nullptr;
    this->y = nullptr;
  }

  bool Init(struct ArrowArray* geo_line, int size, int base_offset) {
    this->size = size;
    this->base_offset = base_offset;
    this->x = (double*) geo_line->children[0]->children[0]->buffers[0];
    this->y = (double*) geo_line->children[0]->children[1]->buffers[0];
    return true;
  }

  int num_vertices() const { return this->size; }
  S2Point vertex(int i) const { 
    double x = this->x[this->base_offset + i];
    double y = this->y[this->base_offset + i];
    return S2LatLng::FromDegrees(y, x).ToPoint();
  }

  // S2Shape interface:
  int num_edges() const final { return std::max(0, this->num_vertices() - 1); }
  Edge edge(int e) const final { return Edge(this->vertex(e), this->vertex(e + 1)); }
  int dimension() const final { return 1; }
  ReferencePoint GetReferencePoint() const final {
    return ReferencePoint::Contained(false);
  }
  int num_chains() const final { return std::min(1, this->num_edges()); }
  Chain chain(int i) const final { return Chain(0, this->num_edges()); }
  Edge chain_edge(int i, int j) const final { return Edge(this->vertex(j), this->vertex(j + 1)); }
  ChainPosition chain_position(int e) const final { return S2Shape::ChainPosition(0, e); }

 private:
  int size;
  int base_offset;
  double* x;
  double* y;
};

[[cpp11::register]]
logicals arrow_line_array_s2_intersects(sexp geo_line_array1, sexp geo_line_array2) {
  struct ArrowArray* geo_line1 = (struct ArrowArray*) R_ExternalPtrAddr(geo_line_array1);
  int* offsets1 = (int*) geo_line1->buffers[0];
  
  struct ArrowArray* geo_line2 = (struct ArrowArray*) R_ExternalPtrAddr(geo_line_array2);
  int* offsets2 = (int*) geo_line2->buffers[0];
  
  if (geo_line1->length != geo_line2->length) {
    stop("Both arrays must be of equal length");
  }
  
  MutableS2ShapeIndex index1;
  MutableS2ShapeIndex index2;
  
  writable::logicals result(geo_line1->length);
  for (int i = 0; i < geo_line1->length; i++) {
    auto shape1 = absl::make_unique();
    auto shape2 = absl::make_unique();
  
    shape1->Init(geo_line1, offsets1[i + 1] - offsets1[i], offsets1[i]);
    shape2->Init(geo_line2, offsets2[i + 1] - offsets2[i], offsets2[i]);
    
    index1.Add(std::move(shape1));
    index2.Add(std::move(shape2));
    
    result[i] = S2BooleanOperation::Intersects(index1, index2);
    
    index1.Clear();
    index2.Clear();
  }
  
  return result;
}

Let’s see if it can detect a self-intersection!

arrow_line_array_s2_intersects(
  geo_line$array,
  geo_line$array
)
#> [1] TRUE TRUE

This works because Shape instances in S2 don’t have to own the underlying data, which is what we want. The ShapeIndex class also doesn’t have to own anything, which might help here for points since allocating the index structure is probably as much memory or more than the point itself and what we’re trying to avoid is a lot of small allocs. Keen observers will note that this probably creates a copy via the MutableS2ShapeIndex but this isn’t necessarily true: the implementation is clever and is often smaller than the underlying Shape. (In case it’s not obvious, these are all un-tested musings).

Zero-copy GEOS/Arrow?

I tried to get this to work by implementing a custom CoordinateSequence class, but couldn’t get it not make a copy of every single coordinate. Every implementation I know about uses the C API and does a lot of copying so maybe it’s never come up (or maybe I just don’t know what I’m doing with the GEOS C++ API!). If anybody reading knows about this, the problem I’m up against is this method of CoordinateSequence:

virtual const Coordinate& getAt(std::size_t i) const = 0;

…which requires that a const Coordinate persists in memory for the lifecycle of the CoordinateSequence. Many algorithms loop through every coordinate, which pretty much guarantees a complete copy.

Is this useful?

It was for me learning about this! Realistically any Arrow geometry representation has to be flexible…we need the ability to represent coordinates as vectors of x and y, as vectors keeping coordinate values together, and as dictionaries to support topological encoding of meshes and coverages. Without flexibility, there is unnecessary copying! It’s early days but I’m looking forward to what open-source spatial developers come up with in the future.

Avatar
Dewey Dunnington
Geoscientist, Programmer, Educator