Home | Benchmarks | Categories | Atom Feed

Posted on Thu 07 October 2021 under Rust

ROAPI: An API Server for Static Datasets

ROAPI is an API Server that exposes CSV, JSON and Parquet files without the need to write any code. The project was started by Qingping Hou around this time last year. Qingping had spent the better part of four years working at LinkedIn prior to joining Scribd as a Senior Engineer. He is also a committer to both the Apache Airflow and Arrow projects.

ROAPI is made up of 4K lines of Rust. This line count is low due to the intense use of 3rd party libraries. These include Apache Arrow for, among other things, Parquet support, Arrow's DataFusion Project, which provides SQL and query execution support, Actix, which provides the HTTP interface and Rusoto, the AWS SDK for Rust.

Files can be sourced from either the file system, via HTTP/HTTPS, AWS S3, Google Sheets or Delta Lake. HDFS is supported by both Delta Lake or by simply using HDFS Fuse.

DataFusion is an in-memory database and doesn't yet support spill-to-disk. For this reason, any dataset exposed via ROAPI needs to fit into memory. Also, ROAPI isn't able to take advantage of DataFusion's distributed query engine "Ballista" so data can only be served off of a single machine for the time being.

Each table exposed via ROAPI can only source data from a single file. So if you were to expose multiple Parquet files sitting on HDFS, only one file's contents could be used for any single table's endpoint. SQL JOINs are supported between tables but UNION statements aren't.

ROAPI, Up & Running

I'll install a few utilities that will be used throughout this post via Homebrew.

$ brew install \
    curl \
    git \
    htop \
    jq \
    virtualenv \
    wrk

ROAPI can be built via Rust's Cargo package manager but alternatively, ROAPI can also be installed from a pre-compiled binary which is packaged with Maturin and released via PyPI. This release is built using Rust's nightly channel and has support for SIMD.

$ virtualenv ~/.roapi
$ source ~/.roapi/bin/activate
$ python3 -m pip install roapi-http

The above installation via Python will work on Linux, macOS and Windows. On macOS for example, a single 70 MB binary will be placed in the virtual environment's binary folder.

$ otool -L ~/.roapi/bin/roapi-http
/Users/mark/.roapi/bin/roapi-http:
    /System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 1.0.0, current version 59306.140.5)
    /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 1677.104.0)
    /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.100.1)
    /usr/lib/libresolv.9.dylib (compatibility version 1.0.0, current version 1.0.0)

Building APIs with ROAPI

I'll use two example datasets that are provided in the Git repository for ROAPI.

$ git clone https://github.com/roapi/roapi
$ roapi-http \
    --table "uk_cities=roapi/test_data/uk_cities_with_headers.csv" \
    --table "roapi/test_data/spacex_launches.json"

The two datasets above are in different formats but show that you can add one or more datasets simply by adding another --table flag to the roapi-http call. The above will launch an HTTP server using Actix with 8 workers listening on http://127.0.0.1:8080/.

There is no HTML-based discovery from the root URL. If you open the above URL in a web browser you'll be greeted with an HTTP 404 response. In order to discover which datasets are available, run the following:

$ curl --silent "127.0.0.1:8080/api/schema" | jq "keys"
[
  "spacex_launches",
  "uk_cities"
]

To see the fields of a given dataset, run the following:

$ curl --silent "127.0.0.1:8080/api/schema/uk_cities" | jq
{
  "fields": [
    {
      "name": "city",
      "data_type": "Utf8",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "lat",
      "data_type": "Float64",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "lng",
      "data_type": "Float64",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

Query Files using SQL, GraphQL & REST

There are three different interfaces exposed for each table by ROAPI. The first is a REST interface.

$ curl --silent "127.0.0.1:8080/api/tables/uk_cities?columns=city,lat,lng&limit=2" | jq
[
  {
    "city": "Elgin, Scotland, the UK",
    "lat": 57.653484,
    "lng": -3.335724
  },
  {
    "city": "Stoke-on-Trent, Staffordshire, the UK",
    "lat": 53.002666,
    "lng": -2.179404
  }
]

The second is a GraphQL interface via HTTP POST.

$ curl --silent \
     -X POST \
     -d "query { uk_cities(limit: 2) {city, lat, lng} }" \
     127.0.0.1:8080/api/graphql | jq
[
  {
    "city": "Elgin, Scotland, the UK",
    "lat": 57.653484,
    "lng": -3.335724
  },
  {
    "city": "Stoke-on-Trent, Staffordshire, the UK",
    "lat": 53.002666,
    "lng": -2.179404
  }
]

And the third is a SQL interface exposed via HTTP POST.

$ curl --silent \
     -X POST \
     -d "SELECT city, lat, lng
         FROM uk_cities
         LIMIT 2" \
     127.0.0.1:8080/api/sql | jq
[
  {
    "city": "Elgin, Scotland, the UK",
    "lat": 57.653484,
    "lng": -3.335724
  },
  {
    "city": "Stoke-on-Trent, Staffordshire, the UK",
    "lat": 53.002666,
    "lng": -2.179404
  }
]

The SQL interface is the only one of the three that supports JOINs between tables. The above interface will have DataFusion's codebase parse the SQL, build an Abstract Syntax Tree from it, build and optimise a query plan and execute it in parallel with any other queries that happen to be running.

Benchmarking ROAPI

The system I'm using is a 13" 2020 MacBook Pro running macOS 11.6. It has an Intel Core i5 quad-core locked at 1.4 GHz with a turbo boost speed of up to 3.9GHz, 8 GB of 2133 MHz LPDDR3 RAM and 250 GB of SSD capacity.

I'll launch ROAPI with the logging level set to "error". I found this to offer a 50% speed boost over INFO-level logging.

$ RUST_LOG=error \
    roapi-http \
    --table "uk_cities=roapi/test_data/uk_cities_with_headers.csv"

I'll then run Will Glozer's wrk HTTP benchmarking tool using 8 threads and 50 open connections for a duration of 30 seconds.

$ wrk -t8 \
      -c50 \
      -d30s \
      "http://127.0.0.1:8080/api/tables/uk_cities?columns=city,lat,lng&limit=2"

htop showed ROAPI using ~700% CPU on this 4-core, 8-thread system while wrk was using ~40%.

  8 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.27ms    2.81ms 138.22ms   96.00%
    Req/Sec     1.94k   198.81     2.58k    70.58%
  464715 requests in 30.02s, 114.34MB read
Requests/sec:  15479.80
Transfer/sec:      3.81MB

ROAPI was able to reply 15,479.8 times / second with the above benchmark settings.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.