Using R2 to host open data

Posted on 2023-10-11

Cloudflare R2 is an alternative to traditional blob storage platforms. It boasts an S3-compatible API to allow seamless migration and is a lot cheaper compared to S3.

The Case for Open Data

Some main attractions of R2 are its generous free tier and overall pricing in comparison to AWS' S3 and other providers.

Open data exhibits a few patterns worth noting. Typically, it is slow-moving, meaning uploads are infrequent. Open data sizes usually range from 100 MB to 10 GB, but this can vary depending on the dataset. The frequency with which people access open data is low besides a few spikes during workshops or tutorials; downloading it locally and accessing it from there is a common practice when dealing with large datasets.

Zero Cost Egress

S3 charges for egress, meaning they bill you based on how much data is downloaded or transferred when someone accesses your data. There are a few ways to circumvent these charges, such as publishing the data to AWS Open Data or enabling requester pays, but these options are not easily accessible to everyone.

R2, on the other hand, does not charge for egress. This allows you to host large files with minimal costs. You can host gigabytes or even terabytes of data without worrying too much about the amount of data being accessed.

However, R2 still charges for GET requests, similar to S3. Therefore, you only need to worry about the amount of data stored and the number of requests.

Issues

Cloudflare Default Settings Blocking Users

My first use case involved hosting some data I commonly used in a public location so that I could easily access it with Python/Geopandas.

import geopandas as gpd
df = gpd.read_file("https://free.domain/blah.geoparquet")

An issue arose when I made data requests via geopandas: the requests were blocked by Cloudflare's browser integrity checks.

I was able to address this by disabling integrity checks for my domain, but this was difficult to debug. I had to determine that the issue was Cloudflare's default setting to block requests with a User-Agent from urllib, which Geopandas uses when requesting HTTPS data. Similar issues might arise, which could be difficult to debug.