Using R2 to host open data

Posted on 2023-10-11

Cloudflare R2 is an alternative to blob storage platform. It boasts a S3 compatible api to allow seamless migration and a lot cheaper compared s3.

The case for open data

Some main attractions for R2 is its generous free tier and over all pricing in comparison in to S3 and pricing.

Open data has a few patterns we can note. Open data is usually slow moving, which means uploads are infrequent. Open data is usually in 100mb-10gb but this really depends on the data. The people accessing will do in infrequently, as downloading it locally and accessing it from there is a common practice when dealing with data.

Zero cost egress

S3 charges for egress. This means whenever someone access the data, they will charge you depending on how much data they download/transfer. There are few ways to circumvent this like publishing the data to AWS open data or requesters pays but this aren't easy for anybody to access.

R2 does not charge for egres . This means it is possible to host large files with minium costs. You can host gigabytes and even terabytes of data wihout worrying too much about the amount of data people access.

This doesn't mean people accesing data is free. R2 still charges for get requests similiar with S3, but you would only need to worry about the amount of data stored and the number of requests.

Issues

Cloudflare default settings blocking users

My 1st use case was hosting some data I commonly used in a public place so I can easily access them in python/geopandas.

import geopandas as gpd
df gpd.read_file("https://free.domain/blah.geoparquet")

An issue arose that when I was requesting data via geopandas where requests were blocked Cloudflare's browser intergretity tests.

I was able to address this by disabling integretity tests for my domain but this was diffcult to debug. I had to identify the reason for the issue was Cloudflare by default block requests with User-Agent from urllib which geopandas uses when requesting https data. Similiar issues might pop up which might be difficult to debug.