Progressive S3 to Cloudflare R2 migration using Workers

Progressive S3 to Cloudflare R2 migration using Workers
Hi R2 team! 👀

R2, Cloudflare's competitor to S3 and other object stores, is now in open beta! In my opinion, there's three things that make R2 stand out from the rest:

1. There are no egress fees. Yep, absolutely none - you are only charged for storage and operations.
2. Global distribution as standard with region control on the roadmap.
3. There's a generous free tier which makes it great for small projects.

Free Paid - Rates
Storage 10 GB / month $0.015 / GB-month
Class A Operations 1,000,000 requests / month $4.50 / million requests
Class B Operations 10,000,000 requests / month $0.36 / million requests

But just because R2 has free egress doesn't mean that your current object store does - and that throws a spanner into any migration plans. Ideally, you'd transfer assets as and when they're requested so you can handle the migration over time & not fuss over files that aren't accessed frequently.

Progressive migration

The R2 team is planning to release automatic migration from S3 already, as discussed in the announcement blog post.

To make this easy for you, without requiring you to change any of your tooling, Cloudflare R2 will include automatic migration from other S3-compatible cloud storage services. Migrations are designed to be dead simple. After specifying an existing storage bucket, R2 will serve requests for objects from the existing bucket, egressing the object only once before copying and serving from R2. Our easy-to-use migrator will reduce egress costs from the second you turn it on in the Cloudflare dashboard.

The only issue? That isn't available yet. R2 has only just reached the open beta phase - so that's to be expected - but that doesn't stop us from implementing it on our own.

Using Cloudflare Workers

Workers have a great integration with R2, much like the existing KV and Durable Object bindings, so it's the platform of choice for this.

s3-to-r2-diagram

We're going to assume that you already have a Cloudflare account and have purchased the R2 plan. There's a free tier, it'll just want a payment method to make sure you're not a bot.

If not, follow the get started guide available in Cloudflare's documentation. You'll want to follow that up until Step 5 where you're adding code into your Worker. Pause there because we're going to add our own.

Setting up the wrangler.toml

My wrangler.toml (the configuration file for your Worker) looks like this:

name = "s3-to-r2"
compatibility_date = "2022-05-12"
main = "./src/index.ts"

[vars]
AWS_DEFAULT_REGION = "eu"
AWS_SERVICE = "s3"
AWS_S3_BUCKET_SCHEME = 'https:'
AWS_S3_BUCKET = "example-s3-bucket.storage.googleapis.com"

[[r2_buckets]]
binding = 'R2'
bucket_name = 's3-migration'
wrangler.toml

AWS_S3_BUCKET_SCHEME is configurable because some providers (like Google Cloud Storage) allow you to have dots in the bucket name which causes issues since wildcard certificates only cover specific levels. Using HTTP instead of HTTPS mitigates this by bypassing SSL entirely. Try HTTPS if possible, though.

In addition to the variables under the [vars] blocks, we're going to need to add our AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Because we don't want those in plain-text, we're going to put them into secrets.

To do this, use wrangler secret put to add both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

➜  s3-to-r2 git:(master) ✗ wrangler secret put AWS_SECRET_ACCESS_KEY
 ⛅️ wrangler 2.0.3 
-------------------
Enter a secret value: **************************************** 
adding secrets into your worker

The Worker itself

Let's have a look through the Worker's code and break it down into segments to explain what each part does.

export default {
    async fetch(request: Request, env: Env, ctx: EventContext<any, any, any>): Promise<Response> {
	...
}
export our fetch handler

We're going to use TypeScript and use the Module Worker format - we export a fetch handler that's called on each request.

import { AwsClient } from "aws4fetch";

interface Env {
    R2: R2Bucket,
    AWS_ACCESS_KEY_ID: string,
    AWS_SECRET_ACCESS_KEY: string,
    AWS_SERVICE: string,
    AWS_DEFAULT_REGION: string,
    AWS_S3_BUCKET: string
    AWS_S3_BUCKET_SCHEME: string
}
imports & environment bindings

The env parameter that contains our bindings - like environmental variables, secrets and the R2 bucket - needs to be defined, as well as adding in the aws4fetch library that we use to create the signed URLs that we'll fetch() to grab the object from S3.

if (objectName === '') {
    return new Response(`Bad Request`, {
        status: 400
    })
}

if (request.method !== 'GET') {
    return new Response(`Method Not Allowed`, {
        status: 405
    })
}
some quick logic to check if we want to continue with the request

As it stands, we're not offering directory listing and we're only interested in GET requests as this isn't an endpoint to upload or delete files - so we'll get rid of those requests before continuing on.

const obj = await env.R2.get(objectName);
if (obj === null) {
	...
}
try and fetch the R2 object, check if it exists

.get(object) on a R2 bucket will return null if the object doesn't exist so that's how we can check if the object exists in our current R2 bucket. If not, we want to fetch it from S3.

const aws = new AwsClient({
    "accessKeyId": env.AWS_ACCESS_KEY_ID,
    "secretAccessKey": env.AWS_SECRET_ACCESS_KEY,
    "service": env.AWS_SERVICE,
    "region": env.AWS_DEFAULT_REGION
});

url.protocol = env.AWS_S3_BUCKET_SCHEME;
url.hostname = env.AWS_S3_BUCKET;

const signedRequest = await aws.sign(url);
const s3Object = await fetch(signedRequest);

if (s3Object.status === 404) {
    return objectNotFound(objectName)
}
try and fetch the S3 object, check if it exists

We'll first create a new AwsClient from the aws4fetch package. Then we need to rewrite the request URL to point to our S3 bucket.

If the original request was at https://cdn.example.com/asset.mp4 and we're using Google Cloud Storage, it'll look a little something like https://bucket.storage.googleapis.com/test.mp4.

We'll turn that into a AWS4 signed URL and make the request.

function objectNotFound(objectName: string): Response {
    return new Response(`Object ${objectName} not found`, {
        status: 404,
    })
}
404 response

If we get a 404 back then we'll display a simple 404 page with a short message rather than returning the XML response you'd usually get.

const s3Body = s3Object.body.tee();
ctx.waitUntil(env.R2.put(objectName, s3Body[0], {
    httpMetadata: s3Object.headers
}))

return new Response(s3Body[1], s3Object);
stream the response and push into R2 in the background 

The body returned by fetch() is a ReadableStream and can only be used once - which is problematic since we need to push it to R2 and also return the asset to the user. tee() gives us an array containing two ReadableStream objects so we pass one to R2 and stream the other in a response to the user.

We pass the original headers from the S3 response into the httpMetadata property of the R2PutOptions so that any original Content-Type, Content-Language, etc is preserved.

const headers = new Headers()
obj.writeHttpMetadata(headers)
headers.set('etag', obj.httpEtag)

return new Response(obj.body, {
    headers
});
return the object from R2 and add the headers from the httpMetadata

Before returning the R2 object, we'll add the headers from the httpMetadata object using the writeHttpMetadata method.

Give it a try!

The code for this Worker will be available at https://github.com/KianNH/cloudflare-worker-s3-to-r2-migration

I've tested it with files upwards of 350MB+ but if you spot any issues or bugs, please open an issue or pull request!

You can also leverage the Cache API of Workers to cache the assets from R2, saving on R2 operations - that'll be added into this Worker in the future.