Engineering deep-dive: how we shave 80 ms off every resumed large-file upload

Large file uploads get interrupted. Network hiccups, mobile clients switching between WiFi and cellular, laptop lids closing at inconvenient moments — resumable uploads exist because the alternative, restarting a multi-gigabyte transfer from scratch, is not something users will tolerate. We’ve had resumable large-file upload support for years. What we didn’t have, until this quarter, was a resume path that was as fast as it could be.

This post documents where we found 80 milliseconds of unnecessary latency in the resume flow, what we changed, and what the numbers look like now. It’s a specific engineering story, not a product announcement, but if you’re running large-file workflows through SEND-SECURELY.COM — radiology archives, financial transaction datasets, engineering CAD packages — the change is real and you’ll notice it.

The original resume flow

Our large-file upload path uses a multipart upload model. When a client initiates an upload of a file above our multipart threshold (currently 100 MB), the upload is broken into chunks of up to 64 MB each. The client uploads chunks in parallel to a set of pre-authorized upload URLs, then signals completion. The server assembles the parts and commits the object.

When an upload is interrupted mid-way and the client reconnects, it needs to determine which chunks were successfully received so it can resume from the right point. The original resume flow worked like this:

Client sends a resume request with the upload session ID.
Server queries the database for the upload session record, retrieving the list of all chunks and their completion status.
Server returns the list of completed chunks to the client.
Client computes which chunks remain, validates that the chunks it already sent match what it intends to send (by re-reading chunk content and computing hashes), then begins uploading remaining chunks.

Step 4 contained the problem. “Validates that the chunks it already sent match what it intends to send” sounds straightforward. In practice it meant: for every chunk the server reported as received, the client re-read that chunk from disk and computed a SHA-256 hash, then compared it against what the server stored. This was correct behavior — you don’t want to resume uploading a file that changed on disk between the initial upload and the resume — but it was doing that work sequentially, blocking the resume upload from starting until every received chunk had been locally re-verified.

For a 2 GB file with 32 chunks of 64 MB each, if the upload was interrupted after 20 chunks, the client re-read and hashed all 20 previously-uploaded chunks before starting to upload the remaining 12. On typical NVMe storage that’s fast, but it’s not free. We were measuring 80–120 ms of latency between “client receives resume response” and “first byte of resuming upload hits our ingress” that was attributable entirely to sequential chunk re-hashing.

Where the 80 ms hid

The root cause was an architectural decision that made sense when we designed it but hadn’t been re-examined as our understanding of the use cases matured.

The original thinking was: verify everything before resuming, to prevent a corrupted or partially-modified file from completing an upload without the modification being caught. That’s correct thinking. The implementation mistake was making the verification synchronous and blocking on the critical path.

There’s a second issue we found while investigating the first. The resume response from the server returned the chunk list in database insertion order — which is not necessarily sequential order. Chunks uploaded in parallel don’t arrive in sequence, so the insertion order is effectively random. The client received a list of chunks it needed to verify and uploaded in an order that corresponded to neither sequential chunk order nor optimal I/O order. For files on spinning disk, this caused seek patterns that were particularly bad; even on NVMe, unnecessary seek-and-read patterns added latency.

A third finding: the server-side resume handler was fetching the full chunk records from the database — all metadata including content hash, ETag, storage location, completion timestamp — for every chunk in the upload, including chunks that were complete. For the resume workflow, what the client actually needs is just the list of incomplete chunk indices. Fetching full records for all chunks was unnecessary I/O on the database path.

Three separate issues, each contributing to the 80 ms, compounding in the common case.

The fix: chunk hash precomputation and parallel range stitching

We addressed all three issues.

Chunk hash precomputation. Rather than having the client re-hash chunk data at resume time, we now precompute chunk hashes on the client during the initial upload and store them in the upload session state, alongside the chunk data. When the client initializes a multipart upload, it computes the SHA-256 of each chunk and includes those hashes in the initial upload request. The server stores them in the session record.

At resume time, the client receives back both the list of received chunks and their pre-stored hashes. The client can verify a received chunk’s integrity by comparing the stored hash against the hash it computed during the original upload — no re-reading of chunk data required for chunks that were previously sent. Re-reading is only necessary if the client cannot produce a locally-stored hash for a chunk (upload session state was lost, or this is a different client device resuming an in-progress upload initiated on another device). In that case, we fall back to the original behavior.

This eliminates the sequential re-read bottleneck for the common case: the client that interrupted is the same client resuming, and it has its upload session state in memory or in local storage.

Parallel range stitching. When verifying remaining chunks before upload, we now issue the local hash reads (when necessary) in parallel rather than sequentially, with a concurrency limit that respects disk I/O patterns. For NVMe storage, we set the limit at 4 parallel reads; for configurations where we detect rotational disk, we use 2. This doesn’t eliminate the verification cost but reduces the wall-clock time by parallelizing it.

Sparse chunk query. The server-side resume handler now fetches only incomplete chunk records rather than full records for all chunks. The query is: “which chunk indices for this upload session are not yet marked complete?” rather than “give me all chunk records for this upload session.” For an upload that is 90% complete at resume time, this reduces the database fetch from all chunks to the 10% of chunks that remain. The query is simpler, the result set is smaller, and the serialization overhead is reduced.

Numbers

We benchmarked the change against a set of representative transfer scenarios using our internal transfer testing harness. All measurements are median of 50 runs, on a connection with 100 Mbps uplink, resuming at various completion percentages.

2 GB file, resumed at 60% complete (19 of 32 chunks received):

Before: 94 ms from resume response to first upload byte
After: 11 ms from resume response to first upload byte
Improvement: 83 ms (88% reduction)

500 MB file, resumed at 50% complete (4 of 8 chunks received):

Before: 31 ms from resume response to first upload byte
After: 6 ms from resume response to first upload byte
Improvement: 25 ms (81% reduction)

10 GB file, resumed at 40% complete (64 of 160 chunks received):

Before: 412 ms from resume response to first upload byte
After: 47 ms from resume response to first upload byte
Improvement: 365 ms (89% reduction)

The 80 ms figure in our headline is the median improvement across the scenarios we tested at our most common real-world file size (the 2 GB case). The improvement is proportionally consistent across file sizes — it scales roughly with the number of previously-uploaded chunks. Large files with many completed chunks benefit most.

The change is live in production as of July 15th. If you’re running automated large-file transfers and have observability on your upload flow, you should see the improvement in your resume metrics. If you’re not — and most customers aren’t watching resume latency specifically — you may notice faster-feeling resumes without knowing why. Now you know why.

Takeaway

Sequential chunk re-hashing at resume time was blocking the start of resumed uploads by 80+ ms. Precomputing and caching chunk hashes during initial upload, combined with sparse chunk queries and parallel hash verification, reduced the resume-to-first-byte latency by 80–90% across tested file sizes. The change is in production.