Chris
Chris White Web Developer

Generating Large ZIP Exports From Files in S3

28 May 2023 ~1 minute read

If you've built a product big enough or for long enough, chances are you've had to offer export functionality that delivers a ZIP file to the user containing their data. In our case this was a media export containing videos, photos and audio clips that were stored in an S3 bucket.

Our first implementation used an asynchronous background job to download each media file to disk and then used PHP's native ZIP API to build the final archive to send to the user. This solution wasn't a bad one, it lasted for years! But eventually as we scaled we started to see problems. As we onboarded customers with higher media demands (most recently a customer who manages ~600MB video files) exports would take a long time to finish, and the largest exports would occasionally run out of memory even on large EC2 instances with 16GB of RAM. We needed a more efficient solution.

Through some research we stumbled on the ZipStream-PHP library. What immediately stood out about it was that it supports adding files from a PHP stream and for the ZIP file destination to also be a stream. This would theoretically allow us to avoid saving the media files to disk, and also avoid keeping the files in memory. We'd just stream each of the files off S3 in turn and send them straight back to S3 inside the ZIP file, never keeping an entire file in memory. To our delight, it worked perfectly. Exports are finishing faster than ever, and we're building these ZIP files without ever saving the input files or the resulting ZIP file to disk.

I wrote a small proof of concept script that takes a list of files from an S3 bucket and sends them back to the same bucket inside a ZIP file. It's all commented to explain each part.

1<?php
2 
3require './vendor/autoload.php';
4 
5use Aws\S3\S3Client;
6use ZipStream\ZipStream;
7use ZipStream\CompressionMethod;
8 
9$s3 = new S3Client([
10 'region' => 'ca-central-1',
11 'version' => '2006-03-01',
12]);
13 
14// We need to call this method to allow PHP streams to work with the `s3://` protocol.
15$s3->registerStreamWrapper();
16 
17// We define our list of files that we want to export into a ZIP file. Real production
18// code would define this dynamically, we're just hard-coding it for an example.
19$files = [
20 new class {
21 public string $name = 'test_photo.jpeg';
22 public string $path = 's3://zipstream-testing/test_photo.jpeg';
23 },
24 new class {
25 public string $name = 'test_photo2.png';
26 public string $path = 's3://zipstream-testing/test_photo2.png';
27 },
28 new class {
29 public string $name = 'test_video.mov';
30 public string $path = 's3://zipstream-testing/test_video.mov';
31 },
32 new class {
33 public string $name = 'test_video2.mov';
34 public string $path = 's3://zipstream-testing/test_video2.mov';
35 },
36];
37 
38// We open a stream to the destination location of the ZIP export. In this case we're
39// just placing it into the same bucket alongside the files we're exporting, but this
40// could go to another S3 bucket, or anywhere else that you can fopen().
41$archiveDestination = fopen('s3://zipstream-testing/archive.zip', 'w');
42 
43$zip = new ZipStream(
44 // Configures the library to stream its output to our destination archive.
45 outputStream: $archiveDestination,
46 
47 // We turn off ZIP compression to make the export faster. Omit this parameter
48 // entirely if you do want to compress the resulting ZIP file.
49 defaultCompressionMethod: CompressionMethod::STORE,
50 
51 // This parameter is required to work with remote streams.
52 defaultEnableZeroHeader: true,
53 
54 // We're not sending the ZIP output to the user's browser, but back to S3.
55 sendHttpHeaders: false,
56);
57 
58// Iterate over each of our files, fopen() them, add them to the ZIP, then close the open stream.
59foreach ($files as $file) {
60 $fileStream = fopen($file->path, 'r');
61 $zip->addFileFromStream($file->name, $fileStream);
62 fclose($fileStream);
63}
64 
65// Finally, finish the ZIP file which will write the required footer data and close the stream
66// to our destination archive.
67$zip->finish();
68fclose($archiveDestination);
69 
70echo sprintf("Peak memory usage: %d MiB", (memory_get_peak_usage(true)/1024/1024));

I ran this script to export 4 media items from S3, two photos and two videos. The photos were around 10 MiB in size and the videos were about 800 MiB each. Running the script and examining the output, we can see it never used more than 34 MiB of memory, even when working with files that are almost a gigabyte in size:

1AWS_ACCESS_KEY_ID=redacted AWS_SECRET_ACCESS_KEY=redacted php archive.php
2Peak memory usage: 34 MiB

Additionally, the export runs fast because we're not wasting time fully downloading each media file to a local disk before we start to upload it to the ZIP archive.

Made with Jigsaw and Torchlight. Hosted on Netlify.