July 16, 2019

Building a Cloud Function that generates PDFs from provided URL

At my current company, we generate PDF reports for our customers that contains detailed statistics for checks done on a car. We did so by generating an HTML using Go Templating engine, then using Headless chrome inside docker container to generate PDF from HTML. Since we’re running on App Engine, we had to use Flex environment in order to support custom docker image, which resulted in higher cost and significantly slower deployment times. To move the service to standard environment, I rewrote the PDF generation to a Cloud Function.

The repository with deployable Cloud Function is available on GitHub.

PDF usage is more widespread than you think. It’s one of the most common formats and it’s available on any modern browser.

Previously when generating PDFs in our Go web service, we were using Headless Chrome running inside a Docker container. The code looked something like:

cmd := exec.CommandContext(
    cmdc,
    "chromium-browser",
    "--no-sandbox",
    "--disable-gpu",
    "--virtual-time-budget=2000",
    "--timeout=6000",
    "--headless",
    fmt.Sprintf("--print-to-pdf=%s", fname),
    fmt.Sprintf(
        "http://localhost:8080/report_data/url?query1=%s&query2=%s", data.Query1, data.Query2
    ),
)

if err = cmd.Run(); err != nil {
    log.Printf("error generating pdf: %v", err)
    if err == context.DeadlineExceeded {
        respond.WithJSON(w, r, "could not generate pdf, a timeout occured. Please try again...")
        return
    }
    respond.WithJSON(w, r, "could not generate pdf, plase try again...")
    return
}

file, err := ioutil.ReadFile(fname)
if err != nil {
    respond.WithJSON(w, r, "could not read check pdf, please try again...")
    return
}

w.Header().Add("Content-Type", "application/pdf")
_, err = w.Write(file)
if err != nil {
    respond.WithJSON(w, r, err)
}

While this worked exceptionally well for us, it had one major drawback. Since we’re using Google’s AppEngine we had to use the Flex environment instead of the Standard one. It meant higher costs and slower deployment times. Our deployment times went from 14 to 2 minutes after switching to standard environment.

There were few alternatives that came up, like using an external service that handles this. Most of them were quite costly and considering that we operate within EU the service had to ge GDPR compliant and we needed to inform our customers about new service where we share some of their data. Not good.

Thus we chose to use Google Cloud Functions (AWS Lambda alternative) to generate PDFs for us. Unfortunately, the Go runtime doesn’t support headless Chrome, but luckily for us, Google updated the runtimes for NodeJS back in October last year and added Node.js 8 and Node.js 10 which supports headless chrome via Puppeter.

I found an example repo containing something similar to what we needed on GitHub. The problems with this were that it ran slow due to performance issues with Puppeteer on Cloud Function and the URL it made PDF of was wrong (for our use case). The repo above fetches URL from query parameter url, but it would not fetch any other query params and we used those for authentication and providing additional data.

Thus I opted for using chrome-aws-lambda which runs in Headless mode by default and utilizes /tmp folder which provides huge performance gains (although still not as fast as on Lambda, but comparable). People on the GitHub issue claim up to 500% performance gains.

The complete code is available on my GitHub. In short, this is how the function looks like:

const chromium = require("chrome-aws-lambda");
const puppeteer = require("puppeteer-core");
const functions = require("firebase-functions");

const options = {
  timeoutSeconds: 30
};

let page;

async function getBrowserPage() {
  const browser = await puppeteer.launch({
    args: chromium.args,
    defaultViewport: chromium.defaultViewport,
    executablePath: await chromium.executablePath,
    headless: chromium.headless
  });
  return browser.newPage();
}

exports.html2pdf = functions
  .runWith(options)
  .https.onRequest(async (req, res) => {
    if (!req.originalUrl.startsWith("/?url=")) {
      return res.status(400).send(`url query param must be provided`);
    }

    let url = req.originalUrl.substring(6);

    try {
      if (!page) {
        page = await getBrowserPage();
      }

      await page.goto(url);
      await page.emulateMedia("screen");

      const pdfBuffer = await page.pdf({ printBackground: true });
      res.set("Content-Type", "application/pdf");
      res.status(200).send(pdfBuffer);

    } catch (error) {
      throw error;
    }
});

The function is quite simple. It asynchronously launches a new headless chrome instance and exposes an endpoint. The endpoint expects the URL to have /url?= query param, and if not returns an error.

It trims the /url?= part of the received URL opens everything after url query param in a new page. Once loaded, it creates a PDF version of the page using headless Chrome’s page.pdf functionality.

We’re quite satisfied with the response times. During the warmup period, it takes a few seconds, but afterwards all requests take less than a second. I might update the post later on with failure rates, which is a common thing with serverless technologies.

If you need only screenshots instead of PDFs, it’s quite trivial to change the function. Replace the last 3 lines before the catch block with:

const imageBuffer = await page.screenshot();
res.set('Content-Type', 'image/png');
res.status(200).send(imageBuffer);

Working with Cloud Functions turned out to be simpler than I expected and in the future we might move more endpoints from AppEngine there.

2024 © Emir Ribic - Some rights reserved; please attribute properly and link back. Code snippets are MIT Licensed

Powered by Hugo & Kiss.