Building an AWS Serverless ML Pipeline with Step Functions

Published in

OLX Engineering

5 min readApr 15, 2019

At OLX Group, we strive to serve customers in many markets (like India, Pakistan, Indonesia, Brazil, Poland, and Ukraine) with the best online classifieds experience possible. Obviously, this leads to a series of challenges, one of them being: How to consistently deliver meaningful recommendations, e.g. categories to visit, given markets have so many differences among them?

This is where Machine Learning comes to the rescue.

Warning: this article does not intend to compare AWS Step Functions with other workflow processing engines (e.g. Airflow). Please refer to this article for a detailed comparison among several of these tools.

Experimenting with Sagemaker

It’s not a secret that OLX works heavily on top of AWS, thus Sagemaker is currently the first stop when problems arise. Often, the process is spinning up a notebook instance on a development account, and start testing a combination of models and hyper-parameters against a subset of the data.

Illustration of the experimentation process in Sagemaker.

A typical starting point is the Sagemaker examples Github repository, which is pretty comprehensive and helps Data Scientists to spin up an initial version quickly.

When you have selected a model for the first implementation with real-world data, fetched in a continuous way, that's when things get more complicated. Sagemaker won’t cover the ETL part of the process (out of its scope), neither controlling which steps should execute and when.

For that, you’ll need to piece together other AWS services to get the job done in a scalable and maintainable way. And by scalable, I mean not just the amount of data but also being able to use more than one ML model (after all, you often want to benchmark and pick the best).

Proposed solution

After experimenting with several combinations of AWS services, we came up with the following solution proposal:

Storage for input, output, and temporary steps data: Amazon S3.
ETL (to fetch and prepare the input data as well as output data in the correct location and format): AWS Glue (Athena can’t export to Parquet natively as of the day this article was written).
ML Model training and Batch Transformation: Amazon Sagemaker.
Custom triggering logic with proper input parameters: a combination of Cloudwatch Events, Glue ETL triggers, and AWS Lambda:

Although aligned with the current set of best practices for serverless applications in AWS, once we deployed the pipeline we quickly realised:

Tracking which job belonged to each execution was painful (tagging could help, but you’d have to implement it yourself).
It was hard to assess the current state (again, you’d need to implement something perhaps in DynamoDB to save the executions current state).
It needed extra tooling/scripts to assess the total execution time (as there are multiple sources to check, like Sagemaker and Glue) as well as to run backfill/reprocessing tasks with past dates.
Coordination between training and transforming was tricky, as the triggering logic was spread across Glue ETL triggers, Cloudwatch Events, and two lambdas.

These problems ended up harming the experience of whoever needs to interact with it, as troubleshooting failed runs was painful compared to other platforms like Airflow.

To tackle the issues described above, we decided to try out Step Functions.

Intro to AWS Step Functions

Simply put, AWS Step Functions is a general purpose workflow management tool. A more comprehensive definition can be found in their AWS service page.

Although not focused on data workloads, the fact you can coordinate long-running tasks controlling/watching execution details/transitions from a single interface only paying by state transition was exactly what we needed.

Part of an execution log, with time spent in each step.

The final solution diagram looks much simpler, as AWS Step Functions coordinates the sequence of execution and the triggering logic was moved to a single place:

Lessons learned/Step Functions gotchas

Create a triggering lambda function.

Reason: you can programatically ensure input validation and safe defaults before the state machine is called. Otherwise, you’ll potentially spend unnecessary dollars running jobs with wrong parameters.

2. Stick with AWS Step Functions native integrations.

Reason: Service integrations will save you from writing lambdas to fire up a Glue ETL Job or a Sagemaker Training Job, and provide you full control from the Step Functions console or API (e.g. StopExecution will stop any synchronous step like arn:aws:states:::glue.startJobRun.sync).

3. Use Lambda for performing string interpolation and formatting.

Reason: You might run into a scenario — as we did — in which you need to combine your input parameters into a single string (e.g. s3://my-bucket/model/market). Unfortunately, as of today, you cannot do this purely with ASL as it doesn’t support mixing JsonPath expressions with strings (this information is not explicitly documented anywhere, but I got it after talking to AWS Support). To overcome this, you can code a very dumb lambda function that does this for you and returns it to the next steps.

Example:

The lambda function code:

ResultPath and JsonPath are your best friends.

Reason: Step Functions sends the output of a previous state as the input of the following state by default. Although a reasonable behaviour most of the time, often you want to access the input arguments from a middle-stage step, which won’t be possible. To overcome this, you can encapsulate the results produced in one step using the ResultPath option.

Example:

For more information: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-input-output-filtering.html

Conclusion

AWS Step Functions turned out to be a good fit for the data pipeline use case as it comprises long-running steps. It’s a very handy way of overcoming the lambda execution time limit (and it’s also cheaper to pay state transitions than a “long-running” lambda function). Although not a requirement, it helped us overcome a reasonable amount of shortcomings of a serverless data pipeline in AWS. There's no magic though: you still have to implement any flexibility you require (like reprocessing from a past date or skipping part of the flow), as it’s meant for covering general workflow use cases.