Posted On: May 1, 2024

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it simple for data engineers and data scientists to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers. An EMR Serverless application uses workers to execute workloads, allowing users to configure ephemeral storage per worker based on the workload's needs. Today, we are excited to introduce Shuffle-optimized disks on Amazon EMR Serverless, offering increased storage capacity (up to 2TB) and higher IOPS delivering better performance for I/O-intensive Spark and Hive workloads.

Shuffle is a fundamental step in an Apache Spark or Apache Hive job, involving I/O intensive operations that redistributes or reorganizes data for parallel computations during operations like joins, aggregations, or transformations. Complex workloads with large datasets to shuffle require sufficient disk capacity and I/O performance for optimized shuffle processing. Shuffle-optimized disks offer up to 2TB of storage capacity and higher baseline IOPS, enabling you to efficiently run shuffle-heavy and I/O-intensive Spark and Hive workloads.

Shuffle-optimized disks are generally available on EMR release versions 7.1.0 in all AWS Regions where EMR Serverless is available, excluding AWS GovCloud (US) and China regions. For more information on Shuffle-optimized disks, visit the EMR Serverless User Guide. For pricing info on Shuffle-optimized disks, visit the EMR Serverless pricing page.