Posted On: Jan 21, 2022

We are excited to announce that Amazon EMR 6.5.0 now includes Apache Iceberg version 0.12. Apache Iceberg is an open table format for large data sets in Amazon S3 and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. With the current release, you can use Apache Spark 3.1.2 on EMR clusters with the Iceberg table format.

Apache Iceberg offers an open source table format for data stored in data lakes that helps data engineers manage complex challenges such as managing continuously evolving data sets while maintaining query performance. Iceberg allows you to:

  • Maintain transactional consistency on tables between multiple applications where files can be added, removed or modified atomically with full read isolation and multiple concurrent writes
  • Implement full schema evolution to track changes to a table over time
  • Issue time travel queries to query historical data and verify changes between updates
  • Organize tables into flexible partition layouts with partition evolution enabling updates to partition schemes as queries and data volumes change without relying on physical directories
  • Rollback tables to prior versions to quickly correct issues and return tables to a known good state
  • Perform advanced planning and filtering in high performance queries on large data sets etc.

Amazon EMR release 6.5.0 with Apache Iceberg is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), South America (São Paulo), Europe (Ireland), Europe (Stockholm), AWS GovCloud (US), Amazon Web Services China (Beijing Region) Operated by Sinnet, Amazon Web Services China (Ningxia) Region, operated by NWCD with more regions being added in the upcoming weeks.

To learn more about using Apache Iceberg on Amazon EMR, see the Amazon EMR documentation page here.