Skip to main content
Engineering blog

Apache Spark ❤️ Apache DataSketches: New Sketch-Based Approximate Distinct Counting

Daniel Tenedorio
Menelaos Karavelas
Ryan Berti
Share this post

Introduction

In this blog post, we'll explore a set of advanced SQL functions available within Apache Spark that leverage the HyperLogLog algorithm, enabling you to count unique values, merge sketches, and estimate distinct counts with precision and efficiency. These implementations use the Apache Datasketches library for consistency with the open source community and easy integration with other tools. Say goodbye to traditional counting methods and embrace these cutting-edge functions to revolutionize your data analysis workflows! This functionality is available starting in Apache Spark 3.5 and Databricks Runtime 13.0.

Why Sketches?

Using a sketch-based library for computing approximate distinct counts offers several benefits over the direct result integer counts returned from the approx_count_distinct function previously available in Apache Spark and Databricks Runtime. One significant advantage is the ability to persist the sketches into storage and scan them back later as needed. In this way, users can retrieve and perform further analysis or computations without needing to recalculate distinct counts from scratch. This saves both time and computational resources, as the approximate distinct count can be readily available for subsequent queries or analyses.

Another benefit of using sketch buffers is their flexibility in handling different scenarios. The sketches can be easily combined or merged using union operations, allowing users to aggregate multiple sketch buffers into a single sketch. This flexibility enables scalable processing of large datasets and distributed systems, as the sketches can be generated independently and then efficiently merged together. These sketch buffers empower users with the ability to perform advanced set operations on the sketches, such as unions, intersections, and differences, opening up new possibilities for complex data analysis tasks.

To show why this functionality is useful, let's consider an example where you have a medical dataset and want to incrementally update a dashboard that contains all the patients with various conditions on a daily basis. In general, approximate counting is particularly useful in incremental update use cases like this, where for low latency applications like dashboards, a small margin of error is acceptable for much faster query times.

-- Create a table to store a medical dataset with patient IDs and conditions.
CREATE TABLE medical_dataset (date DATE, patient_id INT, condition STRING)
USING DELTA;

-- Insert several rows into the table.
INSERT INTO medical_dataset VALUES ...

-- Create a table to store approximate count sketches for each day.
CREATE TABLE patient_condition_sketches_daily (sketch BINARY) USING DELTA;

-- Periodically insert sketches into the table.
INSERT INTO patient_condition_sketches_daily
SELECT hll_sketch_agg(condition, 12) AS sketch
FROM medical_dataset
WHERE `date` = CURRENT_DATE();

-- When desired, merge the sketches together for an approximate count.
-- This operation is fast!
SELECT hll_sketch_estimate(hll_union(sketch)) AS num_distinct_conditions
FROM patient_condition_sketches_daily

Make Probabilistic Counting Easy with hll_sketch_agg and hll_sketch_estimate

The hll_sketch_agg function is a game-changer when it comes to counting the number of unique values in a column. By utilizing the HyperLogLog algorithm, this function provides a probabilistic approximation of uniqueness, outputting a binary representation known as a sketch buffer. This sketch buffer is highly efficient for long-term storage and persistence. You can easily integrate hll_sketch_agg into your queries, and with the resulting buffers, compute approximate unique counts.

The hll_sketch_estimate function is a powerful companion to hll_sketch_agg. With the input of a sketch buffer generated by hll_sketch_agg, hll_sketch_estimate provides an estimation of the distinct count. By leveraging the HyperLogLog algorithm, this function delivers fast and accurate results, enabling you to gain valuable insights into the uniqueness of your data. With hll_sketch_estimate, you can confidently make informed decisions based on reliable approximations of distinct counts.

For example:

-- In the following list of six integers, there are four unique values.
-- The 'hll_sketch_agg' aggregate function consumes all six integers
-- and produces a sketch, then the enclosing 'hll_sketch_estimate'
-- scalar function consumes that buffer and returns the resulting 
-- approximate count.
SELECT hll_sketch_estimate(
  hll_sketch_agg(col, 12))
FROM VALUES (50), (60), (60), (60), (75), (100) AS tab(col);

4

-- In the following list of five strings, there are three unique values.
-- Like above, the 'hll_sketch_agg' aggregate function consumes the values
-- and produces a sketch, then the enclosing 'hll_sketch_estimate'
-- returns the approximate count.
SELECT hll_sketch_estimate(
  hll_sketch_agg(col))
FROM VALUES ('abc'), ('def'), ('abc'), ('ghi'), ('abc') AS tab(col);

3

Merge Sketches for Comprehensive Analysis with hll_union

When you need to combine two sketches into a single sketch, the hll_union function comes to the rescue. By leveraging the power of the HyperLogLog algorithm, hll_union enables you to merge sketch buffers efficiently. This functionality is especially useful when you want to aggregate data across different columns or datasets. By incorporating hll_union into your queries, you can obtain comprehensive insights and compute approximate unique counts using hll_sketch_estimate. For example:

SELECT hll_sketch_estimate(
  hll_union(
    hll_sketch_agg(col1),
    hll_sketch_agg(col2)))
  FROM VALUES
    (1, 4),
    (1, 4),
    (2, 5),
    (2, 5),
    (3, 6) AS tab(col1, col2);

6

Streamline Sketch Aggregation with hll_union_agg

For scenarios where you need to combine multiple sketches within a group, the hll_union_agg function is your go-to tool. With hll_union_agg, you can aggregate multiple sketch buffers into a single buffer, simplifying the process of analyzing large datasets. This function allows you to efficiently compute approximate unique counts by incorporating hll_sketch_estimate. By utilizing the power of hll_union_agg, you can streamline sketch aggregation and achieve accurate insights into the distinct counts within your data. For example:

SELECT hll_sketch_estimate(hll_union_agg(sketch, true))
    FROM (SELECT hll_sketch_agg(col) as sketch
            FROM VALUES (1) AS tab(col)
          UNION ALL
          SELECT hll_sketch_agg(col, 20) as sketch
            FROM VALUES (1) AS tab(col));

1

Export Sketches to Storage and Load them Back Later

You can generate sketch buffers and export them into managed tables to avoid recomputing intermediate work later. Using the new hll_sketch_agg function, you can follow these steps:

  1. Create a managed table: Begin by creating a managed table using the CREATE TABLE statement. Define the schema of the table to include a column to store the sketch buffers. For example:
CREATE TABLE sketch_buffers (buffer BINARY) USING DELTA;
  1. Generate and insert sketch buffers: Use the INSERT INTO statement along with the hll_sketch_agg function to generate the sketch buffers and insert them into the managed table. Provide the column or expression against which you want to count unique values as an argument to the function. For instance:
INSERT INTO sketch_buffers
SELECT hll_sketch_agg(col, 12)
FROM your_table;
  1. After repeating the previous step several times, the sketch_buffers table will contain many rows. You can periodically combine them by creating a new table to store the merged sketch buffers:
CREATE OR REPLACE TABLE sketch_buffers USING DELTA
AS SELECT hll_union_agg(buffer) AS buffer
FROM sketch_buffers;
  1. Finally, when you're ready to compute the final result, you can call hll_estimate over the merged buffer:
SELECT hll_estimate(buffer) AS result
FROM sketch_buffers;

42

Make Different Tools Work Together with the Apache Datasketches Library

These new SQL functions in Apache Spark and Databricks Runtime are powered by the Apache Datasketches library. This library offers a useful solution to the challenges of analyzing big data quickly, introducing a class of specialized algorithms which provide approximate results with proven error bounds, significantly speeding up analysis. The functions in this library generate buffers known as sketches which are suitable for saving to storage and then consuming later as needed.

The community chose the Dataksetches implementation because of the availability of libraries in various programming languages. These sketch buffers provide a consistent binary representation that users can seamlessly utilize across different languages and platforms, enabling effortless interoperability. This feature, along with the inherent accuracy and reliable outcomes of sketches, unlocks a multitude of opportunities for swift queries and groundbreaking analysis capabilities. With this powerful toolkit at their disposal, users can extract valuable insights from large-scale data. By harnessing the power of sketches, organizations can expedite their data analysis processes, minimize processing times, and make well-informed decisions with utmost confidence.

Unleash the Power of Sketch Based Approximate Distinct Counting for Effective Data Analysis

Embracing advanced SQL functions like hll_sketch_agg, hll_sketch_estimate, hll_union, and hll_union_agg can revolutionize your data analysis capabilities. By leveraging the HyperLogLog algorithm and the efficiency of sketch buffers, you can count unique values, estimate distinct counts, and merge sketches with ease. Say goodbye to traditional counting methods and welcome these powerful SQL functions into your toolkit! Unlock the full potential of your data analysis workflows and make informed decisions based on accurate approximations of uniqueness.

Try Databricks for free

Related posts

Engineering blog

Advanced Analytics with HyperLogLog Functions in Apache Spark

May 8, 2019 by Sim Simeonov in Engineering Blog
Read Rise of the Data Lakehouse to explore why lakehouses are the data architecture of the future with the father of the data...
Engineering blog

Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles

Introduction Apache Spark is fast, but applications such as preliminary data exploration need to be even faster and are willing to sacrifice some...
See all Engineering Blog posts