Skip to main content
Engineering blog

Extending Delta Sharing for Azure

Delta Sharing 0.3.0 includes Azure support, token expiration time, query limit parameters, and improve APIs
Will Girten
Shixiong Zhu
Denny Lee
Share this post

We are excited for the release of Delta Sharing 0.3.0, which introduces several key improvements and bug fixes, including the following features:

  • Delta Sharing is now available for Azure Blob Storage and Azure Data Lake Gen2: You can now share Delta Tables on Azure Blob Storage and Azure Data Lake Gen2 (#56, #59).
  • Token expiration time: An optional expirationTime field has been added to the Delta Sharing profile to specify a token expiration time (#77).
  • Query limit parameters: The Python Connector now accepts an optional limit parameter to allow fetching a subset of rows when using the load_as_pandas function (#76). Similarly, users can also send a limitHint parameter when submitting a sharing query using the Apache Spark™ Connector (#55).
  • Improved API to list all tables in a share: A new API has been added for listing all tables in a share that supports pagination (#63, #66, #67, #88).
  • Automatic Refresh of Pre-signed URLs: A new cache has been added to the Apache Spark driver that automatically refreshes pre-signed file URLs for long-running queries (#69).

In this blog post, we will go through some of the great improvements in this release.

Delta Sharing on Azure Blob Storage and Azure Data Lake Gen2

Azure Blob Storage has proven to be a cost-effective solution for storing Delta Tables in the Azure cloud. New to this release, you can now share Delta Tables stored on Azure Blob Storage and Azure Data Lake Gen2 in the reference implementation of Delta Sharing Server.

With Delta Sharing 0.3.0, you can now share Delta tables stored on Azure Blob Storage and Azure Data Lake Gen2.

Delta Sharing on Azure Blob Storage example

Sharing Delta Tables on Azure Blob Storage is easier than ever! For example, to share a Delta Table called classics in an Azure Blob container called movie_recommendations, you can simply update the Delta Sharing profile with the location of the Delta table on Azure Blob Storage:

delta-sharing-profile.yaml

# Config shares/schemas/tables to share
shares:
- name: "my_share"
 schemas:
 - name: "movies"
   tables:
   - name: "classics"
     location: "wasbs://movie_recommendations@delta_sharing.blob.core.windows.net/delta/classics"

Delta Sharing on Azure Data Lake Storage Gen2 example

For those who would prefer to leverage the built-in hierarchical directory structure and fine-grained access controls, you can share Delta Tables on Azure Data Lake Storage Gen2 as well. Simply update the Delta Sharing profile with the location on Azure Data Lake Storage Gen2 of your Delta Table, and the Delta Sharing server will automatically process the data for a Delta Sharing query:

delta-sharing-profile.yaml

# Config shares/schemas/tables to share
shares:
- name: "my_share"
 schemas:
 - name: "movies"
   tables:
   - name: "comedy_heaven"
     location: "abfss://movie_recommendations@delta_sharing.dfs.core.windows.net/delta/comedy_heaven"

Query limit parameters

Sometimes it might be helpful to explore just a few records in a shared dataset. Rather than loading the entire dataset into memory from blob storage, you can now add a limit hint in your Delta Sharing queries. The query limit will be pushed down and sent to the Delta Sharing server as a limit hint.

For example, to load a shared Delta Table as a Pandas DataFrame and limit the number of rows to 100, you can now add the limit as a parameter to the load_as_pandas() function call:

import delta_sharing

from IPython.display import display

profile_file = "~/wgirten/delta-sharing-profile.yaml" 

client = delta_sharing.SharingClient(profile_file)
 
table_url = profile_file + "#my_share.movies.comedy_heaven"

# Add a query limit to limit amount of data to only 100 rows
sample_pdf = delta_sharing.load_as_pandas(table_url, limit=100)

display(sample_pdf)

Similarly, if the Apache Spark Connector finds a LIMIT clause in your Spark SQL query, it will try to push down the limit to the server to request less data:

-- Create a new table, specifying the location to the share as a table path
CREATE TABLE my_comedy_movies 
USING deltaSharing 
LOCATION '~/wgirten/delta-sharing-profile.yaml#my_share.movies.comedy_heaven';

-- Display the first 100 rows by passing a limit hint in the query
SELECT * FROM my_comedy_movies LIMIT 100;

Improved API for listing all tables

Included in this release is a new and improved API for listing all the tables under all schemas in a share. The new API supports pagination similar to other APIs.

For example, to list all the tables in the Delta share my_share, you can simply send a GET request to the /shares/{share_name}/all-tables endpoint on the sharing server.

curl http://localhost/shares/wgirten_share/all-tables -H "Authorization: Bearer <token>"
{"items":[
   {
 "share": "my_share",
      "schema":"movies",
      "name":"classics"
   },
   {
      "share":"my_share",
      "schema":"movies",
      "name":"comedy_heaven"
   }
],
"nextPageToken": "..."
}
</token>

Automatic refresh of pre-signed URLs

When reading a Delta Sharing table, the Delta Sharing server automatically generates the pre-signed file URLs for a Delta Table. However, for long-running queries, the pre-signed file URLs may expire before the sharing client has a chance to read the files. This release adds a pre-signed URL cache in the Spark driver, which automatically refreshes pre-signed file URLs inside of a background thread. Tasks running in Spark executors communicate to the Spark driver to fetch the latest pre-signed file URLs.

What’s next

We are already gearing up for our next release of Delta Sharing. One of the major features we are currently working on is Google Cloud Storage support. You can track all the upcoming releases and planned features in github milestones.



Credits
We’d like to extend a special thanks for the contributions to this release to Denny Lee, Felix Cheung, Lin Zhou, Matei Zaharia, Shixiong Zhu, Will Girten, Xiaotong Sun, Yuhong Chen, kohei-tosshy, and William Chau.
Try Databricks for free

Related posts

Platform blog

Top Three Data Sharing Use Cases With Delta Sharing

Update: Delta Sharing is now generally available on AWS and Azure. Data sharing has become an essential component to drive business value as...
Engineering blog

The Foundation of Your Lakehouse Starts With Delta Lake

December 1, 2021 by Denny Lee and Vini Jaiswal in Engineering Blog
It’s been an exciting last few years with the Delta Lake project. The release of Delta Lake 1.0 as announced by Michael Armbrust...
Engineering blog

The Delta Between ML Today and Efficient ML Tomorrow

Delta Lake and MLflow both come up frequently in conversation but often as two entirely separate products. This blog will focus on the...
See all Engineering Blog posts