Apache Spark's Developers Friendly Structured APIs: Dataframe and Datasets

Table of contents

Reading Time: 3 minutes

This is the second part of the blog series on Spark‘s structured APIs Dataframe & Datasets. In the first part we covered Dataframe and I recommend you go read that blog first if you are new to spark. In this blog we’ll cover the Spark Datasets API, so let’s get started.

The Datasets API

Datasets are also the combination of two characteristics: typed and untyped APIs, as shown in the fig:(fig1)

Let’s get the better understanding of Datasets by comparing it with Dataframe, so Dataframe in Scala is an alias for a collection of generic objects, Datasets[Row], where Row is a generic type which can hold different types of fields on the other hand the Datasets documentation puts it like this:

A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset [in Scala] also has an untyped view called Dataframe
which is a Dataset of Row.

How to create Datasets?

Since datasets are strongly types the first thing you need to need to know is the schema. In other words the data types. The easiest way to create the datasets in Scala is to infer schema is by using the case class. Let’s take an example to get a better understanding.

case class IoTDeviceData(device_id: Long, device_name: String, ip: String,               cca2: String, cca3: String, cn: String, latitude: Double,longitude: Double, scale: String, temp: Long, humidity: Long, c02_level: Long, lcd: String, timestamp: Long)

val ds = spark.read
  .json("dbfs:/FileStore/iot_device.json")
  .as[IoTDeviceData]

ds.show(5, false)  (Image2)

Datasets Operations

We can perform the same transformation and actions as we performed on the Dataframe. Lets take a few examples:

Starting with a filter transformations

val filteredDS = ds.filter(col => col.temp > 30 && col.humidity > 70)
filteredDS.show()   (Image3)

The above implementation gives us another difference between the dataframe and datasets api, for the filter transformation in the datasets api is an overloaded method filter(func: (T) > Boolean): Dataset[T], takes a lambda function, func: (T) > Boolean as its argument.

Whereas in the Dataframe API, you express your filter() condition as SQL-like DSL operations.

Difference b/w Dataframe & Datasets

If you have read my previous blog on Dataframe, and from this blog, you must be wondering when to use what so let’s see:

If you want strict compile type safety, then use Datasets.
Sometimes your project required high-level expression like filter, maps, aggregations, or SQL queries, then you have a choice to use both Dataframe & Datasets.
If you want to take advantage of and benefit from Tungsten’s efficient serialization with Encoders, use Datasets.
If you want space and speed efficiency, use Dataframes.
In case you want errors to be caught during compile time rather than at runtime, then on the basis of the following diagram you can choose your API.

Conclusion

This was all about for the first part of the Spark’s structured APIs in which we have covered all the basics of Datasets. If you are reading this conclusion, then thanks for the making it to the end of the blog and if you do like my writing sense then please do checkout my more blog by clicking here.