Datasets in Spark

The new dataset API lands today in Spark 1.6. Though it’s marked as experimental, it is soon to become an important part of the Spark ecosystem as the platform attempts to push the underlying RDD operations away from the gaze of the everyday developer and into the hands of the Catalyst optimization engine.

Datasets simply present a typed interface over Dataframes, providing a RDD-like way of interacting with the Dataframe with negligible overhead. Creating a dataset is simple:


val boardgameDataframe = sqlContext.read.json("boardgames.json")
case class Boardgame(name: String, rating: Int, type: String)

val  boardgameDataset = boardgameDataframe[Boardgame]

Standard RDD-like operations and aggregations like map, filter, and groupBy are supported. If we want to filter the data so as to only report games rated 8 or above and then group them by type, we’d use the fairly familiar chain below:


val highRanking = boardgameDataset.filter(_.rating >= 8).groupBy(_.type)

A good question at this point is ‘why is this useful, and what advantage does it get me over just doing RDD or dataframe operations?’

Firstly, having a typed interface over dataframes gives you the chance of compile-time type-checking instead of discovering at run-time that the column you’re referencing either doesn’t exist, or is a Long instead of a String (of, course, I’m sure you’re doing testing to make sure this doesn’t happen in production, aren’t you?).

But more importantly, the type information that you provide to a dataset is used to automatically provide an encoder for the Tungsten memory management system (in a similar, but improved manner like Java or Kryo serialization), immediately offering memory benefits over RDDs. As the datasets are based on Dataframes, the execution of your operations is handled by Catalyst, so you’ll benefit from the query optimization engine.

Also, since Catalyst is going to be getting adaptive query optimization (follow SPARK-9850 for updates on this feature, which will allow Spark to dynamically update query plans depending on runtime information), getting used to working with datasets now is going to pay off in your Spark applications in the very near future.