ETL Phone Home (And Go Away)

(data science, data science!, DATA SCIENCE!, and avoiding some of the boring bits, scala, spark, data sources)

Data Science! Doesn’t it sound awesome? The facts, the figures, all at your fingertips! You effortlessly write a few lines of Scala implementing a fancy new algorithm that’s going to save your company millions (Millions!) and then a 500 node Spark cluster churns away on your data…oh hang on, the data. The cluster chokes on the data and falls apart like a mis-timed Heath Robinson1 machine.

“Oh yeah, the ETL.”

Extract! Transform! Load! The endless excitement of writing Pig scripts that might someway do what you want and dump a file out into the Lovecraftian horrors of HDFS. So much time. Effort. Time and effort that might be better spent on working on the problem rather than trying to write a script that dumps a few MySQL tables into a text file. A text file! Is this really the data future we have created? Is it a ‘data lake’, or just a huge Lovecraftian Horror of a HDFS filesystem where everything gets thrown in, just in case it might be useful someday?

Spark! Save us!

How about the Data Sources API?

“The Data Sources API?”

Yes. Give it a whirl.

“Thanks, Spark! Now, I was wondering about memory management and why—“

Don’t push it, peon developer.

“I am a Data Scientist!”

You failed Stats 1 back in 19952.

“I–WAIT? How could you possibly know that?”

You’ll see. YOU’LL ALL SEE. YOU WILL ALL KNOW MY POWER.

Ahem. Okay, back to the point - ETL is a thankless task, but in a lot of cases, there’s no choice. However, Spark’s semi-new Data Sources API allows you to talk directly to heterogeneous resources and hide many of the messy details.

In the standard World of Hadoop™, you might set up a Sqoop3 job to import data from a MySQL database into a Hive table. Data Sources, on the other hand, says “why don’t I go and pull that data in from the database and give it to you as a DataFrame? Also, you should think about closing that window. The 15:34 from Basingstoke will be coming in shortly, and you know it makes a racket.”

Let’s look at some code!


val driver = "com.mysql.jdbc.Driver"
val conn = "jdbc:mysql://1.1.1.1:3306/sample?user=test&password="+sys.env("DB_PASS")
val table = “users”

val options = Map[String, String](
		"driver" -> driver, 
		"url" -> conn, 
		"dbtable" -> table)
		
val df = sqlContext.load("jdbc", options);

df.show()

And the result of the df.show() is as you’d expect:


+---+--------+
|id |username| 
+---+--------+
|  1|  Gerald|
|  2|   Peter|
|  3|    Erin|
|  4|     Pip|
...

You do have to pass in the MySQL JDBC JAR on your spark-shell command line (i.e. --jars mysql-connector-java-5.1.36-bin.jar), or bundle it in your application, but that’s all. You now have a DataFrame that you can operate on exactly the same way as any another DataFrame. Now, you may have to do some processing on that to get you where you need to be, but already you’ve skipped so much misery4.

The Data Sources API doesn’t completely eliminate the need for ETL operations, but it brings so many heterogenous data sources much much closer to Spark, which can only be a good thing. Now, if you excuse me, I’m going to mash up data from Cassandra, MySQL, Postgres, Couchbase, and a random CSV file I have lying around… (you’ve gone too far now. Have a lie down and a cup of tea. —Ed.)

  1. Sod off, Americans

  2. For the record, I passed Statistics 1 with flying colours. I can’t remember if it was Statistics 1 or Mechanics 1 where I built a random-walk nuclear reactor in BASIC. It sounds more impressive than it probably was (the code is likely still up in my family’s loft).

  3. No, seriously. Somebody thought ‘Sqoop’ was a good name.

  4. There’s also some extra options you can pass along in the set-up to the JDBC driver to control Spark’s parallelism when reading from the DB. Oh, and that table option? It can be a query or a view as well as just a simple table name. So you could build up the SQL query that provides only the data you need from the database and have MySQL do all the work for you even before you get your hands on it. Madness! (note, though, you can’t specify a password separately from the connection URL yet, hence the slightly-awkward DB_PASS environment variable addition above…

blog comments powered by Disqus