ETL Phone Home (And Go Away)

(data science, data science!, DATA SCIENCE!, and avoiding some of the boring bits, scala, spark, data sources)

Data Science! Doesn’t it sound awesome? The facts, the figures, all at your fingertips! You effortlessly write a few lines of Scala implementing a fancy new algorithm that’s going to save your company millions (Millions!) and then a 500 node Spark cluster churns away on your data…oh hang on, the data. The cluster chokes on the data and falls apart like a mis-timed Heath Robinson1 machine.

“Oh yeah, the ETL.”

Extract! Transform! Load! The endless excitement of writing Pig scripts that might someway do what you want and dump a file out into the Lovecraftian horrors of HDFS. So much time. Effort. Time and effort that might be better spent on working on the problem rather than trying to write a script that dumps a few MySQL tables into a text file. A text file! Is this really the data future we have created? Is it a ‘data lake’, or just a huge Lovecraftian Horror of a HDFS filesystem where everything gets thrown in, just in case it might be useful someday?

Spark! Save us!

How about the Data Sources API?

“The Data Sources API?”

Yes. Give it a whirl.

“Thanks, Spark! Now, I was wondering about memory management and why—“

Don’t push it, peon developer.

“I am a Data Scientist!”

You failed Stats 1 back in 19952.

“I–WAIT? How could you possibly know that?”


Ahem. Okay, back to the point - ETL is a thankless task, but in a lot of cases, there’s no choice. However, Spark’s semi-new Data Sources API allows you to talk directly to heterogeneous resources and hide many of the messy details.

In the standard World of Hadoop™, you might set up a Sqoop3 job to import data from a MySQL database into a Hive table. Data Sources, on the other hand, says “why don’t I go and pull that data in from the database and give it to you as a DataFrame? Also, you should think about closing that window. The 15:34 from Basingstoke will be coming in shortly, and you know it makes a racket.”

Let’s look at some code!

val driver = "com.mysql.jdbc.Driver"
val conn = "jdbc:mysql://"+sys.env("DB_PASS")
val table = “users”

val options = Map[String, String](
		"driver" -> driver, 
		"url" -> conn, 
		"dbtable" -> table)
val df = sqlContext.load("jdbc", options);

And the result of the is as you’d expect:

|id |username| 
|  1|  Gerald|
|  2|   Peter|
|  3|    Erin|
|  4|     Pip|

You do have to pass in the MySQL JDBC JAR on your spark-shell command line (i.e. --jars mysql-connector-java-5.1.36-bin.jar), or bundle it in your application, but that’s all. You now have a DataFrame that you can operate on exactly the same way as any another DataFrame. Now, you may have to do some processing on that to get you where you need to be, but already you’ve skipped so much misery4.

The Data Sources API doesn’t completely eliminate the need for ETL operations, but it brings so many heterogenous data sources much much closer to Spark, which can only be a good thing. Now, if you excuse me, I’m going to mash up data from Cassandra, MySQL, Postgres, Couchbase, and a random CSV file I have lying around… (you’ve gone too far now. Have a lie down and a cup of tea. —Ed.)

  1. Sod off, Americans

  2. For the record, I passed Statistics 1 with flying colours. I can’t remember if it was Statistics 1 or Mechanics 1 where I built a random-walk nuclear reactor in BASIC. It sounds more impressive than it probably was (the code is likely still up in my family’s loft).

  3. No, seriously. Somebody thought ‘Sqoop’ was a good name.

  4. There’s also some extra options you can pass along in the set-up to the JDBC driver to control Spark’s parallelism when reading from the DB. Oh, and that table option? It can be a query or a view as well as just a simple table name. So you could build up the SQL query that provides only the data you need from the database and have MySQL do all the work for you even before you get your hands on it. Madness! (note, though, you can’t specify a password separately from the connection URL yet, hence the slightly-awkward DB_PASS environment variable addition above…

And That’s Why I Bought Resistance Avalon

(not the saturday i had planned)

My Saturday was fairly well planned-out: I was going to go food shopping in the morning, do a bit of writing in the afternoon, and then make Alex Stupak’s cheeseburger tacos in the evening. Maybe add in a test run of a raspberry parfait for a dinner I’m doing next weekend.

That lasted until around just before one. And I’m in the bathroom. Not something I’d talk about on a normal day, obviously.


Now, I live in a city in America. I have heard gunshots before. Sometimes they’re actually fireworks, sometimes a car back-firing, and sometimes they really are gunshots. A distance away. But this was different.

Outside the window

Outside the window

Somebody is emptying a pistol outside my window and all I can think about is that dying on the toilet would be incredibly embarrassing.

Look, I’m a simple boy from the South of Britain. I’ve never had to really deal with this sort of thing. Back home, guns exist, obviously (it’s ten years this week since the Met shot Jean Charles de Menezes, after all), but they’re mostly other. Something that you are very unlikely to come across in everyday life, and specifically, you don’t often encounter the situation where you are wondering about the bullet-stopping abilities of a wooden house’s walls1.

The police arrived around 30 minutes later and possibly walked around the area trying to find bullets. I wouldn’t really know, as at this point I had retreated to my dining room, thinking that it provided several walls of protection from the side of the road where the shots where fired.2

I might3 have been suffering from a slight touch of shock at this point.

The idea of spending all afternoon in the house suddenly seemed a lot less appealing, so I drove to Target to get tortillas for the cheeseburger tacos (I’d forgotten them in Kroger this morning). Except, I drove past Target, went into Atomic Empire, and then spent forty minutes wandering around the shop before leaving with Mercury Heat #1 and…Resistance: Avalon. Retail as therapy. Always works.

I forgot to actually stop in Target on the way back, so I still don’t have tortillas. Which is something of a problem for making tacos…

  1. Also, we build our houses in brick because we damn well understood the moral of Three Little Pigs, thank you very much, but that’s another story.

  2. Seriously guys, bricks. They’re so good!

  3. Have you ever tried to puree and sieve raspberries when your hands are shaking uncontrollably? It’s like The Krypton Factor, but with added red splotches.

Inspiration Failure

(you can type this, george)

This week, I made three different types of sweet (maple candy, white chocolate with golden syrup crunch and orange milk chocolate bars), played the same board game four times (two of which were just playing by myself), drove around five hundred miles on a round-trip to SC, discovering that apparently you can get red velvet waffles with your chicken in Columbia, finished finalizing things for the first Triangle Spark meetup on Tuesday (you should come!), began work on a semi-secret work project, trained all sorts of machine learning models…and still don’t really think I got anything done this week. I may need to recalibrate my expectations a little.

In other, odd news: since having to give up walking to work, I have lost more weight than I have in the past two years or so. Which is…strange. I have tried to be somewhat more consistent in riding the exercise bike in an attempt to make up for the lack of exercise, and I guess I’ve been much better at that than I realized.

(it’ll all go to hell after I have my operation, mind you…)


(what now)

Just tired. Tired of the news back home, tired of the news back here, tired of the hipsters’ reflexive bashing of anything that happens in Durham, tired of the heat, tired of my foot giving out at random times. Tired of being tired and spending the evening at the Internet instead of doing something useful instead.

Not a great week. But the week before ended well, so maybe the coming week will be better.

North Carolina — The Fun Police


America (or at least parts of it): where you can go out and buy a gun for concealed carry, but heaven forbid that you buy a firework that shoots into the sky. At least in North Carolina, anyhow; across the border in South Carolina, they’re happy to essentially sell you small ordinance for you to fire off in your back garden.

And then on July 4th, Durham erupts in illegal fireworks. It’s an odd system.

A week of two halves, then: the first four days full of 12-hours days and misery, and the back three days finally relaxing, not staring at a computer screen, and not being woken up at 4am by a spider cricket crawling up my pyjama leg. You can probably guess which part of the week I enjoyed more.

Blackberry Sake sorbet

And I made things! Sorbets, vegan meringues, caramelized soy milk pudding (yes, I know, but believe me, it tastes about 500x better than it sounds, and I’m thinking about using the caramelized soy milk to make a vegan ganache in the near future), deep-fried cheese, and soy nuggets slathered in ssamjang.

Everything should be covered in ssamjang.

(This stems from finally getting to go to Kokyu’s sandwich shop this Friday. The ssamwich is essential and you should beat a path there for weekday lunch sometime)

During a whirlwind visit to Durham, Tammy followed through on her determination to dazzle paint yet another piece of my furniture, so I now have a wonderful dazzle table sitting on my porch. I will not stop until the entire house clashes with itself. Wait until you see the rugs I’m looking at muahahahahahaha.

Still no news on the foot. Maybe this week…

buy my books
Instant Zepto.js