ETL Phone Home (And Go Away)

Data Science! Doesn’t it sound awesome? The facts, the figures, all at your fingertips! You effortlessly write a few lines of Scala implementing a fancy new algorithm that’s going to save your company millions (Millions!) and then a 500 node Spark cluster churns away on your data…oh hang on, the data. The cluster chokes on the data and falls apart like a mis-timed Heath Robinson1 machine.

“Oh yeah, the ETL.”

Extract! Transform! Load! The endless excitement of writing Pig scripts that might someway do what you want and dump a file out into the Lovecraftian horrors of HDFS. So much time. Effort. Time and effort that might be better spent on working on the problem rather than trying to write a script that dumps a few MySQL tables into a text file. A text file! Is this really the data future we have created? Is it a ‘data lake’, or just a huge Lovecraftian Horror of a HDFS filesystem where everything gets thrown in, just in case it might be useful someday?

Spark! Save us!

How about the Data Sources API?

“The Data Sources API?”

Yes. Give it a whirl.

“Thanks, Spark! Now, I was wondering about memory management and why—“

Don’t push it, peon developer.

“I am a Data Scientist!”

You failed Stats 1 back in 19952.

“I–WAIT? How could you possibly know that?”

You’ll see. YOU’LL ALL SEE. YOU WILL ALL KNOW MY POWER.

Ahem. Okay, back to the point - ETL is a thankless task, but in a lot of cases, there’s no choice. However, Spark’s semi-new Data Sources API allows you to talk directly to heterogeneous resources and hide many of the messy details.

In the standard World of Hadoop™, you might set up a Sqoop3 job to import data from a MySQL database into a Hive table. Data Sources, on the other hand, says “why don’t I go and pull that data in from the database and give it to you as a DataFrame? Also, you should think about closing that window. The 15:34 from Basingstoke will be coming in shortly, and you know it makes a racket.”

Let’s look at some code!


val driver = "com.mysql.jdbc.Driver"
val conn = "jdbc:mysql://1.1.1.1:3306/sample?user=test&password="+sys.env("DB_PASS")
val table = “users”

val options = Map[String, String](
        "driver" -> driver, 
        "url" -> conn, 
        "dbtable" -> table)
        
val df = sqlContext.load("jdbc", options);

df.show()

And the result of the df.show() is as you’d expect:


+---+--------+
|id |username| 
+---+--------+
|  1|  Gerald|
|  2|   Peter|
|  3|    Erin|
|  4|     Pip|
...

You do have to pass in the MySQL JDBC JAR on your spark-shell command line (i.e. --jars mysql-connector-java-5.1.36-bin.jar), or bundle it in your application, but that’s all. You now have a DataFrame that you can operate on exactly the same way as any another DataFrame. Now, you may have to do some processing on that to get you where you need to be, but already you’ve skipped so much misery4.

The Data Sources API doesn’t completely eliminate the need for ETL operations, but it brings so many heterogenous data sources much much closer to Spark, which can only be a good thing. Now, if you excuse me, I’m going to mash up data from Cassandra, MySQL, Postgres, Couchbase, and a random CSV file I have lying around… (you’ve gone too far now. Have a lie down and a cup of tea. —Ed.)


  1. Sod off, Americans [return]
  2. For the record, I passed Statistics 1 with flying colours. I can’t remember if it was Statistics 1 or Mechanics 1 where I built a random-walk nuclear reactor in BASIC. It sounds more impressive than it probably was (the code is likely still up in my family’s loft). [return]
  3. No, seriously. Somebody thought ‘Sqoop’ was a good name. [return]
  4. There’s also some extra options you can pass along in the set-up to the JDBC driver to control Spark’s parallelism when reading from the DB. Oh, and that table option? It can be a query or a view as well as just a simple table name. So you could build up the SQL query that provides only the data you need from the database and have MySQL do all the work for you even before you get your hands on it. Madness! (note, though, you can’t specify a password separately from the connection URL yet, hence the slightly-awkward DB_PASS environment variable addition above… [return]

And That’s Why I Bought Resistance Avalon

My Saturday was fairly well planned-out: I was going to go food shopping in the morning, do a bit of writing in the afternoon, and then make Alex Stupak’s cheeseburger tacos in the evening. Maybe add in a test run of a raspberry parfait for a dinner I’m doing next weekend.

That lasted until around just before one. And I’m in the bathroom. Not something I’d talk about on a normal day, obviously.

Gunshots

Now, I live in a city in America. I have heard gunshots before. Sometimes they’re actually fireworks, sometimes a car back-firing, and sometimes they really are gunshots. A distance away. But this was different.

Outside the window

Outside the window

Somebody is emptying a pistol outside my window and all I can think about is that dying on the toilet would be incredibly embarrassing.

Look, I’m a simple boy from the South of Britain. I’ve never had to really deal with this sort of thing. Back home, guns exist, obviously (it’s ten years this week since the Met shot Jean Charles de Menezes, after all), but they’re mostly other. Something that you are very unlikely to come across in everyday life, and specifically, you don’t often encounter the situation where you are wondering about the bullet-stopping abilities of a wooden house’s walls1.

The police arrived around 30 minutes later and possibly walked around the area trying to find bullets. I wouldn’t really know, as at this point I had retreated to my dining room, thinking that it provided several walls of protection from the side of the road where the shots where fired.2

I might3 have been suffering from a slight touch of shock at this point.

The idea of spending all afternoon in the house suddenly seemed a lot less appealing, so I drove to Target to get tortillas for the cheeseburger tacos (I’d forgotten them in Kroger this morning). Except, I drove past Target, went into Atomic Empire, and then spent forty minutes wandering around the shop before leaving with Mercury Heat #1 and…Resistance: Avalon. Retail as therapy. Always works.

I forgot to actually stop in Target on the way back, so I still don’t have tortillas. Which is something of a problem for making tacos…


  1. Also, we build our houses in brick because we damn well understood the moral of Three Little Pigs, thank you very much, but that’s another story. [return]
  2. Seriously guys, bricks. They’re so good! [return]
  3. Have you ever tried to puree and sieve raspberries when your hands are shaking uncontrollably? It’s like The Krypton Factor, but with added red splotches. [return]

Inspiration Failure

This week, I made three different types of sweet (maple candy, white chocolate with golden syrup crunch and orange milk chocolate bars), played the same board game four times (two of which were just playing by myself), drove around five hundred miles on a round-trip to SC, discovering that apparently you can get red velvet waffles with your chicken in Columbia, finished finalizing things for the first Triangle Spark meetup on Tuesday (you should come!), began work on a semi-secret work project, trained all sorts of machine learning models…and still don’t really think I got anything done this week. I may need to recalibrate my expectations a little.

In other, odd news: since having to give up walking to work, I have lost more weight than I have in the past two years or so. Which is…strange. I have tried to be somewhat more consistent in riding the exercise bike in an attempt to make up for the lack of exercise, and I guess I’ve been much better at that than I realized.

(it’ll all go to hell after I have my operation, mind you…)

Tired

Just tired. Tired of the news back home, tired of the news back here, tired of the hipsters’ reflexive bashing of anything that happens in Durham, tired of the heat, tired of my foot giving out at random times. Tired of being tired and spending the evening at the Internet instead of doing something useful instead.

Not a great week. But the week before ended well, so maybe the coming week will be better.

North Carolina — The Fun Police

America (or at least parts of it): where you can go out and buy a gun for concealed carry, but heaven forbid that you buy a firework that shoots into the sky. At least in North Carolina, anyhow; across the border in South Carolina, they’re happy to essentially sell you small ordinance for you to fire off in your back garden.

And then on July 4th, Durham erupts in illegal fireworks. It’s an odd system.

A week of two halves, then: the first four days full of 12-hours days and misery, and the back three days finally relaxing, not staring at a computer screen, and not being woken up at 4am by a spider cricket crawling up my pyjama leg. You can probably guess which part of the week I enjoyed more.

Blackberry Sake sorbet

And I made things! Sorbets, vegan meringues, caramelized soy milk pudding (yes, I know, but believe me, it tastes about 500x better than it sounds, and I’m thinking about using the caramelized soy milk to make a vegan ganache in the near future), deep-fried cheese, and soy nuggets slathered in ssamjang.

Everything should be covered in ssamjang.

(This stems from finally getting to go to Kokyu’s sandwich shop this Friday. The ssamwich is essential and you should beat a path there for weekday lunch sometime)

During a whirlwind visit to Durham, Tammy followed through on her determination to dazzle paint yet another piece of my furniture, so I now have a wonderful dazzle table sitting on my porch. I will not stop until the entire house clashes with itself. Wait until you see the rugs I’m looking at muahahahahahaha.

Still no news on the foot. Maybe this week…

The Delicious Salty Taste of Scalia’s Tears

A new definition of ‘own worst enemy’: knowing that you are going to have surgery soon to alleviate your bad foot…and then walking 13 miles in two days pretty much by accident.

So, yes, I’m currently lying down and in quite a bit of pain.

The talk at the Red Hat Summit seemed to go down well, though. Even if the wireless connection decided to go down right in the middle of the demo. That was a tense minute or two, but thanks to the rather aggressive polling in the web client, it eventually worked (hurrah!).

(also, apologies to everybody I know in Boston - I didn’t know how much time I would have to myself, and how mobile I’d be, otherwise, I would have sought you all out!)

Remember, everyone: have a good long drink of Scalia’s Tears this weekend, and tune out the obnoxious leftist-radical whining about how Friday’s SCOTUS decision means ’nothing’.

Spark and Kafka - Getting Cozier

I’m a huge fan of the reappearance of Enterprise Service Buses. They are especially great for Big Data systems and the Lambda Architecture: messages get sent to various different streams on the bus and consumers can read them in a streaming or a batch operation as desired.

(a good introduction to the idea of a Enterprise Service/Message Bus from last year)

Obviously, you wait for a decent Enterprise Service Bus/Data Stream Bus/PubSub/Messaging Log1 and then many come at once. One of the most popular in recent times is Apache Kafka - developed at LinkedIn to be capable of handling their huge throughput requirements. It’s quickly become a de-facto component of many a Spark Streaming or Storm solution.

In the world of Spark, though, Kafka integration has always been a bit of a pain. If you look at this guide to integrating Kafka and Spark, it’s clear that wrangling more than a simple connection to Kafka involves quite a bit of faff, having to union multiple DStreams as they’re coming in from Kafka to increase parallelism. Spark is supposed to be easier to work with than that!

Well, in Spark 1.3, a new interface to Kafka was added. It’s still (in 1.4) marked as ‘experimental’, but I know of several companies who have been using it in producing for months, handling billions of messages per day (I imagine that it will be marked as safe in 1.5 if you’re still cautious). And it makes things so much simpler!


val ssc = new StreamingContext(new SparkConf, Seconds(5))

val kafkaParams = Map("metadata.broker.list" -> “kafka-1:9092,kafka-2:9092,kafka-3:9092”)
 
val topics = Set(“example-topic”, “another-example-topic”)
 
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

stream.map(_._2). // and then do Spark stuff!
// ...  
ssc.start()
ssc.awaitTermination()

This automatically creates a DStream comprised of KafkaRDDs which read in parallel from the number of Kafka partitions. No union required! As a bonus, because Spark handles the offsets that have been read, bypassing ZooKeeper, the new approach gains exactly-once semantics (with the downside that ZooKeeper no longer knows exactly what the Spark Streaming application is doing, which may cause some monitoring issues unless you manually update ZooKeeper from within Spark).

Also in 1.3 and above - batch access to Kafka! You can create KafkaRDDs and operate on them in the usual way (a boon if you’re working on a Lambda Architecture application).


val offsetRanges = Array(
      // args are: topic, partitionId, fromOffset (inclusive), untilOffset (exclusive)
      OffsetRange(“example-topic”, 0, 110, 220)
)
 
val rdd = KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder](sc, kafkaParams, offsetRanges)

(In batch mode, you currently have to manage reading from all the partitions and the offsets in the Kafka log yourself.)

Okay, so we can now do parallelism with Spark and Kafka in a much simpler manner…but an important feature of these architectures is writing results back to the bus (e.g., flagging possible fraudulent bids in a real-time auctioning system for further investigation). Unfortunately, baked-in support of this is not scheduled until 1.5 (see SPARK-4122 for more details), so for now, you have to handle the details here yourself - consider a connection pool if you find yourself doing many writes back to Kafka in a short time.


  1. the cited difference between a data stream bus/log and an enterprise bus seems to be that traditional enterprise buses tended to do transformations on the bus itself, whereas systems like Kafka are much simpler and leave it up to the consumers to transform data (and possibly write it back to the bus under a different topic). [return]

The Adventure Continues

Good news first! I’m going to Boston again next Wednesday, for the Red Hat Summit. We’ll be doing a presentation on financial modelling.

And then the bad news…results from the MRI are in, and an appointment with a surgeon is incoming; my first operation will be involve doing things to my left foot. Things that will leave me unable to walk for a while, and rather impaired mobility for some time beyond that as well. Hurrah for being in a job where remote work is possible!

(though the upcoming surgery did mean I had to pass on a rather fancy posting today; a shame, but I’m sure there’ll be others!)

Other than that, though, quiet week. I did adult things like buy new filters for the house’s air conditioning system and sorted out the various bills that my MRI adventures have cost me so far. Fun times!

Maybe some chocolate-making this weekend…

Insurance Adventures

“Your insurance company has not approved this MRI yet. We can still go ahead with it, but we’ll need you to sign this waiver that holds you liable for the cost if they don’t approve.”

“Er, how much will it be?”

“…”

“Nobody asks that?”

“I’ve seen some for as low as $2,000, and some as high as $11,000.”

deep breath from the British person on the other side

“How about we reschedule until next week and see if they approve it?”

Meanwhile, back in Britain:

“Here for your MRI? This way!”1

You can infer from this that I did not have my MRI this week, and thus I still don’t know what’s wrong with it, and I’m also laid up in bed after hurting it again last night and then having to drive four hours back from South Carolina on top of that. Still, a fun trip down to SC where I discovered the useful effects of trampoline parks on children (they’re so tired afterwards!), was given a geometric painting as a birthday present (yay triangles!), and practiced my Sichuan Wonton construction skills. Oh, and I saw all of Flash Gordon for the first time. Richard O’Brien was better in Jubilee, I think. As for the rest of it, my goodness, there were some awful films produced in the wake of Star Wars.

And finally, I got promoted! I’m now a Lead Consultant at Mammoth Data. I now consult in a leading way on all the Big Data things! Perhaps.


  1. Yes, yes, there might be some waiting due to it being a non-urgent scan, but I wouldn’t have had that conversation, and I would have gone to the doctor earlier anyhow. So there. [return]

Gary Oldman - Spirit Animal

Happily, work ended on a much better note this week than it began.

Not much going on here except work this week…but next week: Ian Gets His First MRI!