The Delicious Salty Taste of Scalia’s Tears

(ow, no, seriously, why did you walk that far, yes, very good that you won’t be happy until the state is destroyed. good luck with that., sometimes i do think i’m getting more right-wing with age, but no, it’s just that they’re that obnoxious)

A new definition of ‘own worst enemy’: knowing that you are going to have surgery soon to alleviate your bad foot…and then walking 13 miles in two days pretty much by accident.

So, yes, I’m currently lying down and in quite a bit of pain.

The talk at the Red Hat Summit seemed to go down well, though. Even if the wireless connection decided to go down right in the middle of the demo. That was a tense minute or two, but thanks to the rather aggressive polling in the web client, it eventually worked (hurrah!).

(also, apologies to everybody I know in Boston - I didn’t know how much time I would have to myself, and how mobile I’d be, otherwise, I would have sought you all out!)

Remember, everyone: have a good long drink of Scalia’s Tears this weekend, and tune out the obnoxious leftist-radical whining about how Friday’s SCOTUS decision means ’nothing’.

Spark and Kafka - Getting Cozier

(1.3, 1.4, and above, apache spark, apache kafka, you are trapped in a big data room. to the north, there are five trillion exits)

I’m a huge fan of the reappearance of Enterprise Service Buses. They are especially great for Big Data systems and the Lambda Architecture: messages get sent to various different streams on the bus and consumers can read them in a streaming or a batch operation as desired.

(a good introduction to the idea of a Enterprise Service/Message Bus from last year)

Obviously, you wait for a decent Enterprise Service Bus/Data Stream Bus/PubSub/Messaging Log1 and then many come at once. One of the most popular in recent times is Apache Kafka - developed at LinkedIn to be capable of handling their huge throughput requirements. It’s quickly become a de-facto component of many a Spark Streaming or Storm solution.

In the world of Spark, though, Kafka integration has always been a bit of a pain. If you look at this guide to integrating Kafka and Spark, it’s clear that wrangling more than a simple connection to Kafka involves quite a bit of faff, having to union multiple DStreams as they’re coming in from Kafka to increase parallelism. Spark is supposed to be easier to work with than that!

Well, in Spark 1.3, a new interface to Kafka was added. It’s still (in 1.4) marked as ‘experimental’, but I know of several companies who have been using it in producing for months, handling billions of messages per day (I imagine that it will be marked as safe in 1.5 if you’re still cautious). And it makes things so much simpler!

val ssc = new StreamingContext(new SparkConf, Seconds(5))

val kafkaParams = Map("" -> “kafka-1:9092,kafka-2:9092,kafka-3:9092”)
val topics = Set(“example-topic”, “another-example-topic”)
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) // and then do Spark stuff!
// ...  

This automatically creates a DStream comprised of KafkaRDDs which read in parallel from the number of Kafka partitions. No union required! As a bonus, because Spark handles the offsets that have been read, bypassing ZooKeeper, the new approach gains exactly-once semantics (with the downside that ZooKeeper no longer knows exactly what the Spark Streaming application is doing, which may cause some monitoring issues unless you manually update ZooKeeper from within Spark).

Also in 1.3 and above - batch access to Kafka! You can create KafkaRDDs and operate on them in the usual way (a boon if you’re working on a Lambda Architecture application).

val offsetRanges = Array(
	  // args are: topic, partitionId, fromOffset (inclusive), untilOffset (exclusive)
      OffsetRange(“example-topic”, 0, 110, 220)
val rdd = KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder](sc, kafkaParams, offsetRanges)

(In batch mode, you currently have to manage reading from all the partitions and the offsets in the Kafka log yourself.)

Okay, so we can now do parallelism with Spark and Kafka in a much simpler manner…but an important feature of these architectures is writing results back to the bus (e.g., flagging possible fraudulent bids in a real-time auctioning system for further investigation). Unfortunately, baked-in support of this is not scheduled until 1.5 (see SPARK-4122 for more details), so for now, you have to handle the details here yourself - consider a connection pool if you find yourself doing many writes back to Kafka in a short time.

  1. the cited difference between a data stream bus/log and an enterprise bus seems to be that traditional enterprise buses tended to do transformations on the bus itself, whereas systems like Kafka are much simpler and leave it up to the consumers to transform data (and possibly write it back to the bus under a different topic).

The Adventure Continues

(my first operation by fisher-price, but anything else is communism, obviously, no, really)

Good news first! I’m going to Boston again next Wednesday, for the Red Hat Summit. We’ll be doing a presentation on financial modelling.

And then the bad news…results from the MRI are in, and an appointment with a surgeon is incoming; my first operation will be involve doing things to my left foot. Things that will leave me unable to walk for a while, and rather impaired mobility for some time beyond that as well. Hurrah for being in a job where remote work is possible!

(though the upcoming surgery did mean I had to pass on a rather fancy posting today; a shame, but I’m sure there’ll be others!)

Other than that, though, quiet week. I did adult things like buy new filters for the house’s air conditioning system and sorted out the various bills that my MRI adventures have cost me so far. Fun times!

Maybe some chocolate-making this weekend…

Further Adventures With MRIs

(but anything else is communism, obviously, no, really)


Quick tip: if you ever want to shock a bunch of British people into silence followed by twenty-thirty minutes of raging, telling them how much you had to pay for your MRI scan is a great way of doing it.

So, my first MRI! Lots of whomp, whomp. No results yet, so no idea what’s wrong with the foot, only that it’s very sore and I would like some idea what I need to do to fix that. If possible.

Also, I hardly slept last night at all, so today’s blog is going to suffer from that. Though it probably won’t end up being much different than usual.

Anyway, it’s been an odd week; I’ve been at home most days as I was having serious pain walking, meaning that I missed the arrival of two new people over at Mammoth Data, though they’ll still be there tomorrow, I guess (well, one will, the other will be back on his boat). I continue working on preparation for the demo we’re going to be giving in a couple of weeks time.

Meanwhile, back in the kitchen, I spent the days making vegan marshmallows, mixing a steel-aged cocktail (I’ll let you know how it is come the end of July!), and also, after coming across the EZTemper earlier in the week, I attempted to replicate its functionality using a water bath instead of another ~$1,000 piece of equipment. And it does work - seeding the chocolate with melted cocoa butter tempers it almost instantly (at the slight expense of changing the chocolate characteristics a little, but not so much that you’d notice). I rolled out a sheet of orange flavoured 55% chocolate mixed with feuillantine which tempered quickly and tasted pretty good. It’s not going to replace my industrial temperer for large runs, but I can see using the cocoa butter method for smaller batches (one drawback is that you still need to plan in advance, as the butter needs to sit in the water bath for a fair few hours before you can use it).

I have a rant building about TFI Friday and how I’m not ready for I Love The 1990s. But maybe that’ll keep for a few days. From what I heard about the show, though, it summed up TFI Friday almost perfectly: fun to begin with, then it petered out and got lost in Evans’ obsessions. And then you question whether the first bit was any good to begin with either…

Insurance Adventures

(but anything else is communism, obviously)

“Your insurance company has not approved this MRI yet. We can still go ahead with it, but we’ll need you to sign this waiver that holds you liable for the cost if they don’t approve.”

“Er, how much will it be?”


“Nobody asks that?”

“I’ve seen some for as low as $2,000, and some as high as $11,000.”

deep breath from the British person on the other side

“How about we reschedule until next week and see if they approve it?”

Meanwhile, back in Britain:

“Here for your MRI? This way!”1

You can infer from this that I did not have my MRI this week, and thus I still don’t know what’s wrong with it, and I’m also laid up in bed after hurting it again last night and then having to drive four hours back from South Carolina on top of that. Still, a fun trip down to SC where I discovered the useful effects of trampoline parks on children (they’re so tired afterwards!), was given a geometric painting as a birthday present (yay triangles!), and practiced my Sichuan Wonton construction skills. Oh, and I saw all of Flash Gordon for the first time. Richard O’Brien was better in Jubilee, I think. As for the rest of it, my goodness, there were some awful films produced in the wake of Star Wars.

And finally, I got promoted! I’m now a Lead Consultant at Mammoth Data. I now consult in a leading way on all the Big Data things! Perhaps.

  1. Yes, yes, there might be some waiting due to it being a non-urgent scan, but I wouldn’t have had that conversation, and I would have gone to the doctor earlier anyhow. So there.

buy my books
Instant Zepto.js