My Week In Lists

My Week In Lists (extended disco remix)

  1. The Blue Note Grill is quite nice!
  2. Three hours around Home Depot is about two beyond my limit.
  3. All Things Open is basically a giant Triangle Tech get together, isn’t it?
  4. Cambodia apparently has British plugs (among others).
  5. I don’t think I’ve slept properly for four weeks now.
  6. You’re not going to win a one-upping contest by referencing your Salford grandmother in front of two people from Liverpool. (the server in Beasley’s was not on a winning streak that night, sadly)
  7. Giving directions and trying to help everybody navigate an unfamiliar menu goes so much better when you haven’t had your pupils dilated.
  8. (But Taco Nacos is pretty awesome and you should go there)
  9. On a similar front, True Flavors was pretty good, too!
  10. Friday night in Durham was strangely subdued - a count of five in James Joyce at times.
  11. I now have three working showers! Heavens above!
  12. You can have a lot of fun with a clown mask in the middle of the day.
  13. It took me four years to find the local dump.
  14. I forgot to take people to Monuts!
  15. It’s a bit quieter in the house tonight…

All Things Open & Spark

I was going to open this week’s blog by pointing out that my first-ever conference talk will be on Tuesday at All Things Open.. But then I realized that it’s actually my second conference talk, as ten years ago, I spoke at LinuxTag 2005 about DVD Authoring in Linux. 2005 was an interesting year, looking back. But maybe more on that another time.

Tuesday, then. I’ll be talking about debugging and tuning Spark Streaming pipelines, and when to throw them all away and just do it in Storm instead. There will also be pictures, if that helps.

Meanwhile, the house is full! Dad, Andy, and Ray are here and tomorrow will be starting the epic week-long attempt to add a shower and see if I can rack up frequent flyer bonuses at Home Depot (they’re doing well on that front so far!). Things currently look optimistic, but we’ll see how that mood survives while I’m out tomorrow and it’s tested by experience. Hundred-year-old houses are hard! Let’s go shopping!

36 Hour Ribs!

In contrast to last weekend, I have spent much less time this weekend trying to find an escape route through flooded roads. Instead, I went to the mall and made ribs! I’m sure both of those subjects will come up on the citizen exam if I ever choose to take it.

(36 hour ribs came via Serious Eats and were as good as you’d expect. I cheated by using a Cackalacky Cheerwine BBQ sauce instead of my own, but hey, I had been cooking ribs for almost two days, so I feel like I should get something of a pass there)

I feel like I have something to write about this mess, but now that Brooks has decided to change the name (a good thing!), maybe it’s time to let those embers cool for a bit. Having said that, obviously my upcoming restaurant will be called “Paul Morley Once Commented On My Blog”.

Meanwhile, I have a bathroom to pack up! My dad, uncle, and my uncle’s brother will be joining me here at the House of Pi for a week for a spot of bathroom remodelling and seeing the sights of Durham. Hurrah!

Escape From Columbia

The title of today’s post is misleading, as I’m currently writing it in Columbia, instead of in Durham (and I do hope the house is still standing, given that I haven’t seen it for a few days).

As it turned out, Hurricane Joaquin did not hit directly, but it, combined with an assortment of other weather factors, formed a storm more potent than any recorded in SC history (at least 125 years, in fact). Which led to things like this greeting us on Sunday morning:

Flooding by Hard Scrabble Road 2015-10-04

A small problem, as that is the road that I normally take to get back from SC to Durham. But! Not quite as bad as it could be, as I don’t have to take that part of the road. So I turned right and headed off to my usual route. Until I came across a bunch of traffic bollards. Ha! I had the power of the iPhone! I would not be stopped! Until the alternate route was blocked off in the same manner. Then I got confused, drove aimlessly across a rather scary bridge (well, the bridge itself was not quite as scary as the fact that the water level either side of the bridge was getting rather close to the road), encountered a third blockade, and finally got a text saying that all the interstates and highways around Columbia were being closed.

At that point, I admitted defeat in my attempt to get to I-77 and headed back to Tammy and Robert’s.

(the annoying thing is that if I had just made it to I-77, I would have been fine. North is drier ground!)

time passes

Well, somebody didn’t hit ‘publish’ last night. I’m now back in Durham, under the dazzle blanket and reading about ALL THE BRUTALIST THINGS. It was touch and go for a bit, but we scouted the area this morning and found two entrances to I-77 open. As predicted, once I managed to hit the Interstate, things got a lot easier, but still quite tired after getting back into Durham tonight. Thankfully, the training session I was supposed to lead today got rescheduled for Wednesday, so I can ease back into things. Meanwhile, as ever, a big thank-you to Tammy and Robert for putting me up for another night, and I hope everybody in Columbia stays safe!

Troubleshooting Apache Spark — GZ Splitting Woes!

You’ve spent hours, no, days getting your Spark cluster up and running (maybe it’s on YARN, maybe Mesos, or even just standalone). It’s a shiny thing with over 50 instances running in AWS, burning money at a furious pace. You run your first ETL job and…

wait, it’s only using ONE EXECUTOR? WHY IN HEAVEN’S NAME $#$%$%@$%^?

You start messing around with spark.default.parallelism, going so far as adding explicit parallelism parameters in all your RDD calls…and yet…nothing.

So here’s what could be going wrong. When you make an RDD, Spark splits the RDD into partitions based on either the spark.default.parallelism value, or the user-supplied value:

(spark.default.parallelism = 2)

scala> sc.parallelize(Seq(1,2,3)).toDebugString
res12: String = (2) ParallelCollectionRDD[4] at parallelize at :22 []

scala> sc.parallelize(Seq(1,2,3), 200).toDebugString
res13: String = (200) ParallelCollectionRDD[5] at parallelize at :22 []

Let’s load in a text file!

scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

Well, that looks fine. But maybe you’re not loading in a simple text file. Maybe, just maybe, you’re storing your input files in a compressed manner (perhaps to save space on S3).

scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

scala> sc.textFile(“C.gz”,200).toDebugString
(1) MapPartitionsRDD[13] at textFile at :22 []
 |  C.gz HadoopRDD[12] at textFile at :22 []

Eh? We’ve asked for 200 partitions, and we’ve only got one. What’s wrong? Well, you’ve probably already guessed - Spark can split a text file into a bunch of lines with no trouble, and the cores can operate on those bunches separately. But it’s not magic; it can’t split a GZ file and somehow magically decompress chunks of binary. Hence one partition, one core.

Is there anything that can be done? Well, you could look into storing the files uncompressed. It’s the easiest solution! But not always the most practical. Or you could take a look at using LZO compression, which has a different encoding so it can be split across a cluster (at the cost of less compression overall). But maybe you can’t control the choice of compression, either. Enter repartition:

scala>  sc.textFile("C.gz").repartition(100).toDebugString
res38: String = 
(100) MapPartitionsRDD[43] at repartition at :22 []
  |   CoalescedRDD[42] at repartition at :22 []
  |   ShuffledRDD[41] at repartition at :22 []
  +-(1) MapPartitionsRDD[40] at repartition at :22 []
     |  MapPartitionsRDD[39] at textFile at :22 []
     |  C.gz HadoopRDD[38] at textFile at :22 []

Of course, you’ve spotted the ShuffledRDD in the lineage above. After all, if you want the information to be spread around the cluster…you are actually going to have to spread it around the cluster. This is likely to cause a performance hit at the start of a Spark job, but this overhead may be worth incurring if the rest of the processing is distributing more evenly. Test, evaluate, and take your fancy!

Press Play

Damn. This is good.

Have You Ever Been Afraid of an Album?

A new album by a group I really like always has a moment of trepidation before I press play. What if it’s rubbish? Of course, this can be mitigated over the years if you follow a band that manages to plumb deeper depths with every album following their first two (look, even I stopped buying Oasis albums in the end), but most of the time, a new album is a slightly unnerving time.

And I don’t know why! It’s not really my fault if I don’t like a new release by a band I like! Why do I feel as if I’ve failed? And then there’s the kidding yourself part where you try to like a new album and only admit the truth a year or so later.

(Yes, Be Here Now, take a bow)

Music Complete, then. I still haven’t listened to it, mainly because Warner sent me a bunch of WAV files and I can’t be arsed to do all the metadata, so I’m waiting for the CD to turn up. But on the face of it, this is a worrying prospect. Get Ready and Waiting For The Sirens’ Call had a few good songs spread between them, but it was clear that the Imperial Phase was well and truly over. And then Hook left in acrimony.


You see, while I’m saddened that the band isn’t together any more, I’ve always felt a bit…annoyed that Gillian’s absence from the band never generated as much comment in the same way as Hook’s has. This is partly because she didn’t leave in a big bust-up, but I think a lot of it has to do with “well, she’s only a woman and wasn’t even in Joy Division. What does she even bring to the band?”1 Now she’s back, and from reports, her and Stephen had a much larger hand in writing this album than the previous two.


But I still haven’t pressed play.

  1. I have a variation on this argument where I prove that Hole are and always were a better band than Nirvana ever were. With graphs and everything. [return]

All Apache Spark All The Time

If you’re looking for slides of my Introduction To Spark talk, you can find them here.

This week has felt like being in a holding pattern. Lots of things may be happening in the coming weeks and months, but they may not. Meanwhile, there’s so much archive BBC/ITV on YouTube to get through…

Oh, and I guess, it’s probably worth pointing out that I killed this week. Well, I didn’t hold the gun, but did point out to Tom that the site appeared to have been hacked. It’s odd - I went there to reflect on how wonderful it is that large sections of the web from decades ago still exist…only to be instrumental in why it now just returns 403 Forbidden. Oops.

(but to be fair, somebody else would have noticed eventually, and in the case of a data breach, better to know sooner rather than later)

Right, back to watching an ancient Ruth Rendell episode where Peter Capaldi seems to be playing Bob Dylan by way of Glasgow.

Brown Butter Makes Everything Better

Brown butter cake is a thing, and a glorious thing.

Two Rooms & A Boom is very short with six people, and I removed myself for fear of going Full Quinns.

In Resistance: Avalon, I did indeed go Full Quinns and forgot to deal in the Merlin card. Needless to say, that as a spy, I was an impressively good Minion of Mordred that round.