All Things Open & Spark

I was going to open this week’s blog by pointing out that my first-ever conference talk will be on Tuesday at All Things Open.. But then I realized that it’s actually my second conference talk, as ten years ago, I spoke at LinuxTag 2005 about DVD Authoring in Linux. 2005 was an interesting year, looking back. But maybe more on that another time.

Tuesday, then. I’ll be talking about debugging and tuning Spark Streaming pipelines, and when to throw them all away and just do it in Storm instead. There will also be pictures, if that helps.

Meanwhile, the house is full! Dad, Andy, and Ray are here and tomorrow will be starting the epic week-long attempt to add a shower and see if I can rack up frequent flyer bonuses at Home Depot (they’re doing well on that front so far!). Things currently look optimistic, but we’ll see how that mood survives while I’m out tomorrow and it’s tested by experience. Hundred-year-old houses are hard! Let’s go shopping!

36 Hour Ribs!

In contrast to last weekend, I have spent much less time this weekend trying to find an escape route through flooded roads. Instead, I went to the mall and made ribs! I’m sure both of those subjects will come up on the citizen exam if I ever choose to take it.

(36 hour ribs came via Serious Eats and were as good as you’d expect. I cheated by using a Cackalacky Cheerwine BBQ sauce instead of my own, but hey, I had been cooking ribs for almost two days, so I feel like I should get something of a pass there)

I feel like I have something to write about this mess, but now that Brooks has decided to change the name (a good thing!), maybe it’s time to let those embers cool for a bit. Having said that, obviously my upcoming restaurant will be called “Paul Morley Once Commented On My Blog”.

Meanwhile, I have a bathroom to pack up! My dad, uncle, and my uncle’s brother will be joining me here at the House of Pi for a week for a spot of bathroom remodelling and seeing the sights of Durham. Hurrah!

Escape From Columbia

The title of today’s post is misleading, as I’m currently writing it in Columbia, instead of in Durham (and I do hope the house is still standing, given that I haven’t seen it for a few days).

As it turned out, Hurricane Joaquin did not hit directly, but it, combined with an assortment of other weather factors, formed a storm more potent than any recorded in SC history (at least 125 years, in fact). Which led to things like this greeting us on Sunday morning:

Flooding by Hard Scrabble Road 2015-10-04

A small problem, as that is the road that I normally take to get back from SC to Durham. But! Not quite as bad as it could be, as I don’t have to take that part of the road. So I turned right and headed off to my usual route. Until I came across a bunch of traffic bollards. Ha! I had the power of the iPhone! I would not be stopped! Until the alternate route was blocked off in the same manner. Then I got confused, drove aimlessly across a rather scary bridge (well, the bridge itself was not quite as scary as the fact that the water level either side of the bridge was getting rather close to the road), encountered a third blockade, and finally got a text saying that all the interstates and highways around Columbia were being closed.

At that point, I admitted defeat in my attempt to get to I-77 and headed back to Tammy and Robert’s.

(the annoying thing is that if I had just made it to I-77, I would have been fine. North is drier ground!)

time passes

Well, somebody didn’t hit ‘publish’ last night. I’m now back in Durham, under the dazzle blanket and reading about ALL THE BRUTALIST THINGS. It was touch and go for a bit, but we scouted the area this morning and found two entrances to I-77 open. As predicted, once I managed to hit the Interstate, things got a lot easier, but still quite tired after getting back into Durham tonight. Thankfully, the training session I was supposed to lead today got rescheduled for Wednesday, so I can ease back into things. Meanwhile, as ever, a big thank-you to Tammy and Robert for putting me up for another night, and I hope everybody in Columbia stays safe!

Troubleshooting Apache Spark — GZ Splitting Woes!

You’ve spent hours, no, days getting your Spark cluster up and running (maybe it’s on YARN, maybe Mesos, or even just standalone). It’s a shiny thing with over 50 instances running in AWS, burning money at a furious pace. You run your first ETL job and…

wait, it’s only using ONE EXECUTOR? WHY IN HEAVEN’S NAME $#$%$%@$%^?

You start messing around with spark.default.parallelism, going so far as adding explicit parallelism parameters in all your RDD calls…and yet…nothing.

So here’s what could be going wrong. When you make an RDD, Spark splits the RDD into partitions based on either the spark.default.parallelism value, or the user-supplied value:

(spark.default.parallelism = 2)


scala> sc.parallelize(Seq(1,2,3)).toDebugString
res12: String = (2) ParallelCollectionRDD[4] at parallelize at :22 []

scala> sc.parallelize(Seq(1,2,3), 200).toDebugString
res13: String = (200) ParallelCollectionRDD[5] at parallelize at :22 []

Let’s load in a text file!


scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

Well, that looks fine. But maybe you’re not loading in a simple text file. Maybe, just maybe, you’re storing your input files in a compressed manner (perhaps to save space on S3).


scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

scala> sc.textFile(“C.gz”,200).toDebugString
(1) MapPartitionsRDD[13] at textFile at :22 []
|  C.gz HadoopRDD[12] at textFile at :22 []

Eh? We’ve asked for 200 partitions, and we’ve only got one. What’s wrong? Well, you’ve probably already guessed - Spark can split a text file into a bunch of lines with no trouble, and the cores can operate on those bunches separately. But it’s not magic; it can’t split a GZ file and somehow magically decompress chunks of binary. Hence one partition, one core.

Is there anything that can be done? Well, you could look into storing the files uncompressed. It’s the easiest solution! But not always the most practical. Or you could take a look at using LZO compression, which has a different encoding so it can be split across a cluster (at the cost of less compression overall). But maybe you can’t control the choice of compression, either. Enter repartition:


scala>  sc.textFile("C.gz").repartition(100).toDebugString
res38: String = 
(100) MapPartitionsRDD[43] at repartition at :22 []
  |   CoalescedRDD[42] at repartition at :22 []
  |   ShuffledRDD[41] at repartition at :22 []
  +-(1) MapPartitionsRDD[40] at repartition at :22 []
     |  MapPartitionsRDD[39] at textFile at :22 []
     |  C.gz HadoopRDD[38] at textFile at :22 []

Of course, you’ve spotted the ShuffledRDD in the lineage above. After all, if you want the information to be spread around the cluster…you are actually going to have to spread it around the cluster. This is likely to cause a performance hit at the start of a Spark job, but this overhead may be worth incurring if the rest of the processing is distributing more evenly. Test, evaluate, and take your fancy!

Press Play

Damn. This is good.

Have You Ever Been Afraid of an Album?

A new album by a group I really like always has a moment of trepidation before I press play. What if it’s rubbish? Of course, this can be mitigated over the years if you follow a band that manages to plumb deeper depths with every album following their first two (look, even I stopped buying Oasis albums in the end), but most of the time, a new album is a slightly unnerving time.

And I don’t know why! It’s not really my fault if I don’t like a new release by a band I like! Why do I feel as if I’ve failed? And then there’s the kidding yourself part where you try to like a new album and only admit the truth a year or so later.

(Yes, Be Here Now, take a bow)

Music Complete, then. I still haven’t listened to it, mainly because Warner sent me a bunch of WAV files and I can’t be arsed to do all the metadata, so I’m waiting for the CD to turn up. But on the face of it, this is a worrying prospect. Get Ready and Waiting For The Sirens’ Call had a few good songs spread between them, but it was clear that the Imperial Phase was well and truly over. And then Hook left in acrimony.

But.

You see, while I’m saddened that the band isn’t together any more, I’ve always felt a bit…annoyed that Gillian’s absence from the band never generated as much comment in the same way as Hook’s has. This is partly because she didn’t leave in a big bust-up, but I think a lot of it has to do with “well, she’s only a woman and wasn’t even in Joy Division. What does she even bring to the band?"1 Now she’s back, and from reports, her and Stephen had a much larger hand in writing this album than the previous two.

I WANT TO BELIEVE.

But I still haven’t pressed play.


  1. I have a variation on this argument where I prove that Hole are and always were a better band than Nirvana ever were. With graphs and everything. ↩︎

All Apache Spark All The Time

If you’re looking for slides of my Introduction To Spark talk, you can find them here.

This week has felt like being in a holding pattern. Lots of things may be happening in the coming weeks and months, but they may not. Meanwhile, there’s so much archive BBC/ITV on YouTube to get through…

Oh, and I guess, it’s probably worth pointing out that I killed barbelith.com this week. Well, I didn’t hold the gun, but did point out to Tom that the site appeared to have been hacked. It’s odd - I went there to reflect on how wonderful it is that large sections of the web from decades ago still exist…only to be instrumental in why it now just returns 403 Forbidden. Oops.

(but to be fair, somebody else would have noticed eventually, and in the case of a data breach, better to know sooner rather than later)

Right, back to watching an ancient Ruth Rendell episode where Peter Capaldi seems to be playing Bob Dylan by way of Glasgow.

Brown Butter Makes Everything Better

Brown butter cake is a thing, and a glorious thing.

Two Rooms & A Boom is very short with six people, and I removed myself for fear of going Full Quinns.

In Resistance: Avalon, I did indeed go Full Quinns and forgot to deal in the Merlin card. Needless to say, that as a spy, I was an impressively good Minion of Mordred that round.

The Hot Take You've Been Waiting For

So, Jeremy Corbyn.

I…don’t know. Don’t know. I’m still reeling from the audacity of his victory; it wasn’t a sneaking past the post on second or third preferences, it wasn’t a tidal wave of entryism. It was the Labour Party en masse reaching up to the PLP and saying: “You lot. Sod off.”

And the scale of that victory will buy him some time. Even the PLP isn’t insane enough to start a coup before the May elections (note: you’ll never go broke betting on the insanity of the Blairite factions, but I think even they have to realize they can only go to decapitation if May is a disaster).

I think the Tories may soon discover that their traditional ‘slander early and often’ approach may not work with a leader that doesn’t believe in focus groups or PR in general. And that there may be a Johnson/Farage effect as Corbyn gets TV time talking like a human being instead of the usual political contortions.

Having said that, there’s a huge amount on Corbyn’s past that we’re going to be hearing, and not all of it is going to be good. And yes, some of what the Tories are going to drag out is going to be distortion. Sound and fury designed to win the spin cycle by selectively quoting old bits of Hansard. But let’s not kid ourselves either; some of those quotes are not going to be edited. In the tradition of the UK Left, there will be reflexive anti-Americanism, including support of regimes that are neither democratic or humane. And yes, the current Government will be hypocrites when they tweet these things, but that will not automatically absolve Corbyn, and it will not be mentioned on the BBC news reports.

And then there’s the electoral calculus. Two of the reasons Miliband put in such a dismal showing were the collapse of Labour to the SNP in Scotland, and the slaughtering of the Lib Dems in the South West by the Tories (in addition, obviously, to not exactly lighting fires elsewhere). If Scottish Labour plays ball with the new management in Westminster, I can see some SNP seats returning back to Labour. But not as many as they once had. I don’t see much conversion in the South West, though. Maybe breakthroughs elsewhere will be enough?

Whatever happens, there’s no longer any hiding for the Left. For years, we’ve hidden behind what-ifs, and if-onlys. We have everything we’ve said we’ve always wanted: a proper left-wing leader of the Labour Party, somebody who believes in the ideals of Hardie, Attlee, and Bevan. No more excuses, no more half-hearted attempts. The Great Experiment begins…

Bank Holiday Monday Till I Die

You can call it Labor Day if you must, but this is a Bank Holiday Monday Weekend, and therefore there are rules:

  1. It must rain at some point during the weekend.
  2. You must watch some very old television (I made it through five hours of Election ‘92 before getting too tired and depressed to continue)
  3. You should have a plan on how you’re going to fill this magical additional time, and then spend at least one of the days doing nothing.
  4. TV Specials! Thankfully, due to an active VPN connection and iPlayer, I was able to fulfill this criteria by watching the Harry and Paul special and also finding the first episode of Cradle To Grave (who knew that there were so many people interested about writing life in British council estates in the early 70s? (the count is now two)).
  5. Hook must be watched, and yet again, discussions need to be had over the ickiness of the Peter/Wendy/Moira plotline, and imagining the fun therapy sessions that likely resulted afterwards.
  6. You must plan to do some gardening (mainly cutting back those bushes that just grow and grow and do not do anything useful), but end up spending all day in your pyjamas instead.
  7. Why not do some DIY? I fulfilled this one by ordering a new shower and tile. Actually doing the DIY is beyond me, so I’m importing family to do it.
  8. Plan out all the additional learning you’re going to do…did I mention the part where I stayed in pyjamas all day?
  9. Be thankful that living in America means that there’s one more holiday before Christmas. Go Thanksgiving! (also, it means re-watching Addams Family Values again!)
  10. It’s officially Christmas season. Play some Slade.