Troubleshooting Apache Spark — GZ Splitting Woes!

You’ve spent hours, no, days getting your Spark cluster up and running (maybe it’s on YARN, maybe Mesos, or even just standalone). It’s a shiny thing with over 50 instances running in AWS, burning money at a furious pace. You run your first ETL job and…

wait, it’s only using ONE EXECUTOR? WHY IN HEAVEN’S NAME $#$%$%@$%^?

You start messing around with spark.default.parallelism, going so far as adding explicit parallelism parameters in all your RDD calls…and yet…nothing.

So here’s what could be going wrong. When you make an RDD, Spark splits the RDD into partitions based on either the spark.default.parallelism value, or the user-supplied value:

(spark.default.parallelism = 2)

scala> sc.parallelize(Seq(1,2,3)).toDebugString
res12: String = (2) ParallelCollectionRDD[4] at parallelize at :22 []

scala> sc.parallelize(Seq(1,2,3), 200).toDebugString
res13: String = (200) ParallelCollectionRDD[5] at parallelize at :22 []

Let’s load in a text file!

scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

Well, that looks fine. But maybe you’re not loading in a simple text file. Maybe, just maybe, you’re storing your input files in a compressed manner (perhaps to save space on S3).

scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

scala> sc.textFile(“C.gz”,200).toDebugString
(1) MapPartitionsRDD[13] at textFile at :22 []
 |  C.gz HadoopRDD[12] at textFile at :22 []

Eh? We’ve asked for 200 partitions, and we’ve only got one. What’s wrong? Well, you’ve probably already guessed - Spark can split a text file into a bunch of lines with no trouble, and the cores can operate on those bunches separately. But it’s not magic; it can’t split a GZ file and somehow magically decompress chunks of binary. Hence one partition, one core.

Is there anything that can be done? Well, you could look into storing the files uncompressed. It’s the easiest solution! But not always the most practical. Or you could take a look at using LZO compression, which has a different encoding so it can be split across a cluster (at the cost of less compression overall). But maybe you can’t control the choice of compression, either. Enter repartition:

scala>  sc.textFile("C.gz").repartition(100).toDebugString
res38: String = 
(100) MapPartitionsRDD[43] at repartition at :22 []
  |   CoalescedRDD[42] at repartition at :22 []
  |   ShuffledRDD[41] at repartition at :22 []
  +-(1) MapPartitionsRDD[40] at repartition at :22 []
     |  MapPartitionsRDD[39] at textFile at :22 []
     |  C.gz HadoopRDD[38] at textFile at :22 []

Of course, you’ve spotted the ShuffledRDD in the lineage above. After all, if you want the information to be spread around the cluster…you are actually going to have to spread it around the cluster. This is likely to cause a performance hit at the start of a Spark job, but this overhead may be worth incurring if the rest of the processing is distributing more evenly. Test, evaluate, and take your fancy!

Press Play

Damn. This is good.

Have You Ever Been Afraid of an Album?

A new album by a group I really like always has a moment of trepidation before I press play. What if it’s rubbish? Of course, this can be mitigated over the years if you follow a band that manages to plumb deeper depths with every album following their first two (look, even I stopped buying Oasis albums in the end), but most of the time, a new album is a slightly unnerving time.

And I don’t know why! It’s not really my fault if I don’t like a new release by a band I like! Why do I feel as if I’ve failed? And then there’s the kidding yourself part where you try to like a new album and only admit the truth a year or so later.

(Yes, Be Here Now, take a bow)

Music Complete, then. I still haven’t listened to it, mainly because Warner sent me a bunch of WAV files and I can’t be arsed to do all the metadata, so I’m waiting for the CD to turn up. But on the face of it, this is a worrying prospect. Get Ready and Waiting For The Sirens’ Call had a few good songs spread between them, but it was clear that the Imperial Phase was well and truly over. And then Hook left in acrimony.


You see, while I’m saddened that the band isn’t together any more, I’ve always felt a bit…annoyed that Gillian’s absence from the band never generated as much comment in the same way as Hook’s has. This is partly because she didn’t leave in a big bust-up, but I think a lot of it has to do with “well, she’s only a woman and wasn’t even in Joy Division. What does she even bring to the band?”1 Now she’s back, and from reports, her and Stephen had a much larger hand in writing this album than the previous two.


But I still haven’t pressed play.

  1. I have a variation on this argument where I prove that Hole are and always were a better band than Nirvana ever were. With graphs and everything. [return]

All Apache Spark All The Time

If you’re looking for slides of my Introduction To Spark talk, you can find them here.

This week has felt like being in a holding pattern. Lots of things may be happening in the coming weeks and months, but they may not. Meanwhile, there’s so much archive BBC/ITV on YouTube to get through…

Oh, and I guess, it’s probably worth pointing out that I killed this week. Well, I didn’t hold the gun, but did point out to Tom that the site appeared to have been hacked. It’s odd - I went there to reflect on how wonderful it is that large sections of the web from decades ago still exist…only to be instrumental in why it now just returns 403 Forbidden. Oops.

(but to be fair, somebody else would have noticed eventually, and in the case of a data breach, better to know sooner rather than later)

Right, back to watching an ancient Ruth Rendell episode where Peter Capaldi seems to be playing Bob Dylan by way of Glasgow.

Brown Butter Makes Everything Better

Brown butter cake is a thing, and a glorious thing.

Two Rooms & A Boom is very short with six people, and I removed myself for fear of going Full Quinns.

In Resistance: Avalon, I did indeed go Full Quinns and forgot to deal in the Merlin card. Needless to say, that as a spy, I was an impressively good Minion of Mordred that round.

The Hot Take You've Been Waiting For

So, Jeremy Corbyn.

I…don’t know. Don’t know. I’m still reeling from the audacity of his victory; it wasn’t a sneaking past the post on second or third preferences, it wasn’t a tidal wave of entryism. It was the Labour Party en masse reaching up to the PLP and saying: “You lot. Sod off.”

And the scale of that victory will buy him some time. Even the PLP isn’t insane enough to start a coup before the May elections (note: you’ll never go broke betting on the insanity of the Blairite factions, but I think even they have to realize they can only go to decapitation if May is a disaster).

I think the Tories may soon discover that their traditional ‘slander early and often’ approach may not work with a leader that doesn’t believe in focus groups or PR in general. And that there may be a Johnson/Farage effect as Corbyn gets TV time talking like a human being instead of the usual political contortions.

Having said that, there’s a huge amount on Corbyn’s past that we’re going to be hearing, and not all of it is going to be good. And yes, some of what the Tories are going to drag out is going to be distortion. Sound and fury designed to win the spin cycle by selectively quoting old bits of Hansard. But let’s not kid ourselves either; some of those quotes are not going to be edited. In the tradition of the UK Left, there will be reflexive anti-Americanism, including support of regimes that are neither democratic or humane. And yes, the current Government will be hypocrites when they tweet these things, but that will not automatically absolve Corbyn, and it will not be mentioned on the BBC news reports.

And then there’s the electoral calculus. Two of the reasons Miliband put in such a dismal showing were the collapse of Labour to the SNP in Scotland, and the slaughtering of the Lib Dems in the South West by the Tories (in addition, obviously, to not exactly lighting fires elsewhere). If Scottish Labour plays ball with the new management in Westminster, I can see some SNP seats returning back to Labour. But not as many as they once had. I don’t see much conversion in the South West, though. Maybe breakthroughs elsewhere will be enough?

Whatever happens, there’s no longer any hiding for the Left. For years, we’ve hidden behind what-ifs, and if-onlys. We have everything we’ve said we’ve always wanted: a proper left-wing leader of the Labour Party, somebody who believes in the ideals of Hardie, Attlee, and Bevan. No more excuses, no more half-hearted attempts. The Great Experiment begins…

Bank Holiday Monday Till I Die

You can call it Labor Day if you must, but this is a Bank Holiday Monday Weekend, and therefore there are rules:

  1. It must rain at some point during the weekend.
  2. You must watch some very old television (I made it through five hours of Election ‘92 before getting too tired and depressed to continue)
  3. You should have a plan on how you’re going to fill this magical additional time, and then spend at least one of the days doing nothing.
  4. TV Specials! Thankfully, due to an active VPN connection and iPlayer, I was able to fulfill this criteria by watching the Harry and Paul special and also finding the first episode of Cradle To Grave (who knew that there were so many people interested about writing life in British council estates in the early 70s? (the count is now two)).
  5. Hook must be watched, and yet again, discussions need to be had over the ickiness of the Peter/Wendy/Moira plotline, and imagining the fun therapy sessions that likely resulted afterwards.
  6. You must plan to do some gardening (mainly cutting back those bushes that just grow and grow and do not do anything useful), but end up spending all day in your pyjamas instead.
  7. Why not do some DIY? I fulfilled this one by ordering a new shower and tile. Actually doing the DIY is beyond me, so I’m importing family to do it.
  8. Plan out all the additional learning you’re going to do…did I mention the part where I stayed in pyjamas all day?
  9. Be thankful that living in America means that there’s one more holiday before Christmas. Go Thanksgiving! (also, it means re-watching Addams Family Values again!)
  10. It’s officially Christmas season. Play some Slade.

Foot Update!

I had the CT scan! I managed to get lost twice in the hospital car park (once in the car, once on foot, oh, and I lost the car when I tried to find it on my way out, so maybe that’s three times), but I finally got it done. And the results?

Well, I do have a problem with my foot. But operation is not on the agenda yet. Firstly, cortisone injections, but even those aren’t happening until I have another couple of days of problems. I’m told that there’s little danger of permanent damage in the meantime, so at least there’s that!

In other news, 12-hour sous vide caramelized white chocolate ganache is pretty awesome.

In other, other news, I had my first game night at my house! Explaining Suburbia is fun.

Still Ill

After almost passing out on Monday, I had been getting better…until Friday, where I regressed back to how I was on Monday. I can definitely say I’m sick of it now. I suppose I haven’t been helping my mood by spending the last week and a half watching all of The World At War. Nothing quite like twenty-six hours of war to keep your spirits up.

I’m now finishing up a re-watch of Gurren Lagann, which is a much better idea. Still, I would like to emerge from the blankets and the couch at some point…

The World At War

I am under blanket with a Lemsip and watching the Russians defend Stalingrad. Feeling a little sorry for myself, it has to be said. The rubbish bins in the house are overflowing with tissues, and the chalkboard to-do list for this weekend looks down and mocks me. Maybe some of those tasks will get done next weekend. Maybe.

This week: Phonogram. In the previous series, I felt something of a connection, due to coming of age in the Britpop era, the shared love of Kenickie, the Dexys bits, the Johnny Boy part, and quoting all The Long Blondes lyrics. Obviously. The first issue of the new series was something else entirely. The Poptimism wars. ILX/CTCL. I actually had a conversation with Kieron back in 2006 which closely resembled one in this issue. Reading it back was an odd experience (I’m sure he had that same exchange with many people, mind you).

I fought on the periphery of the Poptimism wars. Most of my music writing was for Static (and is no longer present in their archives). My crowning achievements were probably talking to Paul Morley, having a record company complain to the editor about one of my reviews of a turgid American rock album, and of course, having my review quoted on the advertisement for Johnny Boy: “Karl Marx with a beat, Girls Aloud with C4 strapped to their chests”.

Right, time for another lemsip and to put on Annie (from Norway).

(No ILX in-jokes here, or you’ll be Suggest Autobahned — Ed.)

Oh, before I go - I recommend that you go watch the Shut Up & Sit Down Gen Con special. Oh, Billy Cool, what you’re gonna do?