Troubleshooting Apache Spark — GZ Splitting Woes!

(spark, gz, gzip, partitions)

You’ve spent hours, no, days getting your Spark cluster up and running (maybe it’s on YARN, maybe Mesos, or even just standalone). It’s a shiny thing with over 50 instances running in AWS, burning money at a furious pace. You run your first ETL job and…

wait, it’s only using ONE EXECUTOR? WHY IN HEAVEN’S NAME $#$%$%@$%^?

You start messing around with spark.default.parallelism, going so far as adding explicit parallelism parameters in all your RDD calls…and yet…nothing.

So here’s what could be going wrong. When you make an RDD, Spark splits the RDD into partitions based on either the spark.default.parallelism value, or the user-supplied value:

(spark.default.parallelism = 2)

scala> sc.parallelize(Seq(1,2,3)).toDebugString
res12: String = (2) ParallelCollectionRDD[4] at parallelize at :22 []

scala> sc.parallelize(Seq(1,2,3), 200).toDebugString
res13: String = (200) ParallelCollectionRDD[5] at parallelize at :22 []

Let’s load in a text file!

scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

Well, that looks fine. But maybe you’re not loading in a simple text file. Maybe, just maybe, you’re storing your input files in a compressed manner (perhaps to save space on S3).

scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

<pre> scala> sc.textFile("C.gz",200).toDebugString (1) MapPartitionsRDD[13] at textFile at :22 [] | C.gz HadoopRDD[12] at textFile at :22 [] </pre>

Eh? We’ve asked for 200 partitions, and we’ve only got one. What’s wrong? Well, you’ve probably already guessed - Spark can split a text file into a bunch of lines with no trouble, and the cores can operate on those bunches separately. But it’s not magic; it can’t split a GZ file and somehow magically decompress chunks of binary. Hence one partition, one core.

Is there anything that can be done? Well, you could look into storing the files uncompressed. It’s the easiest solution! But not always the most practical. Or you could take a look at using LZO compression, which has a different encoding so it can be split across a cluster (at the cost of less compression overall). But maybe you can’t control the choice of compression, either. Enter repartition:

scala>  sc.textFile("C.gz").repartition(100).toDebugString
res38: String = 
(100) MapPartitionsRDD[43] at repartition at :22 []
  |   CoalescedRDD[42] at repartition at :22 []
  |   ShuffledRDD[41] at repartition at :22 []
  +-(1) MapPartitionsRDD[40] at repartition at :22 []
     |  MapPartitionsRDD[39] at textFile at :22 []
     |  C.gz HadoopRDD[38] at textFile at :22 []

Of course, you’ve spotted the ShuffledRDD in the lineage above. After all, if you want the information to be spread around the cluster…you are actually going to have to spread it around the cluster. This is likely to cause a performance hit at the start of a Spark job, but this overhead may be worth incurring if the rest of the processing is distributing more evenly. Test, evaluate, and take your fancy!

Press Play

(new order, empty nest disco rave, music complete)

Damn. This is good.

Have You Ever Been Afraid of an Album?

(new order, hook-less, or is it, ahahahaha, (dies))

A new album by a group I really like always has a moment of trepidation before I press play. What if it’s rubbish? Of course, this can be mitigated over the years if you follow a band that manages to plumb deeper depths with every album following their first two (look, even I stopped buying Oasis albums in the end), but most of the time, a new album is a slightly unnerving time.

And I don’t know why! It’s not really my fault if I don’t like a new release by a band I like! Why do I feel as if I’ve failed? And then there’s the kidding yourself part where you try to like a new album and only admit the truth a year or so later.

(Yes, Be Here Now, take a bow)

Music Complete, then. I still haven’t listened to it, mainly because Warner sent me a bunch of WAV files and I can’t be arsed to do all the metadata, so I’m waiting for the CD to turn up. But on the face of it, this is a worrying prospect. Get Ready and Waiting For The Sirens’ Call had a few good songs spread between them, but it was clear that the Imperial Phase was well and truly over. And then Hook left in acrimony.


You see, while I’m saddened that the band isn’t together any more, I’ve always felt a bit…annoyed that Gillian’s absence from the band never generated as much comment in the same way as Hook’s has. This is partly because she didn’t leave in a big bust-up, but I think a lot of it has to do with “well, she’s only a woman and wasn’t even in Joy Division. What does she even bring to the band?”1 Now she’s back, and from reports, her and Stephen had a much larger hand in writing this album than the previous two.


But I still haven’t pressed play.

  1. I have a variation on this argument where I prove that Hole are and always were a better band than Nirvana ever were. With graphs and everything.

All Apache Spark All The Time

(spark, talk)

If you’re looking for slides of my Introduction To Spark talk, you can find them here.

This week has felt like being in a holding pattern. Lots of things may be happening in the coming weeks and months, but they may not. Meanwhile, there’s so much archive BBC/ITV on YouTube to get through…

Oh, and I guess, it’s probably worth pointing out that I killed this week. Well, I didn’t hold the gun, but did point out to Tom that the site appeared to have been hacked. It’s odd - I went there to reflect on how wonderful it is that large sections of the web from decades ago still exist…only to be instrumental in why it now just returns 403 Forbidden. Oops.

(but to be fair, somebody else would have noticed eventually, and in the case of a data breach, better to know sooner rather than later)

Right, back to watching an ancient Ruth Rendell episode where Peter Capaldi seems to be playing Bob Dylan by way of Glasgow.

Brown Butter Makes Everything Better

(butter, caaaaaaaaaaaaaaake)

Brown butter cake is a thing, and a glorious thing.

Two Rooms & A Boom is very short with six people, and I removed myself for fear of going Full Quinns.

In Resistance: Avalon, I did indeed go Full Quinns and forgot to deal in the Merlin card. Needless to say, that as a spy, I was an impressively good Minion of Mordred that round.

buy my books
Instant Zepto.js