This Week, I Have Mostly…

I’m not even supposed to be here today. I was supposed to be on the other side of the world, taking in the vast urban streetscapes of communism (with a hefty slice of command capitalism on the side, of course). But it was not to be, at least not this week, anyhow.

However, I did have to have some surprise vaccinations on the off-chance I would be in a plane for most of this weekend. These came on top of a raging sore throat; as a result, my immune system has been lying down and crying for surrender all week. As a result, I have been essentially living on Strepsils and Lemsips all week. So much blackcurrant!

I am tired, still sick, and looking forward to Christmas for a proper break. Less than a month to go until I head back to the UK for a bit…

Congress of Berlin

The Congress of Berlin (in 1878) made a watershed in the history of Europe. It had been preceded by thirty years of conflict and upheaval; it was followed by thirty-four years of peace. No European frontier was changed until 1913; not a shot was fired in Europe until 1912, except in two trivial wars that miscarried. It would not do to attribute this great achievement solely, or even principally, to the skill of European statesmen. The decisive cause was no doubt economic. The secret that had made Great Britain great was a secret no longer. Coal and steel offered prosperity to all Europe and remade European civilization. The dream of Cobden seemed to have come true. Men were too busy growing rich to have time for war. Though protective tariffs remained everywhere except in Great Britain, international trade was otherwise free. There was no governmental interference, no danger of debts being repudiated. The gold standard was universal. Passports disappeared, except in Russia and Turkey. If a man in London decided at nine o’clock in the morning to go to Rome or Vienna, he could leave at ten AM without passport or travellers’ cheques-merely with a purse of sovereigns in his pocket. Europe had never known such peace and unity since the age of the Antonines. The times of Metternich were nothing in comparison. Then men lived in well-founded apprehension of war and revolution; now they came to believe that peace and security were “normal”, and anything else an accident and an aberration. For centuries to come men will look back at that age of bliss and will puzzle over the effortless ease with which it was accomplished. They are not likely to discover the secret; they will certainly not be able to imitate it. — A.J.P. Taylor, The Struggle for Mastery in Europe 1848-1918

A Small Respite

This is likely my last quiet weekend of the year, which is a bit crazy. But, in a nice twist of fate, it has also been the first in a long time where I haven’t been working. So of course, I spent Friday night lying awake until 5am on Saturday morning.

Today has been main wandering around the house like a zombie, giving myself caramel burns, cleaning chocolate molds, and watching BBC dramas from the 60s, 70s, 80s, 90s, and, thanks to my trusty VPN, the latest Who.

(five rounds rapid!)

Oh, and the ironing. Always the ironing.

More updates on the cryptic-ness to come.

Spoooooooky

Do not under-estimate the restorative powers of giving out chocolate to small children for three hours. After a rather difficult month (and more), it was wonderful to be able to spend the evening acting silly, cackling like mad (I was a mad scientist, after all), and hearing loads of children whispering “this house is the best house - they have a lab and they’re giving out full-size candy bars!” as they wandered back to their parents.

Thanks to Tammy and Robert for inviting me down to celebrate Hallowe’en with them, and remember, kids: only experts should attempt to extract the brains of Jack O’Lanterns!

(all the work above was done by Tammy and Robert - Tammy especially spent months pulling all this together. All I had to do was stand, cackle, and give out sweets!)

My Week In Lists

My Week In Lists (extended disco remix)

  1. The Blue Note Grill is quite nice!
  2. Three hours around Home Depot is about two beyond my limit.
  3. All Things Open is basically a giant Triangle Tech get together, isn’t it?
  4. Cambodia apparently has British plugs (among others).
  5. I don’t think I’ve slept properly for four weeks now.
  6. You’re not going to win a one-upping contest by referencing your Salford grandmother in front of two people from Liverpool. (the server in Beasley’s was not on a winning streak that night, sadly)
  7. Giving directions and trying to help everybody navigate an unfamiliar menu goes so much better when you haven’t had your pupils dilated.
  8. (But Taco Nacos is pretty awesome and you should go there)
  9. On a similar front, True Flavors was pretty good, too!
  10. Friday night in Durham was strangely subdued - a count of five in James Joyce at times.
  11. I now have three working showers! Heavens above!
  12. You can have a lot of fun with a clown mask in the middle of the day.
  13. It took me four years to find the local dump.
  14. I forgot to take people to Monuts!
  15. It’s a bit quieter in the house tonight…

All Things Open & Spark

I was going to open this week’s blog by pointing out that my first-ever conference talk will be on Tuesday at All Things Open.. But then I realized that it’s actually my second conference talk, as ten years ago, I spoke at LinuxTag 2005 about DVD Authoring in Linux. 2005 was an interesting year, looking back. But maybe more on that another time.

Tuesday, then. I’ll be talking about debugging and tuning Spark Streaming pipelines, and when to throw them all away and just do it in Storm instead. There will also be pictures, if that helps.

Meanwhile, the house is full! Dad, Andy, and Ray are here and tomorrow will be starting the epic week-long attempt to add a shower and see if I can rack up frequent flyer bonuses at Home Depot (they’re doing well on that front so far!). Things currently look optimistic, but we’ll see how that mood survives while I’m out tomorrow and it’s tested by experience. Hundred-year-old houses are hard! Let’s go shopping!

36 Hour Ribs!

In contrast to last weekend, I have spent much less time this weekend trying to find an escape route through flooded roads. Instead, I went to the mall and made ribs! I’m sure both of those subjects will come up on the citizen exam if I ever choose to take it.

(36 hour ribs came via Serious Eats and were as good as you’d expect. I cheated by using a Cackalacky Cheerwine BBQ sauce instead of my own, but hey, I had been cooking ribs for almost two days, so I feel like I should get something of a pass there)

I feel like I have something to write about this mess, but now that Brooks has decided to change the name (a good thing!), maybe it’s time to let those embers cool for a bit. Having said that, obviously my upcoming restaurant will be called “Paul Morley Once Commented On My Blog”.

Meanwhile, I have a bathroom to pack up! My dad, uncle, and my uncle’s brother will be joining me here at the House of Pi for a week for a spot of bathroom remodelling and seeing the sights of Durham. Hurrah!

Escape From Columbia

The title of today’s post is misleading, as I’m currently writing it in Columbia, instead of in Durham (and I do hope the house is still standing, given that I haven’t seen it for a few days).

As it turned out, Hurricane Joaquin did not hit directly, but it, combined with an assortment of other weather factors, formed a storm more potent than any recorded in SC history (at least 125 years, in fact). Which led to things like this greeting us on Sunday morning:

Flooding by Hard Scrabble Road 2015-10-04

A small problem, as that is the road that I normally take to get back from SC to Durham. But! Not quite as bad as it could be, as I don’t have to take that part of the road. So I turned right and headed off to my usual route. Until I came across a bunch of traffic bollards. Ha! I had the power of the iPhone! I would not be stopped! Until the alternate route was blocked off in the same manner. Then I got confused, drove aimlessly across a rather scary bridge (well, the bridge itself was not quite as scary as the fact that the water level either side of the bridge was getting rather close to the road), encountered a third blockade, and finally got a text saying that all the interstates and highways around Columbia were being closed.

At that point, I admitted defeat in my attempt to get to I-77 and headed back to Tammy and Robert’s.

(the annoying thing is that if I had just made it to I-77, I would have been fine. North is drier ground!)

time passes

Well, somebody didn’t hit ‘publish’ last night. I’m now back in Durham, under the dazzle blanket and reading about ALL THE BRUTALIST THINGS. It was touch and go for a bit, but we scouted the area this morning and found two entrances to I-77 open. As predicted, once I managed to hit the Interstate, things got a lot easier, but still quite tired after getting back into Durham tonight. Thankfully, the training session I was supposed to lead today got rescheduled for Wednesday, so I can ease back into things. Meanwhile, as ever, a big thank-you to Tammy and Robert for putting me up for another night, and I hope everybody in Columbia stays safe!

Troubleshooting Apache Spark — GZ Splitting Woes!

You’ve spent hours, no, days getting your Spark cluster up and running (maybe it’s on YARN, maybe Mesos, or even just standalone). It’s a shiny thing with over 50 instances running in AWS, burning money at a furious pace. You run your first ETL job and…

wait, it’s only using ONE EXECUTOR? WHY IN HEAVEN’S NAME $#$%$%@$%^?

You start messing around with spark.default.parallelism, going so far as adding explicit parallelism parameters in all your RDD calls…and yet…nothing.

So here’s what could be going wrong. When you make an RDD, Spark splits the RDD into partitions based on either the spark.default.parallelism value, or the user-supplied value:

(spark.default.parallelism = 2)


scala> sc.parallelize(Seq(1,2,3)).toDebugString
res12: String = (2) ParallelCollectionRDD[4] at parallelize at :22 []

scala> sc.parallelize(Seq(1,2,3), 200).toDebugString
res13: String = (200) ParallelCollectionRDD[5] at parallelize at :22 []

Let’s load in a text file!


scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

Well, that looks fine. But maybe you’re not loading in a simple text file. Maybe, just maybe, you’re storing your input files in a compressed manner (perhaps to save space on S3).


scala> sc.textFile("CHANGES.txt",200).toDebugString
res15: String = 
(200) MapPartitionsRDD[9] at textFile at :22 []
  |   CHANGES.txt HadoopRDD[8] at textFile at :22 []

scala> sc.textFile(“C.gz”,200).toDebugString
(1) MapPartitionsRDD[13] at textFile at :22 []
 |  C.gz HadoopRDD[12] at textFile at :22 []

Eh? We’ve asked for 200 partitions, and we’ve only got one. What’s wrong? Well, you’ve probably already guessed - Spark can split a text file into a bunch of lines with no trouble, and the cores can operate on those bunches separately. But it’s not magic; it can’t split a GZ file and somehow magically decompress chunks of binary. Hence one partition, one core.

Is there anything that can be done? Well, you could look into storing the files uncompressed. It’s the easiest solution! But not always the most practical. Or you could take a look at using LZO compression, which has a different encoding so it can be split across a cluster (at the cost of less compression overall). But maybe you can’t control the choice of compression, either. Enter repartition:


scala>  sc.textFile("C.gz").repartition(100).toDebugString
res38: String = 
(100) MapPartitionsRDD[43] at repartition at :22 []
  |   CoalescedRDD[42] at repartition at :22 []
  |   ShuffledRDD[41] at repartition at :22 []
  +-(1) MapPartitionsRDD[40] at repartition at :22 []
     |  MapPartitionsRDD[39] at textFile at :22 []
     |  C.gz HadoopRDD[38] at textFile at :22 []

Of course, you’ve spotted the ShuffledRDD in the lineage above. After all, if you want the information to be spread around the cluster…you are actually going to have to spread it around the cluster. This is likely to cause a performance hit at the start of a Spark job, but this overhead may be worth incurring if the rest of the processing is distributing more evenly. Test, evaluate, and take your fancy!