Self-Promotional Post

As of today, I’m looking for a new position. If you’re looking for somebody to do Big Data / Spark / Storm / Scala / Hadoop / Ruby / Elixir / Docker / Mesos / AWS work, then I’m available! Ideally, I’m looking for a remote position, or otherwise in Durham (downtown very much preferred).

Some people have been asking, so just to point out that I’m also available for consultancy on any of the above.

Here’s my extended resumé, and you can reach me at

Cake…or Brexit?

This is what happens when I need to work through my feelings using the medium of pasty.

A photo posted by Ian Pointer (@carsondial) on

(thanks to Tammy who ended up doing all 20 layers of the Schichttorte!)

England Made Me. And England Will Break Me

(note: this is basically a jumble. I need to write something, so this is something. If you’re looking for a great piece of writing that sums up how I feel in a better way, I direct you to Tom Ewing’s fabulous A to Z on the matter)

“Here, the intersection of the timeless moment is England and nowhere. Never and always.”

Watching your country crumble to dust on live television is, if nothing else, something. Exciting in the way that a 9.5 Earthquake is exciting in the brief seconds before it turns to terror. You could see it on the faces of Dimbleby and others on the BBC broadcast as Newcastle reported its results, and in the poor teller at Sunderland, her voice barely holding together as she signalled the defeat of Remain with only the second mainland result of the night. The hour before where we laughed about the 823 leavers in Gibraltar seemed another age ago.

Then the last three days, staggering tales of abuse, starting with children and teenagers turning on their parents, yelling ‘what have you done to us?’, swiftly escalating as stories of non-Britons being told to fuck off back home, pork thrown into the gardens of British Muslims, and worse. Many of these people have lived in the country for years, decades, born here even. As much right to live in the country as any of us. Suddenly other and subject to abuse that would not normally be acceptable. What has happened to us?

Things getting more and more insane, the pound experiencing the biggest drop since records began, market chaos, our Prime Minister toddling off into the sunset yelling ‘fuckitybye!’ in the general of Boris Johnson. Scotland running for the exits, Ian Paisley Jr. suggesting that people get Irish passports. I struggle to tell my American friends just how crazy that last sentence is to somebody who grew up during the Troubles.

I hate their hot takes, their casting of Brexit in the light of their own problems, whether it be Trump or Black Lives Matter, not even managing to determine the difference between the Republic of Ireland and Northern Ireland and getting petulant when somebody points out their mistake. It is apparently a victory for the left against the neoliberal European Union. It doesn’t feel like that here, or on the ground back home, especially as the Labour Party continues its stellar approach of approaching a crisis by jumping off the nearest cliff.

I feel alien. The United States is my home now, but it is not my real home. But that will soon be gone forever. Scotland will leave now, that much is almost certain, and who can blame them? The UK as I grew up in will be cast into a faded memory, three hundred years of Union blown up in order to bolster David Cameron’s re-election chances. Prime Minister Johnson of the United Kingdom of England, Wales, and perhaps N. Ireland. But isn’t he a legend, they will say.

In the end it’s not the future, But the past that’ll get us.

There is more, but I’m tired.

High Point - Ghost Town

I will never tire of this building. “So, we’d like a furniture showroom that is half an ocean cruiseliner and blacker than Darth Vader’s mask. That’s fine, right?”

A photo posted by Ian Pointer (@carsondial) on

The view from the bustling High Point theatre. Not pictured: the throng and bustle of the crowd:

A photo posted by Ian Pointer (@carsondial) on

But! It does at least have a charming, ‘old-time Americana’ train station:

A photo posted by Ian Pointer (@carsondial) on

At The Top Of The Summit

As I mentioned last time, I spent most of this week at Spark Summit in San Francisco (along with sampling custom Four Roses bottles and spending a fun evening feeling fragile and desperately hoping I wasn’t going to be sick). It was my first big conference! People from places like MapR, Cloudera, Ticketfly, LinkedIn, Netflix, Shopify, Google, and of course Databricks!

And what did I do?

Why, of course, I basically sat at the back for almost two days and barely talked to anybody. I did have a short, almost star-struck chat with Matei Zaharia (creator of Spark), but that was about it. In theory, I like the idea of conferences…but being so unable to talk to people, I might have been better off doing my usual thing and watching the talks when they go up online in a couple of weeks.

Still, despite all that, I did have a good time in San Francisco and the surrounding area (see the pictures in the last post). But I wish I was better at these things.

Back home, trying to catch up on sleep, failing, and wasting the weekend away. Another Sunday evening, it seems…

California Weekend

A photo posted by Ian Pointer (@carsondial) on

A photo posted by Ian Pointer (@carsondial) on

A photo posted by Ian Pointer (@carsondial) on

A weekend seeing much, much more of California than I imagined!

Tomorrow, I’m off to Spark Summit 2016! Come say hi! I’ll be the odd British person looking lost in downtown San Francisco!

Three Years In Driver

It’s coming up to the third anniversary of the time when I bought a house under the influence of vicodin. Halcyon days and all that. And finally, after all that time, the reasons why I don’t think I can ever love this house are crystalizing in my mind.

Mostly, it’s likely due to a quirk of my life back in the UK. Nobody else has lived in that house since it was built in 1978. Nobody. My parents moved in after it was built, and we have been there ever since. Which I know is atypical, but I think it’s formative.

Every part of 39 Avon Crescent is a memory we created. Every cupboard, every painted wall, even the newspaper lining under the carpet belongs to us. I can look at the ceiling in the kitchen and remember that the bump is where the kitchen used to end before the extension was put in. I know why there isn’t a door leading into the living room, and even now I know exactly what awaits me if I go under the stairs. Every part of it is ours. And even though I don’t live there any more, it takes me about 30 minutes to fall back into the way of things.

This is not something that the knotty pine kitchen does for me here in Durham. Nor the grass outside that just grows and laughs at me after it gets chopped back every two weeks.

The house is slowly changing. The furniture, the new rugs, the fancy new shower in the master bathroom. The way that we stripped the back room and turned it into a chocolate-making room. Posters on the walls for Helvetica and the Trellick Tower.

But it will never be my house.

Of course, working this out doesn’t really help me much. Answers on a postcard to the usual address: W12 7RJ.


I have appendicitis, I’m going to have surgery today sometime.

I was expecting a quite week after winebratsf’s visit last weekend (Eurovision party! Whiskey was had! Also, TRON, in some bizarre drunken haze, but anyway), but when one of your best friends has nobody with her the day she’s going into surgery and you can possibly get there in time? It’s time for ACHIEVEMENT UNLOCKED: buying a plane ticket and flying the same day.

(thanks goes to my employers, of course, for giving me the flexibility of being able to do that; I’m in a privileged position that not many people share)

Thankfully, the surgery went well, which meant I spent the week doing some driving in Kentucky and trying to beat back the Calvinist American work ethic. Keyhole surgery or not, I feel a week off is understandable (and recommended by the NHS!). Others disagreed, but we came to a compromise of at least easing back into work.

Oh, and I ate this in Kentucky and survived:

A photo posted by Ian Pointer (@carsondial) on

(admittedly, I only ate half)

Now, back home, cleaning the house up some, keeping an eye on work data processes that are running all weekend, and looking forward to next week’s long weekend. Whereupon I shall endeavour to try and keep myself busy. There may be chocolate involved.

UDAFs In Spark

Today, we’re going to talk about User-Defined Aggregation Functions (UDAFs) and Dataset Aggregators in Spark, similar but different ways of adding custom aggregation abilities to Spark’s DataFrames. This may be of interest to some. For the rest of you, umm…I suggest a cup of tea and some digestives instead.

User-Defined Aggregation Functions

UDAFs as a concept come from the world of databases, and are related to User-Defined Functions (UDFs) While UDFs operate on a column in a row in an independent fashion (e.g. transforming an entry of ‘5/2/2016’ into ‘Monday’), UDAFs operate across rows to produce a result. The simplest example would be COUNT, an aggregator that just increments an value for every row in the database table that it sees and then returns that number as a result. Or SUM, which might add up a column in every row in the table.

Or, to put it another way: UDFs are map(), UDAFs are reduce().

Normally things like SUM and COUNT will be built into the data manipulation framework you’re using; UDAFs come into their own for implementing custom logic in a reusable manner. If you and your team often need to generate a custom probability distribution for your warehouses’ delivery times, maybe you can implement it as a UDAF once and then everybody can get access to it without having to reimplement the logic over repeated queries.

UDAFs in Spark

Adding a UDF in Spark is simply a matter of registering a function. UDAFs, however, are a little more complicated. Instead of a function, you have to implement a class that extends UserDefinedAggregateFunction. Here’s a UDAF that implements harmonic mean:

class HarmonicMean() extends UserDefinedAggregateFunction {

  def deterministic: Boolean = true
  def inputSchema: StructType = StructType(Array(StructField("value", DoubleType)))
  def dataType: DataType = DoubleType
  def bufferSchema = StructType(Array(
    StructField("sum", DoubleType),
    StructField("count", LongType)
  def initialize(buffer: MutableAggregationBuffer) = {
    buffer(0) = 0.toDouble
    buffer(1) = 0L
  def update(buffer: MutableAggregationBuffer, input: Row) = {
    buffer(0) = buffer.getDouble(0) + ( 1 / input.getDouble(0))
    buffer(1) = buffer.getLong(1) + 1
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
    buffer1(0) = buffer1.getDouble(0) + buffer2.getDouble(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
  def evaluate(buffer: Row): Double = {
    buffer.getLong(1).toDouble / buffer.getDouble(0) 

Let’s walk through the class to see what each of the methods do. Firstly, deterministic() is a simple flag that tells Spark if this UDAF will always return the same value if given the same inputs (in this example, the Harmonic Mean should always be the same given the same inputs, or else we’ve got bigger problems). The next two methods, inputSchema() and dataType() specify the input and output data formats. We’re not doing anything crazy here - just requiring that our input column is a double and that our output will also be a double. You’re free to create UDAFs with weird and wonderful type signatures though, bringing in multiple columns and outputting anything you like.

With those out of the way, we now need to specify the schema of our buffer. The buffer is a mutable object that will hold our in-process calculations. For calculating the harmonic mean, we’re going to need a running count of the sum of the reciprocals, plus another variable which can count the numbers we’ve seen for the final calculation. The bufferSchema is defined as a StructType, here with an array of two StructFields, one of type Double and the other as type Long.

Having finally set up all the types (come on, this is Scala! The typing is all the fun, right? Right? Anybody?), we can implement the methods that will calculate our mean. As you might expect, initialize() is called first. Here, we’re making sure that both the sum and the count fields are initialized with zero values.

update() and merge() are where your aggregations happen. update() takes two arguments, a MutableAggregationBuffer where aggregation has already taken place, and a new Row which needs to be processed. In this example, we add the reciprocal value of the incoming row to the buffer (note that we don’t do a reciprocal on the buffer because the contents of the buffer have already been processed).

merge(), on the other hand, merges two already-aggregated buffers together. This is needed because Spark will likely split the execution of the UDAF across many executors (which is want you’d want, of course!) and it needs a way of combining those aggregations for the result. Here, like in many UDAF examples that compute a mean, the merge() is very straightforward. We just need to sum the two different counts, and the two different sums of reciprocals. Your custom merging logic may be more complicated than this.1

Finally, there’s evaluate(). This gets called at the end of the UDAF’s processing. In this example, evaulate() actually produces the harmonic mean result we’re looking for by dividing the count by the sum of the reciprocals.

Using UDAFs

Having defined the UDAF, how do you actually use it? Well, it’s so easy, like UDFs, you get two choices. Firstly, there’s the fairly-obvious method of using it in DataFrame aggregations, like this:

val hm = new HarmonicMean()
val df = sc.parallelize(Seq(1,2,4)).toDF("value")

But you can also register the UDAF and use it transparently within SparkSQL queries:

sqlContext.sql("SELECT hm(value) AS hm FROM df")

As you can imagine, the latter method is a great way of providing additional functionality to your Spark platform which can be introduced to your analytics team without having to step outside of their SQL comfort zone.

Behind the scenes

UDAFs are implemented as SparkUDAF, a class that extends ImperativeAggregate (This is one of the two AggregateFunctions available in Spark - the other being DeclarativeAggregate which works directly with Catalyst expressions rather than the row-based approach of ImperativeAggregate).

You can trace through the AggregationIterator class to see how Spark walks through the execution of the aggregators - it’s not especially pretty, but it does work!

What Happened to Dataset Aggregators?

I spent a bit longer on the UDAFs than I planned, so I’ll do a separate follow-up post where I look at Dataset Aggregators.

  1. Essentially ‘may you live in interesting times’, but for Spark. [return]

Kentucky! No, Ohio! No, Kentucky! No, Ohio!

Back from Cincinnati again, and this time I can say I saw more of the city. Including, perhaps, the most insane supermarket I have ever seen.

A photo posted by Ian Pointer (@carsondial) on

This is Jungle Jim’s, perhaps the only supermarket to think “I know, let’s jam another supermarket on to the end of the original one and go crazy!”

A photo posted by Ian Pointer (@carsondial) on

Yes, that’s Robin Hood standing over the British foods section. There’s also an island, a boat, a fire engine, singing cereal mascots…and…look, I can’t do it justice. It even imports Tesco and Sainsbury products, for goodness sake and somehow manages to get proper Cadbury chocolate (don’t ask me how, given the Hershey clampdown). It is almost every American shopping stereotype brought to life…complete wi th a monorail out front. Simply wonderful.

Somewhat late to the party, I finished watching the first series of Detectorists this week as well. I had given it a wide berth because of Crook, and I was totally wrong. It turned out to be exactly the gentle comedy that I never thought could come from somebody involved with The Office. And I’ll confess that at times it even made me a little homesick. It might be nice to go home one year when it isn’t Winter.

One thing that has been nice about travelling so far this year is that I’m going to places where I actually know people (okay, so not Nashville or Atlanta, but my hit-rate is much better this year)! This time I got to meet up with Tammy to see where she’ll hopefully end up living in a couple of weeks’ time. While it’s technically in Kentucky, you can literally walk in a straight line for half-an-hour, go over a bridge and you’re back in Ohio. I could entertain myself for a fair portion of the afternoon simply by walking across states. I have issues.

Whilst you wouldn’t normally put ‘hipster’ and Cincinnati together, it is a decent-sized city and that means that they are present.

A photo posted by Ian Pointer (@carsondial) on

It’s a converted post office that is now a restaurant selling fried chicken! Surrounding it: high-end pet shop, hipster shops selling homewares, and of course the bars spilling out into the street and the Sunday sun. So yes, Over-The-Rhine is trés hipster. But, you know what? It felt inclusive in a sense that you don’t often get in Durham. Plus the lunch at The Eagle cost less than half than what you’d pay here in the Triangle.

Oh, and the beer cellar where we played board games also has this shop for all your lederhosen needs. What more could you possibly want from a city?