Cleanse+ is complicated. Really complicated. From the moment we first started developing it, we knew its unique approach would demand some serious grunt. Each address that’s cleansed requires millions of calculations and access to gigabytes of data. This isn’t something that’s going to work on your average PC!
Our first attempts were far from optimal. It took almost 20 minutes to clean up one address, so there was a lot of work to do to make it work in real life! It took a few weeks, but we got the times right down to less than a second, which was much more sensible. However, we also knew we needed to be able to clean lots of addresses – millions a day – and in that scenario, even a second per address is just too slow.
There are different approaches to this sort of problem – buy really, really fast servers, or buy lots of slow ones and make them work together. Unfortunately, because we need not only processor performance, but also lots of memory, lots of cheap servers weren’t an option. Moreover, buying servers is just the start: you need to house them, power them and suck out the heat they generate.
All this adds to the cost and influences the decision.
Serious machines, HyperThreading and 48 cores
For Cleanse+, we settled on a bunch of Dell servers, each with four processors, six cores and just less than 100Gb RAM (yes – you read that right!). These are serious machines, but with 24 cores available to us, we knew the system should run 24 times faster than on our dev machines using a single core. Turn on HyperThreading and Windows was reporting 48 cores – a techie’s dream! But that’s when the nightmare began…
Developers are taught that if you write nicely multi-threaded software, it’ll scale really well on bigger hardware. However, our first trial runs were extremely disappointing. We were expecting 100% processor utilisation but were actually seeing something closer to 15%. This was looking like an expensive mistake! More worryingly, there wasn’t any obvious reason for the sluggish performance.
Several hours of Googling later and a baffling picture of hardware complexity started to come into focus – but sadly no easy answers. So why don’t things scale as you’d expect on these larger machines?
More cores into more processors
The answer is relatively simple… but the solution is more complex than perhaps it should be. Over recent years, we’ve seen chip manufacturers like Intel and AMD put more cores into processors. They don’t boost straight-line performance, but they do give greater capacity and the ability to do more things at once – something that’s great for servers.
Processors also have memory cache built into them to save them going to main memory for frequently-used data. This cache is typically shared across all the cores on a processor, so while things are great for tasks that aren’t memory-intensive, if you find yourself moving around large chunks of data the impact can be quite detrimental.
This is especially true with a feature like HyperThreading. It makes the operating system believe there are twice as many cores present, with the side-effect of pushing even more through the processor. This means that the cache becomes ‘dirty’ faster and the performance enhancements may be completely offset by the time spent waiting to bring in data from the main memory.
Things get even more complicated with more processors. While all processors have access to all the memory, they typically have faster access to some of it. A four-processor machine will typically have memory arranged into four banks, each with fast access to one processor and slower access to the others. This becomes an issue with a large, single application as it will span all the processors and not necessarily be aware of these underlying architectural issues.
Ignoring the hardware is a costly mistake to make, but if you’re sensitive to it you can work with them to your benefit. Our own solution wasn’t especially complicated but works surprisingly well.
Instead of a single monolithic application containing all the data and cleansing workers, we split things into six separate apps: a job manager which take the jobs and distributes them to four worker apps. These workers take no more than six addresses at a time to cleanse. Each worker app binds to a single processor, and doesn’t overload it with more threads than cores to help reduce context switches. The final piece is a storage and database app that contains all the PAF® data in memory for exceptionally fast searching.
This approach worked well on a number of levels. Among other things, we had seen real performance problems in the single-app approach with garbage collection. The storage system contains a huge number of non-volatile objects, while the cleanse system is constantly creating and destroying objects. In a single application the garbage collector seemed to struggle to keep track of all the objects and manage the memory properly.
Splitting these out meant that the garbage collector only really has to do anything in the workers and they are a sensible, manageable size. Plus, workers can fail and be recycled which greatly helps availability. Finally, this approach makes it easier to scale across many machines because the job manager can have access to local and remote workers.
The moral of the story
So what’s the moral of this story? Scaling things isn’t as easy as we’re led to believe and solving big problems often means thinking about things in quite different ways. These days it’s easy to assume that the boffins at Intel and Microsoft have successfully hidden all that ugly hardware from us. That’s just not the case – and if you’re working on a complex system you really need to understand what’s going on under the bonnet.