Those of you who tried to upload a photo of your dinner to Instagram last weekend, or a video of your ‘ever so-adorable cat’ onto Vine, probably had a rough time trying to do so. Both services went offline for over an hour due to problems with Amazon Web Services, the world’s biggest and most well-known cloud computing provider.
While the outage was solved relatively quickly, it did result in major downtime for several of the web’s biggest online businesses including online video streaming website Netflix and accommodation booking service AirBnB as well as Heroku.
It was the second service disruption to affect Amazon in one week after an American shopping site went offline just a few days ago. Once again the problems can be traced back to its US-EAST data centre.
According to Amazon updates during the glitch, the problem was with the Elastic Block Storage volumes in its ECS database and the load balancers in a single zone.
We have identified and fixed the root cause of the connectivity issue affecting load balancers in a single availability zone. The connectivity impact has been mitigated for load balancers with back-end instances in multiple availability zones. We continue to work on load balancers that are still seeing connectivity issues.
The outage leaves me and the rest of tech community asking the same question: ‘why are so many prominent web companies overly dependent on one single cloud provider or one giant fallible data centre?’
The problem is, unfortunately an age old one – putting all your eggs in one basket. There are two ways to look at the problem – either get one amazingly strong, robust basket that will ensure the eggs never break, or… get a few baskets, spread those eggs around and assume you’ll lose some along the way. Things always break – no matter how hard to try to stop that happening so the best way to ‘survive’ is assume that’ll happen and tolerate it.
This new approach is gaining some momentum in the tech world but it’s very hard to do. It requires a totally different approach to building and maintaining systems. However it’s much easier to manage and scale longer term.
This is the approach we’ve taken – use two completely autonomous data centres in different parts of the UK. This ensures that even the worst failure of a data centre – earthquake, tsunami or even pesky youths breaking into the building and stealing wires (this actually happened to us once!) can be tolerated without affecting your service availability. Traffic can be re-routed to your other data centre within 60 seconds.
I imagine Amazon’s engineers were frantically working to restore service as quickly as possible but it’s another reminder that even with the biggest budget and the most highly skilled technicians, trusting 100 per cent of our hosting to one supplier – no matter how big they are – is a single point of failure which will always leave you and your data vulnerable.
At the end of the day your data should be your most prized possession so you need to wise up about where you locate it.