wasda

Do Not Underestimate The Power of a Single Server

Published on by

You don’t need a cluster. Really. A single machine is probably fine for your use case.

It’s a controversial opinion. Many people, rightfully, throw their arms up in the air screaming “single point of failure” or “scaling” at the top of their lungs. They’re not wrong. That’s not the point this post is trying to make. Do you need to care about those things though? Why would you? In the end it boils down to what your application needs. And if it’s a commercial application, it’s what the business needs.

Let’s take a trip back down memory lane to create some context. Back in the early days of the (pre-)internet almost every service was hosted on a single machine. A BBS was usually a single computer on somebody’s desk at home. Large businesses had their software running on a single machine. And if that machine couldn’t handle the load, you’d get a bigger one. The biggest challenge businesses faced with these single-machine setups is reliance on a single piece of hardware. Putting all your bits in one basket is bound to go wrong at some point. Your business grinds to a halt if anything goes wrong. This is why hardware manufacturers started building redundancy into their machines. Redundant power supplies, network interfaces, storage, and even CPUs. Everything became hot-swappable. These huge machines come with two big downsides. Firstly, they’re extremely expensive. Secondly, they’re still only one machine. If there’s a power failure or the building that houses it ceases to exist, none of the redundancy will help you.

Tech startups in the late nineties realized that it was much cheaper to run a rack of cheap x86 consumer-grade hardware than to buy these high-end machines. Sure, the failure rates of single machines were higher but they’re all redundant anyway. And replacement parts were only a fraction of the cost. The individual machines were much slower than the competition, but that didn’t matter either. If you can buy four machines of half the performance for the same price as one high-end one, it’s still a net win. Basically all large internet-facing companies now run large clusters of cheap hardware. These days, often of their own design.

The next logical step is to have multiple racks running for your application. It is not unheard of that entire racks can fail. Usually it’s power or network related. And what about data centers? If the entire data center loses connectivity or power, all your racks go offline. Multiple data centers, it is! If those data centers are too close together and there’s a regional issue like an extreme weather event, a prolonged blackout, or even a war, you’re going offline. If you want to be truly resilient, you’re going multi-region.

This is prohibitively expensive for most businesses. Only the extremely large ones can afford to set up this level of redundancy. Amazon is one of them. They quickly realized many companies dreamt of this but couldn’t afford it. Amazon Web Services was born and with it the era of cloud computing. (Sure, granted, they were not the first. They were certainly the biggest.) Every business now has the ability to run their applications in an extremely fault-resilient way by piggybacking on the enormous investment these large cloud providers have put into their infrastructure. For a price, of course. As an added benefit, you get the ability to scale your rented infrastructure footprint on the fly to match demand.

If scaling, redundancy, and fault-tolerance can be bought from cloud providers for a fraction of the cost of doing it yourself, why not do it? Every business wants six nines of uptime, right? Complexity. That’s why. Most software is stateful. The vast majority of business applications are some sort of CRUD system with business logic on top. Making these applications into distributed systems that would make use of the cloud infrastructure effectively is a delicate and error-prone undertaking. Distributed systems are hard. The engineers who know how to properly design and maintain them are few and far between.

To give an example of a classic distributed systems problem: Imagine you have two availability zones, both receiving half your incoming traffic. Due to limitations of how (most) databases work, only one zone can receive writes and the other zone maintains a read-only slave. If the main zone goes down, the slave automatically gets writer-status and business continues as usual. But what if only the link between the zones fails? Now they both think the other has gone down and they both become writers. Half your updates go the first, and the other half go to the other. This is the split-brain problem. The only way out is to either painstakingly weave the writes back together, or discard half the writes. Usually only the latter is a viable option. This means data loss. Data loss caused by a system designed to prevent it. In general, every measure of redundancy added to a distributed system introduces more logic that introduces its own failure modes. It’s a fine balancing act between making the system more resilient by adding redundancy and less stable by adding complexity.

Of course there are solutions out there to handle multiple write locations to prevent this unfortunate split-brain problem. Especially since the advent of Raft these systems have become plentiful. Unfortunately these are not immune to the CAP theorem either. There can be consistency, availability, and partition tolerance. But you may only pick two. Raft picks consistency and partition tolerance. If too many copies become unavailable, the system stops responding. Other combinations also exist. You can choose for eventual-consistency. Some instances of your application will be serving old data for a while. Or, like the example above, you can give up partition tolerance and risk a split-brain problem. In all these scenarios your application needs to be able work with the limitations. It needs to be partition-aware. Or it needs to know it could be accessing old data, even though it just wrote new data. Or it needs to know that sometimes the data is unavailable until consistency is reached in the cluster. Most off-the-shelf applications are not that clever.

To set up a distributed cloud-based infrastructure one needs to design it. Cloud providers offer everything one could ever want in every scenario. Choosing the right offerings and configuring those correctly to match your application’s needs is not easy. In fact, it’s very complex. This is why Infrastructure-as-Code has become mainstream. It’s so complex that a good engineer will need write actual code to configure and maintain your infrastructure. And you guessed it, these engineers are hard to come by as well.

What engineers often do to reduce the immediate complexity is building abstraction layers. The hardware is abstracted by the operating system. The operating system gets abstracted by some container runtime. The management of (virtual) machines and networks gets abstracted by the cloud provider’s API. On top of that we use Kubernetes to abstract all of that again. Sprinkle some custom Terraform on top and you’re five, six layers deep. They’re managing a huge interconnected web of highly complex systems through ultimately leaky abstractions. The resulting stack becomes so complex that nobody can truly reason about its state or failure modes anymore. This is when bad things start to happen. The unknown unknowns come rearing their heads. Especially when maintenance is required to keep the myriad of components updated. Read any post-mortem of large internet-facing businesses and you’ll see the same pattern. They drowned in complexity. All that investment into redundancy and flexibility, and what did it buy them? More downtime and higher costs.

To recap, we went from simple applications running on single machines to complex distributed systems that are expensive to maintain and often do not live up to the availability expectations.

Clearly, many businesses have gone too far in their pursuit of technical excellence. They’ve ended up on the wrong side of the balance; reduced availability and maintainability due to complexity. What’s the alternative? Can we do without all that complexity?

Back in 2006, yours truly worked for a medium-sized business with a fairly sizable custom software stack. Tailor-made for their particular use case. They were the market leader in their field. The hardware running all of this took up almost half a rack. 99.99% uptime was considered normal. Some years it reached more.

Recently I found a copy of that software in one of my old backups and decided to give it a spin. On a virtual machine. On my laptop. Boy, was that software quick. A quick hammering of the heaviest endpoints showed that my little virtual machine easily outperformed that entire rack. That company is still around. Their customer base is still roughly the same size. In theory, my laptop could handle their entire demand. With ease.

Imagine what a modern, beefy server machine can handle. CPUs with 128 cores. Terabytes of RAM. SSDs that handle over 50k IOPS and read over 10 GB/s. You can now buy a single machine that is as powerful as a mid-sized data center of yesteryear. The vast majority of businesses can be powered by just a single machine. Especially considering the software can be made much less complex on a single machine. Much of the resource usage growth of business software over the past decade and a half can be directly contributed to making everything distributed. Language runtimes and frameworks have only become faster over time, roughly speaking. Single machines have already been handling millions of simultaneous clients over a decade ago.

Okay, so if you can run just about any business on a single machine, what about the single point of failure? Isn’t that still a problem? And yes, yes it is. However, it’s the wrong way to look at it. The only things that matters to a business in this regard are MTBF and MTTR. Mean Time Between Failures and Mean Time To Recovery. In other words; how often does is break and how quickly can you un-break it? The argument for a single machine over a highly complex distributed system, with a higher fault-tolerance in theory, is that the complexity of the distributed system actually decreases MTBF and increases MTTR. More moving parts means that there are more things that can go wrong. And they do. Often. And if nobody truly understands all the possible failure modes of the system, recovery also becomes more of a challenge. Switching to a hot standby or even setting up a new machine from backups is often much less complicated than recovering from one of the many failures of a complex distributed system.

What about scaling, you ask? Nowadays you can scale a virtual machine in-place with very little downtime. And you can scale that instance all the way up to 11. If that’s still not big enough, you can switch to a beefy physical machine. Or two, for backup purposes. Once you really start growing out of your datacenter-in-a-box you’ve got yourself a nice luxury problem on your hands. Unless you’re doing something outrageously computationally intensive for your customers you should be making so much money you can hire full-time distributed systems specialists to scale your software across multiple machines.

Of course there are always cases when you actually need a very fault-tolerant distributed system. It’s rare, but it happens. Some services simply can not tolerate downtime. Things like financial, communication, healthcare, or defense systems. Basically anything that touches human lives or is otherwise sensitive to (global) security. You wouldn’t have read anything new in this article if you had one of these systems to build or maintain, as you should know what you’re doing. Other businesses are of such vast scale that their software would never fit on a single machine. For at least 99% of businesses however, this is not the case.

TL;DR: Try to run your business from a single machine first and for as long as possible. Do not forget to think about your recovery strategy. Only when you grow out of that, or if your business has extremely high uptime guarantees to meet, go for a distributed system.

Tags: