I agree. I think we know how to build a singe piece of software correctly. The problems arise in integration.
Volatile and Decentralized: What I wish systems researchers would work on:
“Understanding interactions in a large, production system: The common definition of a ‘distributed system’ assumes that the interactions between the individual components of the system are fairly well-defined, and dictated largely by whatever messaging protocol is used (cf., two phase commit, Paxos, etc.) In reality, the modes of interaction are vastly more complex and subtle than simply reasoning about state transitions and messages, in the abstract way that distributed systems researchers tend to cast things.
Let me give a concrete example. Recently we encountered a problem where a bunch of jobs in one datacenter started crashing due to running out of file descriptors. Since this roughly coincided with a push of a new software version, we assumed that there must have been some leak in the new code, so we rolled back to the old version — but the crash kept happening. We couldn’t just take down the crashing jobs and let the traffic flow to another datacenter, since we were worried that the increased load would trigger the same bug elsewhere, leading to a cascading failure. The engineer on call spent many, many hours trying different things and trying to isolate the problem, without success. Eventually we learned that another team had changed the configuration of their system which was leading to many more socket connections being made to our system, which put the jobs over the default file descriptor limit (which had never been triggered before). The ‘bug’ here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.
Somehow there needs to be a way to perform offline analysis and testing of large, complex systems so that we can catch these kinds of problems before they crop up in production. Of course we have extensive testing infrastructure, but the ‘hard’ problems always come up when running in a real production environment, with real traffic and real resource constraints. Even integration tests and canarying are a joke compared to how complex production-scale systems are. I wish I had a way to take a complete snapshot of a production system and run it in an isolated environment — at scale! — to determine the impact of a proposed change. Doing so on real hardware would be cost-prohibitive (even at Google), so how do you do this in a virtual or simulated setting?
I’ll admit that these are not easy problems for academics to work on. Unless you have access to a real production system, it’s unlikely you’ll encounter this problem in an academic setting. Doing internships at companies is a great way to get exposure to this kind of thing. Replicating this problem in an academic environment may be difficult.”
(Via Volatile and decentralised.)