Woe (noun, literary): great sorrow or distress.
Anyone building products on the web with any interactivity, collaborative, real-time, or reactive features will realize that we live in a cold dark world of anguish. My team’s SOSP21 paper on BladeRunner glosses over the depth of this topic, and today I intend to walk through my thoughts as I must build yet another streaming proxy. Perhaps, this wandering of thoughts helps you on your journey as well.
The first step of this journey starts with excitement because you have found a secret power. The powers of streaming from server to client are immense. You can optimize network transmission, minimize client CPU and battery, experience low latency with low overhead, invent new ways of sharing context between server and client to optimize data flow, and generally find joy. For the first time in your life, coding is joyful. Those of us that understand protocol design welcome you.
Sadly, the second step gets harder because you must take this journey alone. For every one person that designs rich protocols with sockets there are hundreds that simply use request-response. Of those hundreds, you will find your fallen family who denounce the complexity of using streams. Streams are a radically different beast than what you get out of the box. Even now, as I write this, I have a hope that I can spark the streaming renaissance! The revolution is near! Perhaps? No, not yet. There is much work to be done because in the land of the stream: there be dragons.
The first bit of sadness is that once you get your WebSocket server up and running, then what? Well, most software to exploit in building a product is request-response. Are you seriously going to poll your data-store? As if! This will be your most likely reality until you discover you need a messaging product. The good news is that there are plenty. The bad news is that they mostly suck for various reasons, and you will have to become an expert in a variety messaging stacks. You will learn to love pub/sub at first, but it eventually becomes a bad time.
This bad time generally manifests with poor reliability for a variety of reasons. First step is whether or not the messaging stack provides guarantees. If it doesn’t, then guess who has to figure out how to build a reliable product: you. Worse, even if the messaging stack provides guarantees, then guess who has to monitor that it is working within reasonable bounds: you.
Now, if it does provide guarantees, then poor reliability can still manifest via back-pressure. Those guarantees are not cheap, and if you apply too much pressure then the stack will push back and your best choice is to give up at some point. If you don’t give up, then all may be at risk due to unbound queue. Worse yet, pub/sub over-commits such that fan-out creates more work on distribution. While this fan-out can scale well, it definitely not out of box typically.
Within your precious WebSocket, the risk of failure is ever present.
Stretch your imagination such that request handling goes from milliseconds to minutes, hours, and days.
We can contrast this with request response where a single failure can throw an exception and the client gets a precise failure. With WebSocket, failing the connection on first sign of a problem may not be reasonable as the service will most likely converge to a capacity which is not adequate for a connect storm. The connect storm can emerge as the connections fail, the clients will reconnect. The reconnect is not cheap, and this can create congestion leading to more failures and the death spiral is eminent. Your protocol must honestly reflect realities between client and server, and your client must be respectful.
Your socket will potentially last a long time, and you will have to contend with that. Your WebSocket server can’t lose track of anything. You can’t leak memory. You can’t forget to set a promise. There are no time-outs to save you. Your code must work damn near perfectly. The key problem is that a broken stream looks a lot like an inactive stream, and knowing the difference is exceptionally hard.
The second bitter taste of defeat is that the critics of using WebSocket are correct and stateful programming is hard. Did you think those benefits would not require sacrifice? Two key failure modes require contending with: the loss of the connection and performance degradation. The reality is that you are developing your WebSocket in summer, but winter is coming for you and your sockets.
Let’s focus on the loss of the socket.
For a variety of reasons, the socket is broken and you have an interested client lost in the wind. This means that the client and server have some sense of shared state, but data loss happened between them. Simply reconnecting the client to server requires a negotiation of who has what and where to pick things back up.
Perhaps, you solve both finding your server and renegotiation. What about when the server state was blown away due to a deployment which is the happy and often expected form of server state loss? Well, now you have to reconstruct state in a manner which the client was expecting. Does your protocol account for version changes of software? Did you synchronize everything of importance to a durable store prior to sending to client?
Clearly, handling socket loss is a big deal. However, the internet is a wild place and your nice socket born in summer will have to deal with poor performance. Packets will get lost and networks get colds (i.e. congested). Either your client or the server will try to send too much and that pressure will build. While the network library protects itself with flow control, does your code adjust to the poor performance? Does it leverage any type of flow control? Or does it simply use an unbound queue with a small prayer? Perhaps, you will get lucky, and yours queues stay small.
So, you go on and make your protocol robust to both disconnects and congestion, but what about scale? What happens as you multiply out the number of WebSocket servers. Now, scale is fundamentally a game of indirection by sharding and routing via proxies, so what happens? Well, either you need to treat your WebSocket connection as “basically stateless” or congratulate yourself because you on the path of building a database.
Let’s define “basically stateless” as only holding onto state that is easily reconstructed or easily lost. Technically speaking, a WebSocket connection is stateful as it is using TCP and there is a dash of state to maintain that TCP connection. The question is what mess of code do you put in place to manage that socket to make it useful, and if any of that state is important then it should survive all the failure modes of both the socket, the process, or the machine. Not only that, but the state on that machine must survive the failure modes of the proxy talking to it. That is, the state must be found.
Some state which is “basically stateless” is an optimization for better performance. For example, caches of immutable key value pairings from your data store make for a great thing to pair with a connection as this lets you manifest some of the potential of that connection. However, this creates a debt such that a catastrophic socket loss creates a tremendous reconnect pressure. The benefits of WebSocket do not come cheaply. The nice beautiful and cheap potential comes with a submarine of an issue which can be exceptionally problematic if you don’t account for it with back-pressure or spare capacity.
If you deploy your application regularly, then you will have to decide how the proxy handles that. Does it proxy the loss of socket up to the client, or does it try to retry locally. There are advantages to both approaches, but there are dragons with retrying locally since this creates a natural data-loss hole in the stream. How you recover from that data-loss hole requires either additionally signaling or a way to close the gap when connectivity is restored.
Adding signalling is exceptionally helpful for the engineers to debug why things go bump in the night, but such signalling does not make strong guarantees. At some point, you will need a protocol between client and server to checkpoint that everything between them is healthy.
Suppose you get a load balancer that has sticky routing. Well, I have yet to find a load balancer with reactive sticky routing. The load balancer makes a decision based on a header to route a WebSocket request to a machine. In modern cloud environments, machines are coming and going, and this invalidates those sticky decisions. Sticky load balancing should be considered optimistic for aligning request caches.
Again, we find that dealing with the long-lived nature of a WebSocket to invalidate many assumptions of things to buy in a request-response world.
A key problem trying to maintain a stateless ideology is that you must be mindful of it and always push your state out of your application and into a persistent medium. Unlike PHP or a framework that enforces a lifecycle of loss, the discipline to do the right thing is entirely on your shoulders.
This stateless ideology creates a large market of very interesting data stores, messaging stacks, loggers, queues, and other services because the stateless ideology has no state. And yet, state is the most interesting aspect of computing. Failure to outsource your state will be a disaster unless you commit yourself to building a database, but this diaspora of potential ways to store your data creates an interesting burden for building products. That is, how much of your code is simply reading and writing data versus doing the thing you need to do?
Suppose you discover a novel way to use a WebSocket, and there is nothing available to buy to really offload that burden. Great! There has never been a better time to learn about routing and consensus.
Routing is a tricky problem because of modern cloud environments have machines coming and going. If you decide to treat your machines like cattle, then this requires figuring out consistency and precise stickiness. The key problem being giving the proxies an accurate map of where your state is, and then developing a protocol between the proxy and the database such that they can negotiate where the state actually is.
Once you have machines coming and going, you now have to contend with the state held within your WebSocket service. Low and behold, you can’t escape the need to read and write from a durable medium. Perhaps you use the local disk, and now you have to replicate (or erasure encode) it across multiple machines.
Given all these problems, should you abandon the benefits of WebSocket and join the critics?
If you lack an iron will, then you should give up now on a WebSocket. We don’t yet have good things to buy off the shelf, but I believe some good things are emerging (slowly).
Firstly, you really need to abandon the command pattern with pub/sub. You need to stop telling peers what to do and just nudge them or share data. The core reason to abandon this pattern is that this pattern is expensive to scale if you care about reliability. At core, the command pattern relies on an implicit infinite queue which you can’t handled in a finite time.
Something worth noting about BladeRunner, a large chunk (90%+) of messages were thrown away. In fact, the value proposition of BladeRunner was to accept as many messages as possible internally and throw most of them away as intelligently as possible before the last mile.
An alternative approach is to listen to data, but the problem with listening to data is that it is even more expensive when hacked on top of existing databases and query languages. I have come to believe that the only way to focus on data is to layer reactivity as a core primitive at the bottom and then combine that reactivity with Calculus.
Having laid out the problems, I want to pivot and talk about Adama as a foundational infrastructure piece (albeit in its infancy). The first thing to realize is that request-response is so resilient that it demands our utmost respect, and we need to leverage it before we optimize it.
As much as this pains me, the ideal is to poll for data. Now, this is far from ideal on a technical front as polling has problems. However, the spirit is that you ask for data and then you get data. You don’t interact with an agent or broker; you interact with data. This is why event streaming is so compelling with things like Kafka-SQL because they are resilient in many ways. Streaming systems are exceptionally hard. Fortunately, if you have a good stream system then when it fails, you simply ask for the data again. That’s the request-response resilience at play since there no mystery of where things stand: you ask for data, and you get it.
Adama embraces this such that you could just poll the state of a document. This yields incredible resilience. The WebSocket comes in as a downstream optimization such that updates flow over the socket in a differential format. You get all the wins as differentials minimize cpu, network, battery, and even rendering those updates to the user. Given that differentials are in a standard format, they also handle congestion exceptionally well since they can batch on the server when flow control kicks in. The size of what is stored within the server asymptotically approaches the size of data fetched during a poll.
On the upstream, Adama embraces request-response between the client and the document. Instead of publishing to a bunch of subscribers, you send a single message to a single document with traditional request response. If it fails, then try again (idempotent behavior (i.e. exactly once) handled via a unique marker). This has a remarkable advantage over pub/sub since there is no fan-out; that one to one relationship between client and the document creates an easy to measure sense of reliability.
This is then the battlefront and helps outline the requirements for the streaming proxy which I have to build. The hardest problem on my shoulders at the moment is prioritizing how to address this given that I like to build solutions which scale.
However, the task on my shoulders is no longer herculean as I can focus on enabling request-response for everything except the document connection. And, when things go bump in the night on the document connection, then I can simply retry without worry. The hardest task I have is to ensure that I don’t lose the book-keeping game and leave the connection dangling. As such, I will bias initial implementations to signal all failures up to the client to retry.
Well, we live in a world with a lot of stuff to buy that just works, but there are always minority cases which require something a bit more. The problem with taking that first step towards getting that bit may be more expensive than it seems.
I always like to quip that innovation is ten steps forward and nine steps back.
With a WebSocket, that’s clearly true and I believe the critics are right on many fronts. However, we should never let critics force our hand either, and I believe there are new things that we should have built which are available to buy. There are a handful of key efforts that I’d like to see explored within the open source community.
The first is a protocol design language for bi-directional communication; I’ve got a start with my api-kit code generator, but this is being designed explicitly for my use-case. It may be worth to generalize it at some time as I am building a language of how I think upon these things. It would be great if this protocol design language included connection management as well rather than just message layout.
The second is a good streaming proxy which can provide many of the solutions to the mentioned problems. A key problem is that it will require contending with consensus and consistency issues, and I’m looking at Atomix as a possible building block.
Finally, I’m wondering if the asymmetric client/server model that Adama uses (i.e. client sends request/response and server emits a data differentiable stream) is worth generalizing on its own.