Rafał Studnicki

Software Engineer @ Whatnot

Rafal is a software engineer with 12 years of experience in C, Erlang, and Elixir. He has worked on various distributed systems, ranging from tiny clusters on microcontrollers to some of the largest chat servers in the world.

Having consulted on many real-world projects, he has come to believe that clean architecture, ruthless simplicity, and a principled stance towards testing for correctness are required for software to serve its business purpose successfully in the long run.

Currently, Rafal is a software engineer at Whatnot, a rapidly growing live shopping platform.

Deploying an Elixir cluster that keeps stateful connections with the clients and manages distributed state is usually a much more challenging task than in the case of stateless services.

At Whatnot, we learned this the hard way.

With every deployment, there was a big risk of data inconsistencies that were very disruptive to auctions in progress. Which, of course, led to the buyers' dissatisfaction and the sellers’ financial losses. Consequently, we limited deployments to off-peak hours.

In this talk, we will present a case study of how we drastically increased the reliability of our Elixir service.

We did this by automatically verifying the system against most of the problems we've been experiencing in various conditions. We tested the deployments and locally simulated cases where nodes were going up and down randomly.

Having included these new tests in our CI pipeline, we gained enough confidence to deploy to production after every single commit, at any time of the day.

OBJECTIVES

- Why deploying a stateful Elixir service with zero downtime is challenging;
- Why is it even more challenging if it's run on Kubernetes;
- We will also include our mini-survey of the available cluster state management tools and their tradeoffs.
- How to test your distributed systems against various safety and liveness properties;
- How to test your system upgrades and downgrades without doing an actual deployment;
- How to make the CI pipeline reliable so it doesn't slow down your deployments.

AUDIENCE

The talk is for everyone interested in building software in Elixir and building more reliable distributed systems.

Slides
←Back