At Smartly.io, a new technology stack can be introduced in several ways, one of which is to address a particular challenge. Late last year, we were faced with the challenge of modernizing our notifications system for end-users by making it fit for both the changing requirements of the industry, and our internal development teams. In this post, I’ll walk you through some of the considerations and learnings we encountered along the way. It will not be a deep dive into the language and its surrounding ecosystem, but instead I’ll offer a glimpse into how we approach language and technology evaluation and adoption.
In its then current state, our notification system was baked into some older parts of our service ecosystem and heavily resisted far-reaching changes. We also felt that utilising more modern, ubiquitous technologies like WebSockets would suit us better and provide our users with a more engaging experience – and well, let’s just say it had a host of other small problems that made it a prime candidate for a rewrite.
Now a rewrite does not require a change in technologies: we could have stuck to the more or less blessed stack of Node.js/Typescript for our back-end, but we realised an opportunity here try something new in a part of the application that is not necessarily on the hot-path.
In other words, we were in a fairly good position to experiment and explore.
We set off on a scoping and requirements gathering exercise, and managed to draw a circle around the features we wanted to build and deliver in due time. One of the themes that kept creeping up from our discoveries was user experience: the look and feel of the whole thing.
Our goal was to provide a slick, real-time experience to our users. There are a couple of technologies currently available that can facilitate this, but WebSockets was the most mature solution for this. The next steps were to track down some language and framework combinations that have significant support for long-running WebSocket connections.
Apart from UX demands, we also set some expectations from an engineering point of view:
- There need to be strong primitives in the framework - with strong community support - for both building and testing persistent connections, queuing, storage, etc.
- If we have to fight the framework, we’re doing it wrong. It should be opinionated enough, without getting in the way.
- If we can reduce moving parts, we’re doing it right.
With these requirements in hand, we opened up the floor for any suggestions and sure enough, one option quite quickly gained traction.
Elixir is a fairly new language, but it runs on extraordinarily robust and battle-tested concepts and technology, at least by modern-day web development standards. It builds and runs on top of the Erlang VM, doing so — in part — by leveraging Erlang/OTP.
OTP, the Open Telecom Platform, is a collection of tools, libraries and middleware that enable developers to build scalable, robust systems. It is the lower-level ecosystem supporting Erlang – and thus Elixir – development.
I consider Elixir to be a niche language, albeit a very mainstream and capable language within that particular niche. It might not be the fastest or most optimized, but it commits to a few core tenets – concurrency, fault tolerance and isolation of errors, persistent connections – and executes on those really well.
Ultimately, we selected Elixir based on a couple of reasons:
- Developer Experience – Elixir has a reputation for not only being developer-friendly, but having developer experience and productivity at its core
- Integrated feedback loop – Coming from Node.js, having the One True Way to run unit and integration tests, and have a very clean way of running these in parallel is refreshing.
- Purpose – As mentioned above, our requirement was to support and build upon a system of near-real time communication. Many other technologies do offer some level of library support, but we were hard-pressed to find a better solution than the primitives and higher level constructs provided by Elixir and Phoenix.
- Interest – A very enthusiastic engineer within the team took ownership of this project
Our first steps in Elixir
Once we had settled on Elixir, we were ready to roll our sleeves and actually start building something. This is where the champion steps in – a person who not only shows interest in the topic but is also willing to coach and mentor their peers. In my experience, it’s quite risky to try to adopt something that no one in the team has any experience in.
In true champion spirit, one of our own team members took ownership of building the seed project. While he was busy tinkering away, we purchased some books, found some tutorials, installed the blessed extensions and started learning this shiny new stack. It did not take long for us to have something; it was missing a lot of functionality, but it was just enough to give us a taste of how the language is structured, how the web framework hangs together and how we deploy it.
Now of course, introducing a new stack is not all smooth sailing. It should come as no surprise that the cost of adopting something new comes with a substantial amount of minutiae that need addressing. We had established ways of running services within our infrastructure, and our job now was to slot our Elixir-based service into that infrastructure.
From getting logging to work, having a working Prometheus endpoint, and integrating with our internal authentication mechanisms – let’s just say we had quite a few boxes to tick in order to get everything to play nicely.
But in the end, we have something that works! So let’s talk about some of our key learnings.
Clusters within clusters — Understand where Kubernetes ends and Erlang/OTP begins
There are competing, yet complementary philosophies at play here. The application is running on top of Kubernetes within Smartly, which brings with it a whole raft of *stuff* – a scheduler, failure detection, logging integration, load balancers... everything and the kitchen sink.
The application itself is running on top of Erlang/OTP, which in turn drags a fair bit of tooling and opinionation into the fray - failure detection, process management, supervision tree(s). You name it, it’s probably there somewhere.
Jose Valim, the creator of Elixir, has written in depth about how these two runtimes coexist, and at large, the technical aspect of his argument is sound. But what we’ve observed is that there are a few challenges with regards to expectations that needed to be overcome.
Failure escalation — Come to grips with how failure occurs and propagates
Our main competency within the team is running Node.js applications, and while a lot can be said about Node.js, it is not a complex programming model. There’s a script, you run the script, it blows up or keeps on running. Everything else is yours to manage and control.
Elixir really tries hard not to crash the entire underlying process, so it will let parts of the application crash and burn, after which the supervisor will reconstruct processes in your supervision tree based on configuration. Kubernetes does not get involved here, so our regular approach of finding out bad things are happening simply does not apply.
This happened recently when a critical part of the application was unable to talk to an underlying database, and instead of the usual symptom – pods crashing and burning left, right and centre, making our alerting system go off like nothing else – we were faced with an application that looked functional on the surface, but was heroically trying to stem the flood of dying processes internally. Needless to say, we added more monitoring and learnt our lesson.
One is the loneliest number — Account for redundancy in a stateful system
Traditionally, one of the hardest steps to take when scaling your infrastructure is to go from one to two. When vertical scaling is no longer a viable or preferred option, teams need to look at horizontal scaling instead. In the current landscape of Kubernetes clusters and small, disposable services, horizontally scaling your application has even become the de facto standard.
This puts the onus on us to design a system that can work in a distributed way. For all intents and purposes this would be easy if we were planning to run a simple request-response system, but we’re not: we’re aiming to build a distinct-per-user perpetually connected system. So the one thing we really don’t want to deal with is duplicated websocket connections, or having to duplicate our message handling. In short, all copies of our application need to be able to communicate with one another and pass user-bound notifications to the copy of the application that has initially established the WebSocket connection to said user.
Luckily, this is not an unsolved problem. Even more fortunately, a solution is provided within Phoenix — the de-facto Elixir web application framework - itself. This solution takes the shape of Phoenix PubSub. At its core, this would allow our applications to communicate with one another. If PubSub is the API by which applications would communicate, it does not by itself provide the means by which information makes it from one application to another. This message transport back-end needs to be provided separately, and take the form of adapters.
There are two adapters currently available, Redis and PG2. We could have chosen Redis, as we do have some expertise in-house on how to run this at scale. But that would have introduced another moving part to the equation, and we were keen to keep as much as we could contained within the application itself. So that brings us to PG2, a solution that leverages Distributed Erlang in allowing Erlang processes to communicate across servers. In order for us to have this ability, we have to lean on one final component within our toolbox, namely libcluster. This Erlang clustering library supports a Kubernetes-based discovery strategy. In practice, we configure the application in such a way that Kubernetes will supply and maintain a list of node IP addresses that are currently hosting a copy of our application to all copies of our application. Libcluster itself will then overlay an Erlang Cluster on top of the Kubernetes cluster.
Ultimately, we’re running on clusters all the way down, but what struck us when we looked into this solution was the fact that – even with Elixir and Erlang being more niche than some of the other stacks we’re familiar with – this is a solved problem even when running on Kubernetes. Learnings like this give us hope and an insight that the community itself is not stagnant. One of the biggest risks – and quite frankly biggest fears – when adopting or trialing a new technology is finding yourself close to the finish line, but stuck because a key component you absolutely require to get the job done does not exist or is very poorly maintained. Sure, we can write it ourselves, but that carries with it its own risks.
So in short, we operate a Distributed Erlang cluster on top of/within our Kubernetes cluster and guarantee delivering unique notifications to our users whilst maintaining the ability to horizontally scale the back-end.
Function over Form? Commit to the functional idioms
There’s no skirting around this elephant in the room: Elixir is functional in most if not all ways that matter from a productivity standpoint. It doesn’t take strong stances with regards to purity, but it still encourages developers to adopt certain functional paradigms.
This means a shift in mindset, and when you’re bouncing between different stacks, those shifts add up. When you take into account that we’re also shifting paradigms, as it were, the effects compound. We have found that it makes the most sense to bundle different work-packages together where possible, in order to ensure that we don’t have to constantly juggle between stacks, and therefore philosophies.
What do we gain in doing so? A lot of confidence, at least – if something looks like a thing, it’s practically guaranteed to remain that thing until further notice. The fact that we have dynamic typing but can also leverage a compiler to shout at us when we’re being particularly dense is a boon for productivity.
Open Telecom Platform — Learn to use the tools made available to you
We also made mention of the Open Telecom Platform and this is another learning curve one has to scale. OTP can be described as both a set of methodologies and guiding principles for building robust, scalable and well-architected systems, and actual framework components that implement these guiding principles. One such example of an implemented best-practice is the Generic Server (or GenServer colloquially). This component is readily available to developers, and described as:
A GenServer is a process like any other Elixir process and it can be used to keep state, execute code asynchronously and so on. The advantage of using a generic server process (GenServer) implemented using this module is that it will have a standard set of interface functions and include functionality for tracing and error reporting. It will also fit into a supervision tree.
There are a plethora of components available to us, but as with learning any new framework, that richness comes with a burden to both learn how to best utilise these tools, and – sometimes even more importantly – when not to use them.
So now…. WE REWRITE ALL THE THINGS?!
Short answer: No.
Long answer: It depends. This was a very good experiment to run, and there is nothing precluding us from considering Elixir for another use-case. If we find another problem where Elixir could provide an edge, it will most certainly be a contender, but there are always a handful of things that need to be taken into account. The maturity of the team and the willingness to support and own a project fully is a hard requirement – we can not throw this over the fence for another team to support. Fully understanding the scope of the problem is also imperative, and if Elixir turns out to be the wrong choice after all, a pivot to another stack should always remain a realistic option.
This being said, Elixir feels like a strong platform to build software on, and one that definitely inspired me to learn and adopt it into my own set of tools.
Learn more about being a developer at Smartly.io: smartly.io/careers/developers.