Killing a Monolith — How Smartly.io Reworked their Architecture

KEY FINDINGS:

Two years ago we realized that something had to be done. Our Engineering team had been growing rapidly, but the backend framework we were using was home-grown and lacked documentation, and it was a struggle to get new developers familiar with it.

Our system had grown in size and in amount of interconnected parts, which made it difficult to understand the system as a whole. The increased complexity was slowing down our product development speed and resulted in more and more bugs.

We knew that in the future we still had to be able to quickly make changes and turn ideas into features. We also understood that we will be growing both our engineering team and code base size so we needed an architecture that could support both.

Alternative paths forward

We identified two possible paths forward: we could either rewrite the entire system with a well-known framework, or split the system into smaller, independent services. Rewriting an application offers the opportunity to "get it right this time" and find clear boundaries between different modules which each team could develop further independently. In practice, though, larger rewrites tend to take ages.

Worse still, we wouldn’t be able to work on any new features during the rewrite. That’s not really an option for us in this fast-evolving industry, where customers expect to get their hands on the newest Facebook advertising tools as soon as they’re out on the API.

So, we decided to go with splitting the system into microservices instead. Extracting small pieces one at a time lets us ship features incrementally and allows time for gathering feedback from customers to help us keep on improving the product constantly.

We also wanted to have our development teams to be loosely coupled and let them work autonomously on their own areas of the product. If we succeeded in separating parts of the system to independent units, the teams would be able to develop, test and deploy code without constant coordination with other teams. This, in turn speeds things up and minimizes risks as the number of development teams within Smartly.io grows.

Where to start?

Once we had decided to do microservices, it was time to start looking for suitable components to be extracted as independent services. We identified areas of the code base with the least interoperability and where we assumed we would make changes in the future. Little interoperability would make the extraction more straightforward, and we would benefit most from the new code base when we need to evolve it quickly.

Our product feed processing system seemed like a really good candidate. It is the part of our system that downloads our customers’ product feeds and transforms them into the format that Facebook expects. This allows our customers to connect their product feeds from various formats to Facebook which would not be possible otherwise. The system interacted only with a few core objects that could be implemented with API calls, plus we knew we would develop the feed processing a lot in the future.

We wanted to get something to production as soon as possible in order to collect feedback about the new design and verify that we were on right track. We outlined the absolute minimum viable product and got to work.

After a month of coding we had a new service running which was able to download a simple CSV feed and transform it into Facebook-compatible format. It was far from feature parity with the old system, but we had already implemented authentication, load balancing and logging for the new service — things that are vital when working with multiple services, but which we did not have with the old system. Over the following months we kept adding new features and re-implementing old ones, while moving customers to the new service as soon as it was possible.

Working with small batch sizes to ship quickly

The decision between a bigger rewrite and extracting the pieces one by one was a risk-management issue. On other hand we wanted to restructure the architecture in a way that it would support us for the next years to come, but doing so would have included a high risk of never finishing it. On the other, when splitting the problem into smaller pieces we were able to alleviate the risk.

Even large, green field projects that do not contain any existing code get harder to finish if the scope of the projects explodes. One reason for that is feature-bloat. Because shipping the project becomes such a large obstacle, everyone involved wants to include their feature to the project. Doing so extends the scope even more and the project gets even more late — and the vicious cycle continues.

Solution to the problem is not to try to ship the project faster or preventing any changes to the original plan. After all, those changes and new features would probably be brought up after the team learnt something new about the problem the project was set to solve.

Instead, we should aim to reduce the size of the deliverables so that there exists more opportunities to gather feedback from the end-users and change plans if needed. When small parts of the project are constantly being shipped, the pressure to bundle multiple features into one big deliverable lifts. In effect we are fighting feature creep not by fixing the goal but learning what is important and really needed.

We applied the same ideas of small batch sizes when figuring out what to do with the old monolithic system. Rewriting the system is analogous to starting a large project lasting months in the best case, and years at worst. When working with smaller parts at the time, we are able to ship small increments at a constant pace.

Experiences from extracting the first service

The best part in shipping functionality in small, incremental batches allowed us to have a tight feedback loop between the production system and the team. It helped us spot and fix any mistakes early on in the development. Had we held back the launch of the entire system until we reached feature parity with the old version, any mistakes we made would have popped up much later. Fixing them would also have been harder, as we would have made more design choices based on false assumptions.

Building and shipping code in small batches turned out to be so successful that we started to incorporate it into everything we do. Since completing the feed processing service rewrite, we have extracted many other components, like our reporting system.

We’ve also extended the philosophy outside writing software. For instance, when we’re planning new features we try to break them into smaller pieces that can be built in just a few days, so we can start gathering feedback quickly. Similarly, in team retrospectives, we try to come up with small, incremental and measurable changes in how we work, and afterwards analyze what effect the change had and repeat the cycle.

Finishing a rewrite project is always a challenge it does not happen without being actively pushed. Although it took a long time, in total over a year, to reach feature parity with the old system, we were able to shape the architecture and implement new features with the new system early on. All in all, working with small pieces allowed us to keep providing new features for our customers while reworking the architecture to better serve our future needs.

Blogs

min read

February 28, 2018