The fourth and biggest DevTalks took place two weeks ago, with a full house of 150 developers and two great speakers, Neal Ford and Jeremy Edberg. This time, the talks revolved around building evolutionary architectures and stable and well instrumented distributed systems.
DevTalks is a free event for developers who are hungry for learning, powered by the Smartly.io Engineering team in the spirit of our culture of maximizing learning. This blog post recaps the highlights of the two talks and gives some examples of how we are leveraging some of these learnings in our own product development.
Jeremy Edberg is an angel investor and advisor for various incubators and startups, and the CEO and founder of MinOps. He was the founding Site Reliability Engineer for Netflix and before that he ran ops for Reddit as its first engineering hire.
Jeremy Edberg: Building a stable and well instrumented distributed system
The key focus of Jeremy’s talk was how to ensure that a distributed system is stable and that its monitoring generates actionable insights to increase the system’s resilience, scalability, efficiency and velocity. His talk also included some exciting war stories from his time at Netflix, when they were moving from a monolith to microservices.
Jeremy emphasized that you should always aim to automate your infrastructure completely: the application startup, configurations, code deployment and system deployment. Also other repeated tasks, like those from DevOps’ run books, should be codified. Ideally, you would rely entirely on automated testing and, together with working monitoring, could start to test in production. He also pointed out that automated canary deployments are essential for mitigating the negative impact of possible mistakes.
Automating everything strikes a chord with us at Smartly.io. One of our core values is to Work Smartly, which essentially means that our goal is to automate manual work for both our customers and ourselves. For example, we have refined our process of deploying new code in order to make it as easy and automated as possible.
Previously, we used a chatbot to deploy new code: after merging your PR, you would simply tell the chatbot to deploy your code. Using a chatbot was more convenient and transparent than running deployment scripts, but it wasn’t entirely foolproof. Sometimes we would forget to deploy the merged changes. In the worst case scenario, one of the merged PR’s would have a bug and the developer would forget to deploy it. If another developer would later fix a critical bug and at the same time introduce the new one that wasn’t deployed before. For that developer, it was a pain to find out what commit was causing the new issue before reverting it.
Today our new microservices are automatically deployed on merge in order to get immediate feedback about our changes. We rely heavily on automated testing and monitoring to find out about issues as quickly as possible.
Track the right metrics
Jeremy advocated for choosing business metrics over machine metrics i.e., measure whether customers are getting the desired service quality. He said that at Netflix this meant having alerts on the amount of play starts instead of just CPU, disk or memory usage of a single machine. My dev team at Smartly.io owns the Facebook campaign creation part of our tool and we follow for example the number of created ads and the number of ad videos or images uploaded to Facebook to keep tabs on business metrics.
According to Jeremy, the goal of the DevOps team is to provide self serve tooling for developer teams and let them decide what metrics they want to track. Developer teams are closest to the product, so they know best which metrics provide them with meaningful insights. Whenever we create a new microservice at Smartly.io, we follow a checklist that includes setting up a dashboard and configuring alerts. The dev teams own their microservices and can get help from the DevOps team for setting up their dashboards to measure whatever they choose.
Jeremy hinted that in order to get relevant metrics, you should set alerts on failures, not the lack of success. For example, rather set alerts on the increase in HTTP calls returning status 500 than on the decrease of status 200’s. The lack of 200’s alone doesn’t always mean that something is wrong. It might be due to many reasons, in the simplest case it can be that fewer people are using the app due to some unrelated reason, whereas failures always indicate that something went wrong. Another metric that my dev team at Smartly.io is following is the number of 500s that our Facebook ad creation microservice returns.
Jeremy urged us to avoid measuring average values and track the median value p50, p90 and/or p99 instead to reveal outliers. Developers should choose which of these alternatives they track based on business reasons. Focusing solely on the averages might completely hide an issue that some outliers are experiencing.
Also, for bimodal distributions, averages will give inaccurate values of somewhere in-between the two peaks. Some users might have speedy load times, while some have extremely slow ones. The mathematical average value for these cases is not included in either of the peaks and therefore does not represent the actual load time for any real user. Tracking percentiles, on the other hand, will show these bimodal distributions nicely. At Smartly.io, our team is following p90 for request times. We don’t feel tracking p50 is relevant in our use case as long as we can keep p90 on a reasonable level.
How to deal with outages
One of the highlights of Jeremy’s talk was his detailed list of steps to take when dealing with outages.
- Hold post mortem’s to find ways to avoid the same or similar issues in the future.
- Set up automated mitigation to achieve the best availability.
- Set up automated canary deployment: Automated rollout or rollback of canaries based on metrics.
- Avoid adding heavy policies as they will slow you down.
- Analyze graphs after the outage.
- Individual metrics alone are typically not that useful. Use a combination of correlating metrics instead.
- See if the outage could have been predicted from a set of correlating metrics.
For example, at Netflix, they noticed a correlation between their load balancer health, number of play starts and the customer support call volume.
- If it could have been predicted, set up an alert for when similar conditions take place in the future to react in time and avoid outages.
- Ideally set up automation to deal with similar cases in the future.
Neal Ford is a Director, Software Architect, and Meme Wrangler at ThoughtWorks and more recently co-authored the book “Building Evolutionary Architectures”.
Neal Ford: Building evolutionary architectures
While Jeremy described best practices for meaningful metrics to track, which tools to use for that, and other monitoring good practices, Neal took a broader view and discussed the importance of defining architectural characteristics and ensuring that they hold by setting up architectural fitness functions. Many of the metrics that Jeremy described in his talk would function well as architectural fitness functions.
A significant factor in software architectures are the architectural characteristics or non-functional requirements that the system needs to meet. These are, for example, scalability, elasticity, security, performance and many other -ilities. Architectural characteristics tend to affect each other, which turns finding the right ones and how to implement them into quite a balancing act. Choosing too many architectural characteristics will overcomplicate the system.
One benefit of a microservice architecture is that it allows choosing the defining architectural characteristics for each microservice—optimizing the architecture, technologies and tools based on the requirements of that subsystem. Some parts of the system might need to be highly scalable, whereas others do not. One great point that came up in the Q&A, though, was that implementing microservices in different technology stacks will make finding the tools for fitness functions more difficult.
Instead of defining the required architectural characteristics of the system in a PowerPoint presentation or a Confluence page and having no control over whether or not the characteristics hold, Neal introduced the concept of architectural fitness functions as a way to objectively and in an automated manner assess whether the required characteristics apply for the system. In a nutshell, these are metrics on certain aspects of the architectural characteristics that, when exceeding a defined threshold, will either fail a build in CI or trigger an alert in the monitoring system. This way, you can ensure that the chosen characteristics hold now and in the future.
Here are some examples of architectural fitness functions that Neil mentioned in his talk:
- Cyclomatic complexity:
fail the build if the defined maximum complexity value is hit
- Ensure layered architecture:
Some packages should not directly call others e.g., DB should not be accessed by some packages
- Trigger alert when avg / max response time exceeds the defined threshold
(Although learning from Jeremy Edberg’s talk, let’s use percentiles (p50, p90, p99) instead of averages)
- Trigger alert when render time exceeds the defined threshold
- Response times with the amount of concurrent users:
For example, trigger alert if the average response time increases by 10% when the number of users increases over 7%
- The legality of open source libraries:
Alerts when license files of used open source libraries change
- Trigger alert if a vulnerability for a used library is found
- Fail build if a library with a known vulnerability is added
- Chaos Monkey or other Simian Army tools, like Conformity Monkey, Security Monkey or Janitor Monkey
- Scientist is a Ruby library that allows experimenting with rewritten code paths without actually deploying them to users.
The tool will always return the result of the old path, but will also run the new path and compare the results and performance of the two paths
At Smartly.io, our Data Science team has been using Scientist a few times, for example when they experimented with the refactored implementation of getting revenue estimation data for campaigns, accounts or ad sets. The old implementation was getting data by directly accessing a general purpose database of the old monolith. The new implementation fetches the required data from a new microservice which has a dedicated new database for this purpose. Experimenting the new implementation with Scientist turned out to be useful as the team found bugs in the new implementation that they were then able to fix before actually switching to use the new code path for our customers.
It seems that the most challenging part of fitness functions is to make sure the metrics chosen for the functions are relevant. Choosing the right metrics is always tricky and turns out to be more of a best effort than the absolute truth of the state of the system. That’s why you could argue that, for example, failing the build due to those metrics is not the right approach, as in some cases it might be necessary to “go around the threshold” to solve the problem at hand in a pragmatic way. Metrics don’t take into account the context of the change, which is why, in my opinion, we should treat these fitness functions more as guidelines than pass/fail metrics.
DevTalks will be back next fall
We thank Jeremy and Neal for the insightful talks they gave and all the DevTalks participants for the excellent questions and discussion we had after the talks. We hope you learned a lot and had tons of fun!
DevTalks will return next fall. Join our Meetup group to be the first to hear about it.