Since its earliest days, rendering images based on ever-changing product catalog data has been our largest service by traffic volume at Smartly.io. Customers can create templates in our web-based editor, combine them with their product catalog data, and automatically push endless variations of images to their advertising campaigns.
The original codebase has been aging during the 7+ years of its existence. Though taking inspiration from our existing services, extensive load testing, and harnessing Kubernetes, we were able to meet the new demands for the system.
Aging Code vs Rapid Growth
One of the factors that led us to a re-write was the aging codebase. Our original Image Template service was scaling and working very well, requiring relatively modest sustaining work. However, we came to a point where supporting customers' increasingly complex needs started to make the whole system hard to maintain. Additionally, the re-write would allow us to align technologies with other teams, as the growing Smartly.io was standardized on a different tech stack than the one used in the original service.
Project team members showcasing new Image Template previews with image variations in their native languages. Here you can find a more comprehensive look at the editor.
Over time, our customers' use cases grew more complex and demanding. For example, customers often wanted to customize a single template across multiple languages and geographic locations, which led to recurring feedback about our limited support for non-Latin characters and right-to-left text with custom fonts. Adding similar smaller features on top of the service made the code harder to maintain, and we had to find workarounds to support more sophisticated functions. All these factors considering, we decided to embark on a re-write of the entire service.
Rendering Images and Videos in Headless Chrome
In 2019, we created an entirely new Video Template service that does for videos what our Image Template service does for images - it is a browser-based video editor that automatically generates endless variations of the video based on product catalog data. In our re-write of the Image Templates, we decided to build on top of our video templates codebase because it solved many of the issues that the original Image Templates service had.
Learn more about how we built Video Templates here:
The Video Template service uses Typescript both for frontend and backend and critically uses the same rendering code for the editor and the server-side rendering. In addition, we use headless Chrome to render the videos, ensuring that the final rendered videos look the same as the previews in our editor.
Using browser technologies like React and CSS makes it easier to implement new features than implementing a rendering system from scratch - we don't need to write low-level graphics code to draw text, shapes, and images on a screen. For example, adding a new capability to the Image Template editor will automatically work in our renderer because both use a browser. The only issues we've had with this approach have been related to Chrome on Linux behaving differently to Chrome on other platforms. However, we have been able to work around these issues thus far.
Once using Chrome for rendering had been validated for videos, it seemed obvious we should use it for images, too. In simple terms, we could take the video rendering stack and use it to render a single frame of video and voilá, Image Templates! As you'll see, this has proven to be a good approach, but it hasn't been without its challenges.
Artificial Load Tests for Better Optimization
Taking our video rendering system and replacing our Image Templates backend was simple conceptually, but some key differences between the two systems required special attention.
- Scale: Our existing Image Template service renders 50k+ images per second, while our video rendering system renders only a few thousand videos per day.
- Performance: Image Templates are rendered at HTTP-request-time, typically in less than one second. Video Templates are queued, rendered, and then pushed to our platform partners like Facebook, Pinterest, and Snapchat - a process that can take several minutes.
- Reliability: Image Templates need to be highly reliable because the platform partners we work with could stop or throttle the fetching of all images if we have too many failures. With videos, we can retry failed rendering jobs, and the only impact is slower render time.
- Revenue: More than 50% of Smartly.io's revenue depends on our Image Templating system. We can't risk significant downtime with such a vital service.
Replacing an existing production-hardened and highly scalable service with a new system based on Google Chrome and Node.js wasn't straightforward. We were confident that we could make it work, but we knew there would be hard-to-predict problems when the service was running at scale. So we decided to manage the risks in two ways:
- Open early access for a limited number of customers in a controlled Alpha phase to get their feedback and improve the service before making it generally available for all customers.
- Run artificial load tests to see how the service would behave under heavy use.
While Alpha testing is standard practice at Smartly.io, load testing is less common. We started by creating our own test data set to test the service with a realistic load. We took a random sample of existing customer templates and converted them into our new Image Template format. We knew these real templates might not use all the new system's features, but they would still have multiple images and fonts. It would be sufficient to break the system - and break it we did!
Pushing the system to its limit with dummy data was useful for generating and prioritizing our product backlog. Thanks to the tests, we knew exactly where we needed to optimize next. Sometimes we hit limits with network bandwidth between servers, and other times, we overwhelmed external systems with requests. Sometimes we solved these bottlenecks by adding caches. Other times we re-designed the service to increase its performance. We had a roadmap of catch-up features we knew we'd need based on operating the old Image Templates system, but having a good load test meant we could delay implementation until they would actually improve performance. In addition, delaying some of the work freed up some of our time to react to feedback from the Alpha customers, which allowed us to build an even better product.
Our load testing didn't come without mistakes, and we learned valuable lessons. Most critically, we failed to include broken Image Templates in the load test. When we moved from Alpha to Beta phase, the mistakes in Image Templates produced more errors than the artificial templates we used in our testing. Unfortunately, the system we had built to collect, collate and store errors couldn't cope with the load, and for a brief moment, we were only able to display a fraction of errors to some customers. Luckily that was still the Beta phase, and our customers were willing to test our systems in exchange for early access and a chance to influence product development.
Kubernetes the Enabler
In Smartly.io, almost all services run on Kubernetes. It's proven to be a robust and powerful platform to deploy and operate our 50+ services across hundreds of servers. One exception has been our old Image Template system, as it predates our use of Kubernetes and uses considerably more servers than the rest of our services combined. The scale and volume of traffic were something we'd not tried to handle with Kubernetes before. When we chose Kubernetes for the new Image Template service, we knew it was an ambitious decision that would need extra engineering work to achieve the required scale.
There are many advantages of Kubernetes, but for us specifically, its value comes from:
- Configuration as code: developers can make large architectural changes quickly
- Integration with CI: deploy our stack automatically multiple times per day
- Self-healing: service recovers from hardware, network and application failures
- Mobility: deploy to clusters in different places/hosting providers/clouds
For example, we began with our rendering, caching and metadata services distributed across the cluster but later realized that networking bottlenecks limited scalability. So, we decided to switch to running our entire stack on each node and scale horizontally. We were able to make this big architectural change incrementally in production just by adjusting the Kubernetes manifests.
We knew from the start our Image Template service would put extreme demands on Kubernetes due to the scale and ongoing growth of our business. We also knew that our existing self-hosted bare metal Kubernetes clusters would need extra work to scale and tune them for this application.
We would, in effect, experiment in production with different service architectures, ingress controllers, and tuning parameters. For this reason, we decided to use dedicated Kubernetes clusters, at least until we better understood our service design and requirements. We also decided to use at least two dedicated clusters for operational and implementation flexibility. Having two clusters means we can experiment with new Kubernetes releases, new Kubernetes features like topology-aware routing, and different ingress controllers.
End of the Tunnel?
We're still scaling up our new Image Template platform and slowly migrating customers to the new system. At the same time, usage of the old system continues to grow! Busy times ahead!
We've got a packed roadmap of improvements planned to handle the explosive growth from new and existing customers. A few examples:
- We've built a very cost-efficient platform based on bare metal servers, however, we currently provision additional hardware to handle spikes in load and a large amount of seasonal variation. To handle this better, we plan to experiment with autoscaling managed Kubernetes clusters in a public cloud.
- We've also been experimenting with caching resized images to speed up rendering. Initial results are promising though we've yet to enable it globally. There are several opportunities for further optimizations we're excited to test out.
Now we need to get back to work to prepare for our busiest season of the year. Wish us luck!
Today, we have more Engineers working on our Image and Video Templates than we had in the entire Product Development team when we first introduced the Image Templates solution in 2015. Earlier parts of the story of how we have scaled the service are in these blog posts:
We are also hiring new engineers to build even more exceptional services. If you are interested, we'd love to hear from you!