As an innovative SaaS tech company, we want to continuously build and improve our platform at a rapid pace, enabling our developers to run fast and be both creative and productive.
On one hand, we want our users to enjoy and benefit from the improvements without delay. But on the other hand, we want to make sure that the changes we implement don’t impact the stability of our system. These two forces have an inherent struggle between them. How can you move fast – without breaking things?
In my previous article, I described the term “Blast Radius” (a method used to measure the risk of a system failure and its impact) and how you should focus on designing your architecture to limit it.
We apply the same concept when we introduce new code, whether it’s a new feature or architecture/backend changes that are behind the scenes from the user’s perspective.
There is always a risk when deploying a change, no matter how thoroughly you test it, as it is practically impossible to test all possible edge cases and mimic the exact same usage patterns of production. For this reason, we have created a sophisticated process for an automated feature/code granularity gradual rollout control flow.
The high-level flow goes like this:
- Assess the risk of the change based on well-defined criteria (Low/Medium/High). Medium and High risk must go through a gradual rollout, any exception to this should be approved by a dedicated forum. For Low risk, a gradual rollout is preferred but will depend on the coding effort required to do so (steps 2 and 3)
- Using a feature flag management, create “if-else” flows in relevant places in the code separating between the new code and the old code if the feature/code is activated or not for the user/client (this may mean code duplication)
- Test the code with both modes of the flag
- Deploy to production while having the new code disabled for everyone
- The Automated Gradual Rollout manager starts exposing code gradually in stages using dedicated strategies
- Track usage and monitor for failures, support cases, or unusual behavior. If an issue is detected, stop and turn the flag off (roll back). Otherwise, if enough usage is tracked, increase exposure. If not, wait for a longer duration and repeat
- After 2 weeks of a full rollout, and no issues detected, create a tech debt task of deleting the “else” condition (old code) from the system to clean up “dead code”
This process resembles Canary Deployment, where one has two versions of a specific service live and the traffic is gradually rerouted to the newer version while being monitored for errors.
However in this case the granularity is at a feature level where one deployment may contain more than one change. Also, we are able to manually turn on the change for selected users who wish to get early access to the change before the automated gradual rollout even starts or reaches them.
As always, we began with an MVP – so we didn’t fully automate the entire process from day one. We started by manually managing steps 5 and 6, learning from and improving the process before automating it.
After we automated the process, we extended the internal usage as the overhead was diminished and this has become our de facto strategy for any code change in our platform.
From the user’s perspective, one of the main advantages of a continuously updated SaaS product is the pace at which they get improvements without relying on their organization to install them. On the SaaS provider side, a lot of focus and effort should go into making sure that they can deliver on this premise while maintaining quality.
Applying this process allows us to quickly add code changes to the code base, improving development productivity while minimizing risk. If a new error appears as a result of the code change we are able to detect it early on at a relatively small scale – making it possible for us to avoid a large Blast Radius that would otherwise affect many users.