Decoupling releases from deployment
Feature management system requirements, Part 2
Motivation & Example
If you are a single person working on a project, you might not have this requirement. But if there are multiple people, especially multiple teams, working on the same project, you might run into deployment synchronization issues.
Imagine four teams working on a project. There is a mobile team, web team, backend team, and data science team. Or, if cross-functional teams are what you do, four cross-functional teams take care of a subset of features or areas of a product. At some point, the teams will need to decide if they are OK to deploy their code to production or not.
If they are completely isolated, meaning they are editing their codebases and are never affected by the changes of the other teams, there are no deployment synchronization issues. But that is not a usual use case.
The four teams want to release their code changes at the end of each sprint, with each sprint being two weeks long. The probability that a single team being able to deploy all the code changes to production at the end of the sprint might be, let’s say, about 60%.
That means that each team is OK to deploy the changed code to the production 6times out of 10 sprints.
To calculate the probability of success for all four teams, we need to multiply the probability of each team. That calculation would be 60%*60%*60%*60% which equals 12%. That is giving us a small chance for a successful deployment, aka release.
Have a look at this table that shows the probability of successful production deployment, when the probability of each team being OK to deploy the code to production is 90%, 60%, or 40% per team.
It is very improbable that there will be a moment when all the teams will be ready to deploy their changes to production on the same day. If a single team is not ready, all the other teams have to wait for them. That causes delays, time pressure, and stress. Once they are under time pressure they start making code and judgmental mistakes. The probability of deploying code with errors into production is increased.
Consequences of unhealthy time pressure and stress
Having the teams under pressure will cause spikes in bug appearance. Forcing all the teams to release the features on the same day, by deploying the code changes on the same day, is leading us to a more unstable and low-quality product.
The chart above suggests a solution. Do not deploy the code changes done by all the teams at the same time. Do it when the team feels like it is ready to be released.
Deploying changes at the right time
The solution to release whenever the team feels like is to use the feature flags. We can wrap the features and the breaking changes into a feature flag and release it whenever we feel like we are ready.
Postponing the rollout (or release) of a feature from the actual deployment of code makes a few things possible:
- nobody needs to wait for anyone, no team is blocked by other teams, which makes it possible to deploy code at any time and feel safe about it
- the changed and the not ready to be used by end-users code can go all the way to production, it is hidden, and there is no harm in deploying it
- we can release a feature only to a few users, like the product team, beta users, or another set of users defined by any kind of criteria
Will it make the code too complex?
The first piece of advice if you have this worry, and I had it too, is: “Go and try it out first, it is not as crazy as it sounds. There might be some worries and questions for a few weeks or months, but everyone will find it is not that difficult and once they get used to it, they will never want to go back.”
Also, there are two types of feature flags: permanent and temporary. Every feature flag that is not supposed to exist permanently should be marked as a temporary feature flag. Then, we need to periodically review all the temporary feature flags. The temporary feature flags need to be cleaned up once they are not needed. We know when a feature flag is not needed when it returns the same value for a long period.
The following image shows that a temporary feature flag was evaluated and it returned a false value for the past 2 months. This feature flag is a great candidate for removal.