I still see a lot of fear out there in the development and operations communities with regards to being able to rollback failed software deploys.
People talk about the difficulty of rolling back, the need to always be rolling forward in the case of error, and the difficulty in testing rollback procedures.
Having written software and been responsible for deploying that software on very high availability environments, this isn’t however a world I’ve experienced in the slightest.
In my experience:
- It’s been easier to rollback to a known state than roll forward;
- It’s been riskier to roll forward if a deploy is going south than it was to roll backwards;
- It’s much easier to test and gain confidence in a rollback than it is in the roll forward.
I think a lot of the mis-conceptions people have come from the fact that the average development simply does not give rollback enough importance and focus in their development and release management process.
This is a shame as rolling back is potentially your best friend with regards to improving robustness of systems and keeping customers happy.
With a good rollback you can simply hit the red button at the first sign of trouble and have the system back up and running whilst you look into the situation. It may be that you’ve overreacted in pulling the release, but that’s generally better than breaking something.
With that said, here are a few tips that I’ve found to work well:
How To Do Rollback Well
The most important step is to implement an architecture that supports the need to rollback. For instance, componentised, service based architectures lend themselves well to this. Persistent message queues and asynchronous services allow you to bring components down for rollback without impacting the main user base. Work towards something like the such that your application can stay available whilst you are working on one half of the system.
Test roll backs as thoroughly as your roll outs via the QA process. Throughout development and testing, attempt to get as much, or even more, confidence in your rollback as you have in your release. Foster a mindset that change is good, but the route back to the working system is much more important. People generally prefer working software over new features.
Design the rollbacks and roll forwards to both work idempotently. Ensure you can roll back a bad deploy and then roll it forward again when the time is right. Neither step should be destructive as we should be accepting that rollback has a high degree of probability. Have your QA temas explicitly test roll out, roll back, roll out to further gain confidence and experience in the process. Any observations, problems etc should be fed back and the change should not be signed off till the rollback is a high quality one.
Document the roll back procedures. It’s likely there will be a degree of stress involved if we need to roll back the production system, so take the time before the release to write up how to run the rollback process, what to check, and what to do in potential failure situations. e.g. If the database deploy fails, run script x.sql and check conditions a, b, c. If condition b has occured, execute y.sql. 10 minutes anticipation of failure modes before the change can save a lot of time during a crisis.
Take small gradual steps in your releases. Release intermediate steps behind feature toggles so that we can slowly but surely gain confidence in the feature that we are releasing whilst having the minimal possible rollbacks. Try to upgrade components individually rather than in parallel.
Have documented steps in place to assert that the rollback process has put the system back into it’s original good state – something simple like checking the net file size, running a diff, or checking the number of database rows. You’ll then want to back this up with as many automated and manual sanity tests as possible to ensure the rollback was correct.
You’ll notice that none of the above come for free. Each of them take time and effort, but in my experience this has always been worth the time and effort with regards to moving the applications I’m responsible for slowly but surely and with confidence.