How To Do Rollback Well

I still see a lot of fear out there in the development and operations communities with regards to being able to rollback failed software deploys.

People talk about the difficulty of rolling back, the need to always be rolling forward in the case of error, and the difficulty in testing rollback procedures.

Having written software and been responsible for deploying that software on very high availability environments, this isn’t however a world I’ve experienced in the slightest.

In my experience:

  • It’s been easier to rollback to a known state than roll forward;
  • It’s been riskier to roll forward if a deploy is going south than it was to roll backwards;
  • It’s much easier to test and gain confidence in a rollback than it is in the roll forward.

I think a lot of the mis-conceptions people have come from the fact that the average development simply does not give rollback enough importance and focus in their development and release management process.

This is a shame as rolling back is potentially your best friend with regards to improving robustness of systems and keeping customers happy.

With a good rollback you can simply hit the red button at the first sign of trouble and have the system back up and running whilst you look into the situation. It may be that you’ve overreacted in pulling the release, but that’s generally better than breaking something.

With that said, here are a few tips that I’ve found to work well:

How To Do Rollback Well

The most important step is to implement an architecture that supports the need to rollback. For instance, componentised, service based architectures lend themselves well to this. Persistent message queues and asynchronous services allow you to bring components down for rollback without impacting the main user base. Work towards something like the such that your application can stay available whilst you are working on one half of the system.

Test roll backs as thoroughly as your roll outs via the QA process. Throughout development and testing, attempt to get as much, or even more, confidence in your rollback as you have in your release. Foster a mindset that change is good, but the route back to the working system is much more important. People generally prefer working software over new features.

Design the rollbacks and roll forwards to both work idempotently. Ensure you can roll back a bad deploy and then roll it forward again when the time is right. Neither step should be destructive as we should be accepting that rollback has a high degree of probability. Have your QA temas explicitly test roll out, roll back, roll out to further gain confidence and experience in the process. Any observations, problems etc should be fed back and the change should not be signed off till the rollback is a high quality one.

Document the roll back procedures. It’s likely there will be a degree of stress involved if we need to roll back the production system, so take the time before the release to write up how to run the rollback process, what to check, and what to do in potential failure situations. e.g. If the database deploy fails, run script x.sql and check conditions a, b, c. If condition b has occured, execute y.sql. 10 minutes anticipation of failure modes before the change can save a lot of time during a crisis.

Take small gradual steps in your releases. Release intermediate steps behind feature toggles so that we can slowly but surely gain confidence in the feature that we are releasing whilst having the minimal possible rollbacks. Try to upgrade components individually rather than in parallel.

Have documented steps in place to assert that the rollback process has put the system back into it’s original good state – something simple like checking the net file size, running a diff, or checking the number of database rows. You’ll then want to back this up with as many automated and manual sanity tests as possible to ensure the rollback was correct.

You’ll notice that none of the above come for free. Each of them take time and effort, but in my experience this has always been worth the time and effort with regards to moving the applications I’m responsible for slowly but surely and with confidence.

Posted in Software | Comments Off

Configuring Hadoop On EC2

I am about to go live with my first production Hadoop job for a client as a proof of concept.

I found that a lot of the documentation out there is quite text dense, unnecessarily detailed, or out of date, which is frustrating when you’re just trying to get your first cluster up and your first MapReduce job submitted.

For that reason, I’ve decided to write up the guide I wish I had whilst first getting up and running with Hadoop – a simple step by step guide Hadoop up my preferred host, Amazon EC2.

Please click here for the PowerPoint presentation. (I tried to use SlideShare but it’s dog slow for some reason!) I hope it’s helpful to someone.

Shameless plug – if you’re interested in Hadoop you may like my NOSQL Weekly Mailing list – ….

Posted in Software | Comments Off

Launching DevOps Friday!

There’s so much innovation going on in the fast moving DevOps world, that I’ve decided to start a curated email newsletter that will summarise the best content from the weeks blogs, articles and Tweets.

The newsletter will be brief, non-commercial, and curated for quality. I’m aiming to send weekly every Friday.

You can see the first issue here – or sign up at .

Posted in Software | Comments Off

The Key Factor For Viral Products

I love the part in the Facebook movie where the product hits exponential growth a few hours after it was released:

“Harvard at that time did not have a student directory with photos and basic information, and with the initial site generated 450 visitors and 22,000 photo-views in its first four hours online.”

All of the features, designs, and A/B testing in the world wouldn’t have achieved that level of adoption on their own.

To me, this growth shows that Facebook clearly met some quite fundamental human needs that had been unfulfilled on the web until that time – perhaps the need for self promotion and some re-enforcing of social status and groups by sharing a laugh at some of the early photos with your friends.

As features were added that gave people the ability to express themselves, form friendships and groups, and compare themselves with their peers, people obviously continued to adopt the product extremely quickly all across the world and across cultures. Again, this was for quite human rather than technical reasons.

If you’re looking for your product to spread or go viral, I believe that you need to appeal to people on a similarily primal level. Think about how your product can tap in to peoples low level needs, desires, fears, and wants. If you make the user genuinely feel better or more successful after using the product then they’re much more likely to return and spread the word.

Posted in Software | Comments Off

GOTO Copenhagen – Day 2 Review

I spent most of day 2 of the conference on the NoSQL and big data track. It was a close call for me over the mobile development track, but big data just pips mobile to the post in relation to the type of work that I do. I wish I could have cloned myself today :-)

Jumping into summaries of todays talks:

Stefan Edlich – Choose the right database

Stefan is obviously very knowledgable about the broad and diverse NoSQL product world. His presentation talked through how he would walk clients through the process of selecting a NoSQL solution. In short, this involves looking at the characteristics of the data, and using that to drill down into a category of products. For instance, high performance databases think (a, b, c), or for distributed processing think (d, e, f). He also identified a whole suite of evaluation criteria you could use against each product, though noted that a lot of subjectivity was involved here.

This all led on to a really interesting discussion about how do we ‘convince the boss’ to adopt new persistence models. I could tell that Stefan and a few of the attendees felt that companies were too risk averse in applying NoSQL, but I can appreciate why it’s a hard sell – relational databases are so succesfull – proven, well understood, general purpose, that it’s a hard sell to implement polyglot persistence.

The second presentation dug into something called NewSQL, which is a term I have heard before but didn’t really know much about. Essentially, this is the fightback of relational databases, whereby they are using techniques such as better clustering and direct storage engine access to dramatically reduce latency and improve scalability. Apparently the performance metrics are very comparable now, potentially giving us a ‘best of both worlds’ solution.

Sergey Bushik – Evaluating NoSQL performance

This was an interesting topic which was about a suite of benchmarking exercises they had developed to simulate loads on various NoSQL products.

The one thing I took away from this was how difficult it is to compare NoSQL products! The differences in the product almost always end up compromising your tests, and by trying to make a consistent test, you end up moving away from real world usage scenarios.

Chris Anderson – Couchbase

This was a fairly generic introduction to Couchbase, but one which sold the product really well in my opinion.

What I liked about the product was the visibility it gave you in terms of how it was operating – how many objects were stored in the write behind cache, how synchronisation was progressing after introducing a new node. We’ve had problems in the past with caching layers such as Coherence operating as black boxes with limited opportunity to capture metrics and certainly no out of the box GUI.

It was also nice to have a slightly more technical talk, touching on synchronous and asynchronous replication, commit models etc.

I had a play with Couchbase after the talk and was up and running and writing data into it from my Rails app within about 30 minutes which is impressive stuff.

Martin Ferro-Thomsen – Talk To Your Users

A change of pace from NoSQL now as I was interested in attending a few talks from product companies with regards to how they were building community and publicising their products.

I’m interested in moving back toward product design & development in the future, but I know that gaining traction and mindshare is a much harder challenge than the technical side.

This ended up being one of my favourite talks of the conference, as it touched on the various psychologies that were at play with your early stage users – keeping them happy, getting them to become ambassadors, and then nurturing the community.


Though i learnt a ton about NoSQL today, I actually came away less prepared to adopt it then when I started the day. Picking a product is a complex and multi faceted problem, and then they’re hard to benchmark and compare. In the background, we also have vendors making big claims and omitting their downsides.

For this reason, I think I would like to have heard more practical experiences of real world implementations – what went well, what didn’t go so well etc.

Posted in Events | Comments Off

Metrics Driven Development

One of my main take aways from day 1 of was the importance of using metrics to monitor what our applications are doing, how they are running, and whether they are currently in some problematic or broken state.

Once we have good metrics and a good set of monitoring systems on top of them, we can be much more aggressive in pushing out changes due to the fact that this style of monitoring gives us a very effective early warning system with regards to bugs or breakages that have been introduced.

The problem is that internal application metrics can sometimes be hard to capture and slurp out into monitoring or graphing software.

Scraping events out of log files or polling them out of a database is sub-optimal and even coding them into the application more explicitly can be difficult if they were not accounted for during the early development phases.

A few speakers today made the point that generation of metrics should be included up front as part of development of individual features.  They could in fact be included in the ‘definition of done’ such that each feature is not considered complete unless appropriate metrics and appropriate monitoring is put into the place off the back of them. By encouraging or enforcing metrics and monitoring in this way, we are likely to end up with a very rich insight into the application over time – at least if are able to handle and dig into the data.

Etsy are often held up as an example of a team making heavy use of Metrics in this style. Indeed Mike states in this presentation that ‘Metrics are a part of every feature’:

View more presentations from

Metrics Driven Development?

Today mentioned the idea of taking this idea further, and using application metrics to really drive the features that we implement and how we implement them. Even though it sounds a little frivolous at first, I actually think there is a fair amount of potential in the idea.

For instance, imagine the scenario where we’ve been asked to add a new field to a form, perhaps a postcode on the sign up form of some web application. This sounds like an innocuous requirement that we would usually just add and not think twice about. However, think of all of the metrics that we could capture before and after the change if we were to be aggressive with regards to application metrics and monitoring:

  • Form completion attempts % without new field
  • Form completion attempts % with new field
  • Form completion success % without new field
  • Form completion success % with new field
  • % time the new field is ommitted
  • % time the new field is entered but is invalid
  • % time the new field is entered succesfuly

Metrics like this would give us great insight into whether the change was succesful, whether it led to breakage how it has changed user behaviour, and whether it is delivering net value to the business in terms of the A/B test. The bottom few in the list could be used to track and improve it over time, considering there is some benefit to the business in terms of capturing the new data.

We could also make our metrics more granular so that we can pivot by, say, geography or mobile device to identify any problems in relation to those users. [This would create huge volumes of data for an absolutely trivial field, but I could nonetheless see net benefit whilst the feature is being rolled out.]

The point is that even in this trivial example, working in this style does actually sound like a very appealing and value driven way to deliver software:

  • Our implementation decisions should be data driven. We should take the smallest possible step and measure the impact in order to maximise value delivery;
  • We should only be implementing features if they deliver a net benefit. If we can’t measure some feature, how can we be sure it did provide a net benefit? Perhaps we should deliver something of more measurable value initially?;
  • If we can’t monitor some feature, how can we be sure it can reliably and consistently be deployed? Not being able to monitor and measure absolutely reduces the desire to deliver the feature. Perhaps we should deliver something we can more securely delivery first?

My point is this. In a data driven organisation, metrics and monitoring do at least deserve to be bought forward in the process as a factor that heavily influences the development decisions we make.  A new X-driven-development has been born!

Posted in Devops | Comments Off