Increasing System Robustness With A ‘Let It Crash’ Philosophy

Designing fault tolerant systems is extremely difficult.  You can try to anticipate and reason about all of the things that can go wrong with your software and code defensively for these situations, but in a complex system it is very likely that some combination of events or inputs will eventually conspire against you to cause a failure or bug in the system.

In certain areas of the software community such as Erlang and Akka, there’s a philosophy that rather than trying to handle and recover from all possible exceptional and failure states, you should instead simply fail early and let your processes crash, but then recycle them back into the pool to serve the next request.  This gives the system a kind of self healing property where it recovers from failure without ceremony, whilst freeing up the developer from overly defensive error handling.

I believe that implementing let it crash semantics and working within this mindset will improve almost any application – not just real time Telecoms system where Erlang was born.  By adopting let it crash, redundancy and defence against errors will be baked into the architecture rather than trying to defensively anticipate scenarios right down in the guts of the code.  It will also encourage you to implement more redundancy throughout your system.

Also ask yourself, if the components or services in your application did crash, how well would your system recover with or without human intervention?  Very few applications will have a full automatic recoverability property, and yet implementing this feels like relatively low hanging fruit compared to writing 100% fault tolerant code.

So how do we start to put this into practice?

At the hardware level, you can obviously look towards the ‘Google model’ of commodity servers, whereby the failure of any given server supporting the system does not lead to a fatal degradation of service.  This is easier in the cloud world where the economics encourage us to use a larger number of small virtualised servers.  Just let them crash and design for the fact that servers can die at a moments notice.

Your application might be comprised of different logical services.  Think a user authentication service or a shopping cart system.  Design the system to let entire services crash.  Where appropriate, your application should be able to proceed and degrade gracefully whilst the service is not available, or to fall back onto another instance of the service whilst the first one is recycling.  Nothing should be in the critical code path because it might crash!

Ideally, your distributed system will be organised to scale horizontally across different server nodes.  The system should load balance or intelligently route between processes in the pool, and different nodes should be able to join or leave the pool without too much ceremony or impact to the application.  When you have this style of horizontal scalability, let nodes within your application crash and rejoin the pool when they’re ready.

What if we go further and implement let it crash semantics for our infrastructure?

For instance, say we have some messaging system or message broker that transports messages between the components of your application.  What if we let that crash and come back online later.  Could you design the application so that this is not as fatal as it sounds, perhaps by allowing application components to write to or dynamically switch between two message brokers?

Distributed NoSQL data stores gives us let it crash capability at the data persistence level.  Data will be stored in some distributed grid of nodes and replicated to at least 2 different hardware nodes.  At this point, it’s easier to let database nodes crash than try to achieve 100% uptime.

At the network level, we can design topologies such that we do not care if routers or  network links crash because there’s always some alternate route through the network.  Let them crash and when they come back the optimal routes will be there ready for our application to make use of again in future.

Let it crash is more than simple redundancy.  It’s about implementing self recoverability of the application.  It’s about putting your site reliability efforts into your architecture rather than low level defensive coding.  It’s about decoupling your application and introducing asynchronicity in recognition that things go wrong in surprising ways.  Ironically, sitting back and cooly letting your software crash can lead to better software!

Posted in Software | Comments Off

Why DevOps Matters (To Developers)….

DevOps stems from the idea that developers and operations should work more closely together – communicating, knowledge sharing, and collaborating to increase the quality of the systems that we build and operate.

Though DevOps is an idea that is finding a lot of success and adoption, most of the enthusiasm and thought leadership appears to come from the Operations side of the fence.

This is of course understandable.  With Operations teams being on the front line and talking to end users daily, they have an obvious motivation not to upset customers through downtime, and an obvious personal motivation to avoid fire-fighting issues in favour of working on higher value projects.

However, as a developer who has always worked at this intersection of the two teams, I feel that developers should also sit up and give more credence to what is coming out of the DevOps community. By opening up communication paths and adopting Operations-like skills and mindsets, we can likely all benefit – both as individuals and as teams and in the quality of software that we deliver.

Here are some of the reasons why I think this is the case:

DevOps Increases The Focus On Production

Though software teams might divide themselves into development, QA, and operations, these can be slightly arbitrary distinctions. The business who are paying for all of this only care about the net output of what those three teams deliver – the value that the finished production software is bringing to the organisation.

Our goal as developers shouldn’t be just to deliver source code. Our goal should be to deliver a product or a feature or a system that is in production and that people are reliably using and gaining business benefit from.  Though we might personally get our motivation from cutting code, it is all for nothing if the work we do never makes it to production, or if the users of the application have a bad time once the work hits.

To my mind, the operational focus on production and delivery espoused by DevOps is a good thing which usually leads to much more net value being created for the business.  DevOps oriented development teams have a focus on value and their user base, rather than their code base.  

DevOps Helps You Improve Your Site Reliability

If your application has downtime, customers won’t care if it’s due to a software bug, or a hardware outage, or a failed rollout.  They won’t care if it was a silly human error or some arbitrary combination of events that combined be predicted.  All they care about is that they can’t use the system as they intend.

This might be a product of the systems on which I’ve worked, but with good unit and integration testing and good QA testing, it is possible to catch and most software ‘bugs’ that would impact a large percentage of the user base.  However, where things more typically go wrong is when the system comes into contact with the real world.

For instance, we might find our code performs badly under real world load, that a disk fills up, or that a users users the application in a way which we didn’t anticipate at all – as they are prone to do!  This results in some system failure of degradation.

A DevOps oriented developer or team have a much more stringent focus on these issues and general site reliability.  They’ll not only test their code, but they will think about failure scenarios and mitigate them before code is even released.  They’ll think carefully about detailed testing of their features to minimise the risk of them impacting the broader production system.  They will plan and stage their upgrades to de-risk releases, and always have a rollback strategy.  They will talk regularly with operations to ensure that they are fully taking into account their experience with keeping the site available.  In short, DevOps rightly places site reliability front and center, and almost all developers will benefit from this mindset.

This focus on site reliability might mean that sometimes we churn out less pure lines of code in a day, but it means that we move forward more slowly, predictably, and reliably – keeping the system stable and available.

DevOps Helps You Build Better Software

By being more operationally aware of the production context that our code lives within, developers will also design and build better software.

It might be something simple like choosing to add that additional logging statement that you know will make troubleshooting easier later on, or something more complex such as designing a component for horizontal scalability for future growth scenarios.  These kind of ‘operationally aware’ decisions can lead to massive improvements in the net productivity of the team and the quality of their software.

It’s only by increasing communication with operations teams will we developers learn about these concerns and incorporate them into our designs and every day coding decisions.  Simple thing such as joint production incident post-mortems or the inclusion of operations staff in your early design process can help you to move in the right direction.

Again, these practises are core to the DevOps philosophy.

DevOps Improves Your Career Prospects

In additional to being a more well rounded developer with focus on production, Operations skills such as system administration, monitoring, scripting, change management, and the broader knowledge and experience required to maintain and run complex systems are genuinely useful for developers to acquire.

In most of the software teams and hiring decisions I have been involved in, a developer with this profile would have been more attractive and more valuable than someone with superior coding skills but without the same degree of production awareness.

I believe that as a result of DevOps and other trends, this will continue, i.e. that the best and most valuable developers will increasingly be those who are the most operationally aware – those who can both code, but also have the knowledge, skills and experience to reliably deliver a working production system over the long haul.

This is particularly true in these tough and resource constrained economic times.  With fewer people having the luxury or saying “it’s not my job”, the generalist will get ahead.

(I guess for some people, such as those in startups and small companies, it was ever thus, with developers pitching in on operations type stuff such as deployments and upgrades.)

DevOps Helps Developers To Own Their Platforms And Infrastructure

A big element of the DevOps movement is the idea of infrastructure as code. This is the idea that we can define our infrastructure and configuration in descriptive files and metadata, and then be able to test and repeatably deploy that infrastructure and our applications on top of it.

This is such a compelling idea with many benefits, and yet developers do not always embrace and own configuration management tools as much as our operations colleagues.

By moving towards infrastructure as code and configuration management, developers are given the ability to own and bring under their control the infrastructure that their code runs on.

People often say that Apple computers are so reliable because they own the full hardware and software stack. Well, with infrastructure as code and repeatable deploys, developers also get to develop and deploy and own the whole platform on which their software is deployed.

‘It worked on my machine’ or ‘it worked in QA’ should be a thing of the past in a mature DevOps team making use of tools such as Vagrant and Puppet, because the development, test, and production environments should all be in line, and all infrastructure changes should also be versioned and tested alongside the code assets.

Doing this well removes so many unknowns and can lead to massive improvements in efficiency and quality of software development.

DevOps Helps You Manage Modern Infrastructure

DevOps has emerged at a time when cloud hosting, infrastructure as a service, and platform as a service are also reaching widespread adoption.

Cloud and PAAS make the hosting environment much more fluid.  For instance, over time operations might want to use these platforms to their full potential and scale up and scale down capacity dynamically.  To do that, they will need to be working with development much more closely to work out how to support this in the applications.

Because of this, I would argue that developers today need to be more aware of the operational environments on which their applications will operate within.

Increasingly, we will also find that cloud infrastructure will be managed through software. For instance, the ability to provision new boxes via APIs or deploy applications onto a PAAS.  Managing large scale infrastructure in an automated fashion likely to start to look more and more like development work.  Development and operations will increasingly start to look like one and the same role.

 

So these are just a few of the reasons why I think developers need to look at DevOps in a lot more detail.  Some of this is about a broadening of mindset from ‘my job responsibility is to deliver good code’ to ‘my job is to deliver and operate a successful system’.  Others are about acquiring the skills that will actually allow you to do that.  With Operations staff then also actively moving towards more of a developer mindset and skillset, DevOps is likely to continue to grow in importance.

Posted in Software | Comments Off

How To Do Rollback Well

I still see a lot of fear out there in the development and operations communities with regards to being able to rollback failed software deploys.

People talk about the difficulty of rolling back, the need to always be rolling forward in the case of error, and the difficulty in testing rollback procedures.

Having written software and been responsible for deploying that software on very high availability environments, this isn’t however a world I’ve experienced in the slightest.

In my experience:

  • It’s been easier to rollback to a known state than roll forward;
  • It’s been riskier to roll forward if a deploy is going south than it was to roll backwards;
  • It’s much easier to test and gain confidence in a rollback than it is in the roll forward.

I think a lot of the mis-conceptions people have come from the fact that the average development simply does not give rollback enough importance and focus in their development and release management process.

This is a shame as rolling back is potentially your best friend with regards to improving robustness of systems and keeping customers happy.

With a good rollback you can simply hit the red button at the first sign of trouble and have the system back up and running whilst you look into the situation. It may be that you’ve overreacted in pulling the release, but that’s generally better than breaking something.

With that said, here are a few tips that I’ve found to work well:

How To Do Rollback Well

The most important step is to implement an architecture that supports the need to rollback. For instance, componentised, service based architectures lend themselves well to this. Persistent message queues and asynchronous services allow you to bring components down for rollback without impacting the main user base. Work towards something like the Blue-Green release pattern such that your application can stay available whilst you are working on one half of the system.

Test roll backs as thoroughly as your roll outs via the QA process. Throughout development and testing, attempt to get as much, or even more, confidence in your rollback as you have in your release. Foster a mindset that change is good, but the route back to the working system is much more important. People generally prefer working software over new features.

Design the rollbacks and roll forwards to both work idempotently. Ensure you can roll back a bad deploy and then roll it forward again when the time is right. Neither step should be destructive as we should be accepting that rollback has a high degree of probability. Have your QA temas explicitly test roll out, roll back, roll out to further gain confidence and experience in the process. Any observations, problems etc should be fed back and the change should not be signed off till the rollback is a high quality one.

Document the roll back procedures. It’s likely there will be a degree of stress involved if we need to roll back the production system, so take the time before the release to write up how to run the rollback process, what to check, and what to do in potential failure situations. e.g. If the database deploy fails, run script x.sql and check conditions a, b, c. If condition b has occured, execute y.sql. 10 minutes anticipation of failure modes before the change can save a lot of time during a crisis.

Take small gradual steps in your releases. Release intermediate steps behind feature toggles so that we can slowly but surely gain confidence in the feature that we are releasing whilst having the minimal possible rollbacks. Try to upgrade components individually rather than in parallel.

Have documented steps in place to assert that the rollback process has put the system back into it’s original good state – something simple like checking the net file size, running a diff, or checking the number of database rows. You’ll then want to back this up with as many automated and manual sanity tests as possible to ensure the rollback was correct.

You’ll notice that none of the above come for free. Each of them take time and effort, but in my experience this has always been worth the time and effort with regards to moving the applications I’m responsible for slowly but surely and with confidence.

Posted in Software | Comments Off

Configuring Hadoop On EC2

I am about to go live with my first production Hadoop job for a client as a proof of concept.

I found that a lot of the documentation out there is quite text dense, unnecessarily detailed, or out of date, which is frustrating when you’re just trying to get your first cluster up and your first MapReduce job submitted.

For that reason, I’ve decided to write up the guide I wish I had whilst first getting up and running with Hadoop – a simple step by step guide Hadoop up my preferred host, Amazon EC2.

Please click here for the PowerPoint presentation. (I tried to use SlideShare but it’s dog slow for some reason!) I hope it’s helpful to someone.

Shameless plug – if you’re interested in Hadoop you may like my NOSQL Weekly Mailing list – www.nosqlfriday.com….

Posted in Software | Comments Off

Launching DevOps Friday!

There’s so much innovation going on in the fast moving DevOps world, that I’ve decided to start a curated email newsletter that will summarise the best content from the weeks blogs, articles and Tweets.

The newsletter will be brief, non-commercial, and curated for quality. I’m aiming to send weekly every Friday.

You can see the first issue here – http://bit.ly/KFwhG9 or sign up at www.devopsfriday.com.

Posted in Software | Comments Off

The Key Factor For Viral Products

I love the part in the Facebook movie where the product hits exponential growth a few hours after it was released:

“Harvard at that time did not have a student directory with photos and basic information, and with the initial site generated 450 visitors and 22,000 photo-views in its first four hours online.” Wikipedia

All of the features, designs, and A/B testing in the world wouldn’t have achieved that level of adoption on their own.

To me, this growth shows that Facebook clearly met some quite fundamental human needs that had been unfulfilled on the web until that time – perhaps the need for self promotion and some re-enforcing of social status and groups by sharing a laugh at some of the early photos with your friends.

As features were added that gave people the ability to express themselves, form friendships and groups, and compare themselves with their peers, people obviously continued to adopt the product extremely quickly all across the world and across cultures. Again, this was for quite human rather than technical reasons.

If you’re looking for your product to spread or go viral, I believe that you need to appeal to people on a similarily primal level. Think about how your product can tap in to peoples low level needs, desires, fears, and wants. If you make the user genuinely feel better or more successful after using the product then they’re much more likely to return and spread the word.

Posted in Software | Comments Off