Let's talk about your backups · Jacques Mattheij

Hard to believe, but the most frequently encountered ‘to-fix’ during due-diligence has to do with what you’d think would be a formality: having proper backups.

On an annual basis I see quite a few companies from the inside and it is (still!) a total surprise to me how many companies have a pretty good IT operation and still manage to get the basics wrong when it comes to something as as simpel as having backed up the data and software that is critical to running their business. Of course it’s easy to see why backups are an underappreciated piece of the machinery: you rarely need them. But if you you really need them not having them can rapidly escalate a simple hardware failure into an existential crisis.

Here is a list of the most common pitfalls encountered in the recent past:

Not having any backups at all. Yep, you read that right. Not having backups simply because they’re no longer needed, after all, we’ve got RAID, the cloud and so on. So you no longer need backups. WRONG. Hosting in the cloud, or having a RAID setup are not a substitute for a back-up system. Backups are there when your normal system changes to abnormal, they allow you to recover from the unforeseen and unplanned for (usually called disaster recovery) and will, besides maybe some time to order hardware, allow your business to function after the crisis has been dealt with. So hosting your stuff ‘in the cloud’ and using RAID are not backup strategies.
Not having everything backed up. What you should back up? EVERYTHING. Documentation, build tools, source code, all the data, databases, configuration files and so on. Missing even a single element that could be backed up in seconds right now might cause you days or even weeks of downtime if something unforeseen should happen. Making backups is very easy, re-creating software, trying to re-create a database that was lost especially when you’re missing any one of the other components as well is going to be many orders of magnitude harder than making a backup. A common way to get into this situation is to set up a good backup regime initially but not to update it when the service is upgraded or modified. Like that over time the gap between what’s backed up and what runs the production environment slowly increases.
Not verifying backups. After you’ve made your back-up you want to verify that the backup was done properly. The best way to do this is to restore the backup and to re-build a system comparable to the live system with it, preferably from scratch, and then to verify that this system actually works. If you can do that and you do that regularly (for instance: after every backup) you’ll have very high confidence that you can get your systems back online in case of an emergency. One good way to do this is to install the test system from the backup (after suitably anonymizing the data if you’re so required or if that is policy (it should be!)).
Not having your backups in a different location than your primaries. If your datacenter burns down it doesn’t help that your backup sat in another rack. Or maybe even in the same rack, on the same computer but a different VM or even on the same machine. Backups should be separated immediately after creation (or even during) from the originals by as large a gap as you can manage.
Having all your backups and originals accessible by one or more individuals or roles. What this means is that you are exactly one bad leaver or compromise away from being put out of business. The path through which backups can be erased should be a different one than through which they can be created (and should involve different people!). Don’t end up like CodeSpaces.
Not cycling your backups. A single copy of you stuff is better than nothing, but a good backup strategy implies having multiple copies going back in time. Yes, this can get costly if you only do ‘full’ backups but if you do incrementals with periodic full backups the cost can be kept under control and it will allow you to recover from more complex issues than simply losing all your data and restoring it. This means you will have multiple full backups and multiple partials to get you as close to a certain date wrt the state of your systems as you want.
Having all the backups and originals on the same media. This is unwise for several reasons, for one it implies that all the storage is rotating and online which makes it vulnerable, second it implies that if there is something wrong with any of the copies there might be something wrong with all of the copies. In that case you have a problem.
Replication in lieu of backups. If you use a replicating file store (such as a NAS), even if it is from a top-tier vendor you still need backups, no matter what the sales brochure says. Just having your data 3 times or even 5 times replicated is not the same as having a backup. One single command will wipe out all your data and all the replicas, do not rely on replication as your backup strategy.
If backups are encrypted store the decryption keys outside of the systems that are being backed up! It’s great that you have an internal wiki where you document all this stuff but if you’re standing there with your system down that wiki might be down too and it contains what you need to get your data back. Print that stuff out and stick it in a vault or some other place where it is safe so that you can reach it when you need it most. This also includes account credentials in case you use a third party service for part of your backup strategy as well as a list of emergency telephone numbers.
Not having a disaster recovery plan. Go through the motions of restoring after a total crash, document each and every little thing that you find that is missing if you’re not allowed to refer to the currently running live environment and make sure that you fix those issues, then test again until you can recover reliably. Re-do this periodically to make sure that nothing crept in that didn’t make it into the backup plan.