This isn’t strictly one of those articles that exist only to emphasise the point that backups are useless unless you test them. I mean, okay, it does touch on that, but that’s not the whole story…
Planning for recovery and having documented processes for the many different disaster recovery scenarios and outage contexts are massively important to the role of a SQL Server DBA. These are just some of my thoughts about how best to embed that in day to day work to allow recovery planning to be a staple of the role.
Test your backups
It goes without saying that no backup is worth anything unless it allows you to restore and recover your environment as expected by both you and your stakeholders. Having a backup on disk is pointless unless you have complete faith in being able to use it. To instil and maintain that faith, it’s important to test your backups.
There’s a couple of ways you can do this, but one of the best ways is to identify any test or development environments that would suit being restored daily with backups of the appropriate production environments. This way you’re constantly testing both the quality of your backups and the restore process in a controlled manner and as a bonus you’re not letting your test or development systems get too out of touch with production.
Simulate disaster recovery events
Sometimes simply testing the backup is not enough. You need to know you can use it and use it correctly in many different contexts and scenarios. That’s why I like to have a day or two per month to spend some time having fun and breaking things. I find it useful to simulate different disaster recovery scenarios to put into practice the method of restoring things to normality. It helps if you work in a virtualized environment as you can clone VM’s or create new ones for the purpose, but any unused test environment will suffice here. Remember to document and archive these methods and processes for future reference. Here’s a few ideas as what you could do at home or at work:
- Simulate drive failure. Corruption of disk volumes containing system databases, user database, log files, TempDB etc is always a fun thing to resolve. You can simulate this issue by removing drives, changing drive letters, changing folder names etc and becoming familiar with how SQL Server behaves (or doesn’t behave in each scenario). Then you can think creatively about how best and how quickest to solve these issues.
- Simulate data corruption. It’s fairly easy to create database page corruption or transaction log corruption (I’m planning a future article about how to do this). I would say this is a really important exercise for DBA’s as corruption can occur on even the most awesome, well-maintained database environments and finding quick, efficient and effective ways to recover from corruption is a real challenge.
Maintain ‘maximum data loss’ information
This is all about understanding your environments as a DBA, and having a clear and respectful dialogue with your stakeholders, whether they are developers, users or service owners. I’m going to assume your backup strategy is already governed by organisational or service level agreements. In other words, your business has a clear policy on how and how often databases are backed up. If not, then skip this article completely and speak to your managers!
Having a backup policy is all good and well, but do you and your service owners really understand how that policy impacts individual systems? I’ve been there. I had a system that was backed up on the default “full backup once a day, hourly log backups in working hours” policy and the service owner did not understand the impact of that.
In fact, he started to use his system to log important automated tasks overnight that would be compiled manually in the morning via a reporting application and sent to a third-party for further processing. Even worse, funding for his particular area relied on the quality and delivery of that data, and it was only in a passing conversation with him about applying a service pack to the database did I realise that the default backup policy was not going to work for him. A disaster occurring during the latter part of a night could result in hours’ worth of important data lost, and as a result thousands in future funding for him and his department. Suffice to say, we sat down and worked out a better backup policy tailored to that particular system.
The bottom line is this. Don’t define your environments solely by how often they’re backed up, but include what their maximum data loss is as a result of those backups, and whether that is acceptable by the service owners and users.
Getting that information together and agreeing it with all parties is tough, but once done the rewards will be three-fold.
- You’ll really understand your environments.
- Both you and your service owners will feel safe in the knowledge that there is no disaster you can’t adequately recover from.
- They’ll be a mutual respect and confidence between you as a DBA and your customers.
Don’t put 100% trust into third party backup utilities
Third party backup utilities like NetBackup, Tivoli, HP to name but a few are widespread and well used by organisations big and small. Most likely they’re looked after and administered and by your Data Center or Infrastructure team and you, as a DBA, can sit back safe in the knowledge that your backups are being executed efficiently and effectively.
It’s hard to argue with that. I’ve used products like this for a long time, and understand how effective they can be, and not just for SQL Server environments. Managing backups and restores outside of SQL Server can be a breeze, and most of the time it’s fine, but that’s not to say that it’s always fine. I’ve experienced events that have made me question my trust in these products, particular when it comes to rebuilding entire instances from backups alone.
Restoring system databases with one such product gave me hit and miss results for a long time, never knowing whether corruption to MSDB or Master was going to occur on restore. I’ve also had problems with third-party backup clients not playing ball when the database is in single user mode. Stuff like that shakes my faith.
I’d recommend at the very least taking local full backups daily of SQL Server system databases, and perhaps even copying those to a secure share, in the event that you need them and your backup application is no longer available.
At the end of the day all you need to do as a DBA is feel comfortable with your approach to backup and recovery. You need to know that what you have in place in terms of backups and recovery planning is enough to get things back on track, in a way that everyone expects, when things go wrong. I just try to remember these key points:
- Test my backups
- Be well practiced in different DR scenario’s
- Understand my environments
- Know my customers are comfortable with their levels of data loss
- Stay in control of my backups