As DBA’s , we are accustomed to various recovery scenarios when restoring databases. Sometimes database restores are simple recoveries that are planned. Hopefully only once in awhile, recoveries on unplanned.
I don’t have to tell you if you are reading this blog, simply put, DBA’s perform a lot of database backups and a fair amount of database restores. We spend a lot of time thinking about backups, dreaming about backups. Sometimes databases are restored as part of a server migration. This is a great way to test your backups as a backup is only good if it’s restorable. In other cases, you might be restoring a database due to a badly written DML statement and you are only concerned with a single table. And hopefully, only once in great awhile, you are restoring a database due to a real disaster.
Regardless of the scenario, this is how I plan and organize myself for successful database disaster recoveries.
- Understand your risks
- Mitigate the risks
- Practice the recovery
- What to do when the disaster strikes
Understand your risks
There are a lot of risks when it comes to operating a database that is part of a strategic business process. Make sure you document the risks and communicate them to your manager and upper management so that all in the decision making process are well informed. Don’t be afraid to remind them once a quarter :)
Here are a few typical risk factors:
- Single Points of Failure (SPOF)
- Old\Aging Infrastructure
- Unmaintained Infrastructure
- Backup devices with limited storage
- Rapidly changing\unpredictable workloads (Capacity)
Mitigate the risks.
There are 2 types of risks. The knowable and the unknowable.
Knowable risks:
- SPOF - Identify Single Points of failure and work towards redundancy.
- Old\Aging Infrastructure - If the infrastructure hosting your database is 2 years old, you should be planning and budgeting for replacement during year 3.
- Unmaintained Infrastructure - Is the firmware up to date, OS patches up to date, infrastructure under support?
- Backup devices with limited storage - Know your backup requirements and how much space your backups require
- Capacity - does your infrastructure support the workload that is currently support?
SPOF
There are 2 aspects, hardware and software. You can reduce this risk by implementing redundant database hardware and storage. From a software perspective, there are many solutions you can implement: Log Shipping, Replication, Mutli-node clustering, etc. It is also a best practice to capture requirements to build redundancy within the application.
Old\Aging Infrastructure
You can reduce risks by planning, budgeting and communicating the need for new hardware. However, in the short term, you should keep additional capacity on standby in the event of a fatal infrastructure issue.
Unmaintained infrastructure
The first thing your hardware support person is going to ask when filing an incident is “is your device up to date on firmware, patches, service packets, etc). You should be keeping up on this and not doing so slows down any recovery as this is the first thing a support person is going to ask you to do. Also, not keeping up on this is usually a key finding in RCA (Root Cause Analysis)
Backup devices with limited storage
Monitor your backup device for free disk space and alert on it. Build reporting and alerting on the success, failures of backups, and the last date of the successful backup for your databases. I can’t tell you how many times I’ve seen a potential recovery situation and there wasn’t knowledge of when the last backup occurred.
Capacity
Monitoring and reporting on resource utilization is key to predicting when you will run into resource constrains. Publish weekly reports, and share monthly or quarterly reports with your manager and upper management. Publishing and reviewing reports regularly can help tie changes to something that has changed in the environment (Software release, new customer with large user base, etc). Alerting on resource utilization is important as well.
Unknowable risks
- Quality of your most recent backup - Practice your recovery.
- Geographic based natural disasters - Have an off premise plan in the event there is a geographic disaster where your database is hosted.
Quality of your most recent backup
There are a few sayings in the industry: “Your backup is only good if you can restore it”, “We have a backup, but we don’t know if it’s restorable” and so on. The point being is that a great way to test your backup and have confidence in the process is to regularly plan and test your recoveries.
Geographic based natural disasters
Earthquake, fire, hurricane can all impact a region and cause outages. In a perfect world, you have a redundant data center or cloud in a separate geographic area (flood plain, tectonic plate, power grid, etc). Many times you won’t have this luxury so come up with a couple of different scenarios, with cost, effort and downtime, get management sign off on the approach and implement. At a minimum, make sure backups are copied to another facility (cloud, data center, device, etc). For the most part, re-deploying or building the application is much easier than reconstructing the data.
Practice the Recovery
I touched upon this a bit in a previous section, so I’ll repeat it again. A great way to test your backup and have confidence in the recovery process is to regularly plan and test your recoveries. Practicing this can also give you a few more data points. 1) If you time the recovery process, you can communicate what downtime will look like in the future based on the amount of time it takes to recover. 2). Testing the recovery process after each software release helps you identify new areas they may require additional effort to be added to the recovery process (engineering adds new database tech, new database, etc)
What to do when disaster strikes
The hardest thing to do during a disaster is trying to communicate the status and answer questions while also focusing on doing the recovery. You could write an entire blog post about this so I’ll summarize what I feel are important talking points.
- Implement a status page - Post updates to the site. Get your teams used to checking the status page for uptime metrics, downtime for any releases or patches, etc.
- Nominate someone to organize the recovery - For the folks working on the technical side of the problem, create a Slack channel, Zoom meeting, start an internal email, update the status page, open a ticket with the appropriate hardware vendor, etc.
- Communication talking points - usually 3 groups of audiences here: The executive team, the customer success managers (or customers themselves), and the technical team working on the recovery. Things to keep in mind, how long the recovery will take (you know this from practicing), what is impacted, who is impacted, the status, and when the next communication will occur again.
- Host an RCA Meeting (Root Cause Analysis) and come up with action items to reduce risk going forward.
I’ve given you a lot to think about. You may not be able to work on all of the recommendations above. And maybe the items above are not enough to help you recover successfully during your next outage. The most important things are to know and communicate your risks, how much it will cost to remediate the risks, and show, communicate and document evidence of improvement.