In my previous post, SQL Server Database Corruption! Understanding and Diagnosing, we talked about database corruption. We discussed what it was so we could better understand the threat we face. Then we learned how to diagnose database corruption. Today we are going to talk about remediating and preventing database corruption.
If you haven’t already read the previous post please read it first.
Stop, No, and Don’t – What NOT to Do When Facing Database Corruption
Before we dive into fixing database corruption it is important to know what not to do. There are things we can do which will make a bad situation even worse.
- Don’t Panic – most sysadmins are used to having their cages rattled and can keep cool under duress. Jumping to action without a plan is not wise. Now is not the time to start trying things and performing thoughtless actions.
- Do NOT Detach the Database – we may never get it back again as it may be in a recovery pending state.
- No restarting SQL Services – databases may never get started up again just like the above.
- Don’t Reboot the Machine – same as above 2 points
- Don’t Start by Trying to Repair the Corruption – root cause analysis is critical to preventative measures
Rebooting 3 times won’t help us here. Don’t be like Chip in Sales Guy vs Web Dude. In general, try not to do anything that would remind you of this:
We need to understand the root cause so that we don’t perform useless or harmful actions.
Sometimes it helps to know what not to do before braving an unknown situation. Now let’s move onto fixing database corruption.
Remediating Database Corruption
Warning! There are no guarantees with database corruption! Make sure all of the stakeholders know the deal up front before a crisis happens. Additionally keep in mind there might be data loss at some point in the recovery.
There are no guarantees with database corruption – there may be data loss
Backups are key to dealing with corruption. If you have a good backup that is recent it is almost always a better option to restore than trying to repair corruption.
Preliminary Actions
Before we try to fix the corruption and get the database back into a consistent state we must do a few things:
- Stop any backups against known corrupt databases – it makes no sense to backup a corrupt database.
- Stop any backup deletion jobs – we don’t know how long the corruption has been with us so we need to keep our older backups just in case.
- If there is recent CHECKDB output then review it – remember CHECKDB tells us many details about the type of corruption, objects, and scope.
Engage your stakeholders in dialogue to make sure everyone understands the drill.
First and Best Option: Restoring a Good Backup
When it comes to fixing database corruption, restoring a backup is better than all other options. CHECKDB output will show us what databases are affected along with details about which objects are bad.
Your first response to fixing database corruption should be to restore a good backup
If system tables are damaged we cannot fix them. Only Microsoft can possibly help here. Our options are not good here – we need to either create a new database (new system tables) and then load the data into it OR start from scratch.
Fixing Indexes
Sometimes we get lucky and the corrupted object is an index. There are 2 kinds to consider:
- Clustered Index – since this represents the actual table we would need to completely create a new table and port as much data as we can to it. You may experience data loss and you likely will not be able to read all of the data in the table.
- Nonclustered Index – these can be dropped and recreated. In my experience these are the easiest to fix with the least likelihood of data loss.
I feel much better when I see the database corruption is limited to indexes.
Repair Option
This is a last resort before we destroy and rebuild. Remember, restoring from a backup is better and usually involves less data loss.
Please try not to jump to this before trying other options first. This is surgery with a chainsaw. Sometimes desperate times call for desperate measures; however, we need to make sure we truly are in such dire straits before trying to repair.
The command to do it is:
1 |
DBCC CHECKDB('MyDB') WITH REPAIR_ALLOW_DATA_LOSS; |
This will attempt to repair all the errors. However, it is critical for you to know that this will most certainly result in some kind of data loss. What happens internally is database pages are de-allocated. That is why data loss is a certainty.
Remediation Beyond Our Scope
Sometimes the fix is beyond our capabilities. In cases like this you can consider the following options:
- Open a support case with Microsoft
- There are some advanced restore techniques – no guarantees
- Corruption Specialists – engage a consulting expert or firm that deals with this
- Data Recovery Specialists – seek help from a firm that specializes in data retrieval from disk sectors
None of the above are guarantees of any success.
This truly highlights the importance of having a backup and restore strategy. I want to point out here the equal importance of testing your backups. It is not enough to simply take a backup. They should be regularly tested to ensure they are good i.e. they can be restored and are not corrupted. Typically when I see companies suffer from database corruption the backups (if any) are also bad. Get into the habit of restoring to another instance and running consistency checks.
Database Corruption Aftermath
Now that the database corruption is sorted away we can focus on preventative measures. Here is a list of things to consider:
- Patch the SQL Server up to the latest supported build i.e. SP and CU
- Run CHECKDB every night for a while
- Keep backups for longer if there is a short retention cycle
- If, for whatever reason, backups are not being taken please consider starting
- Create alerts for errors 823, 824, and 825
- Make sure page verification is using a checksum – this is the default behavior
- Make sure to take backups with checksum
- Restore backups and run consistency checks to ensure the backups are not corrupted
- If the problem was in the I/O Subsystem then engage the storage engineering team your organization to run checkdisk or vendor specific diagnostic utilities for the disks.
A simple checklist of the flow for responding to database corruption:
- Procure and analyze SQL Error Logs and CHECKDB output
- Determine the impact of corruption by running CHECK on every database on the target SQL Instance
- Try to restore a backup
- Try to resolve corruption based on what object(s) are corrupted
- Take steps listed in the aftermath above to prevent future corruption
In conclusion, I hope you take away some good ideas from this 2 post series. May all of your databases be consistent and your backups be reliable!
Thanks for reading!
If you liked this post then you might also like my series on SQL Server Environmental Diagnostics Guide.
[…] Jeff Mlakar walks us through troubleshooting an instance of database corruption: […]
[…] UPDATE: Part 2 on Fixing Database Corruption […]
[…] SQL Server Database Corruption! Fixing and Preventing […]
[…] SQL Server Database Corruption! Fixing and Preventing […]