Friday, May 6, 2011

Problem
I had a production problem recently, the transaction log grew 1 GB every minute and filled up the entire drive. The error message I got was: "The transaction log for database 'MyDB' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases. My SQL Server is clustered and the issue happened after a node failover occurred. In this tip I cover steps that I took to resolve this problem.

Solution
The message I got was the transaction log drive is full and users are unable to access the application, DBAs never want to hear this! The usual DBA question is "what was changed"?

My first priority was to fix the problem immediately, so users can access the application and then do root cause analysis. For the database in question, I changed the database recovery model to SIMPLE. Then I shrank the transaction log file to a reasonable size (note: you can't shrink it below its original size). After these changes, the log growth stopped, but this still did not tell me what the issue was. Something to note is that changing the recovery model to SIMPLE is not possible if Mirroring or Log Shipping is setup unless you remove these first, but these can be reconfigured later, the important thing is to stop the bleeding ASAP.

Other options would have been to create another transaction log file on a different drive that had plenty of space or to clean up files on the drive that was running out of space. But in my case the file was growing so rapidly that I needed to change the recovery model so the growth would stop until this issue was resolved.

Below are the steps that I took to check the size, change the recovery model and shrink the log file.
SELECT recovery_model_desc FROM sys.databases WHERE name = 'MyDB' --Before

ALTER DATABASE MyDB SET recovery simple

SELECT recovery_model_desc FROM sys.databases WHERE name = 'MyDB' --After

EXEC xp_fixeddrives --Check free drive space

EXEC sp_helpdb MyDB -- Note the size of the log before shrink

DBCC shrinkfile(MyDB_log, 1024) -- shrink log to 1 GB

EXEC sp_helpdb MyDB -- Note the size of the log after shrink

EXEC xp_fixeddrives -- Check free drive space


First aid is complete, database is up and users are able to use the application again.

Now I need to find out what was causing the log to grow so fast, fix the issue, put the database back in FULL recovery mode, do a full backup and make sure my log backups are working correctly.

The drive now has enough space. I put the database in Full recovery mode again to see if I could catch the offending transaction and I monitored drive space and transaction log size. I immediately noticed the transaction log growing about 1 GB per minute. I then executed the block of code below to find out what may be causing the log growth.

DBCC opentran --get the spid and feed it into the next query

DECLARE @handle BINARY(20)
DECLARE @SPID INT
SET @SPID = 100 -- 100 was the spid from the dbcc opentran

SELECT @handle = sql_handle
FROM MASTER..sysprocesses
WHERE spid = @SPID
SELECT [text] FROM ::fn_get_sql(@handle)
GO

The culprit was a stored procedure that was showing up every time I executed the above block of code, although the SPID changed each time. I looked into the stored procedure which was not very complicated, but does a bunch of deletes and inserts based on some condition. This procedure was called by an automated process every minute. What happens when a cluster node fail-over happens? The cache is flushed and all the query plans are gone. Since there has not been any issue with this procedure in the past, my first option was to recompile the procedure. Viola, it worked! The transaction log stopped growing so fast and I didn't see this procedure in the open transactions from the above code. So this simple stored procedure that worked fine suddenly caused a major production issue causing the application to shut down and the fix was as simple as a recompile. Quite interesting.

My next step was to start a full database backup ASAP. While the full database backup was running I kicked off transaction log backups every 5 minutes and they seemed to work fine. I changed the transaction log backup to its normal schedule. After the full backup was completed, I reviewed the sql log, drive space, transaction log size, open transactions and everything seemed fine. If it is a cluster, you could try failing back to the other node and see if that fixes the problem, which in my case was the need for a recompile. In a cluster, the storage is shared by the nodes, so there is no benefit from a storage point of view.

After this, I researched what I could have done better in this situation. I knew that the scripts I used were from SQL Server 2000. SQL Server 2005/2008 DMVs would have helped me to find the offending transaction. Another eventful day in a DBAs life. Feels good to have resolved the issue and at the same time,

No comments: