Recovered from The Crash

closeHey, just so you know ... this post is now about 14 years and 3 months old. Please keep that in mind as it very well may contain broken links and/or outdated information.

After a few evenings’ work, I’ve recovered from The Crash and have my server back up and running!

The problem started last week on Thursday when I found the server was not responding to pings, none of the services were available, and the fans were cranking at 100%. The server is headless, so I hooked up a monitor to see if anything strange was going on. Unfortunately, there was no video signal … and even the “raising elephants” magic key sequence didn’t do anything. My only option was to hard power off the box and restart it, at which point it seemed back to normal except none of the system logs showed anything that would indicate what the problem was.  A similar thing had also occured back in December while we were up in Cleveland which I had just written off as a random crash, but now this was the second time. I should have known better that Linux just doesn’t crash 😉 and sure enough, the box was locked up again Friday morning.

That evening, I pulled the server out of the closet and hooked it up at my desk along with monitor and keyboard so I could interact with it directly instead of over an SSH connection. I figured maybe this way I could see if anything strange was happening that wasn’t being reflected in the logs. It all seemed normal until finally I saw this:

Oh oh … that couldn’t be good. ata1.00 was obviously referring to the hard drive. A quick search on that { UNC } code showed “Uncorrectable error – often due to bad sectors on the disk” so it became apparent that the drive was failing. Presumably there was some bad area of the disk and when the server accessed that area it just locked up. At this point I tried running a disk check and started getting all sorts of I/O errors. Rather than push the drive to complete failure with the scan, I decided to rebuild the server on a spare drive and restore from backup.

The failing drive was an old 120gb and all I had around were older IDE 20gb, 30gb, and 40gb drives (no SATA controller in my “server”). I wasn’t using all 120gb, so moving to a smaller drive was an acceptable solution. My first thought was to make an image of the failing drive using EASEUS  Todo Backup which works great for imaging Windows PCs even when moving to smaller drives. Unfortunately the software doesn’t recognize the Linux file system so it could only do a sector-by-sector copy, which means I could only copy the image to another identical 120gb drive. With the image copy option off the table, I decided to just do a clean install of Karmic on the spare 30gb drive and then restore what I could from my backups. I hadn’t done a clean install of Linux since I put this server together back in 2007, so why not? I spent the rest of weekend re-installing and re-configuring packages to put the server back the way I had it. Luckily I keep good notes of all the changes I make, so between that, being able to pull files (like the MySQL databases and configuration files) from the 120gb drive connected via a USB enclosure, and having the rest of the important stuff backed up on my two NAS devices and in the cloud (via Jungledisk), the restore process was relatively straightforward with no data lost.

Sunday night, I moved the rebuilt server back to its home in the closet. After a quick check to make sure everything was running, I went to bed. The next morning, I heard a loud fan noise coming out of the closet … the server had crashed again! 😮 Same  symptoms, not good. I had to get to work so I just shut the server down and left. Monday evening, I started it up again and logged in via SSH (remember, it’s headless when it’s in the closet). As before, there was nothing in the logs to indicate a cause of the crash. The difference now, though, was that the system seemed to be crashing faster. After just a few minutes, it would be locked up like before the disk crash. So I pulled it out of the closet again and hooked it up to a keyboard and monitor so I could watch it crash first-hand … and it didn’t crash.

Google to the rescue! I found this post that seemed to describe my exact problem: a headless server running the latest 2.6.31.x kernel crashed after just 10 minutes. The cause? The screensaver! Or, rather, since the server isn’t running X windows, the screen blanker that turns off the monitor after 10 minutes. Apparently due to some bug in the interaction between the kernel and the basic video driver, when the system tried to turn off the monitor when no monitor was attached, the system would just hang. This is why the system was just fine as I worked on it all weekend, but then crashed as soon as I moved it back into the closet and it was headless again. I added the suggested init script for disabling the screen blanking to my system, rebooted, and waited. 11 minutes later, the server was still running. Eureka!

The server has now been stable for a few days, so I finally seem to be past the problems. What bugs me now is the similarity of the two crashes. Obviously, according to my diagnostic tools, the original hard drive was failing in a bad way. Is it just a coincidence that the second video/screen blanking-related crash exhibited the same symptoms as the failing disk? That seems suspicious, but the server was running for a lot longer in between crashes, not failing after just 10 minutes of uptime. I was using a slightly different version of the 2.6.31 kernel on the old drive (I’m using 2.6.32 now) so maybe it was due to that different revision.

At any rate, after all of that, the server is finally back. In addition to giving me the “opportunity” to rebuild the server from scratch, this incident also helped to confirm that I’ve got a decent backup strategy in place. Sure, it took me a few days (instead of hours) to get everything back online, which wouldn’t be acceptable in a business situation, but at least I had zero data loss!

Leave a Reply

Your email address will not be published. Required fields are marked *