Rain City Story

9Jan/061

Painful Server Woes

uptime_graph.jpg
You can see my full system performance graphs at http://www.capnstats.com/stats/

Ok, as if there is another kind of server woe...

Anyway, I apologize to all of my friends and customers for the recent outages and pledge that I'm doing everything within my power to fix the problems. Saturday night's incident was completely unacceptable and one of those "perfect storms" I have frequent nightmares about. Around 3:15 pm on Saturday 1/7, my server went offline. It went down so fast that syslogd and a maintenance script I have running couldn't send the "call for help" text message to my cell phone.

To make matters worse, my cell phone had died sometime early Saturday morning so I also didn't receive the pages from my two 3rd party monitoring services (Alertra and websitepulse). And to compound problems further, I was out shopping and away from the computer from about 9 am Saturday morning to 11:30 pm Saturday night (which is damn near a world record for me).

I wasn't aware of any problem until I checked my e-mail and had several hundred messages from my monitoring services and a message from Michael Hanscom of Eclecticism.

Also, my datacenter lost it's internet connectivity making it impossible for me to open a trouble ticket or reboot my server remotely (via a Cyclades unit). When connectivity was restored around 11:30, I rebooted my server and attempted the autopsy. While copying the log files to a remote server I manage in Ashburn, VA, the server went down again without logging anything of course. This incident required console access by the datacenter.

Both the datacenter and I think it's a hardware issue but can't find the culprit. No memory errors, I/O errors or load errors (that would cause the CPU to overheat) so I still have absolutely no idea.

Please take comfort in the fact that your data and website structure are very safe. In addition to having full, off-site backups, I rsync your data to two locations every hour and have been since fall of 2004.

Please bear with me during these trying times and take a peek at my server stats at http://www.capnstats.com/stats/.

Thanks for your patience everyone,
Michael

Comments (1) Trackbacks (0)
  1. No worries, Michael. Normally you’re right on top of things, and the only reason I bothered with the e-mail was because I wasn’t sure if you’d gotten alerts or not. No complaints here, I’ve certainly taken my own servers down longer than that from time to time!


Leave a comment


No trackbacks yet.