Monday, September 27, 2010

Outage on Sunday (502 Bad Gateway)

One of the two web servers got stuck on Sunday afternoon (around 1300 UTC 2010-09-26) and started giving out 502 Bad Gateway responses. About half of the web requests for ended up on the stuck server, so sometimes you got the correct response page and sometimes you didn't. For some reason the automatic failover did not work – in exactly this sort of situation it should magically disable the second server and point all requests to the working one.

I happened to be driving back from the countryside when the problem appeared. After getting home in the evening and analyzing the situation I disconnected the server from the network remotely by shutting down the switch port, so that it would not interfere and that the other web server could handle all of the site visitors. At this point, when the server disappeared completely from the network, the other server automatically took over it's IP address and the service started working perfectly.

The next day, 15 hours after the disconnection, I went to visit the hosting site to bring the server back online, and took a couple of iphone video clips of the servers. First, the primary server (at the moment running the whole site and serving all visitors), and then the secondary server which had just been brought up and was receiving the past 15 hours worth of APRS data from the primary server. It was pretty busy at the time, but it took only an hour to copy over all the missing data from the primary.

After the replay was completed I started up the web service on the second server, too, and the service is now again running in a redundant configuration.

Root cause analysis continues. It was definitely a software glitch, probably something to do with me running a system process under gdb to analyze a SIGSEGV it gets every now and then. It might well have caused some services to freeze. The database didn't hang and continued to replicate until the disconnection, but the web service got stuck and I couldn't log in to the box using ssh, or through the serial console.

I'll also have to fix the health check used to trigger the automatic failover procedure, so that it will work next time this sort of problem appears.

The trip to the computer room gave me an excuse to quickly try out Apple's iMovie for composing a little video. Appears to work, but I guess I still prefer Vegas. There isn't much point in this silly video, but maybe it's interesting for some fans. :)


jvdb said...

Nice work, thanks for!

(vintage Cobalt Raq 2 cameo spotted above the server?)

73s, ON3ASM

Hessu said...

(Good spotting, correct identification points earned!)