Sunday, November 4, 2007

aprs.fi down due to storage problems

Last night the EMC storage system (a half-rack full of hard disks, connected to the database server with Fibre Channel) where the aprs.fi database is stored, has returned some corrupted data, and the ZFS filesystem has detected checksum errors. As a result, the database is now down, and I'm restoring a backup copy. This takes some time, as the database is rather large. The service should be back up later today.

This sucks. Seems I really need to set up online data replication to another database machine. As long as the database is down, no APRS-IS data is stored.

Sorry for the trouble!

10 comments:

Anonymous said...

hi
here is f4fdk phil

I know a ham who live in bayonne (south west of france) he receives ais in his place. We would like send that on the net
can you tell us what to do

friendly
my email
f4fdk09@gmail.com

bye

David said...

Have patience oh7lzb. We are all very grateful for what you are doing for us.

Good luck.

David
VE7EFQ

Matthew said...

Half rack of disks?

Do you have a technical description of what it takes to run aprs.he.fi anywhere?

People 'in the business' and familiar with EMC product line would like to know what you have, what you require and what would it take to help you out.

Cheers!

Matt - N0GIK

oh7lzb said...

Hi Matt,

No, there's not much info online yet. The storage system is a CLARiiON CX600, but there are other services running on it too.

aprs.fi only requires about 4 SATA disks worth of I/O bandwidth in RAID10 configuration for the main database server. The database currently only takes some 30 GB (not counting backups or incremental snapshot space!), but it's growing, and I'd like to add more features which require more I/O capacity.

The database server is a Sun E4500 with 6*400 MHz CPUs, 6 GB RAM, running Solaris 10, and it's got other services running on it too. The web frontend together with some local static databases (like geonames) were moved to another Linux server recently, because the CPUs and memory were becoming a bottleneck on the Sun server.

After some investigation it seems that the current problem is possibly not with the EMC, it reports no errors at all, and neither does PowerPath (the EMC Fibre Channel drivers which provide redundancy over dual fiber links between the server and the storage system). ZFS (the file system used for the database and it's snapshots) reports checksum errors within the database file, so parts of it are not readable. The data might have been corrupted within the EMC, on the fiber links, or somewhere inside the operating system / ZFS.

I'm currently dumping the database in parts, figuring out how much of the up-to-date data can be recovered. The corruption seems to be in some older database blocks, so the ZFS snapshots taken during the last few days (being the most recent "backups") are unusable.

I've been thinking about setting up a database replica or two, so that I'd have a hot backup in case something like this happens. The software would support that. Just need the hardware for it. A current dual- or quad-core rack PC with RAID10 over 4 disks, 4 GB of RAM. I guess something slower would do for just running the write-only replica, just to get an on-line backup... 1 core, 1 GB ram, over 2 disk RAID1.

oh7lzb said...

I managed to dump all of the location history database, which is good. Some data from the weather history, and today's collected data will be lost, which is not too bad.

Anonymous said...

No worries,
The site you have a great, keep up the good work.

de N5DRG

Anonymous said...
This comment has been removed by a blog administrator.
KC2EQ said...

Monumental task at the least. A large thanks
for all the work from us "users"

KC2EQ Elmira, NY

gyro said...
This comment has been removed by the author.
gyro said...

Keep up the good work, I also use Sun at work, I can understand the frustrations..
There are a lot of people thankful to you for the work you do...Me among them.
de AC6VW