Saturday, March 27, 2010

Slowdown on Friday 26th

Yesterday the APRS feed was slow for about an hour, between 15:00:55 and 16:11:03 UTC. One of the WXQA servers the service connects to was down, and the connection attempts timed out. The timeout was too long, and the connection retry timer was too short, and the connect() attempt is a blocking call, resulting in slow processing of packets. I knew about the potential problem, but hadn't bothered to fix it until now.

In the evening I implemented a 2-second connect timeout and an exponential backoff for the retry timer. First reconnect attempts will happen within seconds, but they will slow down to about 2 minutes between retries. Using a non-blocking connect() would have been the correct fix, but this was a bit quicker. The problem should not appear again in this form.

It seems like no APRS data was lost or missed, it was just collected in a buffer, and processed once the connect attempts started working again. The following graph gives some idea of the relative processing rate changes. At peak about 10 megabytes of data was in the buffer.

No comments: