Tuesday, November 27, 2007

Lessons in network latency

At work were recently brought a new site online and connected it to HQ via our MPLS WAN. Almost immediately we began to notice odd behavior out of this site related to the WAN connection. At times we were seeing ping times from our HQ to the sites router go over 1 sec!

The serial interface at the remote site was clean and there were no errors on any interfaces. The problem appeared to be with the carrier. I did the standard routine, place a trouble ticket and request intrusive testing to see if they could detect any problems. The results: no problems detected. Yet we still were experiencing severe latency to this site, what's going on? I decided that we needed to collect some hard numbers to see what was happening, a simple ping output was not going to cut it to troubleshoot this problem. (I should mention that we have 20 other sites on the same MPLS network for over 3 years and none of these sites ever reported a problem this severe.)

Two great opensource tools helped track and visualize the problems:
  • Cacti (a PHP web front-end for RRDTool)
  • Smokeping (a deluxe latency measurement tool, written by the authori of RRDTool).
I had been using Cacti already to monitor the various system and network SNMP counters on much of our infrastructure so I already had a head start on seeing how much data was going into and out of our troubled remote site.

Smokeping was a tool that I had used years ago and had not setup recently. The installation is simple and detailed on the Smokeping web site. A configuration was quickly thrown together to collect and plot the latency for all of our MPLS connected sites. A few mutli-host graphs were built to compare the troubled site with other sites that were both 'newer' and with sites that were geographically farthest from the HQ. Now we wait for the problem... Waiting for this problem did not take long. Once the problem began to occur I took a look at the cacti graphs for our serial interface usage on the remote router (it was pegged at 1.5Mbit inbound) and cross referenced that with our Smokeping latency graphs. Bingo! When utilization spiked so did the latency. Well this is no big surprise! The problem with this conclusion is that all of our other 20 locations will also experience periods of circuit saturation but do not experience high latency during these times.

To test the theory of highcircuit utilization == high latency I needed to perform some additional tests. Another opensource tool to the rescue... Netperf. Netperf is a nice little application which will let you send various traffic patterns from a Netperf client to a Netperf server and provide very useful statistics. I was only interested in providing a consistent pattern of traffic from our HQ site to any one of our MPLS connected remote sites to see if they also experienced high latency during high circuit utilization. I ran tests against numerous sites at different times of the day and was very effective in maxing out the remote circuit from the inbound direction from our HQ. However, all the others sites only experience minor bumps in latency during these high circuit utilization periods. For example a site with an average ICMP echo reponse time of 40ms would jump to 60 or 80ms, max. The problem site would jump from 30 - 40ms to 1000 - 1500ms!

At this point I was convinced that we either had a configuration or hardware problem on the remote sites router or we had a severe problem with the carriers internal MPLS network. The network engineer who deployed the network at the remote site poured over the configuration and a case was opened with Cisco TAC (we use all Cisco gear). No configuration or hardware problems were discovered. Now I had the unfortunate privilege of trying to convince the very large carrier that the problem might exist in their network.

To cut this very long and painful story short, it took almost a week and a half of 'working' with the carrier to get them to recognize that we had a severe discrepancy with how this sites circuit behaved when under high load (show interface showed a load of 253/255) versus all of our other sites. Once they recognized the problem and escalated the ticket many times over and backend engineer got on the phone during a conference call and cleared the issue up in a matter of minutes... What was the problem you ask? A problem with the routing in the cloud? a faulty card in the acces router? Nope. The proble wmas that the hardware in the carriers access router was recently upgraded from a T3 to an OC48. That is a significant increase in hardware and bandwidth capacity. So why would we see poor performance on better hardware...?

The problem is that the bottleneck for the network had just been moved over to us from the carrier. The carrier is now able to get out packets from HQ (connected via a T3) to the remote sites access router faster than ever before. The problem is that we are basically taking the volume of Niagara Falls and trying to pipe it into a straw for general consumption. In researching this a little more it was also discovered that our other locations were still riding off of older access router hardware and that was why we could not reproduce the problem.