Last summer, I had some internet connectivity problems. Specifically, I would have massive latency issues that affected my conversations on Skype and my relatively pathetic under the best of circumstances efforts at online gaming. It was driving me up a wall and I couldn't figure it out. It hadn't occurred earlier with the same ISP so I thought it was just temporary issues with the network. However, the problem went on for weeks at various hours of the day.
I contacted the customer service at my ISP and was dismissed as being crazy. Their website's ping test tool showed me having a ping of around 40 ms and they couldn't see any problem on their end. The issues I was having were just with the remote sites. The fact that it was something like 30 different websites or services that had this problem never phased the tech support guy.
I as frustrated but I couldn't blame them. As far as they could see, no problem existed. And it was one of those evil connection problems that only acts up some of the time on some of the packets. Even when I was having slowdowns, I could open a terminal and run
ping google.com or some other site and it would come back with very reasonable times. Some of the time (2-3% of all packets), however, it would throw huge pings on the order of 300 to 700 ms. The tests that they were doing (a couple of packets and take the mean) would never find the problem. I needed to collect a lot of pings over a reasonably long length of time to be sure of catching and characterizing the problem.
I had done some research and it seems there is an transmission robustness option for DSL called interleaving that, more or less, queues up packets before sending them. This is known to increase latency. With trace route, I was able to see that the problem appeared to be on my ISP's network and not a problem on my LAN or the remote host. I grabbed 4 IPs off the trace route (my router, the “hop” to my ISP's network, the second hop on the ISP's network and then the remote host which is my Linode VPS).
A quick Google search pulled up an implentation of
ping in Python. I wrote a small collection of scripts to use this implantation of a ping tool and to repeatedly ping the two selected IPs. I went a bit overboard and hit each IP 1000 times.
I took the data collected from the ping test and loaded into R. Sure enough, there was some funky stuff going on.
source("~/personalProjects/feelingPingy/importPingData.R") pingTimes <- importPingData("~/personalProjects/feelingPingy/hops3.csv") pingTimes$prettyTargetIP <- ifelse(pingTimes$targetIP == "192.168.1.1", "router", ifelse(pingTimes$targetIP == "22.214.171.124", "firstHop", ifelse(pingTimes$targetIP == "126.96.36.199", "secondHop", "remoteHost"))) # just an fyi for if you are doing this, the name is targetIP and is the # xxx.xxx.xxx.xxx IP address by default by(pingTimes$ping, pingTimes$prettyTargetIP, sd, na.rm = TRUE)
## pingTimes$prettyTargetIP: firstHop ##  14.41 ## -------------------------------------------------------- ## pingTimes$prettyTargetIP: remoteHost ##  8.255 ## -------------------------------------------------------- ## pingTimes$prettyTargetIP: router ##  0.5125 ## -------------------------------------------------------- ## pingTimes$prettyTargetIP: secondHop ##  16.98
We need the
na.rm = TRUE flag because some of the attempts to ping the various IPs actually timed out (ping >= 2,000 ms). We can readily see that the variance is increasing as you move off the LAN with the first hop (me to my ISPs network) has a standard deviation of 14.4 ms and the second hop has a variance of 17.0 ms. Considering that a good ping is probably under 50 ms to the targeted IP, this isn't a very good bit of information, especially since the variance goes up by so much as soon as it leaves the LAN. The packets are screwed out of the gate, so to speak. Lets look at this visually.
library(ggplot2) ggplot(pingTimes) + geom_density(aes(x = ping, color = prettyTargetIP))
We can see the greater variance on the remote IPs here. More striking is that the distribution of ping times to the remote host is clearly bimodal (green). This would suggest that there are two different processes generating this data. One gives a low ping, the other gives a higher ping. If we look at the two IPs tested between me and the targeted remote host, we see that the first hop seems to be giving the shape to both densities (the second hop is a function of the first hop's ping plus some marginal addition). However, they are all kind of hard to see because of the very high density for the pings on the LAN. Lets redo this looking only at the IPs that aren't local.
ggplot(pingTimes[pingTimes$prettyTargetIP != "router", ]) + geom_density(aes(x = ping, color = prettyTargetIP))
Now that looks better. We can really see the bimodal, almost trimodal, nature of the ping times at the remote host (in green). We can also see that this shape seems to also be clear in the densities for the first and second hop (on my ISPs network). Some packets leave right away, some wait a bit long and some seem to wait forever to make the hop from my modem to the first remote node. We see this shape show up again in the second hop (since the since hop is an additive function of the first, this is expected). If the second hope was also slow or had multiple processes going on or if the problem was at my VPS, the curves at each node would look different. The fairly constant shape suggests that there is a rate limiting step that determines the distribution of ping times.
The fact that the second and later hops all have the same shape as the first hop suggests that the rate limiting step is the transfer of the packets from my modem to my ISPs network. And the problem was real. The reason it wasn't showing up on their simple mean with an n of 10 type tests is clear in the ecdf.
ggplot(pingTimes[pingTimes$prettyTargetIP != "router", ]) + stat_ecdf(aes(x = ping, color = prettyTargetIP))
Not all the packets were affected, in fact, nearly half left without excessive delay. However, 25% took over 40 ms to merely move from my modem to the ISPs network. Given that 40 ms is a very long time for a single hop (and over half of what I would expect for a round trip time), the impact I was seeing on Skype and other places was real. Armed with the new data*, I was able to get my connection moved from interleaving to fastpath.
I figured I would post this simple analysis and the tools I used in case they can ever help anyone in the future. I can't be the only one who has ever had this type of problem!
*If you ever have internet connection issues that aren't being fixed by the over-the-phone support or even tech visits, going to the social media teams (see your ISPs Facebook page or check them out on DSLReports) typically brings faster and better results. Once I knew that the problem was real, it took 2 emails and 18 hours to get it fixed via my ISPs social media support people. This is after 2-3 weeks of dealing with phone support and getting nowhere.