Ping Really Does Pong

Is ping a good way to assess the performance and reliability of a wireless network? The short answer is no, but I don’t like short answers.

For a little background, I was informed of an issue at work where a colleague was seeing high latency spikes in ping responses every 70 seconds, like a heart beat. It would spike from 10-15ms RTT to a whopping 700ms for one ping only, and then drop down to 100ms for one ping, and then back down to 10-15ms.

A bit like this:

Reply from host: bytes=32 time=9ms TTL=64
Reply from host: bytes=32 time=10ms TTL=64
Reply from host: bytes=32 time=11ms TTL=64
Reply from host: bytes=32 time=11ms TTL=64
Reply from host: bytes=32 time=9ms TTL=64
Reply from host: bytes=32 time=714ms TTL=64
Reply from host: bytes=32 time=102ms TTL=64
Reply from host: bytes=32 time=10ms TTL=64
Reply from host: bytes=32 time=9ms TTL=64

“It’s just wireless” I explained, “That’s how it works, it’s a shared media and half duplex, it has to play the game.”
That explanation wasn’t enough, and the more I thought about it, the more odd it seemed.
Now, the statements I made were true, wireless is a shared media and half duplex as I’m sure you know, however the recurring pattern of 2 spikes in latency couldn’t quite be answered by those statements.

So, I got to work. I suspected the reason was off channel scanning by either the client or the access point, or perhaps an alignment of the stars where the ping traffic timed perfectly with a beacon or similar management frame – but first I had to recreate the issue.

So, I cracked out a Windows 10 laptop and the pinging commenced, using a lab environment with the same access points and code versions as production. Like clock work, around every 60-70 seconds there was a spike, but whats more, every time I clicked on the list of wireless networks I saw a very similar spike:

ezgif.com-video-to-gif.gif

So whats actually going on?

The ultimate cause of the regular spikes is off channel scanning. It works a little bit like this:

  • Your wireless client (and your access point) can only be on one channel at a time.
  • To understand what other SSIDs, BSSIDs, etc are available,  every so often (or on demand) your client will take its NIC off channel and send probe requests on the other channels, to build up its understanding of the world around it
  • Whilst the client is off channel, it can’t send or receive traffic – this is the reason you see the spike in latency.

This is one of the many reasons ping shouldn’t be used to assess performance and reliability of wireless. The latency spike we saw wasn’t a spike at all, in fact it was a combination of ‘dead time’ (where the client was not actually in communication with the wireless LAN) plus the usual RTT, but to your average joe, it does appear to be a problem. The packets queue, and get sent as soon as the client is back on the channel and the medium allows.

This behaviour is really difficult to capture, for pretty much the same reason as it happens; when running a packet capture you can only capture one channel at a time (Unless you had a dedicated NIC for each channel) – whilst writing this blog I had a single NIC on a MacBook capturing constantly, and it missed most probes. The NIC will dwell on a channel for a small period of time before moving on, and unless the stars align, you’ll miss most of it.

Screen Shot 2018-05-12 at 11.39.55

Using a WLAN Pi you can capture some of this too, but again, it’s quite difficult. If you throw more NICs at it, you will have a better chance.  In this example, using Kismet on the WLAN Pi we only captured a few probes on channel 1.

Screen Shot 2018-05-12 at 11.42.00

The WLAN Pi does offer a good animation of the channel hopping happening:

ezgif.com-crop

The test device in question was a Microsoft Surface Pro 2017, and it probed approximately every 70 seconds whilst stationary, and for some reason it probed twice per channel. Update: Andrew McHale (Mac-WiFi.com) kindly explained the 2 probe scenario – probes are one of the few types of wireless frames that don’t get acknowledged, and so 2 probes are sent to ensure delivery.

The other important concept is that while the signal of a device is deteriorating and it comes to the point where it needs to make a roam, it will probe more aggressively. Unfortunately, every device uses a different algorithm, but from the testing I have carried out with Windows 10 it seems to be around the 70 second mark whilst standing still.

As I mentioned, access points also go off channel for things like RRM, wIPS & Rogue detection, for example, with Cisco many APs follow a pattern similar to the below:

Source: https://mrncciew.com/2013/03/16/configuring-rrm/

All in all, ping is good at testing connectivity, and assessing wired performance where you have a dedicated full-duplex media – but once you use it with Wireless, you’ll run into lots of issues!

I will try throw some NICs at a packet capture to try get a more full conversation as soon as I can.

Update:

As promised, I’ve added a few more NICs and completed a capture, and here you can see more clearly the probes across more channels

Screen Shot 2018-05-15 at 15.19.47

6 thoughts on “Ping Really Does Pong

  1. In addition to this great information you have included in this blog. If you are mobile or in a volitle state while pinging, you’ll also have potential changes in MCS – Modulation, Coding, Channel Width, Guard Interval – all may change. What about when you send Data-NACK, Data-NACK, Data-NACK… then the algorithm chooses to change to try and find a better combination of MCS… all of those failures are missed by PING.

    That’s why we need to be using tools to capture Retry Rates, CRC Rates, average MCS, average Data Rates, etc. To truly see what is going on in a WLAN – PING just masks all of those.

    Liked by 1 person

    1. Thanks Keith – some great points as usual. I’m actually going to do a follow up to this blog as it took longer than expected so couldn’t get into any real detail!

      Like

Comments are closed.