Saturday, August 10, 2013

Bro Knows Packet Loss

First things first, I'm not trying to steal anyone's thunder.  The blog post at the following link is both older and more thorough:
Flushing out Leaky Taps v2

My hope is that someone else might be able to benefit from this specific example, given the increasing ranks of new Bro users,  myself included.

We've suffered from a persistent packet loss problem with our full packet capture solution.  For several years!  It's embarrassing, I know, but there is only so much time in the day.  We chalked it up to the proprietary appliance we use, which is running an old flavor of Linux, and also appears to be using a commodity NIC.  We've been building a replacement, so worrying about something you can't really tinker with was just not that high of a priority.  Besides, none of our other tools, commercial or open source, indicated a problem.  So we just dealt with it.  It was frustrating though.  Every time you NEEDED to pull "that" exe out, "[XXXX bytes missing in capture file]" was interspersed throughout the stream.

Ahh, deployment day.  New hardware, high end capture cards, serious time spent tuning and tweaking.  That should solve the problem, right?  Nope, missing packets continued unabated. Aaarrgg!  This is maddening.  All the while our new Bro deployment was chugging along.  When we fist installed Bro, we randomly came across a thread on the Bro mailing list mentioning the capture loss script, "The misc/capture-loss.bro script can detect loss due to reasons beyond the monitoring host ...". Although enabled, we made the mistake of neglecting to pay attention to it.

One day I bugged Liam from Bro/Broala thinking I had some cool custom HTTP badness that was causing Bro to not see the traffic as HTTP, but no, he told me to look at my capture loss logs. It showed 10-20% packet loss.  I had seen the numbers, but I was in denial and not fully understanding what the log was telling me.  With his authoritative response, we got to work trying to narrow down the problem.  It was probably one of a handful of devices, but how do we prove it?  All the places we could look were off the same tap.  And only Bro indicated a problem, nothing else including the tap.  Hmm.  So we started pulling random conversations off the wire from two different taps at two different locations.

From the documentation in the policy/misc/capture-loss.bro script:
##! Reported loss is computed in terms of number of "gap events" (ACKs for a sequence number that's above a gap).

Seems like TCP ACKs is a good place to start, so using a cool option to tshark, "-R ack_lost_segment", we found a bunch of sessions in which a host is sending TCP acknowledgements for TCP sequence numbers that were never seen.  Here is some output:

# 16 sessions from two locations sniffed from two identical taps:
$ tshark -r capture01.pcap -R tcp.analysis.ack_lost_segment
 13   0.561659 192.168.0.1 -> 172.16.4.33 TCP 64 [TCP ACKed lost segment] smtp > 41469 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
 14   0.646359 192.168.0.1 -> 172.16.4.33 TCP 64 [TCP ACKed lost segment] smtp > 41469 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
 16   0.646460 192.168.0.1 -> 172.16.4.33 TCP 64 [TCP ACKed lost segment] smtp > 41469 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

 25   0.367882 192.168.1.1 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29204 [ACK] Seq=361 Ack=3018 Win=65535 Len=0
 26   0.424540 192.168.1.1 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29204 [ACK] Seq=361 Ack=5938 Win=65250 Len=0
 28   0.424645 192.168.1.1 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29204 [ACK] Seq=361 Ack=8858 Win=65535 Len=0

 40  15.069205 192.168.1.2 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 10983 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
 41  15.125764 192.168.1.2 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 10983 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
 43  15.125770 192.168.1.2 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 10983 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

 67  18.232630 192.168.1.3 -> 172.16.16.12 TCP 64 [TCP ACKed lost segment] smtp > 54197 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
 68  18.288941 192.168.1.3 -> 172.16.16.12 TCP 64 [TCP ACKed lost segment] smtp > 54197 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
 70  18.289085 192.168.1.3 -> 172.16.16.12 TCP 64 [TCP ACKed lost segment] smtp > 54197 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

 94  24.385143 192.168.1.4 -> 172.16.17.4 TCP 64 [TCP ACKed lost segment] smtp > 21183 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
 95  24.437401 192.168.1.4 -> 172.16.17.4 TCP 64 [TCP ACKed lost segment] smtp > 21183 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
 97  24.437405 192.168.1.4 -> 172.16.17.4 TCP 64 [TCP ACKed lost segment] smtp > 21183 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

120  26.313236 192.168.1.5 -> 172.16.16.12 TCP 64 [TCP ACKed lost segment] smtp > 54206 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
121  26.369046 192.168.1.5 -> 172.16.16.12 TCP 64 [TCP ACKed lost segment] smtp > 54206 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
123  26.369101 192.168.1.5 -> 172.16.16.12 TCP 64 [TCP ACKed lost segment] smtp > 54206 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

146  65.777993 192.168.1.6 -> 172.16.16.31 TCP 64 [TCP ACKed lost segment] smtp > 54254 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
147  65.835039 192.168.1.6 -> 172.16.16.31 TCP 64 [TCP ACKed lost segment] smtp > 54254 [ACK] Seq=361 Ack=8857 Win=62330 Len=0

171  66.970440 192.168.1.7 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11020 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
172  67.026886 192.168.1.7 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11020 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
174  67.027136 192.168.1.7 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11020 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

198  76.782464 192.168.1.8 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29252 [ACK] Seq=361 Ack=3018 Win=65535 Len=0
199  76.838431 192.168.1.8 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29252 [ACK] Seq=361 Ack=5938 Win=65250 Len=0
201  76.838633 192.168.1.8 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29252 [ACK] Seq=361 Ack=8858 Win=65535 Len=0

225  97.626805 192.168.1.9 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11043 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
226  97.683504 192.168.1.9 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11043 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
228  97.683571 192.168.1.9 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11043 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

252 100.657547 192.168.1.10 -> 172.16.16.31 TCP 64 [TCP ACKed lost segment] smtp > 54304 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
253 100.714278 192.168.1.10 -> 172.16.16.31 TCP 64 [TCP ACKed lost segment] smtp > 54304 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
255 100.714326 192.168.1.10 -> 172.16.16.31 TCP 64 [TCP ACKed lost segment] smtp > 54304 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

278 101.586561 192.168.1.11 -> 172.16.14.61 TCP 64 [TCP ACKed lost segment] smtp > 11049 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
279 101.643018 192.168.1.11 -> 172.16.14.61 TCP 64 [TCP ACKed lost segment] smtp > 11049 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
281 101.643217 192.168.1.11 -> 172.16.14.61 TCP 64 [TCP ACKed lost segment] smtp > 11049 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

305 103.643512 192.168.1.12 -> 172.16.17.5 TCP 64 [TCP ACKed lost segment] smtp > 21270 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
306 103.695079 192.168.1.12 -> 172.16.17.5 TCP 64 [TCP ACKed lost segment] smtp > 21270 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
308 103.695389 192.168.1.12 -> 172.16.17.5 TCP 64 [TCP ACKed lost segment] smtp > 21270 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

331 104.703383 192.168.1.13 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29271 [ACK] Seq=361 Ack=3018 Win=65535 Len=0
332 104.758992 192.168.1.13 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29271 [ACK] Seq=361 Ack=5938 Win=65250 Len=0
334 104.759181 192.168.1.13 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29271 [ACK] Seq=361 Ack=8858 Win=65535 Len=0

358 105.938441 192.168.1.14 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29272 [ACK] Seq=361 Ack=3018 Win=65535 Len=0
359 105.996401 192.168.1.14 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29272 [ACK] Seq=361 Ack=5938 Win=65250 Len=0
361 105.996412 192.168.1.14 -> 172.16.10.19 TCP 64 [TCP ACKed lost segment] smtp > 29272 [ACK] Seq=361 Ack=8858 Win=65535 Len=0

385 110.904544 192.168.1.15 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11058 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
386 110.961339 192.168.1.15 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11058 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
388 110.961345 192.168.1.15 -> 172.16.10.20 TCP 64 [TCP ACKed lost segment] smtp > 11058 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

412 130.757602 192.168.1.16 -> 172.16.17.5 TCP 64 [TCP ACKed lost segment] smtp > 21298 [ACK] Seq=361 Ack=3017 Win=65535 Len=0
413 130.809672 192.168.1.16 -> 172.16.17.5 TCP 64 [TCP ACKed lost segment] smtp > 21298 [ACK] Seq=361 Ack=5937 Win=65250 Len=0
415 130.809872 192.168.1.16 -> 172.16.17.5 TCP 64 [TCP ACKed lost segment] smtp > 21298 [ACK] Seq=361 Ack=8857 Win=65535 Len=0

There is a repeating pattern here.  Each conversation shows the same amount of missing bytes in the same location, which results in ACKs for sequence numbers 301[78], 593[78], and 885[78], but we never see any sequence numbers between 97 and 885[78].

Here is a snippet from tcpdump for the first session above.  The last sequence number we see 192.168.0.1 receive and acknowledge from 172.16.4.33 is 97, yet the next ACK sent from 192.xxx is 3017:
19:52:49.291559 IP 172.16.4.33.41469 > 192.168.0.1.25: Flags [P.], seq 91:97, ack 341, win 511, length 6
19:52:49.370657 IP 192.168.0.1.25 > 172.16.4.33.41469: Flags [P.], seq 341:361, ack 97, win 65439, length 20
19:52:49.453828 IP 192.168.0.1.25 > 172.16.4.33.41469: Flags [.], ack 3017, win 65535, length 0
19:52:49.538528 IP 192.168.0.1.25 > 172.16.4.33.41469: Flags [.], ack 5937, win 65250, length 0
19:52:49.538583 IP 192.168.0.1.25 > 172.16.4.33.41469: Flags [.], ack 5937, win 65250, length 0
19:52:49.538629 IP 192.168.0.1.25 > 172.16.4.33.41469: Flags [.], ack 8857, win 65535, length 0
19:52:49.539035 IP 172.16.4.33.41469 > 192.168.0.1.25: Flags [P.], seq 13237:14504, ack 361, win 511, length 1267
19:52:49.624887 IP 192.168.0.1.25 > 172.16.4.33.41469: Flags [.], ack 14504, win 59888, length 0
19:52:49.628183 IP 192.168.0.1.25 > 172.16.4.33.41469: Flags [.], ack 14504, win 65535, length 0
19:52:49.628379 IP 192.168.0.1.25 > 172.16.4.33.41469: Flags [P.], seq 361:438, ack 14504, win 65535, length 77
19:52:49.629974 IP 172.16.4.33.41469 > 192.168.0.1.25: Flags [P.], seq 14504:14510, ack 438, win 511, length 6

This has to be a software bug.  We pulled captures from several other devices and found the same pattern.  Ok seems like the tap.  Our tap was not under maintenance and several versions out of date.  Even so, the vendor was kind enough to help us by providing a more recent version of the OS to test our theory.  As soon as we upgraded the tap software, Bro capture loss stats dropped below 1%.

The following chart graphs 4+ months of Bro capture loss data.  The drop off is the day we upgraded.

Bro capture loss percentage

To enable the capture-loss.bro script, add the following to your local.bro file and be sure to monitor your capture_loss.log file in your Bro log directory:
@load misc/capture-loss.bro

Long story short, Bro rocks and don't neglect your tap.
Viva la Bro