Troubleshooting Common Networking Problems with Wireshark, Pt. 6: Finding the Source of Network Delays


NOTE: You can now take course by the author with video and example traces on Wireshark. Check this post for more details.

Author’s Note: This is the last part in a six-part series about finding and solving many networking anomalies using the Wireshark network protocol analyzer. If you are new to the series, you can find part 1 here, and the whole series here.
It may seem surprising in this age of ubiquitous Gigabit Ethernet connections, but many of the calls in Enterprise networking still revolve around network delay problems. Slow access to websites, slow logons, slow email, all of these issues can potentially be related to network delays. Unfortunately, tracking down the source of network delays can be very frustrating at times. With the right filtering, however, you can eliminate much of this frustration and drill down to the source.
In some cases, the delays will be easy to spot, showing up as a per-packet delay in the traces. In others, the per-packet latency may be very small, but the upper layer latency may be huge. The simplest check, however, is typically per packet latency. This can be done easily in Wireshark by filtering down to communications between the two problematic systems. Once this is done, ensure that your time display format is set to ‘Seconds since previous packet’ in the view menu, and click on the ‘Time’ column heading to sort by time. This will show you your longest delays between packets. If you have many packets with very long delays (0.2 seconds or longer), re-sort by frame number for several of these and examine the traffic. The type of traffic will determine if the delay is a problem. For instance, a long delay on a HTTP keep-alive is not a problem, but a long delay between an HTTP Get and a HTTP 200 response is.

Note: Make sure, when looking for delay issues, that the trace was captured with a minimal amount of capture filtering. The default setting of filtering out RDP is fine (unless you are getting RDP delays, of course), but any more than that is questionable. The reasoning for this is that if you are filtering out a large chunk of the traffic, you are in effect creating gaps in the capture, which invalidates the technique discussed above.

However, once you find the delay, you may still need to dig further to locate the source. For example, if you right click on an HTTP packet in Wireshark and choose the option ‘Follow TCP stream’, Wireshark will filter based on the source and destination sockets, giving you a good summary of the TCP conversation. An example of a delayed conversation is shown below:

No.
Time
Source
Destination
Protocol
Info
111
0.003251
160.207.226.48
160.207.12.41
HTTP
GET /default.aspx HTTP/1.1
112
0.000771
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 401 Unauthorized (text/html)
113
0.000533
160.207.12.41
160.207.226.48
HTTP
Continuation or non-HTTP traffic
114
0.000986
160.207.226.48
160.207.12.41
TCP
2456 > http [ACK] Seq=591 Ack=1677 Win=65520 Len=0
115
0.007079
160.207.226.48
160.207.12.41
HTTP
GET /default.aspx HTTP/1.1, NTLMSSP_NEGOTIATE
116
0.001008
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)
117
0.000562
160.207.12.41
160.207.226.48
HTTP
Continuation or non-HTTP traffic
118
0.000993
160.207.226.48
160.207.12.41
TCP
2456 > http [ACK] Seq=1283 Ack=3651 Win=65520 Len=0
119
0.005675
160.207.226.48
160.207.12.41
HTTP
GET /default.aspx HTTP/1.1, NTLMSSP_AUTH, User: GDAIS\Michael.Vogt
121
0.112828
160.207.12.41
160.207.226.48
TCP
http > 2456 [ACK] Seq=3651 Ack=2143 Win=64675 Len=0
166
1.634767
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (text/html)

Looking at this conversation, you may correctly surmise that the page in question takes a long time to retrieve (around 1.7 seconds), but we still do not truly know the source. Even worse, since an HTTP page can include multiple references to additional files (such as images), which can extend over multiple TCP sessions, looking at a single stream does not necessarily give us a true appreciation of the delay. To see the full delay, we must filter just on the two IP addresses in the conversation along with the protocol (HTTP), like so:
(ip.addr==160.207.226.48 and ip.addr==160.207.12.41) and http
With this filter, we can look for each HTTP Get request and response, and determine when the last request for the page is sent, and when the response for that request is returned. We find that the last request is for the file crossbullet.gif in frame 3962, which is returned in frame 4285. We know this is the last request because the Get request for default.aspx (and all the Get requests related to this page) repeats (indicating a new page request) after this Get request. Now that we know where the conversation ends, we can filter to just this HTTP conversation by using the following filter:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and http and frame.number <= 4285
With this filter applied, using the Summary command under the statistics menu gives us the following information:
image
As the summary shows, there is a 28.5 second delay in retrieving the page. This delay is huge, much worse than we previously anticipated, so now we need to dig in further to determine why. Another Wireshark filter can help here:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.number <= 4285
This filter is essentially filtering to all traffic between the two hosts that is also an HTTP request or response packet and has a frame number less than 4285. This filters the HTTP continuation packets out of the trace, dropping the trace from 689 frames to 182 frames. However, we can break it down further and look for just the high delay packets by adding frame.time_delta to the filter, like so:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.number <= 4285 and frame.time_delta >= 0.2
This filter is the same as the previous filter, but now we are adding the condition that all frames displayed must have a time delta from the previously displayed frame equal to or greater than 200ms. This filters the frame down to 27 high-latency frames:

No.
Time
Source
Destination
Protocol
Info
111
————
160.207.226.48
160.207.12.41
HTTP
GET /default.aspx HTTP/1.1
166
1.765202
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (text/html)
244
0.208087
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/1033/owsbrows.js HTTP/1.1
351
1.450349
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (text/css)
550
1.169115
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (text/css)
651
1.157058
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (application/x-javascript)
724
1.081409
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (application/x-javascript)
920
0.264385
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/1033/search.js HTTP/1.1
1136
1.315121
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (application/x-javascript)
1316
1.771438
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (text/css)
1382
0.206753
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/images/icongo02.gif HTTP/1.1
1576
2.916854
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (GIF89a)
1784
2.512169
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (application/x-javascript)
1951
0.20108
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/images/GDAISsource.gif HTTP/1.1, NTLMSSP_AUTH, User: GDAIS\Michael.Vogt
2002
1.160457
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (text/css)
2052
0.200265
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)
2158
1.139176
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (GIF89a)
2226
1.103149
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 304 Not Modified
2337
1.281311
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (text/x-component)
2469
1.267644
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (GIF89a)
2512
1.268341
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (GIF89a)
2551
0.217619
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (GIF89a)
2615
1.331625
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (GIF89a)
2693
1.114701
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (GIF89a)
3280
1.147739
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (GIF89a)
3944
0.352483
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/images/partgradp.gif HTTP/1.1, NTLMSSP_NEGOTIATE
4282
0.919761
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 404 Not Found (text/html)

At this point, we simply need to look for a pattern, and luckily, the filter shows us the pattern very quickly. It appears that nearly all of the HTTP responses have delays of 1 second or longer. This is a huge find, and points us very close to the cause.
Initially, we may be tempted to point to the server itself as the cause of the delay, but we can’t prove that without looking at the server side trace. So we will open the server side trace and attempt to find a conversation with similar delay. It is helpful if you can find the same conversation, but not absolutely necessary. If the problem is caused by a server side issue, any conversation that is going to the same URL on the same server at approximately the same time should show a similar delay, and for the same reasons.
To find a similar delay, we will simply modify the previous filter slightly, removing the frame number restriction, like so:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.time_delta >= 0.2
This filter will show us some frames with a very striking similarity to the ones seen in the previous trace:

No.
Time
Source
Destination
Protocol
Info
3273
————
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/1033/owsbrows.js HTTP/1.1
3751
1.095343
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (application/x-javascript)
3856
1.081487
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (application/x-javascript)
4088
0.265282
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/1033/search.js HTTP/1.1
4202
1.314339
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 200 OK (application/x-javascript)

This proves that the problem is not due to delay in the link, but is something related to the web server’s delivery of the requested page. To dig deeper, we now need to select an example of the problem, say the Get request for owsbrows.js and the response seen in frames 3273 and 3751, respectively, and filter to all traffic between this request and response. This is to see any additional traffic that our previous filters may have hidden. To do this, we will use the following filter:
frame.number >= 3273 and frame.number <= 3751
This filter filters down to all frames between these two numbers, including the first and last frames (request and response). However, this leaves us with 500 or so frames to look through, which is a bit more than we might wish. Looking at the first 36 packets, however, we can begin to determine what is occurring:

No.
Time
Source
Destination
Protocol
Info
3273
6.466433
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/1033/owsbrows.js HTTP/1.1
3274
0.000061
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [ACK] Seq=600692 Ack=27063 Win=65047 Len=1460
3275
0.000017
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [PSH, ACK] Seq=602152 Ack=27063 Win=65047 Len=298
3276
0.000018
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [ACK] Seq=27063 Ack=602450 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
3277
0.000348
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 401 Unauthorized (text/html)
3278
0.000009
160.207.12.41
160.207.226.48
HTTP
Continuation or non-HTTP traffic
3279
0.00151
160.207.226.48
160.207.12.41
TCP
2466 > http [ACK] Seq=742 Ack=1677 Win=65520 Len=0
3280
0.000026
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [PSH, ACK] Seq=27063 Ack=602450 Win=65535 [TCP CHECKSUM INCORRECT] Len=122
3281
0.000571
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [ACK] Seq=602450 Ack=27185 Win=64925 Len=1460
3282
0.000013
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [PSH, ACK] Seq=603910 Ack=27185 Win=64925 Len=298
3283
0.000019
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [ACK] Seq=27185 Ack=604208 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
3284
0.000721
160.207.12.1
Broadcast
ARP
Who has 160.207.12.184?  Tell 160.207.12.1
3285
0.000926
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [PSH, ACK] Seq=27185 Ack=604208 Win=65535 [TCP CHECKSUM INCORRECT] Len=122
3286
0.000439
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [ACK] Seq=604208 Ack=27307 Win=64803 Len=1460
3287
0.000032
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [PSH, ACK] Seq=605668 Ack=27307 Win=64803 Len=298
3288
0.000015
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [ACK] Seq=27307 Ack=605966 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
3289
0.002
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [PSH, ACK] Seq=27307 Ack=605966 Win=65535 [TCP CHECKSUM INCORRECT] Len=122
3290
0.000152
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/1033/owsbrows.js HTTP/1.1, NTLMSSP_NEGOTIATE
3291
0.000454
160.207.12.41
160.207.226.48
HTTP
HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)
3292
0.000011
160.207.12.41
160.207.226.48
HTTP
Continuation or non-HTTP traffic
3293
0.000017
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [ACK] Seq=605966 Ack=27429 Win=64681 Len=1460
3294
0.000016
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [PSH, ACK] Seq=607426 Ack=27429 Win=64681 Len=298
3295
0.000015
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [ACK] Seq=27429 Ack=607724 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
3296
0.001842
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [PSH, ACK] Seq=27429 Ack=607724 Win=65535 [TCP CHECKSUM INCORRECT] Len=122
3297
0.000052
160.207.226.48
160.207.12.41
TCP
2466 > http [ACK] Seq=1585 Ack=3651 Win=65520 Len=0
3298
0.000805
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [ACK] Seq=607724 Ack=27551 Win=64559 Len=1460
3299
0.000023
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [ACK] Seq=609184 Ack=27551 Win=64559 Len=1460
3300
0.000009
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [ACK] Seq=610644 Ack=27551 Win=64559 Len=1460
3301
0.000016
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [ACK] Seq=27551 Ack=612104 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
3302
0.00003
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [PSH, ACK] Seq=612104 Ack=27551 Win=64559 Len=1192
3303
0.002783
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [PSH, ACK] Seq=27551 Ack=613296 Win=64343 [TCP CHECKSUM INCORRECT] Len=122
3304
0.000144
160.207.226.48
160.207.12.41
HTTP
GET /_layouts/1033/owsbrows.js HTTP/1.1, NTLMSSP_AUTH, User: CORP\John.Doe
3305
0.000484
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [ACK] Seq=613296 Ack=27673 Win=64437 Len=1460
3306
0.000015
160.207.12.37
160.207.12.19
TCP
1292 > 4076 [PSH, ACK] Seq=614756 Ack=27673 Win=64437 Len=298
3307
0.000018
160.207.12.19
160.207.12.37
TCP
4076 > 1292 [ACK] Seq=27673 Ack=615054 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
3308
0.000032
160.207.12.19
160.207.230.36
DCERPC
Request: call_id: 4997 opnum: 39 ctx_id: 0

Looking at the bolded frames, we can see two things of importance. First, the client is requested for authentication from the server for the Get, to which it responds in a normal fashion (Get, 401, Get Negotiate, 401 Challenge, Get with Credentials). The interesting thing to note is the speed with which the server responds to each step of this process. You will notice that the server’s 401 responses occur consistently in less than 0.001 seconds. This is not consistent with our previous view of a web server performance problem. The second thing to notice is what the web server does in the last frame (3308): It makes an RPC call to a domain controller, presumably to verify the authentication provided by the client. This is perhaps the most important find in the trace, as it shows interaction with another server occurs before the client is validated and given the page he has requested.
Now that we have found this new piece of information, let’s filter to determine how long the response to this RPC request takes. This is most easily done by selecting the RPC request and right-clicking to choose ‘Follow TCP Stream’. Once this is done, you can very easily see the RPC request and response, identified by their Call ID:

No.
Time
Source
Destination
Protocol
Info
3308
————-
160.207.12.19
160.207.230.36
DCERPC
Request: call_id: 4997 opnum: 39 ctx_id: 0
3654
0.255042
160.207.230.36
160.207.12.19
TCP
1025 > 4003 [ACK] Seq=7680 Ack=1296 Win=64671 Len=0
3747
0.825159
160.207.230.36
160.207.12.19
DCERPC
Response: call_id: 4997 ctx_id: 0

The result of this final filter is interesting. Our previous analysis showed that there was a total delay between the HTTP Get and the Web Server’s response of 1.095 seconds (the delay between frames 3273 and 3751). The analysis of the RPC traffic shows that the delayed response from the domain controller accounts for 1.08 seconds of that delay! Using this information, we would be able to examine all of the other RPC calls to the DC and verify this finding, proving that the domain controller is the primary cause of the web server delay.
This information is the key to resolving the issue, since no amount of resource allocation towards the network infrastructure, web server, or other services (like SQL) would actually improve the situation at all. But an additional DC to handle the authentication traffic may completely resolve the issue.

, , , ,

  1. #1 by Sean Barmettler on April 7, 2013 - 10:49 pm

    i was using tcp.time_delay and wondering why this wasnt working for me. thanks. saved me an hour of banging my head against the wall.

  2. #2 by Brian Hill on April 8, 2013 - 1:01 am

    Cool, glad it helped 🙂

  3. #3 by Unknown on June 17, 2013 - 8:00 pm

    Great Post. I want to read the rest, but the links to the article series does not seem to be working.

  4. #4 by Brian Hill on June 17, 2013 - 8:16 pm

    Really sorry about that. For some reason, blogger doesn’t seem to be returning all of the results. I will work on this further, but until then, you can get to all of the posts from this link:
    http://goo.gl/7AAOH

  5. #5 by Unknown on June 22, 2013 - 1:48 am

    Awesome. Thanks for posting the work-around link

  6. #6 by viper ccie on January 13, 2014 - 7:45 am

    Thanks Brian, this is really a great post 🙂
    Actually I am wondering how you identified frame 3308? The source and destination of frame 3308 not showing web server IP address. Thanks

    The second thing to notice is what the web server does in the last frame (3308)

  7. #7 by Brian Hill on January 13, 2014 - 3:18 pm

    Viper CCIE, thanks for the comment. It’s been a long time since I worked this case, but to the best of my recollection the .19 IP was the primary IP on the web server.

(will not be published)