NOTE: You can now take course by the author with video and example traces on Wireshark. Check this post for more details.
Author’s Note: This is the last part in a six-part series about finding and solving many networking anomalies using the Wireshark network protocol analyzer. If you are new to the series, you can find part 1 here, and the whole series here.
It may seem surprising in this age of ubiquitous Gigabit Ethernet connections, but many of the calls in Enterprise networking still revolve around network delay problems. Slow access to websites, slow logons, slow email, all of these issues can potentially be related to network delays. Unfortunately, tracking down the source of network delays can be very frustrating at times. With the right filtering, however, you can eliminate much of this frustration and drill down to the source.
In some cases, the delays will be easy to spot, showing up as a per-packet delay in the traces. In others, the per-packet latency may be very small, but the upper layer latency may be huge. The simplest check, however, is typically per packet latency. This can be done easily in Wireshark by filtering down to communications between the two problematic systems. Once this is done, ensure that your time display format is set to ‘Seconds since previous packet’ in the view menu, and click on the ‘Time’ column heading to sort by time. This will show you your longest delays between packets. If you have many packets with very long delays (0.2 seconds or longer), re-sort by frame number for several of these and examine the traffic. The type of traffic will determine if the delay is a problem. For instance, a long delay on a HTTP keep-alive is not a problem, but a long delay between an HTTP Get and a HTTP 200 response is.
Note: Make sure, when looking for delay issues, that the trace was captured with a minimal amount of capture filtering. The default setting of filtering out RDP is fine (unless you are getting RDP delays, of course), but any more than that is questionable. The reasoning for this is that if you are filtering out a large chunk of the traffic, you are in effect creating gaps in the capture, which invalidates the technique discussed above.
However, once you find the delay, you may still need to dig further to locate the source. For example, if you right click on an HTTP packet in Wireshark and choose the option ‘Follow TCP stream’, Wireshark will filter based on the source and destination sockets, giving you a good summary of the TCP conversation. An example of a delayed conversation is shown below:
No.
|
Time
|
Source
|
Destination
|
Protocol
|
Info
|
111
|
0.003251
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /default.aspx HTTP/1.1
|
112
|
0.000771
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 401 Unauthorized (text/html)
|
113
|
0.000533
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
Continuation or non-HTTP traffic
|
114
|
0.000986
|
160.207.226.48
|
160.207.12.41
|
TCP
|
2456 > http [ACK] Seq=591 Ack=1677 Win=65520 Len=0
|
115
|
0.007079
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /default.aspx HTTP/1.1, NTLMSSP_NEGOTIATE
|
116
|
0.001008
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)
|
117
|
0.000562
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
Continuation or non-HTTP traffic
|
118
|
0.000993
|
160.207.226.48
|
160.207.12.41
|
TCP
|
2456 > http [ACK] Seq=1283 Ack=3651 Win=65520 Len=0
|
119
|
0.005675
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /default.aspx HTTP/1.1, NTLMSSP_AUTH, User: GDAIS\Michael.Vogt
|
121
|
0.112828
|
160.207.12.41
|
160.207.226.48
|
TCP
|
http > 2456 [ACK] Seq=3651 Ack=2143 Win=64675 Len=0
|
166
|
1.634767
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (text/html)
|
Looking at this conversation, you may correctly surmise that the page in question takes a long time to retrieve (around 1.7 seconds), but we still do not truly know the source. Even worse, since an HTTP page can include multiple references to additional files (such as images), which can extend over multiple TCP sessions, looking at a single stream does not necessarily give us a true appreciation of the delay. To see the full delay, we must filter just on the two IP addresses in the conversation along with the protocol (HTTP), like so:
(ip.addr==160.207.226.48 and ip.addr==160.207.12.41) and http
With this filter, we can look for each HTTP Get request and response, and determine when the last request for the page is sent, and when the response for that request is returned. We find that the last request is for the file crossbullet.gif in frame 3962, which is returned in frame 4285. We know this is the last request because the Get request for default.aspx (and all the Get requests related to this page) repeats (indicating a new page request) after this Get request. Now that we know where the conversation ends, we can filter to just this HTTP conversation by using the following filter:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and http and frame.number <= 4285
With this filter applied, using the Summary command under the statistics menu gives us the following information:
As the summary shows, there is a 28.5 second delay in retrieving the page. This delay is huge, much worse than we previously anticipated, so now we need to dig in further to determine why. Another Wireshark filter can help here:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.number <= 4285
This filter is essentially filtering to all traffic between the two hosts that is also an HTTP request or response packet and has a frame number less than 4285. This filters the HTTP continuation packets out of the trace, dropping the trace from 689 frames to 182 frames. However, we can break it down further and look for just the high delay packets by adding frame.time_delta to the filter, like so:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.number <= 4285 and frame.time_delta >= 0.2
This filter is the same as the previous filter, but now we are adding the condition that all frames displayed must have a time delta from the previously displayed frame equal to or greater than 200ms. This filters the frame down to 27 high-latency frames:
No.
|
Time
|
Source
|
Destination
|
Protocol
|
Info
|
111
|
————
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /default.aspx HTTP/1.1
|
166
|
1.765202
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (text/html)
|
244
|
0.208087
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/1033/owsbrows.js HTTP/1.1
|
351
|
1.450349
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (text/css)
|
550
|
1.169115
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (text/css)
|
651
|
1.157058
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (application/x-javascript)
|
724
|
1.081409
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (application/x-javascript)
|
920
|
0.264385
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/1033/search.js HTTP/1.1
|
1136
|
1.315121
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (application/x-javascript)
|
1316
|
1.771438
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (text/css)
|
1382
|
0.206753
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/images/icongo02.gif HTTP/1.1
|
1576
|
2.916854
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (GIF89a)
|
1784
|
2.512169
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (application/x-javascript)
|
1951
|
0.20108
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/images/GDAISsource.gif HTTP/1.1, NTLMSSP_AUTH, User: GDAIS\Michael.Vogt
|
2002
|
1.160457
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (text/css)
|
2052
|
0.200265
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)
|
2158
|
1.139176
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (GIF89a)
|
2226
|
1.103149
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 304 Not Modified
|
2337
|
1.281311
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (text/x-component)
|
2469
|
1.267644
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (GIF89a)
|
2512
|
1.268341
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (GIF89a)
|
2551
|
0.217619
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (GIF89a)
|
2615
|
1.331625
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (GIF89a)
|
2693
|
1.114701
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (GIF89a)
|
3280
|
1.147739
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (GIF89a)
|
3944
|
0.352483
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/images/partgradp.gif HTTP/1.1, NTLMSSP_NEGOTIATE
|
4282
|
0.919761
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 404 Not Found (text/html)
|
At this point, we simply need to look for a pattern, and luckily, the filter shows us the pattern very quickly. It appears that nearly all of the HTTP responses have delays of 1 second or longer. This is a huge find, and points us very close to the cause.
Initially, we may be tempted to point to the server itself as the cause of the delay, but we can’t prove that without looking at the server side trace. So we will open the server side trace and attempt to find a conversation with similar delay. It is helpful if you can find the same conversation, but not absolutely necessary. If the problem is caused by a server side issue, any conversation that is going to the same URL on the same server at approximately the same time should show a similar delay, and for the same reasons.
To find a similar delay, we will simply modify the previous filter slightly, removing the frame number restriction, like so:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.time_delta >= 0.2
This filter will show us some frames with a very striking similarity to the ones seen in the previous trace:
No.
|
Time
|
Source
|
Destination
|
Protocol
|
Info
|
3273
|
————
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/1033/owsbrows.js HTTP/1.1
|
3751
|
1.095343
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (application/x-javascript)
|
3856
|
1.081487
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (application/x-javascript)
|
4088
|
0.265282
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/1033/search.js HTTP/1.1
|
4202
|
1.314339
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 200 OK (application/x-javascript)
|
This proves that the problem is not due to delay in the link, but is something related to the web server’s delivery of the requested page. To dig deeper, we now need to select an example of the problem, say the Get request for owsbrows.js and the response seen in frames 3273 and 3751, respectively, and filter to all traffic between this request and response. This is to see any additional traffic that our previous filters may have hidden. To do this, we will use the following filter:
frame.number >= 3273 and frame.number <= 3751
This filter filters down to all frames between these two numbers, including the first and last frames (request and response). However, this leaves us with 500 or so frames to look through, which is a bit more than we might wish. Looking at the first 36 packets, however, we can begin to determine what is occurring:
No.
|
Time
|
Source
|
Destination
|
Protocol
|
Info
|
3273
|
6.466433
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/1033/owsbrows.js HTTP/1.1
|
3274
|
0.000061
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [ACK] Seq=600692 Ack=27063 Win=65047 Len=1460
|
3275
|
0.000017
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [PSH, ACK] Seq=602152 Ack=27063 Win=65047 Len=298
|
3276
|
0.000018
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [ACK] Seq=27063 Ack=602450 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
|
3277
|
0.000348
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 401 Unauthorized (text/html)
|
3278
|
0.000009
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
Continuation or non-HTTP traffic
|
3279
|
0.00151
|
160.207.226.48
|
160.207.12.41
|
TCP
|
2466 > http [ACK] Seq=742 Ack=1677 Win=65520 Len=0
|
3280
|
0.000026
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [PSH, ACK] Seq=27063 Ack=602450 Win=65535 [TCP CHECKSUM INCORRECT] Len=122
|
3281
|
0.000571
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [ACK] Seq=602450 Ack=27185 Win=64925 Len=1460
|
3282
|
0.000013
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [PSH, ACK] Seq=603910 Ack=27185 Win=64925 Len=298
|
3283
|
0.000019
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [ACK] Seq=27185 Ack=604208 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
|
3284
|
0.000721
|
160.207.12.1
|
Broadcast
|
ARP
|
Who has 160.207.12.184? Tell 160.207.12.1
|
3285
|
0.000926
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [PSH, ACK] Seq=27185 Ack=604208 Win=65535 [TCP CHECKSUM INCORRECT] Len=122
|
3286
|
0.000439
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [ACK] Seq=604208 Ack=27307 Win=64803 Len=1460
|
3287
|
0.000032
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [PSH, ACK] Seq=605668 Ack=27307 Win=64803 Len=298
|
3288
|
0.000015
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [ACK] Seq=27307 Ack=605966 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
|
3289
|
0.002
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [PSH, ACK] Seq=27307 Ack=605966 Win=65535 [TCP CHECKSUM INCORRECT] Len=122
|
3290
|
0.000152
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/1033/owsbrows.js HTTP/1.1, NTLMSSP_NEGOTIATE
|
3291
|
0.000454
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)
|
3292
|
0.000011
|
160.207.12.41
|
160.207.226.48
|
HTTP
|
Continuation or non-HTTP traffic
|
3293
|
0.000017
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [ACK] Seq=605966 Ack=27429 Win=64681 Len=1460
|
3294
|
0.000016
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [PSH, ACK] Seq=607426 Ack=27429 Win=64681 Len=298
|
3295
|
0.000015
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [ACK] Seq=27429 Ack=607724 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
|
3296
|
0.001842
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [PSH, ACK] Seq=27429 Ack=607724 Win=65535 [TCP CHECKSUM INCORRECT] Len=122
|
3297
|
0.000052
|
160.207.226.48
|
160.207.12.41
|
TCP
|
2466 > http [ACK] Seq=1585 Ack=3651 Win=65520 Len=0
|
3298
|
0.000805
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [ACK] Seq=607724 Ack=27551 Win=64559 Len=1460
|
3299
|
0.000023
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [ACK] Seq=609184 Ack=27551 Win=64559 Len=1460
|
3300
|
0.000009
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [ACK] Seq=610644 Ack=27551 Win=64559 Len=1460
|
3301
|
0.000016
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [ACK] Seq=27551 Ack=612104 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
|
3302
|
0.00003
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [PSH, ACK] Seq=612104 Ack=27551 Win=64559 Len=1192
|
3303
|
0.002783
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [PSH, ACK] Seq=27551 Ack=613296 Win=64343 [TCP CHECKSUM INCORRECT] Len=122
|
3304
|
0.000144
|
160.207.226.48
|
160.207.12.41
|
HTTP
|
GET /_layouts/1033/owsbrows.js HTTP/1.1, NTLMSSP_AUTH, User: CORP\John.Doe
|
3305
|
0.000484
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [ACK] Seq=613296 Ack=27673 Win=64437 Len=1460
|
3306
|
0.000015
|
160.207.12.37
|
160.207.12.19
|
TCP
|
1292 > 4076 [PSH, ACK] Seq=614756 Ack=27673 Win=64437 Len=298
|
3307
|
0.000018
|
160.207.12.19
|
160.207.12.37
|
TCP
|
4076 > 1292 [ACK] Seq=27673 Ack=615054 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
|
3308
|
0.000032
|
160.207.12.19
|
160.207.230.36
|
DCERPC
|
Request: call_id: 4997 opnum: 39 ctx_id: 0
|
Looking at the bolded frames, we can see two things of importance. First, the client is requested for authentication from the server for the Get, to which it responds in a normal fashion (Get, 401, Get Negotiate, 401 Challenge, Get with Credentials). The interesting thing to note is the speed with which the server responds to each step of this process. You will notice that the server’s 401 responses occur consistently in less than 0.001 seconds. This is not consistent with our previous view of a web server performance problem. The second thing to notice is what the web server does in the last frame (3308): It makes an RPC call to a domain controller, presumably to verify the authentication provided by the client. This is perhaps the most important find in the trace, as it shows interaction with another server occurs before the client is validated and given the page he has requested.
Now that we have found this new piece of information, let’s filter to determine how long the response to this RPC request takes. This is most easily done by selecting the RPC request and right-clicking to choose ‘Follow TCP Stream’. Once this is done, you can very easily see the RPC request and response, identified by their Call ID:
No.
|
Time
|
Source
|
Destination
|
Protocol
|
Info
|
3308
|
————-
|
160.207.12.19
|
160.207.230.36
|
DCERPC
|
Request: call_id: 4997 opnum: 39 ctx_id: 0
|
3654
|
0.255042
|
160.207.230.36
|
160.207.12.19
|
TCP
|
1025 > 4003 [ACK] Seq=7680 Ack=1296 Win=64671 Len=0
|
3747
|
0.825159
|
160.207.230.36
|
160.207.12.19
|
DCERPC
|
Response: call_id: 4997 ctx_id: 0
|
The result of this final filter is interesting. Our previous analysis showed that there was a total delay between the HTTP Get and the Web Server’s response of 1.095 seconds (the delay between frames 3273 and 3751). The analysis of the RPC traffic shows that the delayed response from the domain controller accounts for 1.08 seconds of that delay! Using this information, we would be able to examine all of the other RPC calls to the DC and verify this finding, proving that the domain controller is the primary cause of the web server delay.
This information is the key to resolving the issue, since no amount of resource allocation towards the network infrastructure, web server, or other services (like SQL) would actually improve the situation at all. But an additional DC to handle the authentication traffic may completely resolve the issue.
#1 by Sean Barmettler on April 7, 2013 - 10:49 pm
i was using tcp.time_delay and wondering why this wasnt working for me. thanks. saved me an hour of banging my head against the wall.
#2 by Brian Hill on April 8, 2013 - 1:01 am
Cool, glad it helped 🙂
#3 by Unknown on June 17, 2013 - 8:00 pm
Great Post. I want to read the rest, but the links to the article series does not seem to be working.
#4 by Brian Hill on June 17, 2013 - 8:16 pm
Really sorry about that. For some reason, blogger doesn’t seem to be returning all of the results. I will work on this further, but until then, you can get to all of the posts from this link:
http://goo.gl/7AAOH
#5 by Unknown on June 22, 2013 - 1:48 am
Awesome. Thanks for posting the work-around link
#6 by viper ccie on January 13, 2014 - 7:45 am
Thanks Brian, this is really a great post 🙂
Actually I am wondering how you identified frame 3308? The source and destination of frame 3308 not showing web server IP address. Thanks
The second thing to notice is what the web server does in the last frame (3308)
#7 by Brian Hill on January 13, 2014 - 3:18 pm
Viper CCIE, thanks for the comment. It’s been a long time since I worked this case, but to the best of my recollection the .19 IP was the primary IP on the web server.