The story of a strange troubleshooting scenario where devices stop communicating for no apparent reason. Wireshark packet captures used for resolution.
This was an interesting ticket from a few years back in another life. After the success of my other troubleshooting post that also utilized wireshark - The Case of the TCP Challenge ACK, by request I went back into the archives to find another one of the unique network issues I have resolved. This story is true but some of the details and information have been changed or redacted to protect the innocent :). I present this in a story style format, and go over how the issue was troubleshot to resolution. I also provide a few troubleshooting tips. I haven't posted in some time and for that I apologize to the few who have been waiting! I promise to post more in the coming year of 2022!
Problem statement: Network video recorders (NVR) for CCTV cameras at one location become unreachable on remote monitoring after a few minutes of being connected to the network. NVRs work fine when physically viewing on monitor when problem happens. Problem can be replicated by rebooting NVRs.
Background
To begin this was a brand new branch office location that had been active for a few weeks. I hadn't really heard anything about it but I was aware of its existence as I had seen it on the change schedule and heard about it from the field techs. It had a couple hundred users with a fairly standard network setup. There were ~15 network devices: 2 distribution switches, numerous access switches, access points and a Router for the WAN - all devices were Cisco. Now during this case the vendor and specific switch model and code will come into play.
This was the first I heard of the problem when the ticket came in. Apparently this was a location that needed a camera system. Thus we setup an isolated Vlan and network for the local guard and remote security center to be able to review - just like at other locations. Everything was deployed correctly and the information provided to the end users, however after a day or so it came back that after installing the NVRs the cameras would only be viewable for a couple minutes before freezing on the remote monitoring screen, but not on the local device itself. Eventually it would show disconnected and the NVRs would be unreachable on the network. After a restart it would come up again and the issue would repeat. Later we were able to replicate the problem by unplugging the NVR's ethernet cable and then plugging it back in.
The Case of the GARP
You've probably gotten a ticket or two for a device being unreachable on the network. Typically you probably check if its getting a DHCP IP address or if it's static IP info is correct. Checking that the subnet mask, the default gateway, Vlan on the port etc. is correct and trying to ping and traceroute to it. Well here everything seemed to match, it had the correct static IP but it wasn't pingable. So I logged into the local site router and checked the Address resolution protocol (ARP) table to check and see if I saw the device there, "maybe ping isn't enabled" I thought. I did see an ARP which means it should be able to communicate outside of its network because it can talk to it's default gateway. So it should be working fine.
Eventually the ARP cache would clear indicating communication had stopped, yet due to the fact you can see an ARP on the router/gateway and that the ops center was able to view it briefly, we can assume the device is in the correct Vlan and has the correct IP and subnet mask information. Although if I didn't see an ARP, I might then inquire of what switch and port the device was on to check the Vlan assignment. Routing on the WAN was working normally.
Now if you don't know what ARP is, its the protocol used in IPv4 for a device to resolve a MAC address to an IP address. When a device needs to communicate within its same IP subnet it will send an ARP request for a device directly like "Hey who has 10.6.34.55? Tell me 10.6.34.29" Within that request is the source device's physical or MAC address. Then the reply will contain the other device's MAC address, this is how devices learn and send traffic directly between themselves. Furthermore, when a device needs to communicate with an IP address outside of it's own IP address range, it will send an ARP request for its default gateway so it knows where to send the traffic to get to the remote network. A Gratuitous ARP is a announcement request where maybe its used for tracking or similar and not necessarily for communication purposes like a standard address resolution request. This will come into play later.
Back to the troubleshoot. When working with the user remotely, after unplugging and re-plugging the Ethernet cable or rebooting the device I was suddenly able to ping the NVRs and reach them via web browser. The user confirmed the cameras would be reachable and operate normal for the guard on the local LAN and the remote SOC, but as mentioned after a few minutes it would disconnect. At this point I roped in a colleague who also had no idea what was going on after reviewing the information at hand.
Now I know there are some folks out there who would close this ticket and say "Not a network problem, problem is with the device." Us networkers know its very common for the network team to solve other departments issues and I wanted to figure this out. I couldn't leave the user hanging. Maybe my memory deceives me but I do remember this dragging on with some back and forth finger pointing and assistance from desktop support until we decided we have to go on site and figure out what is going on!
Visiting the location
With the blessing of our supervisor a coworker and myself went to the local branch location to see the issue first hand and troubleshoot further. After meeting with the user and replicating the problem and verifying the behavior we decided to try a few things.
First we tried plugging it into an adjacent access switch in the same Vlan - same problem happens. Then we gave it a new static IP in another Vlan/network which was used for the standard users to see if it was related to the isolated security Vlan - issue still happens. Checked the device settings, nothing stands out. Firmware was on the latest version.
I recall at this point we were at a loss and getting desperate to isolate the exact root cause. I'm sure you or anyone would be as well - there has to be a reason for this to be happening! "This NVR works fine at X site" the user would tell us "so it can't be the device itself, something is different at this location". Then the forbidden words were spoken: "it must be the network" (triggered).
So we got the NVRs moved and plugged the device into the upstream switch that the router was connected to (recall the example topology shown above has an upstream switch connected to the router, then the 2 distribution switches, then the rest of the downstream access switches), but the problem was still present. Finally it was suggested to create a new test Vlan on only that switch that connected to the router and put the NVR directly on it. After doing this the NVR was not at all dropping from the upstream ping or viewing station! Problem solved, correct? Just leave it on that switch right? WRONG!
It was time to get a wireshark capture to try and figure out what was going on at the packet level. First I did a capture within the test Virtual LAN to get a look at how things were there, then we would do a capture when replicating the problem to compare the difference.
Wireshark is the best tool you can use to see what is actually happening on the network. Being able to read packet captures (aka pcaps) is a valuable skill for any network engineer/administrator. Now understand that performing a capture north of the WAN router/gateway would have not identified this problem; it was only found after reviewing a capture from the local LAN. Therefore as I've said before in other posts its better to get both local and upstream captures when attempting to do an in-depth diagnosis.
PCAPs or it didn't happen
What we did next was setup a SPAN port to replicate all the traffic from the device port to our port with a PC running wireshark. We would capture traffic in our new test Vlan where things were working normally, and then revert the NVR back to the original switch and port to analyze traffic there where the issue was present.
You can see here I was able to get the ARP of the device and I saw some traffic on the specific NVR TCP port. The capture would continue until the NVR would stop communicating, and as expected the traffic would stop in the capture. Thus communication was genuinely stopping here. Everything else appeared to be normal at first glace....
Note: The "Zhejiang" are the NVRs. As I mentioned the equipment here is Cisco.
Here is a quick screenshot of some back and forth traffic on the NVR port 37777. Things appear to be normal until traffic just inexplicably stops.
After this we left the site with our captures and went back to the office, leaving the devices in the problem state hoping to find a fix at a later time.
I spent some time combing over the captures trying to figure out what was going on, looking for an anomaly or something. Staring and comparing, filtering out the noise looking at them packet by packet, flow by flow. At the same time going over the steps a device uses to communicate "okay first it comes online, makes sure no other devices are using its own IP via ARP announcement, then it sends an address resolution request to the gateway" I was thinking.
After reviewing the captures in detail (5 in all, 1 where it was good and 4 replicating the problem, 2 for each NVR), I started to notice something interesting, there appeared to be a few different Cisco MAC addresses showing up in the ARP traffic. In Addition there seemed to be a lot more ARP traffic than expected for only having a couple devices on the same Virtual LAN.
You might be thinking well you have multiple Cisco devices so this is expected. However this is not the case as there was no other devices configured for layer 3, so we shouldn't see any other layer 3 type communication from a Cisco device on this specific LAN (ARP is a prerequisite to layer 3 communication). All the switches were on their own management subnet, so they didn't have IP addresses on the security or user subnet only the Router did.
I was also seeing a lot of Gratuitous ARP, and also ARP requests with no IP address with a Cisco MAC requesting. It would say "Who is <IP> tell 0.0.0.0" which is strange as usually an ARP has the requestor IP address. In a minute you will see what I was seeing in the capture. (Did you notice it in the picture above?)
Continuing on I logged into the router which had the same MAC on all of its sub interfaces for each network. I recorded that and started looking and trying to figure out what were the other devices, "why am I seeing this?" I searched our monitoring software and logged in to each device at the location. Eventually I identified that the MAC addresses from the unexpected ARP requests matched the distribution switches at the location which were Cisco Catalyst 3850 series.
Root Cause
Perhaps I have found a Bug I'm thinking, getting fun thoughts of grandeur as I thought to have found a Cisco bug! This was not the case though, after doing some google searching and working with Cisco TAC it appeared that this was a feature of the 3850 called IP Device tracking (IPDT). It was expected behavior to see gratuitous ARP probes on Vlans attached to the 3850 with a source IP of 0.0.0.0. The switch used it to track L2/L3 addresses of all devices within its reach. I can only speculate why the NVRs stopped working because of this, likely it had something to do with a poorly coded TCP stack on them. But this is why the test Vlan that was only on the access switch didn't manifest the problem - the Vlan wasn't extended to the distribution switches where the probes would have found their way to the NVR.
I can think that maybe it was the duplicate IP address detection, like the device thought it's IP was already in use since it was seeing ARP probes for its own IP address. Instead, maybe somehow the network video recorder was replacing the MAC address for the correct default gateway with the 3850's MAC after receiving the device tracking GARP. Either way it would seemingly stop using its own IP address that was statically configured. I wasn't about to try and engage a cctv vendor about a bad TCP stack on their hardware, so I focused on the 3850 path to resolution. Maybe we could do something to stop this behavior from the switch since we didn't even use that feature.
Within a minute of getting the GARP/device tracking probe from the 3850 (MAC a6:c in the screenshot is one) the NVR would become unresponsive on the network. You can see 4 different Cisco MAC addresses as well. None of these are the Cisco router/gateway.
I was informed by TAC that the code we were running was older and had IPDT enabled by default, but in later releases it was turned off by default. Additionally there were 3850s at other locations touching thousands of users which were running the same code so it really didn't affect other devices negatively as far as we knew. Therefore I came to the conclusion of the poorly coded TCP stack on the NVR.
Later after getting the proper approval we made the change on the 2 distribution switches at the location to disable the IP device tracking feature (TAC said it was non service effecting and we didn't use anything that needed it). After a reboot of the NVR, low and behold the problem went away and the devices never went offline again! I don't recall the commands 100% but after looking at the guides again I believe it was the "nmsp attach suppress" and "no ip device tracking".
Finally down the road we upgraded all the 3850s on the network to a higher code version that had bug fixes and the IPDT feature disabled. If we were actually using that feature then we probably would have proceeded with the isolated Vlan on only the one switch as a work around or disabled it on the special security Vlan only.
I never heard of another problem with that department's cctv systems across the network again.
Takeaways and Tips
Couple of takeaways I have is that you can't stop troubleshooting or driving for a resolution. The end user depends on you for help and not to give up. Don't be afraid to ask for suggestions from peers. Packets are the ultimate way to look at networks. Understand your protocols and develop the analytical mindset to spot the anomaly. Consider the differences between different locations and check for differences in settings or configurations between them. Also unexpected behavior caused by default settings.
Sometimes you have to go to the location (or use remote hands) to see the behavior authentically. It could be a performance issue or an issue like this for connectivity, but it's good because you get to talk to the users and see what exactly they are doing or what is happening. It also gives you an opportunity to perform some of the steps we did in troubleshooting to isolate the problem.
A lot of the time in wireshark you will be looking at IP addresses, however in this scenario it was essentially a Layer 2 local LAN issue, therefore for something like this its good to filter in wireshark by MAC address vs IP just to get a different look.
The filter would be eth.addr == <MA:C a:dd:re:ss> . You can also expand the Ethernet header and right click on the MAC address of a frame, then select "Apply as filter" in the pop up menu and then "Selected". Lastly, you can also filter by protocol. In one of the screenshots you might notice I just typed "arp" in the filter to see all the ARP traffic, you can do this for other protocols like HTTP as well. Try out different filters and experiment a bit to help remove the noise or focus on something specific in your capture.
I hope you enjoyed reading this and that it helps you! Also consider subscribing to the mail list so you never miss a post. Thank you