Background
When working with networking vendors in this day and age, AI and ML are terms that are thrown around as features that add value to your team by proactively monitoring and alerting you of interesting or problematic things that may unknowingly exist in your environment. At face value, that sounds great! You don’t have to constantly check your network for issues because the system will bubble them up for you. And in an environment with close to 310,000 APs across multiple orgs and sites, it’s pretty much table stakes. The good news is that I’ve seen this work, as long as you’re checking those alerts because today, these systems will most likely alert, but not automatically remediate any of its findings. The remediation process still requires human intervention in most cases. However, what happens if the vendor’s AI/ML solution isn’t trained to find something in your environment that it should or something that you might be interested in looking at? That’s where knowing how to code and also work with APIs really comes in handy these days. I’m going to walk you through two different scenarios where writing my own code to leverage one of our WLAN vendor’s APIs led to insightful discoveries in our environment to help find a few needles in a massive haystack. One of the scenarios even led to the vendor implementing their own proactive checks into the platform which has sparked an auto RMA process for us.
Disclaimers
Disclaimer #1: This post is not meant to bash on Mist. Every vendor that I’ve ever worked with has had problems and at our scale, we’re bound to run into a few issues. It just so happens that the issue found in scenario 1 was within our Mist environment.
Disclaimer #2: This is not a post where I teach you how to write code. I’ll include some examples and screenshots, but learning how to code takes time. I’ve been writing code in Python in some way since 2017 and in Go since last year. My experience as a kid writing my own webpages in HTML or building mIRC scripts has helped me quickly learn, but I still struggle. I’m still not a great coder and this has a lot to do with the fact that it’s not my primary responsibility. I write code when I can and when I need it to help make my job and life or my team’s lives easier.
Disclaimer #3: Every vendor that I’ve worked with has some sort of limitation on the number of API calls you can send. Make sure you are aware of those limitations before you start banging away at their API. Please API responsibly.™️
Scenario 1
Someone from our NOC reached out about an AP that was online, but not accepting any clients on the 5GHz radio. After further examination, I noticed that the AP’s 5GHz radio didn’t show a BSSID or any information about the channel or power when viewing from the dashboard (Fig. 1).
Knowing that sometimes the UI “hides” certain details away with the goal of making management easier on less knowledgable admins, I decided to take a peak at the API to take a look at any available details that might shed some light on what the problem was. The API call I needed was /api/v1/sites/:site_id/stats/devices
where “:site_id” is the character string you’ll typically see at the end of the URL when viewing the dashboard.
Now, there’s more than 1 way to interact with an API, especially Mist’s. You could enter the API call directly into your web browser, use an app like Postman, or write code in your favorite language to interact with it (e.g. using the requests library in Python). I’m not going to go into extreme detail on how to do any of those since there’s likely 100s of blog posts and articles out there that can teach you better than I can. However, if you have specific questions and you think I can help, please feel free to reach out and I’ll do the best that I can. My process with regards to interacting with APIs is fairly simple:
- for quick and well-known things, I’ll use a web browser.
- If I’m trying to proof of concept something or I’m working with an API for the first time, I’ll typically use Postman.
- Once I’m ready to write code to automate and scale up, I’ll begin to write code.
Below (Fig. 4) is what I found when I ran the above API call for the entire site. I found the AP’s name in question and then scrolled up to the radio_stat JSON object and focused on the band_5 object. Notice the channel as well as the bandwidth and power are all showing as 0? Seems odd, doesn’t it? Also notice that there’s a disabled key that’s showing as true even though the device profile is configured to enable the radio. For some reason, I don’t remember that particular key being present back when we first discovered this issue in October of 2022, but my memory is too bad to say that definitively and I probably don’t have the original screenshots from back then to prove it either. Either way, this is a good indicator that something’s wrong!
To try and fix the issue, we decided to reboot the AP to see if that would help and it didn’t. Without any other known tools available to help with this radio’s state, a ticket was opened with support, but at that point, curiosity struck me. I wondered how many other APs were lurking in our environment in a condition like this. So I set off to write a script that would scan our entire environment which includes over 4,000 sites to see how many APs were in this condition. A few days of on/off coding and an initial scan that took 1.5 hours, the result at that time was 173 out of 300,000+ APs. Needles, meet giant haystack. Unless someone had directly come across those 173 APs, they would have just sat there, wasting away and not servicing clients on any non-working radios. We pay these APs too dang well for them to not work! You might be wonder if there was anything else I learned besides the number of APs in this state… Here’s a list:
- Some APs only had the 2.4 GHz radio in this non-working state while others only had the 5 GHz radio affected
- Some APs had both radios in this state
- In some cases, multiple reboots of the AP would eventually clear and fix the condition while in the majority of the cases no amount of reboots would help; the APs that couldn’t be fixed through software or reboots were marked for RMA
- After some of the RMA’d APs made it back to Mist, water damage was found in some of them and this issue was likely the result of that damage
- There was mention of a radio defect with the chipset in use, but to this day, we still don’t have a clear answer on what that defect is
Mist eventually wrote some logic in their backend to proactively discover these APs and began generating automatic RMAs for them on a bi-weekly basis which was extremely helpful, but the lack of visibility into what APs were potentially affected in our environment and having to wait up to 2 weeks for them to send a list of the auto RMAs bothered me. So I refactored my code to do a few things:
- I rewrote the code in Golang to take advantage of Golang’s concurrency. This drove the script run time down from about 1.5 hours to about 30 minutes because orgs were given access to a dedicated Go channel which allowed the orgs to be scanned in parallel instead of sequentially.
- Instead of printing the results to the screen like I typically do for quick scripts, I decided to store them into a PostgreSQL database. This way I could keep track of my daily scan results much easier and also perform queries against the database looking for different trends like model type, hardware revision, or even version of code (Fig. 5).
- I wanted to make the results easy to view for my team and anyone else who wanted to see them. So I created a Grafana dashboard hosted on another team’s Grafana instance and queried the database to display certain results (Fig. 6). This is where the true value came in. Now I could send anyone in our org to a URL and they could view the data without ever needing to come to me. All I had to do was make sure the script ran daily!
Scenario 2
Once again, our NOC reached out to us with a problem. Long story short, the problem appeared to be caused by some local interference on a channel and we were able to confirm that by statically setting the channel on the affected AP to a different channel. With as many APs as we have, we have chosen to use the RRM solution from whatever vendor we use for the large majority of our deployments. There has to be unique circumstances or design challenges for us to use static channel plans. So the goal for that particular break/fix change to a static channel during our troubleshooting was for it be temporary and to eventually configure the AP’s radio back to leveraging RRM. When I checked back on that same AP nearly 3 weeks later, its radio config was still statically set. Once again, a lightbulb went off in my head. If this AP was forgotten about, how many other times has this happened where an AP’s radio(s) was set to a specific channel and/or power level for troubleshooting purposes and then left that way indefinitely. With my script from scenario 1 in mind, I decided to refactor the code to check for this particular circumstance as well. But first, I had to figure out what a statically configured AP radio looked like from the API’s perspective.
Using the same API call from scenario 1 above, I was able to get the desired information. The only difference was, I had to look at the radio_config JSON object instead of radio_stat. As you can see in the screenshot below (Fig. 8), just like in scenario 1, focusing on the channel key was important. In instances where a radio is using RRM, there is no channel key present in the radio’s config object. In instances where a radio is statically configured, it will be any integer above 0 (it will actually be a valid channel like 1, 6,11, 36, 52, 149, but this worked for my script’s logic).
With the script refactored, I was able to find 2190 APs with at least one statically configured radio. Of those 2190, 92 of them weren’t expected to be configured in that way. At the end of the day, that’s not a lot (0.0003%) when you compare it to the overall 310,000, right? However, my opinion is that consistency is key in any environment, especially one this big. When you start to get sloppy and off course, it starts to creep into other areas and bad things tend to start happening. Besides, it gave me an opportunity to write code and sharpen that particular toolset which is not something I get to do a lot of and is always a good thing because you never know what problem or thing you’ll be trying to solve for in the future that will require a programmatic approach!
Neither one of these scenarios could have been easily found at the scale we operate at by just cruising through the dashboard. It’s simply not feasible with that many APs across that many sites. Nor were these scenarios something the vendor provided alerts or visibility on through their AI/ML solution. Obtaining the answers to both of the questions that I had about our environment was only possible with code that could go out and do the work for us. Again, not some pretty dashboard, not AI, just a few hundred lines of code. And there lies the beauty of knowing how to write code..Even if you’ll never be a coding rockstar, you can still do some amazing things!
I’ll leave you with this question… What types of problems are lurking in your environment that could be found or highlighted programmatically?
As always, please reach out on LinkedIn or Twitter if you want to continue the conversation!