Automating Cisco IOS updates with Unimus - Part 2

Intro

In Part 1 of our Cisco IOS upgrade automation series, we focused on a simple and quick solution to upgrade (or downgrade) Cisco IOS devices with just a couple of commands in Unimus, deployed through our Mass Config Push functionality.

Today we will continue this endeavor and show you a more detailed and advanced solution. This article attempts to create a one-stop solution by leveraging TCL scripting to make updating all your IOS devices easy. All IOS-powered devices should be able to update using this guide and script, regardless of whether they are a router or a switch, and also regardless of the product series. All at the same time.

Let's start with what we need. In Part 1 we already described that we need a server/device to source upgrade FW images from, Cisco IOS devices and Unimus. Here is an abstract component diagram:

We will be using the same "topology" as in the previous article, but we will build upon it with some additions:

  • FW image source - this time it will store not just the IOS FW images, but also two scripts. One will be a bash script which we will use to generate a list of FW images for devices to find an upgrade candidate in, and the other one will be a dedicated upgrade TCL script.
  • Cisco IOS devices. These will be downloading the TCL upgrade script, list of available images, and lastly the FW image file to update to (assuming the devices find an update candidate in the image list).
  • Unimus and our Mass Config Push feature (version 2.1.0 and newer) with a slightly different set of commands to push compared to Part 1's preset.

Before we start, let us add a disclaimer. Large scale infrastructure automation is almost never easy. While working on this script, I encountered a number of quirks I had to deal with that were unique to one device but not to another. Keep in mind that it is possible your Mass Config Push preset will finish with errors - this is possible and I expect that to happen.

If errors happen, don't hesitate to check out the troubleshooting FAQ at the end or contact us. It will take a number of picky devices to iron out all weird behaviors and inconsistencies across all the various IOS versions. At least that was definitely my own experience.

And lastly - don't forget to test everything in a lab environment before deploying to your production network.

Preparing the Image / FW source

Contrary to Part 1, we will have higher requirements for the image source server. Once again, we will use a VM running Linux for our showcase. As outlined above, we will be generating a list of available FW images with their MD5 sums for the upgrade TCL script to verify the downloaded file's MD5 sums against. Checking image validity before deploying it is one of the first big features of this upgrade process.

Let's reiterate. The server VM will now "serve" the following purposes:

  • We will serve our IOS images from this machine. Note that we are focusing only on IOS images in .bin format, not archived images in .tar format. This is done so that we can can avoid as many issues as possible with devices with insufficient free space, or ones with smaller flash capacity. These wouldn't be able to fit multiple IOS image even if their flash was empty.
  • We will generate a sorted list (supporting IOS versioning) of FW images with their respective MD5 sums and make this available for our devices to download.
  • We will serve our upgrade TCL script. Our IOS devices will download and execute this to find an update image candidate, download it, verify it, and set it as the boot image.

FW Transfer methods

Just like last time, we are focusing on SCP and HTTP protocols. Usually we see networkers using TFTP or FTP, but we seldom see anyone choosing SCP or HTTP. SCP and/or HTTP are much more robust protocols and we wanted to showcase these, albeit less popular options.

If you are interested in more information on both, including some probably necessary steps to make SCP work with all your devices, please refer to the Part 1 article. Alternatively jump to the third point of the troubleshooting FAQ at the end of the article, where we describe possible issue with SCP and how to deal with them.

With that said, let's prepare our FW image source server. First, where to put the files:

SCP

If you opt for SCP you will be placing your files into the home folder of a user you choose. Let's say we create a special user called unimus for this. In this case, you are going to place your files into that user's home folder, which will be, by default, /home/unimus.

HTTP

If you opt for HTTP you will be using your site's root folder. In our case, we use NGINX and the root folder for a web location we will use is in var/www/ciscoiosupgrade.netcore.internal.

Choose the directory according to the method you chose for the IOS image hosting. Note that in both case the script only supports files placed in the main (root) directory from where they are being served, not sub-directories.

In these directories you also need to place both scripts, which you can find below. And with that being said, let's introduce both scripts and look at the whys and whats of each of them.

Overview of the FW image parsing bash script

This is an absolutely basic script, but still neat. This script simply takes all files with a .bin extension in the same directory, runs them through Linux md5sum command, sorts the output intelligently using sort with the -V argument to take file versioning into the account, and outputs a finished file called fwlist.

#!/bin/bash

#Set a working directory in the current directory
cd "${0%/*}"

#Extract file names and MD5 sums, sort them and output into a file
md5sum *.bin | sort -V -k2 > fwlist

Here's the example how the finished list can look like

a63c90cc3684ad8b0a2176a6a8fe9005  c180x-advipservicesk9-mz.151-4.M12a.bin
6d0bb00954ceb7fbee436bb55a8397a9  c1900-universalk9_npe-mz.SPA.158-3.M7.bin
28518159ba5f75ef0eeb9617fd35e2ba  c2800nm-advipservicesk9-mz.124-24.T4.bin
441018525208457705bf09a8ee3c1093  c3750e-ipbasek9-mz.122-55.SE5.bin
862dec5c27142824a394bc6464928f48  c3750e-universalk9-mz.122-55.SE5.bin
fd4b38e94292e00251b9f39c47ee5710  c3750e-universalk9-mz.152-4.E10.bin
1f94dacb4faf2829b0ffbb25ebd62e2e  c3750-ipbasek9-mz.150-2.SE5.bin
b28cf0ed5cc0d1928ea4f6656e1c8dde  c3750-ipservicesk9-mz.122-55.SE12.bin
871bdd96b159c14d15c8d97d9111e9c8  cat4500-ipbasek9-mz.150-2.SG11.bin
3287282fa1a1523a294fb018e3679872  s72033-adventerprise_wan-vz.122-33.SXI14.bin
a302a771ee0e3127b8950f0a67d17e49  s72033-ipbase-mz.151-2.SY16.bin
bbf7c6077962a7c28114dbd10be947cd  s72033-ipservicesk9-mz.151-2.SY16.bin

Overview of the upgrade TCL script

tclsh

#Load and store list of available FW images
set fwrawlist [read [open fwlist]]

#Retrieve FW type of the current device for further processing
set devicefwdirty [exec show version | include System image file is]
#Get a full name of the current FW
regexp -all {:(.*?)\"} $devicefwdirty junk devicefwfull
#Get a FW type with a major release version, this prevents issues with some devices which might not be compatible (mainly storage constrains) with the next major release version
regexp -all {:(.*?-.*?-.*?\.\d\d).*?.bin} $devicefwdirty junk devicefwrelease

#Find FW update candidate
puts "Finding a viable FW update candidate..."
#Process the list, return the latest match (it will be the latest FW image)
set fwlistparsed [regexp -all -line "\\s{2}($devicefwrelease.+?.bin)$" $fwrawlist junk down_file]
#If no match is found, abort the script
if {$fwlistparsed == 0} {
    puts "List of available FWs does not contain any update candidate. Aborting..."
    return
}
#Run fullname string comparison, if matched, current and matched FW are identical, abort the script
if {[string compare $devicefwfull $down_file] == 0} {
    puts "Current and matched FW image are identical. Aborting..."
    return
}
#If the current and matched FW are not identical, start comparing them on each level
#Compare major release version
regexp -all {\.([0-9]{1,4})\-} $devicefwfull junk curfwmatch1
regexp -all {\.([0-9]{1,4})\-} $down_file junk newfwmatch1
if {$curfwmatch1 == $newfwmatch1} {
    #Major release version is identical, compare minor release version
    regexp -all {\-([0-9]{1,3})\.[a-zA-Z]*?[0-9]*?[a-zA-Z]*?\.bin} $devicefwfull junk curfwmatch2
    regexp -all {\-([0-9]{1,3})\.[a-zA-Z]*?[0-9]*?[a-zA-Z]*?\.bin} $down_file junk newfwmatch2
    if {$curfwmatch2 == $newfwmatch2} {
        #Minor release version is also identical, compare revision
        regexp -all {\-[0-9]{1,3}\.[a-zA-Z]*?([0-9]*?)[a-zA-Z]*?\.bin} $devicefwfull junk curfwmatch3
        regexp -all {\-[0-9]{1,3}\.[a-zA-Z]*?([0-9]*?)[a-zA-Z]*?\.bin} $down_file junk newfwmatch3
        if {$curfwmatch3 == $newfwmatch3} {
            #Revision is also identical, in this case it suggest some other problem in versioning or FW naming
            puts "Current and matched FW image and their (numeric) version seem identical, but full names are not. Aborting..."
            return
        } elseif {$curfwmatch3 > $newfwmatch3} {
            puts "Current FW image is newer than the matched one from the list of available FWs. No update candidates were found. Aborting..."
            return
        } elseif {$curfwmatch3 < $newfwmatch3} {
            puts "Update candidate found."
        } else {
            puts "Unknown error occurred during FW matching. Aborting..."
            return
        }
    } elseif {$curfwmatch2 > $newfwmatch2} {
        puts "Current and matched FW image and their (numeric) version seem identical, but full names are not. Aborting..."
        return
    } elseif {$curfwmatch2 < $newfwmatch2} {
        puts "Update candidate found."
    } else {
        puts "Unknown error occurred during FW matching. Aborting..."
        return
    }
} elseif {$curfwmatch1 > $newfwmatch1} {
    puts "Current and matched FW image and their (numeric) version seem identical, but full names are not. Aborting..."
    return
} elseif {$curfwmatch1 < $newfwmatch1} {
    puts "Update candidate found."
} else {
    puts "Unknown error occurred during FW matching. Aborting..."
    return
}

#Download FW update
#Read common arguments, abort if mandatory arguments are missing, and decide which protocol will be used
set down_prot [lindex $argv 0]
if {[string length $down_prot] == 0} {
    puts "No argument was defined, please add arguments to your MCP in Unimus where you execute this command. Aborting..."
    return
}
set down_addr [lindex $argv 1]
if {[string length $down_addr] == 0} {
    puts "Second argument (address) is missing. Aborting..."
    return
}
#Read HTTP specific arguments and build download URL
if {[string compare http $down_prot] == 0} {
    set down_port [lindex $argv 2]
    if {[string length $down_port] == 0} {
        #Use port 80 if no custom port is defined
        set down_port "80"
    }
    set down_url "http://$down_addr:$down_port/$down_file"
#Read SCP specific arguments and build download URL
} elseif {[string compare scp $down_prot] == 0} {
    set down_user [lindex $argv 2]
    if {[string length $down_user] == 0} {
        puts "Third argument (user) is missing. Aborting..."
        return
    }
    set down_pass [lindex $argv 3]
    if {[string length $down_pass] == 0} {
        puts "Fourth argument (password) is missing. Aborting..."
        return
    }
    set down_url "scp://$down_user:$down_pass@$down_addr/$down_file"
} else {
    puts "Unrecognized protocol. Aborting..."
    return
}
puts "Downloading firmware..."
set down_result [exec copy $down_url flash:]
#Evaluate download result
if {[regexp {bytes copied} $down_result]} {
    puts "Update FW image was downloaded successfully."
} elseif {[regexp {Not enough space} $down_result]} {
    if {[regexp {Not enough space} $down_result]} {
        puts "Error occurred during download - insufficient space left on device. Aborting..."
        return
    }
} elseif {[regexp {Protocol error} $down_result]} {
    puts "Error occurred during download - protocol error. Aborting..."
} elseif {[regexp {busy} $down_result]} {
    puts "Error occurred during download - device is busy. Aborting..."
} else {
    puts "Unknown error occurred during download. Aborting..."
    return
}

#Validate MD5 of the downloaded FW image
puts "Validating integrity..."
#Run validation for the downloaded FW image
set down_file_md5check [exec verify /md5 $down_file]
regexp -all -line "=\\s{1}(.+?)$" $down_file_md5check junk down_file_md5
regexp -all -line "(.+?)\\s{2}$down_file" $fwrawlist junk fwmd5tocomp
#Compare both MD5 sums
if {[string compare $down_file_md5 $fwmd5tocomp] == 1} {
    puts "Update FW image validated successfully."
} else {
    puts "Unknown error occurred when validating update FW image integrity, MD5 sums do not match. Aborting..."
    return
}

#Set up system boot image with the downloaded update FW image
puts "Updating..."
ios_config "boot system flash:$down_file"
puts "Update is ready. Please run your reload MCP preset..."

Here is a breakdown of the workflow of this script:

  • Script ingests a list of available FW images from the pre-generated list we prepared earlier (with the bash script, the fwlist file).
  • Script checks the current version of IOS running on the device.
  • Script compares the current IOS version to all available FW images - note we are matching only the same major release version (if your device is running a 12.X release, you will be able to upgrade only to a newer version of the 12.X release, not to a newer major release like 15.X or 17.X).
  • If the script matches multiple upgrade candidate image files, it will always choose the last match, which will be the latest FW image (thanks to version-aware sorting in the list of FW images).
  • Script processes input arguments, evaluates them and builds a download URL for the device to download the new FW image.
  • A new FW image is downloaded. Its integrity verified and compared to the MD5 sum from our known good sums on the image server.
  • If the integrity checks out, the script sets up the FW image as the boot image and returns a final message informing the user to reload the device.

We intentionally don't reload devices here - you can probably imagine the problems it would cause in real networks where a switch higher in the network topology would finish and reload while devices inherently relying on its activity would then fail their downloads and consequently the Mass Config Push from Unimus with it.

As you can see in the script, we tried our best to add a number of comments to describe parts of the code and what they do to make it easier for anyone to understand and even modify the code according to their needs.

Preparing Unimus and Mass Config Push presets for IOS upgrade

Here are the Config Push presets we will be using to pull the FW images and perform the upgrade and reload the devices:

Config Push preset 1 - Upgrade devices

Just like the last time, we will run the upgrade in a single Config Push preset. This preset will download the necessary files. Note the use of tclsh, or TCL shell, and log_user command set to 0 before running copy commands - we did this to suppress the outputs of those two commands, which would otherwise generate some unpredictable output, creating unwanted output groups. This way, we make sure that any actual output will be generated by the script itself.

SCP

tclsh
log_user 0
exec "copy scp://SCP_USER:SCP_PASS@FW_SRC_ADDR/fwlist flash:"
exec "copy scp://SCP_USER:SCP_PASS@FW_SRC_ADDR/ios_upgrade.tcl flash:"
tclquit
tclsh ios_upgrade.tcl PROTOCOL FW_SRC_ADDR SCP_USER SCP_PASS

Where:

SCP_USER - SCP user
SCP_PASS - SCP password
PROTOCOL - scp or http - protocol used to download an FW image
FW_SRC_ADDR - IP or hostname of FW image source device

Please replace all the example values with your actual ones. Don't forget to check Require "enable" (privileged-exec) mode for this Config Push.

HTTP

tclsh
log_user 0
exec "copy http://FW_SRC_ADDR/fwlist flash:"
exec "copy http://FW_SRC_ADDR/ios_upgrade.tcl flash:"
tclquit
tclsh ios_upgrade.tcl PROTOCOL FW_SRC_ADDR FW_SRC_PORT

Where:

PROTOCOL - scp or http - protocol used to download an FW image
FW_SRC_ADDR - IP or hostname of FW image source device
FW_SRC_PORT - OPTIONAL - define this argument only if your webserver is listening on a port other than 80, otherwise remove this argument altogether, script will default to port 80

Please replace all the example values with your actual ones. Don't forget to check Require "enable" (privileged-exec) mode for this Config Push.

Config Push preset 2 - Reload devices

tclsh
exec "reload in 3"
tclquit

This is a simple preset to reload devices and set it to be executed in 3 minutes. Feel free to change it to your needs, there are two formats for reload in command - MMM or HHH:MM, or you can change it to reload at and define a specific date and time for the reload instead.

Let me quickly address why we use the TCL shell to execute a simple reload - in my testing I encountered an inconsistent sequence of prompts when sending the reload command, and I can imagine there can be even more variations between other devices and IOS versions. Use of the TCL shell handles the issue - it just works without needing to handle various IOS inconsistencies.

Example of a successful run

And here's how we want to see the upgrade Config Push preset to go:

While we wish everyone had exactly the same results as we did, some of your devices might not finish successfully. While the most script-specific errors are self-explanatory, let us also add a troubleshooting FAQ below to help you understand and hopefully fix most of the errors reported by Unimus or the script.

Troubleshooting FAQ

Unimus returned INTERACTION_ERROR

This error is caused by Unimus not receiving any recognizable output from a device before a timeout runs out, which is 20 seconds by default for Cisco devices.

We already touched on overriding timeouts via Advanced settings in Part 1, but we would like to add some more information here. How to find the right value for you?

One of the possible ways to go about this is the trial-and-error approach. You can try to increase it gradually (1 minute, 2 minutes, etc.). An alternative would be to try a larger value (e.g. 5 minutes like in my case) right away. We would recommend the latter, as you can have hundreds of devices fail with this error, and choosing random devices and trying to time image download manually to find out if a certain timeout is enough would be time-consuming.

One of our Config Push features which will come in useful is using the context menu of any push output group. You can choose the option to rerun the preset only on devices in that one output group. You can progressively try increasing the timeout and keep adjusting until there is no device left in the error output group. Alternatively, clone the preset with devices from the error output group to continue tuning only those devices.

If you get INTERACTION_ERROR from the timeout while the FW image was downloading, then unfortunately after Unimus terminates the session, your device(s) will continue downloading the file in the background. So if you quickly increase your timeout and rerun this preset on these devices, the ones with such a background download still running will finish this push with a device is busy error (see below section) returned by the script.

Keep in mind that INTERACTION_ERROR could also include some outliers, such as devices that use a different syntax for commands or syntax for response prompts that the script might not catch, so don't go to extreme lengths with this timeout. If there are devices that you can't fix by increasing the timeout, then stop and let us know. As mentioned in the preface, you should expect some errors and tuning to make large scale automation work seamlessly across a large fleet of different IOS devices.

Script returned error message Error occurred during download - device is busy. Aborting...

In the preceding section, we described a specific case in which device(s) with INTERACTION_ERROR can create an output group with this script-specific error after a quick rerun of the preset.

This error can show up when your device cannot download a chosen FW image in time and you rerun your preset on them before they had a chance to finish the download. IOS's copy command return a Device or resource busy error, as it is already downloading the file in the background. Time is the best cure here. Leave devices in this output group for a couple of minutes and try again.

Script returned error message Error occurred during download - insufficient space left on device. Aborting...

Your device doesn't have enough free space left on it. In the section above, we also mentioned a potential issue when the script failed due to INTERACTION_ERROR, but the device might have kept downloading the firmware. Alternatively, your device may effectively not be able to store more than a single FW image file as its flash is already consumed by your current running FW image.

If that is your case, add the command del /force *.bin to the Config Push preset just after log_user 0 and before the copy command for downloading the fwlist file. This will cause any file with a .bin extension to be removed from the root of your flash (it is not recursive).

Be careful not to delete anything important, though. This will delete any file with .bin extension. This has (hopefully) obvious drawbacks - it can leave you with an unbootable device if the currently configured boot IOS image gets deleted and a power failure occurs during the update to a new image. Caution recommended.

Script returned error message Error occurred during download - protocol error. Aborting...

This script-specific error most likely suggests an SSH-related problem and is likely related to unsupported KEX algorithms, Ciphers or Host Key Algorithms offered by your IOS device and rejected by your FW image source server. You might have never seen such a problem when interacting with Unimus - that is because Unimus is more liberal with KEX and other crypto than your typical OpenSSH's installation by default (we know many networking devices use older crypto protocols).

If this happens to you, I would recommend adding the diffie-hellman-group1-sha1 KEX to your SSH server's config file. It should resolve most of the devices with this error. If that is not the case, you may want to try manually connecting from your Cisco device to the FW image source server, or you can turn on terminal monitor and debug for SCP with these commands

terminal monitor
debug scp all

and try to manually download, for example, the fwlist file from the image source server with:

copy scp://unimus:scppass8520@10.30.50.70/fwlist flash:

Here's an example of the output you can expect:

cisco#copy scp://unimus:scppass8520@10.30.50.70/fwlist flash:
Destination filename [fwlist]?
%Error opening scp://unimus:scppass8520@10.30.50.70/fwlist (Protocol error)
cisco#
*Feb  5 05:31:45.181: SSH2 CLIENT 0: kex algo not supported: client diffie-hellman-group1-sha1, server curve25519-sha256,curve25519-sha256@libssh.org
cisco#

From the information in the example above, you would then add diffie-hellman-group1-sha1 to your server's SSH configuration.

Script returned an error message Unknown error...

If you encounter this error, let us know. This indicates an unexpected error and will require some debugging.

There is also a topic on our forums you can use to report any issues, provide feedback or ask questions: https://forum.unimus.net/viewtopic.php?f=11&t=1426