Translate

Wednesday, June 3, 2015

ACI Example - Simplified Infrastructure Upgrades

Most of the ACI literature tends to focus on  application automation aspects: the possibility of defining application connectivity using a declarative model to state the needs for the various application components. This brings along the option to continue doing networking the way we were before, and also to move into a new way of doing networking where concepts of VLAN, subnets and so on are not required in the same way as before.

But with all these conversations some of the basics about ACI get lost. And the basics alone are very important. In the little blogging that I get to do, I have already written about how on-boarding new devices into a fabric becomes a very simple task when you use ACI (see here). This is also true for replacing hardware, should you need to do an RMA for a device. 

Another advantage that ACI brings over traditional networking is simplified software management. The fabric software, the bits that run in both the APIC controller and the ACI switches, is referred to as firmware in APIC. 

In this blog post I describe how you can completely upgrade an entire fabric including the controllers without service disruption. This is a very important part of maintaining and operating an infrastructure.

Customers who are looking to deploy any SDN solution should look at how upgrades are done, and what is the operational complexity involved. I believe they will find out that this is another area where the integrated overlay has significant advantages over a server-only overlay. 

Let's look at the steps for doing this in ACI.

(1) Add software to the Firmware Repository

There are various ways to do this. The simplest is to configure a download task to download the controller .iso and the switch .bin images into the firmware repository that is accessible on the FIRMWARE section under ADMIN tab. 




Once you set it up, you have to verify the operational status. It should be downloading:


The above tasks can be done through the GUI as shown in the pictures, through the REST API or also via CLI on the APIC with admin privileges. When the download reaches 100%, the firmware is downloaded and added into the Firmware Repository. You can then click there to confirm:



You have to repeat the above to add the .bin image for upgrading the switches. Once done, we move to step 2.

(2) Upgrade the Controller Firmware. 

Again, you have to go to the ADMIN are, and click on Firmware tab. You will see the below options, and when you right-click on Controller Firmware you can select to upgrade the controller.

A window will open, where you must select the desired firmware level from the drop down menu. You select to apply now, or at a later moment, then submit.






Because we selected "Apply Now", the upgrade process begins. The upgrade status can be seen clicking on the "controller firmware" option:

It is important to notice that while the upgrade goes on, the fabric is fully operational and traffic flows through without any problem.

Eventually, the APIC controller that is currently upgrading will reboot, you will see a reboot message on the console if you were connected to it, or else you may see an error on your browser that indicated the session is closed.

After some minutes, the controller reboots and you can login again. You can then check if the upgrade was successful: 



(3) Upgrade the Fabric nodes.

Now we have to upgrade the fabric nodes. Let's check that the switch .bin is also on the firmware repository (needs to be uploaded too, same procedure as for uploading the .iso using the Download Tasks):

On the right hand side options under ADMIN -> Firmware you see the Fabric Node Firmware menu. We right click on Fabric Node Firmware, and select "Firmware Upgrade Wizard" and we see something like below:

We are going to create a firmware group with all switches, and select the right firmware level that we want for all of them (partial upgrades are possible too):


Then, very important, we are going to create two maintenance groups, one for odd numbered switches, one for even numbered switches (remember that when you commission a switch in the fabric you have to assign it a node ID):



Now both maintenance groups have been created, we can roll out the upgrade first on the odd numbered switches, then on the even numbered switches. Because our servers and external routers are all dual homed, doing it this way ensure no service interruption. We click on the Maintenance Group for odd-switches and click on upgrade now:


After one final confirmation, the upgrade process begins:


The upgrade process progresses:


The switches on the odd-switch group have been upgraded:



Important to mention, during the time of the upgrade we had a ping running between two VM on different servers as well as a ping running to the default gateway from one of the VM (the default gateway is, of course, on the ACI leaf switches):






Of course when the leaf switches reboot, the end hosts will see a link down, so in order to avoid service interruptions they must be dual homed (one port to a switch with an even ID, one to a switch with an odd ID - hence our upgrade policy). In our case, the hosts are running ESXi, we see the link down flagged as an alarm:
Our upgrade is now complete:



When the upgrade is complete, clicking on the Fabric Node Firmware will show the new release for all fabric nodes:





And that is it!  

Compared to a traditional network built of individually managed switches, there is need to setup any TFTP servers, download the new code to each switch, script your way into automating every switch to upgrade and reboot, etc. 



1 comment:

  1. It really depends on how many spines you have. In general, I'd recommend having the spines on separate maintenance groups, so you can roll the upgrades one by one. When one spine reloads there should be no impact traffic then. If you have four spines maybe you want to do 2 and 2 to save some time, but probably one by one is advisable. From my lab perspective, I do upgrade the spine at the same time as the leafs, so I literally upgrade half the fabric, then the other half: one maintenance group for odd-switches and one for eve-switches. This is fastest, but for production you need other considerations, like perhaps your border leaf switches should have a dedicated maintenance group. etc.

    About the 2 different code levels, this is a common question. You can run 2 different code versions, but the expectation is that you will do so as a temporary situation while completing an upgrade. I do that in the lab when I have to test something on a new release but I don't want to fully upgrade all switches.

    ReplyDelete