Translate

Tuesday, March 21, 2017

A response to - RUNNING VSPHERE ON CISCO ACI? THINK TWICE…


Ivan, I believe you are misinformed. I am not talking specifically about the vSphere API that AVS uses, and the constant rumours about it, but more about ACI in general. 

It is true that VMware is becoming a company where their technology is increasingly vertically integrated and eventually they want that if you use their hypervisor you have to use their SDN solution and their orchestration system and their VDI and ... I think the market will steer things otherwise … but we shall see! 😃 

I don’t know how much of an opportunity you had to have hands-on with ACI recently. I’d love to spend time showing you how some of the things we do with ACI work. Meanwhile I provide here my respectful feedback. 

I definitely disagree with some comments that you made. Some are in fact open for debate, for instance:

You can’t do all that on ToR switches, and need control of the virtual switch”.

I wouldd say you can do a lot of what you need to do for server networking on a modern ToR. And yet you are right, and you do need a virtual switch. That is clear. How much functionality you put on one vs. another is a subject for debate with pros and cons.

But in an SDN world with programmable APIs, it does not mean you need YOUR virtual switch in order to control it. You just need one virtual switch that you can program. That is all.

There’s a lot that we can do on AVS that we can do on OVS too. And we do it on OVS too. There’s a lot with do on AVS that we can’t do with VDS. But there’s enough that we can do in VDS so that when combining it with what we do on the ToR we deliver clever things (read more below).

The below comment on the other hand is imho misinformed:

That [running without AVS] would degrade Cisco ACI used in vSphere environments into a smarter L2+L3 data center fabric. Is that worth the additional complexity you get with ACI? "

First, it is wrong to assume that in a vSphere environment ACI is no more than smarter L2+L3 data center fabric. But even if it was only that … it is a WAY smarter L2+L3 fabric.

And that leads me to your second phrase. What is up with the “additional complexity you get with ACI?”.

This bugs me greatly. ACI has a learning curve, no doubt. But we need to understand that the line between “complex” and “different” is crossed by eliminating ignorance. 

Anyone that has done an upgrade on anymore than a handful of switches from any vendor and then conducts a network upgrade (or downgrade) on dozens and dozens of switches under APIC control will see how it becomes much simpler.

Configuring a new network service involving L2 and L3 across dozens and dozens of switches on multiple data centers is incredibly simpler. Presenting those new networks to multiple vCenters? Piece of cake. Finding a VM on the fabric querying for the VM name? … done. Reverting the creation on the previously created network service is incredibly simpler too. I mean … compared to traditional networking … let me highlight: INCREDIBLY simpler.
 
Changing your routing policies to announce or not specific subnets from your fabric, or making a change in order to upgrade storm control policies to hundreds or thousands of ports, … or - again - reverting any of those changes, becomes real simple. Particularly when you think how you were doing it on NX-OS or on any other vendor’s box-by-box configuration system.

And the truth is that APIC accomplishes all of that, and more, with a very elegant architecture based on distributed intelligence in the fabric combined with a centralised policy and management plane on a scale-out controller cluster. 

Other vendors require combining six different VM performing three different functions just to achieve a distributed default gateway that is only available to workloads running on a single-vendor hypervisor. Now that’s complex, regardless of how well you know the solution.

Finally, when it comes to using VDS and ACI, we can have multiple EPGs mapped to a single vSphere PortGroup and associate the VM to the right EPG (and correspondingly the right policy) dynamically using vCenter Attributes. To do that we leverage PVLAN on the VDS, but just at the base PortGroup level, not across the infrastructure. In a way, we use an isolated PVLAN to make the VDS perform as a FEX, so that we can ensure that all VM traffic can be seen, and classified, on the ToR. And that works very well. And we are not the only ones doing that … btw. 

The clear disadvantage of that approach is that you lose local switching on the hypervisor. This may be a show stopper in some cases. However, for organisations running DRS clusters it is rarely an issue from a practical standpoint since in any case, it’s difficult to predict how much traffic is switched locally anyways and the switching through the ToR has often times only marginal impact on application performance.

On the other hand, the advantage of this approach is that you keep your hypervisor clean of any networking bits and with the simplest configuration, thus making operations simpler because network maintenance does not need to upset the DRS cluster, and vice versa.

The other advantage is that you get programmable network and security at the VM level without having to pay per-socket licenses and service fees. And that you can extend all of that to other environments running Hyper-V, KVM, etc.




Monday, March 20, 2017

ACI and OpenStack Deep Dive - Part I

The integration between ACI with OpenStack leveraging Opflex has been available for some time now. The following white paper explains this integration and its benefits:


And the documentation explains how to install the Cisco plugin and configure Opflex to function in either ML2, or GBP (or more recently, any of the two using the Unified Plugin). Here is for instance the installation for Red Hat OpenStack:


However, there’s little literature that deep dives into the inner implementation to help operations and troubleshooting. In this post I will try to provide a deeper insight that should help people that are building OpenStack clouds with Cisco ACI troubleshoot their environments.

I will illustrate the concepts with a basic diagram involving two nodes running RDO. In the following diagram, centos-01 runs an all-in-one OpenStack installation, therefore it is the neutron server and also an openstack nova compute node. We complement it with centos-02, running only nova compute node.

We are leveraging the APIC neutron plugin in ML2 mode for SDN integration. This means that we implement distributed routing, distributed switching, distributed SNAT and Floating IP via APIC and the ACI fabric.




In this post I do not cover the installation or configuration. However, the configuration has been done so that we also leverage distributed DHCP and Metadata:



We can use ’neutron agent-list’ to review the list of running agents:


Notice that Open vSwitch agent is disable for both nodes, and that L3 Agent is also disabled since they are not required when using ACI OpFlex integration. Notice also that DHCP agent only runs on the centos-01 node (our neutron server) which is our OpenStack controller, but it is not required on the compute nodes.
On the diagram above you can see a router connected to the fabric via leaf5 and leaf6 and an external switch. We will use this connection to configure an L3Out on the common tenant that will serve for all OpenStack external communications. The following is a snapshot of the L3Out configuration on APIC:




We also have a VMM domain on APIC that allows the fabric administrator to have greater visibility of the OpenStack cloud, here we can see the two hypervisors configured, and an inventory of the networks under the distributed virtual switch implemented by Open vSwitch controlled by APIC:



On OpenStack, I have configured a project called NewWebApps. On the OpenStack Project NewWebApps we have the following topology configured (note that in OpenStack a project is also called a tenant):


We have two neutron networks and a neutron router. We also have an external network called OS_L3Out (matching the L3Out under common tenant). By default, a new ACI tenant is created for each OpenStack Project. Also by default, in ML2 mode, for each tenant a VRF is created and each neutron network will create a Bridge Domain and a matching EPG, with the L3Out having a relation in this case with the common-tenant L3Out via a “shadow L3Out” as shown below:



The neutron network topology from above is further represented by the following ACI Application profile where we can again see the connection to the shadow L3Out:





The neutron subnets are configured under the corresponding Bridge Domain. For instance, for MyWebApp we can see neutron network with subnet 10.100.100.0/24:




So to recap once again, each neutron network will trigger the creation of a Bridge Domain and an EPG under the project/tenant VRF. The neutron subnet will be configured under the corresponding Bridge Domain, and if the neutron network is connected to a neutron router this will be simulated by creating a contract that allows all traffic and associating the corresponding EPGs with that contract.


Endpoint Connectivity in Detail


First let’s look at how instances can get their IP addresses. We have the following instances configured.




Understanding DHCP with OpenStack


If we look at our VMM Domain, we can see for each of the hypervisors the list of running virtual machines. For instance we see something like this for both centos-01 and centos-02:



We can recognise there some of our instances. However on centos-01 there are some additional endpoints that aren’t so obvious to identify at a first glance. For instance, those with VM name beginning with ‘dhcp|_RDO-ML2 ...’. Those are the endpoints that will allow us to run a dhcp server for each of the networks, and we will see one per neutron network.

Each of these dhcp instances will connect with its own network namespace in the linux hypervisor, for each of the subnets on which we are running DHCP. In our case, we see five networks. But why five? On our topology we had only three ...

… because on the neutron topology above we were looking at one project. But running the ‘neutron net-list’ command as admin, and on the VMM domain information we are looking at all tenants and there are additional networks configured for other tenants. Let’s focus on the dhcp endpoints and use ‘ip netns list’ to list on the centos-01 the list of network namespaces:



It is easy to correlate the neutron ID with the network namespace. For instance, MyWebApp has ID 6fccd59e-76e0-49de-9434-020fef1c22d9, and we can see the network namespace matching it: qdhcp-6fccd59e-76e0-49de-9434-020fef1c22d9.

We can look at the IP Address running on this namespace:



We can see interface ‘tap6ebbc085-b4’ running with IP address 10.100.100.2. That is the IP address and the tap interface that we see from the VMM domain as well. This IP Address is also seen in the fabric as just any other endpoint. We just need to check it on the corresponding EPG (in our case, MyWebApp EPG):




And we can see also on centos-01 what processes are using this if we type ‘ps auxwww | grep 6fccd59e-76e0-49de-9434-020fef1c22d9’. We will see as per below the dnsmasq process running for this namespace with the DHCP subnet 10.100.100.0 allocated to it:



So what happens here is that Neutron will use this dnsmasq process to obtain an IPv4 address for instances launched into that neutron network.

But we are running distributed DHCP, so … do instances talk to THAT dnsmaq process? … they don’t. We will see how this works soon.

Let’s first track one of our instance endpoints in the fabric.


Understanding OpenStack Instance Networking


If we look at the neutron diagram from above, we can see that in MyWebApp we have two instances, one called ‘test-instance’ and the other ‘web-server—1’. Let’s see in greater detail how these are connected and how they obtain their IP Address(es).
We begin by looking for the instance ID and the neutron ID. We can check this with ‘nova list’:



Now we may want to find out where these instances are running, on which hypervisor. One way to look at this from OpenStack is with ‘nova show’ command for a particular instance ID. On APIC we find the hypervisor for an endpoint by checking the VMM domain information, by looking at the EPG operational tab or - if you dont know to which EPG or project it belongs to - by looking for the endpoint with endpoint tracker. Let’s see them all together below:



To continue the example, let’s now focus on one instance. We will look for one that is running on centos-02 to make it simpler (looking at a node that is only nova compute node).

Let’s focus on the 'test-instance’, with IP address 10.100.100.17 that it is running on centos-02, our compute node (let’s remember that centos-01 is both compute node and controller).

We need to find out the neutron port ID for this instance. Let’s list the neutron ports using ‘neutron port-list’:


The neutron port ID for our instance is 43591b4f-8635-4ad6-b8d7-c6510dd618ee. We can look at the first few characters only (43591b4f-86) and find out the linux bridge in front of this VM. We can see this by using ‘brctl show’ and looking for a port that is named ‘qbr’ followed by the first characters of the neutron port ID:


On the output from above, notice that there are three Linux bridges, there will be one for each VM running on the host.
The Linux Bridge we are looking for is qbr43591b4f-86. This bridge has two ports:
  • tap43591b4f-86: connects to our virtual machine test-instance virtual NIC.
  • qvb43591b4f-86: this is one end of a vEth pair, the other end is on the integration bridge (bri-int) configured on the OVS.

To confirm that we are on the right track and for troubleshooting, we will start a ping from ‘test-instance’ to ‘web-server—1’ (from 10.100.100.17 to 10.100.100.14).

We can capture packets sent/received by the VM by using tcpdump on the tap interface at the KVM host (centos-02). For instance ‘tcpdump -i tap43591b4f-86 -n -e icmp’ where we can see that our ‘test-instance’ is currently pinging our ‘web-server—1’ instance:



That tap interface is very important, because this is where security groups are implemented using IP Tables.

But for now, we are only trying to see all data path elements between our VM and the OVS.

Where is this vEth qvb43591b4f-86 connected? we can complete the picture by looking at the command ‘ovs-vsctl show’ and matching the neutron port ID characters again.

We will see that another way would have been to look at the VMM domain information, where we see the OVS port under br-int to which the vEth for this instance is attached:



So we can confirm that the other end of the vEth from the VM uses port qvo43591b4f-86 on br-int.
Now we can see all components in the data path at the host level, and how we can identify them using APIC as well as the openstack CLI.

The following diagram puts it all together, where the centos-02 host connects to the fabric using an etherchannel based off physical NICs enp8s0 and enp9s0 over which runs our VXLAN interface that is the uplink of our br-int to the ACI fabric:




Understanding Distributed DHCP


Now we can look at how the 'test-instance’ VM managed to get its IP address delivered. The instance is configured to boot and use DHCP to get an IP address by default. We can monitor how our instance gets an IP address by using tcpdump again on the tap interface that we already knew:

‘tcpdump -i tap43591b4f-86 port 67 or port 68'


And that help us capture the request for an IP address, and the response from our DHCP server running on 10.100.100.2. Who is that server? Since we have distributed DHCP configured, that server is actually the ‘agent-ovs’ and we had seen it above before as the endpoint with name ‘dhcp|_RDO-ML2_NewWebApps|RDO-ML2|NewWebApp:



Let’s check the port-id on the OVS that corresponds to our instance. We know that the instance vEth is connected on OVS port qvo43591b4f-86. So we can find the id for that port with this command:

'ovs-ofctl show br-int'


The ID is 37. So we need to look for entries on the OVS OpenFlow table that use port ’37’. When the instance sends a DHCP packet, it will hit our OVS port and in there it will match this rule (notice we are looking for in_port=37 and matching on UDP 68):



What does this rule do? That rule will send the BOOTP packet to the controller, which is the agent-ovs running on the node.

So is agent-ovs running a DHCP Server? … Not exactly. The DHCP server runs on the dnsmaq process we have seen earlier. Neutron will leverage that to find the right IP address for the instance and the neutron-opflex-agent will receive what this IP address should be. The neutron-opflex-agent will create a file for each endpoint (neutron port) under '/var/lib/opflex-agent-ovs/endpoints/‘.

Let’s look at our endpoint’s file. Let us remember that the neutron port ID is: 43591b4f-8635-4ad6-b8d7-c6510dd618ee. On the centos-02 node, neutron-opflex-agent has created an endpoint file for this neutron-port:




As you can see on that file we will have all required information: IP address, default gateway, DNS server IP address, etc. It is from this file that ‘agent-ovs’ will collect the necessary information for the DHCP response. The response will be sent directly to the OVS port connecting to the VM.

We can correlate this information with the output of ‘neutron port-show 43591b4f-8635-4ad6-b8d7-c6510dd618ee’:




As well as with the metadata corresponding to the Nova instance which would be visible under ‘/var/lib/nova/instances/<instance-id>/libvirt.xml



Security Groups Implementation - IP Tables


One of the key differences between using GBP mode vs ML2 mode is the security model. In ML2 mode we leverage standard neutron security groups, implemented as IP Tables on the linux bridge tap interface.

Once again, we can use the ‘nova show’ command to look at the instance details, including the configured Security Groups:



This instance is part of two security groups: AllowPublicAccess and MyWebservers. Using the ‘neutron security-group-list’ command we can see what’s configured for those:


From there one, you can use regular IPtables troubleshooting. The rules are created by the neutron-opflex-agent and they can be identified, again, by matching on the neutron port ID characters we have seen before. For instance below you can quickly spot the rules of the MyWebServers security group for tcp/22, tcp/80 and ICMP:


On the next post we will look at NAT and Floating IP.