Translate

Wednesday, August 31, 2011

A new way to do L2 connectivity in a cloud: VXLAN

I have written before about the issues of L2 and L3, in the context of DC Infrastructure design. As people is looking to build denser DCs, you are faced with having to run a network with a large number of networking ports and devices. Clearly this was a no-go for STP based networks and people recognized this years ago. But then, the challenge was: we need L2 to run distributed clusters, do VM vMotion and so many other things ... the DC network HAS TO BE L2! ...

And we learnt and discussed about TRILL, of which the only shipping incarnation I know about is Cisco's FabricPath (pre-standard, but TRILL-like, with enhancements to the standard). True that FP and TRILL enable us to run larger L2 networks with larger number of VLANs and higher bandwidth than ever before with STP (which had no multi-pathing at all).

But then there is another challenge, or a few at least ... (1) will we really be able to scale L2 networks, even based on TRILL/FP, to the levels of L3? ...(2) how do we implement multi-tenancy, which essentially resolves into creating L2 overlays which can support multi-vlan environments inside the overlay?

Or better said, if we MUST create L2 overlays over the physical infrastructure, to provide for segmentation of virtual machines running on the hypervisors across the infrastructure, does it make sense to make such infrastructure L2? what for? ...

Yesterday's news where good in this sense as VMware and Cisco announced the first implementation of a new emerging standard called VXLAN. I like this standard a lot at a first glance because it decouples the two sides of the problem: (1) how to build the DC networking infrastructure and (2) how to provide L2 segmentation to enable multi-tenancy. VXLAN provides a solution to the second, and one which works with either approach for the first: L2 or L3 designs.

VXLAN implements a tunneling mechanism using mac-in-udp to transport L2 over a L3 network (by using a L4 encapsulation there are great benefits). It also defines a new field header as part of the VXLAN encapsulation: the Virtual Network Identifier (VNI), a 24 bit field which identifies the L2 segment. A scalable solution at first glance.

Flooding is elegantly solved by mapping VNI to IP multicast groups, a mapping which must be provided by a management layer (i.e. vCenter or vCloud Director, etc...). So the soft-switch element (VEM) in each hypervisor will know this mapping for the VNIs that are relevant to it (which is known at all moments by the management layer). The VEM then has but to generate IGMP joins to the upstream physical network, which must of course provide robust multicast support. Flooding is then optimally distributed through the upstream network to interested VEMs only, which rely on it to learn source mac addresses inside the VNIs and map them to the VEM's source address. Beautiful. I can see a great use case for BiDir-PIM (for each VEM is both source and receiver for each group), a protocol which is fine implemented in hardware and software on Cisco's switching platforms.

Because the header is L4 (mac-in-udp), this lends itself very well to optimal bandwidth utilization because modern switches can use the L4 field to perform better load balancing on ECMPs or port bundles (certainly Cisco switches work nice for this).


Moreover, the first implementation, based on the Cisco Nexus 1000V, also leverages NX-OS QoS. With VXLAN's UDP encapsulation, DSCP marking can be used for providing differentiated services for each VNI. So a cloud provider could provide different SLAs to different tenants.

The final nice thing about VXLAN is the industry support. While unveiled at VMWorld, and currently only supported by vSphere 5 with Nexus 1000V, the IETF draft is backed by Cisco, VMware, Citrix, Broadcom, Arista and other relevant industry players.

I personally like this approach very much, both from the industry level support as well from a technology point of view. I have blogged before that I believe a L3 design to the ToR is great for building the DC Fabric, and then leverage L2-in-L3 for building the VM segmentation overlays, so this fits nicely.

The only drawback I can see is the need to run multicast in the infrastructure. As an IP head and long time CCIE I think this is just fine, but I know many customers aren't currently running multicast extensively in the DC and don't have (yet) the skills to properly run a scalable multicast network. On the other hand, as a Cisco employee and shareholder, I welcome this very much for Cisco has probably the best in class multicast support across its routing and switching portfolio so it can bring a lot of value to customers.

... when I have time, I'll write up also about other approaches to solving these problems, of cloud L2 segmentation, and how I see SDN/OF playing in this space but so far ... time is scarce ...


Saturday, August 27, 2011

An example of poor network design, or how to look for trouble

Everybody these days is trying to do more with less. This is fine, but the challenge is that sometimes you end up doing less with more. I have recently found a paper from a networking vendor with a recommendation which falls into that category and I wanted to write about it.

Companies are looking to upgrade their infrastructure from GE to 10GE, and many are considering whether to upgrade existing modular switches or deploy new ones. There are many variables to consider here but, clearly, newer switches are denser in terms of 10GE (as one could expect of the natural evolution of technology). Two platforms in particular, the Cisco Catalyst 4500 and 6500 have demonstrated impressive evolution over time. I've know many customers using the latter in Data Center environments and when I look back five years ago, I am sure those customers can recognize that they did the right investment in the Catalyst 6500. I believe NONE of the high end modular switches which the Catalyst 6500 was competing with five years ago is still a valid option in the markeplace. Customers who chose to go with Nortel Passports, Foundry BigIron or MLXs, Force 10 ... would find themselves today with platforms which have had no future for already a couple of years, very limited upgrades, and poor support. On the other hand, the Catalyst 6500 still offers bleeding edge features and options for software and hardware upgrades to enhance performance.

But it is clear that there are way denser switches for DC 10GE deployments, starting with the Cisco Nexus 7000 of course. Customers need to evaluate what is best for them, and each case is different.

An idea which comes to some, and is recommended by at least one network vendor as I wrote earlier, is to front end existing switches (which are less dense in 10GE port count) with low-cost 10GE switches to provide a low-cost high density 10GE fan-out. The follow picture shows this "design" approach:



In my opinion this is a bad idea, very bad network design, and it is looking for trouble. Moreover, I think this is an approach which may end up being "doing-less-with-more", even if at first glance, may look "cheap" to build.  In this post I will try to explain the reasons why I think this way.

Technical Reasons

Multicast Performance - Switches constrain multicast flooding by implementing IGMP snooping. Shortly: as multicast receivers send IGMP join messages to signal they want to join a group, the switch's control plane will snoop the traffic and program hardware installing an entry (hopefully for the S,G information) which points to the port on which the IGMP join was received. Without IGMP v3 support, upon receiving a leave message, the switch in the middle must forward the message upstream but only after it has sent a message downstream to ensure no more receivers are left on the downstream switch. All in all, this adds complication, potential points of control plane failure, and inevitable latency on the join and leave processes. Means: a receiver will take longer to get multicast traffic after requesting, and the network will take longer to prune multicast traffic when the receiver leaves a group.

Multicast Scalability - This is a bigger issue than the former one. Networks built with Catalyst 6500 modular switches can scale up to 32K multicast entries in hardware. Typical ToR switches support between 1K and 2K entries  {CHECK FOR ARISTA}. These figures are ok when a switch with 48 ports connects to 48 servers. But if you will use that switch to connect to 48 access switches each with 48 servers, then you are looking to provide connectivity for in excess of 2,000 devices. The math is clear in how much you are limiting yourself with this design.

Buffering - Fixed form factor switches are designed typically to connect end-points. A ToR is designed to connect servers. The buffer available in those switches is then suited for such application, and expects little to no contention on server-facing ports, and the only contention point to be on the uplinks for practical traffic purposes. When a 48-port ToR switch is used in access, it deals with the traffic and burstiness of 48 servers or less (duh!) ... but if you use it to aggregate 48 access switches each with 48 ports, then it aggregates the traffic of 2,000+ servers. It is VERY obvious that you'll end up in trouble, silent drops, without much more analysis.

The latter is probably the most serious limitation of this design, because it will affect all possible types of deployments. There are others, such as how to deal with QoS, LACP or spanning-tree compatibility, impact on convergence on failure scenarios ...

Operational Reasons

Software Upgrades - Since you are adding one more layer to manage, you now have more devices to upgrade, potentially with different software and upgrade procedures, and you need to evaluate the impact of upgrades on convergence. Apart from the Cisco Nexus 5000 series, no ToR in the market provides ISSU support.

Provisioning - Setting up vlans, policy and many network tasks get more complicated because you need to provision them on more devices. If spanning-tree needs to run (and this is recommended on L2 networks using multi-chassis etherchannel as a safe-guard mechanism) you also need to consider that ToR switches may have support for limited number of spanning-tree instances.

Support - Troubleshooting gets more complicated. Instrumentation will likely be different between ToR switches and high end modular devices. IF the ToR is from a different vendor of the rest of the network, this gets even more complicated because the technical support services from two or more vendors may need to be involved to deal with complex troubleshooting issues.

The list could go on ... I sure hope customers consider things carefully. Something which is cheap today and may "just work" today, could be extremely expensive in the long run, or simply no loger work for (near) future requirements.


I believe that in most cases poor network design comes as a result of lack of knowledge. Many people still think that to build networks you just need switches and ports. So all the count is how many ports you need, how many switches, off you go ...

Lack of knowledge is not a bad thing (I mean, nobody is born knowing stuff), and can be solved with reading, training, etc. Now, when the poor network design comes from a networking vendor document you have to ask yourself how much you can trust them ... for THEY should not lack the knowledge to do things right in networking.