Wednesday, August 31, 2011

A new way to do L2 connectivity in a cloud: VXLAN

I have written before about the issues of L2 and L3, in the context of DC Infrastructure design. As people is looking to build denser DCs, you are faced with having to run a network with a large number of networking ports and devices. Clearly this was a no-go for STP based networks and people recognized this years ago. But then, the challenge was: we need L2 to run distributed clusters, do VM vMotion and so many other things ... the DC network HAS TO BE L2! ...

And we learnt and discussed about TRILL, of which the only shipping incarnation I know about is Cisco's FabricPath (pre-standard, but TRILL-like, with enhancements to the standard). True that FP and TRILL enable us to run larger L2 networks with larger number of VLANs and higher bandwidth than ever before with STP (which had no multi-pathing at all).

But then there is another challenge, or a few at least ... (1) will we really be able to scale L2 networks, even based on TRILL/FP, to the levels of L3? ...(2) how do we implement multi-tenancy, which essentially resolves into creating L2 overlays which can support multi-vlan environments inside the overlay?

Or better said, if we MUST create L2 overlays over the physical infrastructure, to provide for segmentation of virtual machines running on the hypervisors across the infrastructure, does it make sense to make such infrastructure L2? what for? ...

Yesterday's news where good in this sense as VMware and Cisco announced the first implementation of a new emerging standard called VXLAN. I like this standard a lot at a first glance because it decouples the two sides of the problem: (1) how to build the DC networking infrastructure and (2) how to provide L2 segmentation to enable multi-tenancy. VXLAN provides a solution to the second, and one which works with either approach for the first: L2 or L3 designs.

VXLAN implements a tunneling mechanism using mac-in-udp to transport L2 over a L3 network (by using a L4 encapsulation there are great benefits). It also defines a new field header as part of the VXLAN encapsulation: the Virtual Network Identifier (VNI), a 24 bit field which identifies the L2 segment. A scalable solution at first glance.

Flooding is elegantly solved by mapping VNI to IP multicast groups, a mapping which must be provided by a management layer (i.e. vCenter or vCloud Director, etc...). So the soft-switch element (VEM) in each hypervisor will know this mapping for the VNIs that are relevant to it (which is known at all moments by the management layer). The VEM then has but to generate IGMP joins to the upstream physical network, which must of course provide robust multicast support. Flooding is then optimally distributed through the upstream network to interested VEMs only, which rely on it to learn source mac addresses inside the VNIs and map them to the VEM's source address. Beautiful. I can see a great use case for BiDir-PIM (for each VEM is both source and receiver for each group), a protocol which is fine implemented in hardware and software on Cisco's switching platforms.

Because the header is L4 (mac-in-udp), this lends itself very well to optimal bandwidth utilization because modern switches can use the L4 field to perform better load balancing on ECMPs or port bundles (certainly Cisco switches work nice for this).

Moreover, the first implementation, based on the Cisco Nexus 1000V, also leverages NX-OS QoS. With VXLAN's UDP encapsulation, DSCP marking can be used for providing differentiated services for each VNI. So a cloud provider could provide different SLAs to different tenants.

The final nice thing about VXLAN is the industry support. While unveiled at VMWorld, and currently only supported by vSphere 5 with Nexus 1000V, the IETF draft is backed by Cisco, VMware, Citrix, Broadcom, Arista and other relevant industry players.

I personally like this approach very much, both from the industry level support as well from a technology point of view. I have blogged before that I believe a L3 design to the ToR is great for building the DC Fabric, and then leverage L2-in-L3 for building the VM segmentation overlays, so this fits nicely.

The only drawback I can see is the need to run multicast in the infrastructure. As an IP head and long time CCIE I think this is just fine, but I know many customers aren't currently running multicast extensively in the DC and don't have (yet) the skills to properly run a scalable multicast network. On the other hand, as a Cisco employee and shareholder, I welcome this very much for Cisco has probably the best in class multicast support across its routing and switching portfolio so it can bring a lot of value to customers.

... when I have time, I'll write up also about other approaches to solving these problems, of cloud L2 segmentation, and how I see SDN/OF playing in this space but so far ... time is scarce ...

No comments:

Post a Comment