Out of Nillo's mind: 2011

Friday, October 21, 2011

On Catalyst 6500 investment protection

I have read an article in Network World about the real investment protection of the Catalyst 6500. I think the article is lousy and biased, focusing on an analyses of speeds and feeds and thinking that the only reason for network upgrades is raw throughput. Moreover, there were some comments to that article which pointed in the same direction. Everything is subject of opinion, here's my own:

Anyone that has ran a network knows it's not just about how many ports a switch has and how many packets per second it moves. This post is so simplistic at that ... and some comments below as well ...

Yes an Arista 7050 can do 64 linerate 10GE ports in one RU. It better ... it's been designed in 2011, I'd expect it to do that. But, can it do that for running inter-vlan routing for 1000+ different vlans concurrently, hundreds of HSRP groups? having ACLs applied to the SVIs with tens of thousands of lines each, running BFD on uplinks providing sub-second convergence, routing for 10K mcast groups or more, etc.? ... Can it be used to extend LANs over an MPLS backbone? ... Can it ...? The list would be TOO long. Way too long. And it would apply to switches from other vendors too.

Mid-size to large companies and many small ones NEED those capabilities. Running a network means running hundreds of vlans with policy applied to them, etc. If a company doesn't need any such thing, or just need a dozen vlans with static routing, sure they can use a lower end switch. And they do. And that's fine. And that's why Cisco also has other products in the portfolio.

Reality is, if a customer did an investment five or six years ago in a Catalyst 6500, he may have also considered at the time a Nortel 8600 (now virtually dead), or a F10 E600 (no comments), or a Foundry Big Iron 8000 or and RX-16. Had they gone with one of those, look at how much hardware (or even software) upgrade options are left for them ...

But no. Most of them went with a Catalyst 6500, and as much as it hurts other vendors (and some people on the grey market for some reason that escapes my understanding), those customers know that did the right choice. And today, those customers have options for upgrading and increasing the value of their investment and solving new network problems. Those upgrade options may be great for some, and maybe insufficient for others (which is why Cisco also has other products in the portfolio).

And Art, another angle you are totally missing is operational aspects. Running a Cat6500 with a Sup2-T presents no operational change at all for customers, no re-learning, no re-scripting, etc. THAT is also investment protection. THAT would be impossible had they invested on another vendor when they chose to trust Cisco.

Finally, I'd like to add that Sup2-T enhances the performance of a Catalyst 6500 is ways beyond pps. Any Catalyst 6500 running with combinations of 67xx cards, 61xx and others just by replacing the Sup with new Sup2-Ts get to use up to 4x more vlans, 4x more entries on the netflow tcam, flexible network, vpls available to every port, SGT/CTS and many more features.

So yes, there is investment protection. And even more when you compare with other products that have been in the market for a while, no matter which vendor you pick.

Saturday, October 15, 2011

On SDN, and OpenFlow - Are we solving any real problem?

I must admit that I started looking at SDN and OpenFlow with a lot of skepticism. It is not the first time I was facing a networking technology which proposed a centralized intelligence to setup the network paths for traffic to go through. Such thinking always reminds me of old circuit-based networks and NewBridge 46020, which proposed TDM path setup to FR and ATM, and when making in-roads into ADSL, offered it as the solution to all evils. ATM LANE was, in a different way, another "similar" attempt.

Being a long term CCIE makes me an IP-head, no doubt about it, and a controller-based network is something that I really need to open my mind to, in order to even consider it. I am trying though ...

The way I look at it, the question about SDN and OpenFlow alike is: what problem are we solving?

So far, I see three potential problems we are trying to solve:

- scale: help building more scalable networks spending less money
- complexity: simplify running a complex network
- agility: simplify or fasten implementing network services & network virtualization

In this blog post I try to look at SDN/OF from those three potential problems. I make little to no distinction between SDN and OF and that is wrong because they aren't quite the same thing. Things like 802.1x, or even LISP to some extent, could be somehow considered Software Defined Networking, in that the forwarding and/or the policy is defined in a "central" software engine or database lookup.

But for the purpose of this post, I really look at OF as THE way for implementing SDNs. But before I begin ...

A Common Comparison I consider flawed ...

Many people put existing Wireless LANs as a way to say the controller based approach is proven. Sure, look at wireless networks today, almost all are using a controller based approach ... well ... yes, but no. The biggest difference with OF from a networking perspective is that the controller FORWARDs all the traffic, which is tunneled in an overlay from each of the access points. So the controller IS part of the datapath, and an intrinsic part of it. In fact, can be the bottleneck.

Moreover, the capillarity of a WLAN network is orders of magnitude lower than that of a datacenter fabric, so any analogy is, IMHO, flawed.

Scalability

From a scalability perspective, my first take at SDN and OpenFlow was focused on two points which I looked at as big limitations:

1. a totally decoupled control plane (perhaps centralized - albeit distributed in a cluster) requires an out of band management network which could limit scale and reliability (plus add to the cost)
2. programming "flows" on device TCAMs using OF will not scale or at least will not provide any savings in CAPEX lead by networking hardware itself

I see point one above less as a limitation now, so long as we really succeed at achieving a large simplification of the network in all other areas (beyond management), thanks to the SDN approach.
It isn't unusual to have an OOB management network in tier-1 infrastructures anyways. In the SDN/OF approach however, the OOB network is really a critical asset (even more than critical ...). It must have redundancy with fast convergence built into it and be built to scale as well. We also need to factor the cost of running and operating this network too. Also, as the policy and/or number of flows becomes higher, the cost of the cluster itself may be non-negligible.

Point two from above is still one where I need to better understand how things will be done. A a first glance, I thought this was going to be a big limitation because I thought each network forwarding element would be managed like a remote linecard, programmed perhaps using OpenFlow. In such case, the hardware table sizes of each network element would limit the entire infrastructure because for L3 forwarding you want to keep consistent FIBs and for L2 forwarding even more to minimize flooding. Hence, if your forwarding elements are limited to, say, 16K IPv4 routes, that's the size of your OpenFlow network ... there are ways to optimize that by programing only part of the FIB, which is possible as the controller knows the topology. But then if there are flow exceptions ...

But then of course, things change if you consider that ... why would you need to do "normal" L2 or L3 lookups for switching packets? You can forget about L2 and L3 altogether (potentially). And then, I assume the "controller" could keep track of the capabilities of each node including table sizes, and program state only as needed and where needed. This adds complexity to the controller, but should help scaling.

But can this scale if hardware is programmed per flow?

I still don't see this happening really. Two issues with this: scale the flow setup (a software process), and scale the hardware flow tables. I understand the flow setup is not necessarily THE problem, but still, let's review it. Let's say the first packet is sent to the controller for the flow to be setup, all of this over the OOB network. This will add delay to initial communication and put load on the controller, but no reason why this can't scale out with multiple controller servers in a cluster or splitting the forwarding elements between different instances of the controller. All of this adds to the cost of the solution though (and add management complexity too).

But what I don't see is this scaling at the forwarding chip level. I wonder how to program hardware with the SDN approach. The OpenFlow way seems to be to do it on a per flow basis, leveraging table pipelining.

Of course it all depends on what do we call a flow. If we take source/destination IP addresses plus tcp/ip port, any aggregation switch will easily see hundreds of thousands of flows at any given time. Even at the ToR level the number of flows will ramp up rapidly. This will kill the best silicon available from vendors such as Broadcom, Fulcrum or Marvel so easily. We can indeed limit to source-destination mac addresses, or IP for that matter, but then that limits a lot what you CAN do with the packet flows. So if host A wants to communicate with host B that is two flows a->b and b->a. That is if you define a flow by source/destination mac address. In this case, let's assume you have 48 servers connected to a ToR switch. Let's say there are 10 VMs per server (40 is very common in today's Enterprises by the way). Let's say each VM needs to talk to two load balancers and/or default gateway plus to 10 other VMs. This means each VM would generate 12-14 flows. So this means each ToR switch would see 48 x 10 x 12 = 5,760 (times two, because they have to be bi-directional). Now that isn't too much, chips like Trident can fit 128K L2 entries, which in this case would mean flows if we define them as per mac-address. But think of the aggregation point which has 100+ ToR switches connected. Now those switches need to handle 576,000 flows (times two). Way more if you assume more than one vNic per VM.

At any rate, if you want to handle overlapping addresses, you also need to add some other field to the flow mask ... So I still don't see how this can scale at all, certainly not using "cheap" hardware.

In the end, if you want to run fast, you'll need to pay for a Ferrari, whether you drive it yourself or have a machine do it for you.

But I see a benefit if we can run the network forwarding elements in a totally different way than what we do today. The options for virtualizing the network can be much richer, this is true, but would come at a cost (discussed below). And also, I see a point to scale the network beyond what current network protocols allow which can be interesting. I certainly understand the interest from companies running very large datacenters, which tend to have very standardized network topologies which can benefit a lot from the SDN approach. But at the same time, I do not see why standard routing protocols can't be made to scale to larger networks too ...

I doubt OpenFlow will be the right approach in the enterprise for quite a while, because in that world, you rarely build from scratch 100%. There is always a need to accomodate for legacy and this will mandate for "traditional" networking for quite a while, no doubt (if SDN ever really takes up, that is). Sure, I know companies like Big Switch are looking at ways to use OF as an overlay into existing networks. We will have to wait and see how this works ...

Simplify Running a Complex Network

This point is a tough one. What is simple to some is complex to others. Someone with networking background will not consider that running an ISIS network which implements PIM-SM is difficult, while someone with software development background will see it very complicated. Likewise, the same software developer may think that running a cluster of servers which control other "servers" (that each do packet forwarding) is very simple.

SDN looks, on powerpoint, very promising for network management simplification. But when you begin to dig in for details, on how to do troubleshooting, how to look deep into the hardware forwarding tables, how to ensure path continuity or simply test it etc, you begin to see that what was simple in concept, becomes more complex.

Simplify Network Services

A lot of the writing I have seen around this focuses on the idea that once we have a "standard" way of programming the forwarding hardware (OpenFlow, that is), then all forwarding actions become sort of like instruction sets on a general purpose CPU. Hence, all network problems become solvable by writing programs that operate on the network using such instructions.

I have seen typically two examples given for quick-wins of this approach: load balancing and network virtualization. Both hot topics on any data center. Others point to fancier ones, like unequal load balancing, load based routing, or even shutting down unused nodes. The latter speaks for itself and is foolish thinking ...

All others CAN be done with traditional networking technologies, and if they are not implemented is typically for very good reasons.

On the point of load balancing and network virtualization, what I have seen so far are discussions at very high level, which show how this can indeed be done. OpenFlow-heads praise how this is going to be not only simple, but even free! ...

Implement a load balancer, nothing simpler and cheaper. The fantastic OF Controller will simply load balance flows depending on the src address for instance. Done. Zero dollars to it. Of course, it obviates the point of hardware (and software) flow table scalability - already mentioned above. Of course it obviates the fact that a load balancer does A LOT MORE than push packets out of various ports depending on source address ... it keeps track of real ip addresses, polls the application to measure load, off loads server networking tasks, etc. There's a reason why people invest in appliances from Cisco or F5 to do load balancing. Switches (from multiple vendors) have been able to do server load balancing for a long time, but what you can do there isn't just enough for most applications. OF changes nothing there.

Network Virtualization is another one where the complex becomes so simple thanks to SDN and OF. I admit to write here from ignorance of the actual work of companies like Big Switch or Nicira, of course. But most of what I read resolves into implementing an overlay network with a control plane running on software. Nothing different to many other approaches today and what would be done using VXLAN or NVGRE. At any rate, I would argue LISP is a better choice, but alas, it does not solve L2 adjacencies which are required for clustering and other reasons (which OF doesn't solve either).

I have seen many others proposing that thanks to OF, one can easily program the controller to use fields like MPLS tags, or VLAN tags, etc. to do segmentation ala carte. This is true, and fine. But I wonder, how is this good?!

And how is it different from doing traditional networking? Sure, Cisco, Brocade, Juniper, F10 and others could have decided to change the semantics of existing network fields to implement segmentation and many other features. And sometimes we have seen this done. But in so doing, they become proprietary. They don't interoperate.

IF a controller software vendor X provides a virtualization solution that works that way (by redefining the semantics of existing fields), it offers a solution that locks-in the customer with that software vendor. A solution for which a tailor-made gateway will be required to connect with a standards based network, or to a network from any other company.

Imagine company Z who runs a DC with a controller from software vendor X. Imagine company Z merges with company Y, who runs with a controller from a different vendor, ... imagine the trouble. Today, company Z running Juniper merges with company Y running Cisco and they connect their networks using OSPF or BGP, or just plain STP if they are not very skilled ...

I am sure I am missing the obvious when so many bright people praise OpenFlow. I just can't see how it solves problems that we can't solve today, or how it does it better. Or how it really will make it for a better industry. Many would like to see a world where networking wont be dominated by two vendors. OF, at best, could change who those two vendors are, nothing more.

Conclusion ... (for now)

I think OpenFlow is a very interesting technology, and the SDN paradigm one that can contribute to many good things in the industry. But up until today, almost everything that I read about both topics is of very high level, and idealistic to the point of being naive. In my opinion, a lot of the assumptions made on the problems that OpenFlow will solve are made without knowledge of other existing solutions. I am yet to see an expert in IP and MPLS praise OpenFlow for solving problems that we couldn't solve with those two, or solve them in a clearly advantageous way.

So far to me, from my ignorance of things I admit, OpenFlow looks at just a different way to do the same. And for that ... what for?

Wednesday, September 7, 2011

SDN wont solve network problems if you just try to dimiss them

I am very excited overall about SDN, and recently about VXLAN, a protocol that is well suited for delivering L2 broadcast domains to support IaaS and sort of follows the SDN paradigm (a software controller instantiating the L2 domains as needed).

But I still have the perception that from certain vendors, VMware in particular, there is a lack of interest and knowledge in anything related to networking. The blog post on VXLAN by Allwyn Sequeira has some comments (and some minor mistakes) which feed my perception.

To begin with, the almost constant mantra that networks need to become "fast, fat and flat". For years networks have been faster than most applications could leverage (we have had 10GE for years, but no server was capable of using that capacity until a couple of years ago, and even today many servers being deployed do not have such capacity). The other part of the mantra (fat and flat) shows just ignorance, and I am sorry I can't be polite about it. Even more when put in the context of VXLAN.

Engineering a network for it to perform, scale and provide fast convergence isn't an easy task. Period. VXLAN is cool, yes, but it looks like Allwyn, and many others too, forget the minor detail that for it to work, it requires a (very well) performing L3 multicast network. Of course this comes to no surprise from someone who writes "[...] tenant broadcasts are converted to IP multicasts (Protocol Independent Multicast – PIM).". IP multicast and PIM are two different (but of course tightly related) things. It is funny that people see "24-bit ID, so I can run millions of VNs" ... so cool ... anybody thought of running with millions of (S,G) entries in the network? ...

Of course I imagine you can (and will) group several VNIs mapped to a single (S,G) but still ... Anybody with networking experience knows that running a multicast network with thousands of entries isn't a simple task, even less if you want to achieve sub-second convergence on any network failure (hint, no fault tolerance built-in in VXLAN, but this is a minor detail for the VMware folk ...).

Anybody though that current ToRs from merchant silicon vendors can't run more than 1K-2K mroutes in hardware? Anybody thought that they don't support BiDir PIM? ...

Bottom line, running networks isn't complicated because network-heads are evil and want to ruin the happiness of application writers. There's more to it than evilness ...

I write this with the utmost respect for Allwyn and VMware in general. I just wish that one day I'll see bright people from the application world be open and willing to work with the stuff they don't know about or don't understand, as oppose to simply dismiss it and expect it to be fast, fat and flat (... and dumb they'd gladly add for sure).

VXLAN is a cool protocol that does not solve any network complexity problems, but provides a great way to abstract L2 edge domains in virtual environments.

Wednesday, August 31, 2011

A new way to do L2 connectivity in a cloud: VXLAN

I have written before about the issues of L2 and L3, in the context of DC Infrastructure design. As people is looking to build denser DCs, you are faced with having to run a network with a large number of networking ports and devices. Clearly this was a no-go for STP based networks and people recognized this years ago. But then, the challenge was: we need L2 to run distributed clusters, do VM vMotion and so many other things ... the DC network HAS TO BE L2! ...

And we learnt and discussed about TRILL, of which the only shipping incarnation I know about is Cisco's FabricPath (pre-standard, but TRILL-like, with enhancements to the standard). True that FP and TRILL enable us to run larger L2 networks with larger number of VLANs and higher bandwidth than ever before with STP (which had no multi-pathing at all).

But then there is another challenge, or a few at least ... (1) will we really be able to scale L2 networks, even based on TRILL/FP, to the levels of L3? ...(2) how do we implement multi-tenancy, which essentially resolves into creating L2 overlays which can support multi-vlan environments inside the overlay?

Or better said, if we MUST create L2 overlays over the physical infrastructure, to provide for segmentation of virtual machines running on the hypervisors across the infrastructure, does it make sense to make such infrastructure L2? what for? ...

Yesterday's news where good in this sense as VMware and Cisco announced the first implementation of a new emerging standard called VXLAN. I like this standard a lot at a first glance because it decouples the two sides of the problem: (1) how to build the DC networking infrastructure and (2) how to provide L2 segmentation to enable multi-tenancy. VXLAN provides a solution to the second, and one which works with either approach for the first: L2 or L3 designs.

VXLAN implements a tunneling mechanism using mac-in-udp to transport L2 over a L3 network (by using a L4 encapsulation there are great benefits). It also defines a new field header as part of the VXLAN encapsulation: the Virtual Network Identifier (VNI), a 24 bit field which identifies the L2 segment. A scalable solution at first glance.

Flooding is elegantly solved by mapping VNI to IP multicast groups, a mapping which must be provided by a management layer (i.e. vCenter or vCloud Director, etc...). So the soft-switch element (VEM) in each hypervisor will know this mapping for the VNIs that are relevant to it (which is known at all moments by the management layer). The VEM then has but to generate IGMP joins to the upstream physical network, which must of course provide robust multicast support. Flooding is then optimally distributed through the upstream network to interested VEMs only, which rely on it to learn source mac addresses inside the VNIs and map them to the VEM's source address. Beautiful. I can see a great use case for BiDir-PIM (for each VEM is both source and receiver for each group), a protocol which is fine implemented in hardware and software on Cisco's switching platforms.

Because the header is L4 (mac-in-udp), this lends itself very well to optimal bandwidth utilization because modern switches can use the L4 field to perform better load balancing on ECMPs or port bundles (certainly Cisco switches work nice for this).

Moreover, the first implementation, based on the Cisco Nexus 1000V, also leverages NX-OS QoS. With VXLAN's UDP encapsulation, DSCP marking can be used for providing differentiated services for each VNI. So a cloud provider could provide different SLAs to different tenants.

The final nice thing about VXLAN is the industry support. While unveiled at VMWorld, and currently only supported by vSphere 5 with Nexus 1000V, the IETF draft is backed by Cisco, VMware, Citrix, Broadcom, Arista and other relevant industry players.

I personally like this approach very much, both from the industry level support as well from a technology point of view. I have blogged before that I believe a L3 design to the ToR is great for building the DC Fabric, and then leverage L2-in-L3 for building the VM segmentation overlays, so this fits nicely.

The only drawback I can see is the need to run multicast in the infrastructure. As an IP head and long time CCIE I think this is just fine, but I know many customers aren't currently running multicast extensively in the DC and don't have (yet) the skills to properly run a scalable multicast network. On the other hand, as a Cisco employee and shareholder, I welcome this very much for Cisco has probably the best in class multicast support across its routing and switching portfolio so it can bring a lot of value to customers.

... when I have time, I'll write up also about other approaches to solving these problems, of cloud L2 segmentation, and how I see SDN/OF playing in this space but so far ... time is scarce ...

Saturday, August 27, 2011

An example of poor network design, or how to look for trouble

Everybody these days is trying to do more with less. This is fine, but the challenge is that sometimes you end up doing less with more. I have recently found a paper from a networking vendor with a recommendation which falls into that category and I wanted to write about it.

Companies are looking to upgrade their infrastructure from GE to 10GE, and many are considering whether to upgrade existing modular switches or deploy new ones. There are many variables to consider here but, clearly, newer switches are denser in terms of 10GE (as one could expect of the natural evolution of technology). Two platforms in particular, the Cisco Catalyst 4500 and 6500 have demonstrated impressive evolution over time. I've know many customers using the latter in Data Center environments and when I look back five years ago, I am sure those customers can recognize that they did the right investment in the Catalyst 6500. I believe NONE of the high end modular switches which the Catalyst 6500 was competing with five years ago is still a valid option in the markeplace. Customers who chose to go with Nortel Passports, Foundry BigIron or MLXs, Force 10 ... would find themselves today with platforms which have had no future for already a couple of years, very limited upgrades, and poor support. On the other hand, the Catalyst 6500 still offers bleeding edge features and options for software and hardware upgrades to enhance performance.

But it is clear that there are way denser switches for DC 10GE deployments, starting with the Cisco Nexus 7000 of course. Customers need to evaluate what is best for them, and each case is different.

An idea which comes to some, and is recommended by at least one network vendor as I wrote earlier, is to front end existing switches (which are less dense in 10GE port count) with low-cost 10GE switches to provide a low-cost high density 10GE fan-out. The follow picture shows this "design" approach:

In my opinion this is a bad idea, very bad network design, and it is looking for trouble. Moreover, I think this is an approach which may end up being "doing-less-with-more", even if at first glance, may look "cheap" to build. In this post I will try to explain the reasons why I think this way.

Technical Reasons

Multicast Performance - Switches constrain multicast flooding by implementing IGMP snooping. Shortly: as multicast receivers send IGMP join messages to signal they want to join a group, the switch's control plane will snoop the traffic and program hardware installing an entry (hopefully for the S,G information) which points to the port on which the IGMP join was received. Without IGMP v3 support, upon receiving a leave message, the switch in the middle must forward the message upstream but only after it has sent a message downstream to ensure no more receivers are left on the downstream switch. All in all, this adds complication, potential points of control plane failure, and inevitable latency on the join and leave processes. Means: a receiver will take longer to get multicast traffic after requesting, and the network will take longer to prune multicast traffic when the receiver leaves a group.

Multicast Scalability - This is a bigger issue than the former one. Networks built with Catalyst 6500 modular switches can scale up to 32K multicast entries in hardware. Typical ToR switches support between 1K and 2K entries {CHECK FOR ARISTA}. These figures are ok when a switch with 48 ports connects to 48 servers. But if you will use that switch to connect to 48 access switches each with 48 servers, then you are looking to provide connectivity for in excess of 2,000 devices. The math is clear in how much you are limiting yourself with this design.

Buffering - Fixed form factor switches are designed typically to connect end-points. A ToR is designed to connect servers. The buffer available in those switches is then suited for such application, and expects little to no contention on server-facing ports, and the only contention point to be on the uplinks for practical traffic purposes. When a 48-port ToR switch is used in access, it deals with the traffic and burstiness of 48 servers or less (duh!) ... but if you use it to aggregate 48 access switches each with 48 ports, then it aggregates the traffic of 2,000+ servers. It is VERY obvious that you'll end up in trouble, silent drops, without much more analysis.

The latter is probably the most serious limitation of this design, because it will affect all possible types of deployments. There are others, such as how to deal with QoS, LACP or spanning-tree compatibility, impact on convergence on failure scenarios ...

Operational Reasons

Software Upgrades - Since you are adding one more layer to manage, you now have more devices to upgrade, potentially with different software and upgrade procedures, and you need to evaluate the impact of upgrades on convergence. Apart from the Cisco Nexus 5000 series, no ToR in the market provides ISSU support.

Provisioning - Setting up vlans, policy and many network tasks get more complicated because you need to provision them on more devices. If spanning-tree needs to run (and this is recommended on L2 networks using multi-chassis etherchannel as a safe-guard mechanism) you also need to consider that ToR switches may have support for limited number of spanning-tree instances.

Support - Troubleshooting gets more complicated. Instrumentation will likely be different between ToR switches and high end modular devices. IF the ToR is from a different vendor of the rest of the network, this gets even more complicated because the technical support services from two or more vendors may need to be involved to deal with complex troubleshooting issues.

The list could go on ... I sure hope customers consider things carefully. Something which is cheap today and may "just work" today, could be extremely expensive in the long run, or simply no loger work for (near) future requirements.

I believe that in most cases poor network design comes as a result of lack of knowledge. Many people still think that to build networks you just need switches and ports. So all the count is how many ports you need, how many switches, off you go ...

Lack of knowledge is not a bad thing (I mean, nobody is born knowing stuff), and can be solved with reading, training, etc. Now, when the poor network design comes from a networking vendor document you have to ask yourself how much you can trust them ... for THEY should not lack the knowledge to do things right in networking.

Sunday, July 24, 2011

Yes, the math is misleading ...

I have had more people reading this that I anticipated, and a couple of good comments which I want to reply to in a proper way, so I decided to make it another post ...

Mark, thank you for your comments. They really add to my post as they provide more insight and context, which was lacking on the interview: the hypothetical 3MW DC is for a cloud provider built on L3 from the access.

And in that sense, I stand by my comment: the math presented at that interview is indeed misleading. I will explain below why I say this ... But first to your comments :-)

1. When I mentioned "claim to support 384 10GE ports" I do not imply that I doubt it to be the reality. I am writing here based on my knowledge and experience only :-) I take your word fot it of course.
2. Good catch on the Arista 7050-64 power consumption. Believe me it was not intentional. I have corrected this :-)
3. Agreed that cloud providers like standardization and could prefer to operationalize cabling with a pair of ToRs per rack. Again, at the interview it wasn't mentioned which kind of customer we are talking about ... Enterprise customers could think differently and would very much welcome racking ToRs every other rack because it simplifies the network operation by quite a bit (manage 84 devices vs 250).
4. There was no "Cisco defense" because I do not take this as any attack :-) ... and I write here for the fun of it. I simply stated the fact that you picked on platforms which don't allow a fair comparison. You say that Cisco's best practice design is with 5548 and FCoE ... Where did you read a paper from Cisco that recommends such a thing for an MSDP? ... However an Enterprise datacenter with a high degree of virtualization will in many cases require to support storage targets on FC, and FCoE is great solution to optimize the cost of deploying such infrastructure. L2 across the racks is also a common requirement in this case, not just for vMotion, but for supporting many cluster applications. L3 access in most Enterprise DC simply (and sadly) does not apply ...

As I said, your comment brings insight because in the interview there is no comment of what was the hypothetical datacenter for, or what kind of logical design you would use. No context at all. You say it is to be built using L3 at the access layer. I concur with you that the limitations I mentioned do not necessarily apply, for the access switches wont need to learn that many ARP entries, and the distribution switches can work with route summarization from the access, so smaller tables could do the job.

I am well aware that most MSDPs use L3 from the ToR layer. I am sure that you are also well aware that most Enterprise virtualized and non-virtualized datacenters, do not use L3 from the ToR. A 5,000 server DC could be found in either market space. I am sure you are also well aware that Cisco's recommended design with Nexus 5500 and Nexus 7000 (also leveraging FCoE) is thought for Enterprise datacenters primarily where it provides a lot of flexibility.

I still can't see how the topology looks like in your design anyways, with 12 spine switches. I just cannot see how to add up 16 uplinks per ToR/leaf to 12 spine switches and keep ECMPs ... I guess that the design is made up of pairs, so six pairs of 7500s, each then aggregating about 40 ToRs, with 8 links from ToR to each of the 7500s. But if this is the case, each of the 7500s pair has no connection to the other pairs unless a core layer is deployed, which perhaps is assumed in the example (but not factored in for power calculations??).

So again, I think your comment confirms my point, while also acknowledge that for non-L3 designs, the limitations that I mentioned do apply.

What do I mean that you confirm my point?: The math in the interview is presented without context, so it creates the perception that the comparison applies to any and every environment. In fact, it only applies in a specific environment (L3 design), and completely reverses in others (L2 desing ... probably a more common one btw). So you pick an area to your advantage, exaggerated it (by picking Cisco's most sub-optimal platform for such design), and presented it as general. That is misleading.

To answer to Doug ... Let me first say that I respect you a lot, and I have very very much respect for Arista. It is a company with a lot of talent. But talent lives in many places, including Cisco too. My blog post was to state that the math is misleading. Nothing more. I still believe it is, as stated above. You say "Nillo had to significantly modify the design, away from the specification" ... what? :-) ... what specification? None was provided during the interview ... how could I change it?!

To your challenge, two things to say:

1. Can YOU spec a 3MW datacenter where you can do life vMotion between any two racks and allow any server to mount any storage target from any point in the network? :-D
2. I write here for the fun of it and on my free time ... oh ... and I choose what I write about ;-) (not really the "when", because my free time isn't always mine ;-) ).

... I thought nobody would care a bit about what I write ... wow ...

Wednesday, July 20, 2011

Building a network takes more than switches and ports (or why Arista's math is wrong ...)

I couldn't help reading a recent interview with Arista's CEO on NetworkWorld. Very good interview, and nice insight on certain topics by the always bright Jayshree Ullal. However, there were claims in that interview which are ... misleading, to be polite.

The interview can be found here: http://www.networkworld.com/community/blog/QA-with-jayshree-ullal-ceo-arista-networks

In there, they talk of a hypothetical network to support 5,000 servers. The number comes from a hypothetical 3MW datacenter, assuming all servers of the same kind with consumption at 600W. In a perfect world, that's 5,000 servers. They go and say this fits in 125 cabinets, so putting 40 servers per cabinet and therefore considering 24KW of power per cabinet ... Pretty agressive in conventional datacenters, but certainly possible.

So considering standard 42RU cabinets, the math goes that you need two 1RU ToR switches per cabinet, and the desired oversubscription at the first hop is 3:1. No doubt, the hypothetical scenario was chosen to obtain best fit for Arista's boxes, but even so, I think the math fails.

Before I begin, a couple considerations with this approach:

at least in EMEA region, I have seen few designs request such low oversubscription ratios. 4:1 is very common in scenarios where local switching at the ToR isn't even required, with 8:1 being common where local switching is available. But ok, large SPs deploying dense cloud infrastructure would push for lower oversubscription.
there are other ways to handle cabling than to put a pair of ToR switches per cabinet. With CX-1 cabling to the server, there are cables which go up to 10meters which allows for several cabinets of distance, so it isn't uncommon to see a 2RU device being used for every other rack which reduces the number of managed devices and also power consumption.

Why the math is misleading

Arista considers using a 7050-64 (64x10GE, 125Watts), which is good enough for 40 ports to each of the servers and 16 10GE uplinks, leaving some spare ports unused. The ToR devices will be aggregated on Arista's 7500. With 16 uplinks per ToR, 2 ToRs per cabinet and 125 cabinets, you need 4,000 10GE aggregation ports in the upper layer. The statement goes you need 12 Arista 7500 series at the aggregation layer and 250 Arista 7050s ToR switches.. A quote is given of 75.6kw worth of power: the main variable here.

(I would again highlight that few organizations deploy a 4,000 10GE aggregation layer to provide connectivity to 5,000 servers today ...)

How are ToR devices connected to the upper layer 12 devices? This isn't mentioned. The Arista 7500 claims to support 384 10GE ports. On a two spine design, you can therefore support 48 ToR switches. On a four spine design, 96 ToR switches. On a eight spine design, 192 switches. A 12 spine design doesn't add up to 16 uplinks to keep an even topology ...

Another thing isn't mentioned: how do they keep redundancy an multi-pathing? M-LAG currently works across two spines ... No TRILL or equivalent yet. Where do they do L3?

Finally: 5,000 servers means a minimum of 5,000 Mac addresses and IP addresses. In fact, more, because you will have at least also a management interface. If you run VMs on each server, then each VM is a minimum of one mac and one IP. As per current datasheet the Arista 7500 supports only 16K mac addresses and 8K ARP entries. At the first hop router you must be able to keep one ARP entry per IP in the subnets. If each server has one IP for management one for traffic (again, will be more), that's already 10K ARP entries ...

So, how does this "green" network that Arista depicts work at all?! Can't keep all information in the forwarding table, and can't manage redundancy for so many paths ...

If we do the math reversed, with 10 VMs per server, two vNICs per VM, you need at least 150K ARP entries and mac address entires.

With current Arista numbers, best you can do is around 800 servers.

Why the Arista reported Cisco math is also misleading ...

For the Cisco example, they pick on a design with a Nexus 5548. This obviates that we also have a 1 RU box with 64 10GE ports: the Nexus 3064. Had they chosen that box, we would use the same number of devices at the ToR, and produce similar power consumption at the edge calculation (which is the bulk of the power number).

But we could also say that you can use the Nexus 5596, which gives us greater density. Use 60 server facing ports and 20 uplinks and rack them every other rack. This means we only need 84 devices at the access layer, which further reduces power consumption and management complexity.

For redundancy between the spine and access layer we could propose FabricPath, but this isn't yet available on the Nexus 5500 (although it will be in the months to come). This also scales better in terms of mac address table explosion because it uses conditional mac learning. And the Nexus 7000 also scales way better in the number of L3 entries, capable of storing up to 1M L3 entries.

So if we do the math by building a network that actually works ... Arista's would stop at 800 servers and Cisco's could indeed scale at 5,000 servers.

Monday, July 11, 2011

One layer, two layers … how many do your really need?

I do not know how many times over the last few months I have had to deal with the "we want a network with less layers, one layer is the ideal" comment. In fact it is a recurring question. The number of customers that over the past 10 years has asked me why do we need a core, or whether they can collapse aggregation and access is significant.

In the end, it comes down to the perception that by collapsing you may save money, on smaller deployments, and to simplify the network, on larger ones. I think that this perception is often wrong, and the truth is these days several networking vendors are trying to exploit it in a misleading way.

Why and when do you need more than one layer?

I will look at this from a data center perspective, so I will consider that end-points connecting to the network are primarily servers. But the analysis is similar for campus, with a different set of endpoints (desktops, laptops, wireless access points, printers, badge readers, etc.) which have very different bandwidth & security requirements.

If the number of servers you have fits in a single switch, then you can do with a one layer approach [I know, some people may read this and be thinking "some solutions out there convert the network in a single switch" … I'll get to those in other blog posts too].

So from a physical point of view this is obvious: if you can connect all your servers to one single switch, you need one layer. Now you want to have redundancy, so you will want the servers to connect each to at least two switches. So you have two switches which then connect to the rest of your network.

If you have 40 servers with just two NIC per server, you can probably do with a pair of 48-port switches. What if you have 400 instead of 40? Then you use a switch with +400-ports and you are done. But well, then you realize that you have to put that big switch somehow far from your servers and you spend lots of money on fiber cables and also complicate operations.

What if the switch needs to be replaced/upgraded? You need to impact ALL your servers ... And also, what if I need 4,000 servers now instead of 400?

So you say it is best to use smaller switches and place them closer to your servers. Top of Rack is where most people put them. Why is this good?
- can use less fiber between racks
- can use copper within the rack (cheaper)
- simplifies (physical) operations [i.e. when you need to upgrade/replace the switch you impact less devices]

So most people would connect servers to a pair of smaller switches placed at the Top of the Rack (ToR). Now you have to interconnect those ToR switches together. So we need a 2-layer network. The switch that we use to interconnect the ToR switches would be called a "distribution" or "aggregation" layer switch. Because you want redundancy, you would put two of those.

Pretty basic so far. In practical terms, there is almost no way to get away of a 2-layer network.

Ok, but more than 2 layers is really an overkill! This is just vendors trying to rip money off of people!

Same principle applies to begin with. A pair of distribution switches would have a limited number of ports. Let's say for simplification that you use two uplinks per ToR switch, one to each distribution switch. If you have 40 ToR switches you MAY do with say a 48-port switch for distribution [notice the MAY in uppercase, for there are considerations other than the number of switchports here].

Say you have 400 ToR devices now ... you need a pair of +400-port switches in distribution layer. Say you have 4,000 ToR devices ... Not so easy to think of a single device with 4,000 ports. But the key point here now is say you have 40,000, or more ... How do you make sure that whatever the number, you know what to do and can grow the network.

This is where a third layer and a hierarchical modular design MAY come in place [notice the upper case on MAY, for there are other options here, discussed below]. By adding another layer, you multiply the scalability. Now you can design PODs of distribution pairs with a known capacity (based on the number of ports and other parameters like size of forwarding tables, scalability of control plane, etc) which you can interconnect with one another through a core layer.

So, do you need a core layer? Depends on the size of your network. But the point is that regardless of what any vendor may claim today, as the network grows above the physical limits of the devices you use to aggregate access switches, you need to add another layer to make the network grow in a scalable way.

OK, but I can also grow with a 2-layer design ...

Yes, and this is becoming more and more popular. In such architecture the ToR switch is typically defined as a 'leaf' switch, which connects to a 'spine' switch (spine would be similar to the distribution switch). You can have multiple 'spine' switches, so you can scale the network to a higher bandwidth for each ToR. If a ToR requires more than two uplinks, you can add more spine switches to provide more bandwidth.

But the size of the network measured by the number of leaf/ToR switches is still limited by the maximum port density of the spine/distribution switch. When you exceed that number you are bound to use another layer.

More reasons to consider multiple layers ...

So far we have seen the obvious physical reasons calling for 2 or 3 layer design. Needless to say that as platforms become denser in number of supported ports with each product generation, what before required 3 layers may be later accomplish in 2. But at scale, you always need to fall into multiple physical layers.

But there are other reasons why you want to use multiple layers, and those are operational more than anything. If you have all your servers connected to a pair of switches, or for that matter all your ToRs connected to a pair of switches which ALSO connect to the rest of the network, maintenance, replacements, and upgrades on those switches may have a big impact on the overall datacenter.

So in practical terms, it may be interesting to reduce the size of a distribution pair even if that requires adding a third layer, so that you reduce the failure domain. What I mean here is that let's say you have 4,000 servers and you can hook them all off of a single pair of distribution switches. It may be interesting to think of doing two PODs of 2,000 each even if that requires using double the number of distribution switches. Why? A serious problem in your pair of switches will not bring down all your computing capacity.

Of course people would argue "hey, I already have TWO distribution switches for THAT reason". Fair enough. But if you goof an update or you are impacted by a software bug that is data-plane related (i.e. related to traffic through the switches) there is a chance that you will affect both. This is rare, but happens.

The 2-layer approach already shields you in that level for when you need to upgrade your access switches, say to deploy a new feature or correct a software bug, you can test on a sandbox switch first, and then deploy the upgrade serially so you make sure there is no impact to the network.

There are other reasons to have a hierarchical network (with 2 or 3 layers depending on size), such as network protocol scaling by allowing aggregation (again, comes to minimize size of failure domain, this time on L3 control plane), applying policy in hierarchical way (allows scaling hardware reasources) and others, but I am not touching on any of these on purpose. The reason is that there are some that now would dispute that if you do NOT rely on network control protocols you get away with their limitations. This is certainly one of the ideas behind most people working on SDN and OpenFlow, as well as on some vendor proprietary implementations.

What I wanted to do with this post is prove that regardless of the network control plane approach (standards based with distributed control - OSPF, ISIS, TRILL -, standards based with centralized control - i.e. OpenFlow -, or proprietary implementations - i.e. BrocadeOne or JNPR's QFabric), at scale you ALWAYS need more than one physical network layer, and in large scale datacenters it is almost impossible to move away from three.

Wednesday, June 29, 2011

On L2 vs. L3 vs. OpenFlow for DC Design

The L2 vs. L3 debate is old and almost religious. One good thing about OpenFlow and more general Software Defined Networking is that perhaps it ends the debate: L2 or L3? Neither (or both?).

The problem in the end is that people perceive L2 as simpler than L3. I say perceive because it really depends on where you want to face complexity. From a simple-to-establish-connectivity perspective, L2 is easy. From a scaling perspective, L3 is simple (at least, simpler).

People working with servers and applications have had traditionally minimal networking knowledge, which has lead them to rely too much on L2 connectivity and today many application and clustering technologies wont work without L2 connectivity between involved parties. Same thing can be said about virtualization technologies. Easiest thing to do to make sure you can live move a VM between two hosts is assume they can both be on the same subnet and L2 broadcast domain.

For VM mobility, as well as for many clustering technologies, it is important to avoid IP re-addressing, and in this sense it does not help that the IP address represents both identity, and location (by belonging to a subnet which is located in a particular place). This is why LISP is so cool, because it splits the two intrinsic functions of the IP address: identity and location.

When looking at building a datacenter, and in particular a datacenter which will support some form of multi-tenancy and potentially can be used to host virtual datacenters (i.e. private cloud type of services), how do we want to do it? Do we use L2 or L3? Or is the solution to consider OpenFlow? Tough one.

For the past two or three years there's been some level of consensus that you must have large L2 domains, and that with newer protocols such as TRILL we will be able to build very large L2 networks, hence, that was the way forward. Reality is, most MSDPs today, to the best of my knowledge, are based on L3: because it works and scales.

Reality is as well that for delivering IaaS you will need to have some form of creating overlay L2 domains on a per virtualDC basis. Sort of like delivering one or more L2 VPNs per virtualDC. Why? Because the virtualDCs will have to host traditional workloads (virtualized) and host legacy applications and such which are not cloud-aware or cloud-ready. From a networking point of view this means each virtualDC will have to have its own VLANs, subnets, and policies.

VLANs are commonly used to provide application isolation or organizational isolation. So in a DC, this means you use dedicated VLANs for, say, all your exchange servers, all your SAP servers, etc. Or you may use different VLANs for different areas of the company, which then may share various applications, or you combine both and you give a range of vlans to each organization/tenant and then further divide by application. This needs to be replicated on each virtualDC.

At the infrastructure level, relying on current virtualization offerings, you may have dedicated VLANs for your virtual servers, where you have vlans for the management of virtual servers, for allowing VM mobility, or for running NFS or iSCSI traffic (also for running FCoE).

Do you want to use the same VLAN space for infrastructure and for virtualDCs? Probably not. Then the question is whether is best to rely on a L2 infrastructure over which you deliver L2 VPNs for each virtualDC, or whether you build a L3 infrastructure over which you deliver L2 VPNs.

The latter one does not have, today, a standards based approach. The former has at least the known option of QinQ (with all its caveats). Some would argue that combining this with SPB or TRILL you have a solution. Maybe.

But I think the real way forward with scalability is build a L3 network, which can by the way accommodate to any topology and provides excellent multicast, and then build L2 overlays from the virtual switching layer.

And then a question is whether this is all easier to do with OpenFlow. I think not, because in the end the control plane isn't really the problem. In other words: networks aren't more or less flexible because of the way the control plane is implemented (distributed vs centralized) but because of poor design and trying to stretch too much out of L2 (IMHO).

I do not doubt you could fully manage a network from an OF controller (although I have many questions on the scalability, reliability and convergence times) but I don't really see the benefit of doing that. The only way I see a benefit is to avoid the L2 vs. L3 because at the controller you could bypass completely the "normal" forwarding logic and make a L2 or L3 decision on the fly regardless of topology. But the question how to scale that at the dataplane and also that in order to do that you must offload the first packet lookup to the controller and THAT wont fly.

So there it is … I think that with modern IGP implementations we can build networks with thousands of ports in a very reliable and easy to manage way, and by building a Mac-in-IP overlay from the virtualization layer, provide the L2 services required. That would be my bet on how to do things :-)

Friday, June 17, 2011

Why FC has a brighter future ahead thanks to FCoE

I must admit that I have been an sceptic about FCoE since Cisco introduced it in 2008 up until recent times. With limited storage knowledge and background, I though that going forward NAS would prevail for most environments, and iSCSI would be the winning option for block based storage needs. Now I think there's going to be space for all of these, but FC has a better chance when dealing with block storage.

Why am I changing my mind? Well, a bit more knowledge, but also recognizing the facts. Overall FC market is still growing. Some analysts estimate 9-10% Y/Y growth at the end of 2010 for total FC market in vendor reported revenue. And it is remarkable that FCoE is becoming a larger part of that market. Depending on reports, it would now be up to 10% of the total FC market (considering adapters, switches and controller ports). And the really important thing is that this is more than doubling its contribution to the total market vs one year ago.

Those are some facts. About knowledge, I know recognize that FC networks bring not only stability and performance required for Tier 1 applications and mission critical environments, but also offer management over the SAN that is unmatched by other SAN protocols. The problem for many customers was that deploying FC was prohibitively expensive: the need to deploy a completely separate network, (which usually had to be design-stamped by the storage vendor) and the lack of FC interfaces on low end (and even mid-range) arrays made it impossible for small and mid-sized organizations to even consider it.

This is where I see a new angle now. FCoE is going to make FC as ubiquitous as Ethernet, and almost as affordable.

Customers jumping on iSCSI would normally pick on using the same switches for LAN and iSCSI (perhaps dedicating ports for the latter, and certainly should be dedicating separate vlans, subnets, etc.). Server side? Just another pair of GE ports. Usually not adding much to the cost if you were considering quad NIC cards on top of whatever comes as LOM.

FC would have added a lot. A pair of 4Gbps ports would represent an HBA or about $2K, and then you had to add separate dedicated Fabric switches at the access layer.

Well with FCoE, you put FC at the same level as deploying iSCSI for many people. First of all, FCoE now comes standard in NICs (CNAs) from many vendors, including Emulex, Qlogic, Intel, Cisco or Broadcom. Intel's approach of OpenFCoE adds no cost to deploying just 10GE, and renders the price per port to about $400 range. Such CNAs are already certified by most relevant storage vendors including EMC and NetApp. Vendors which, btw, are also adding FCoE support to their mid-range and high end arrays.

Now the network piece. 10GE ToR ports are about $500-600 a port. Cisco solutions there, thanks to FEX approach, allow to deploy a very cost effective 10GE access network, where all ports can support FCoE. So, in essence, you are at the same position as with iSCSI now: can use affordable ports at the server, have support for most common OS, can use the same network infrastructure, and can pick from multiple storage vendors. Nice thing on top? You can get management tools like CIsco's DCNM to give you visibility of LAN and SAN traffic (which you do not get with iSCSI).

Bottom line, I think FCoE is making FC easier and cheaper to deploy, so I expect more and more customers to consider using this technology, on top of those already using FC to adopt FCoE as well (using less power, lower capes, simplified operations, etc.).

Of course there is space for NAS deployments as well, and iSCSI will continue to grow, but I think FCoE makes FC much more competitive as a technology than it ever was for deploying a SAN.

Ok, me too, I am now blogging ...

Again. Yes, I had some blogs in the past, mainly to keep family and friends updated with certain important events in my life. But then ... well. Then for numerous reasons I had no time and was not in the mood to write at all.

Now it's changing. I have more and more contact with remote people. This is not just friends and family (having my parents and brothers live in three different countries helps), but also colleagues and other people. Facebook is cool for keeping in touch with many of them. Twitter too. But none of them allow to really express what you think or feel about something you want to share. So yes. Me too, I am blogging (again). If nobody ever reads this? Who cares :-)

Language? ... hmmm now that is an issue. I have friends and family who can only read Spanish. Also many who read English but can't read Spanish, and some who would perhaps welcome French. I choose English for writing most of the time. It is what I feel more comfortable with now a days. That said, I suppose that if I decide to write about whatever is happening in Spain, or about the endless disputes between Real Madrid and Barcelona, I will probably do it in Spanish. We will see. In any case, I don't care if anybody really reads it right?

Translate