Out of Nillo's mind: July 2011

Sunday, July 24, 2011

Yes, the math is misleading ...

I have had more people reading this that I anticipated, and a couple of good comments which I want to reply to in a proper way, so I decided to make it another post ...

Mark, thank you for your comments. They really add to my post as they provide more insight and context, which was lacking on the interview: the hypothetical 3MW DC is for a cloud provider built on L3 from the access.

And in that sense, I stand by my comment: the math presented at that interview is indeed misleading. I will explain below why I say this ... But first to your comments :-)

1. When I mentioned "claim to support 384 10GE ports" I do not imply that I doubt it to be the reality. I am writing here based on my knowledge and experience only :-) I take your word fot it of course.
2. Good catch on the Arista 7050-64 power consumption. Believe me it was not intentional. I have corrected this :-)
3. Agreed that cloud providers like standardization and could prefer to operationalize cabling with a pair of ToRs per rack. Again, at the interview it wasn't mentioned which kind of customer we are talking about ... Enterprise customers could think differently and would very much welcome racking ToRs every other rack because it simplifies the network operation by quite a bit (manage 84 devices vs 250).
4. There was no "Cisco defense" because I do not take this as any attack :-) ... and I write here for the fun of it. I simply stated the fact that you picked on platforms which don't allow a fair comparison. You say that Cisco's best practice design is with 5548 and FCoE ... Where did you read a paper from Cisco that recommends such a thing for an MSDP? ... However an Enterprise datacenter with a high degree of virtualization will in many cases require to support storage targets on FC, and FCoE is great solution to optimize the cost of deploying such infrastructure. L2 across the racks is also a common requirement in this case, not just for vMotion, but for supporting many cluster applications. L3 access in most Enterprise DC simply (and sadly) does not apply ...

As I said, your comment brings insight because in the interview there is no comment of what was the hypothetical datacenter for, or what kind of logical design you would use. No context at all. You say it is to be built using L3 at the access layer. I concur with you that the limitations I mentioned do not necessarily apply, for the access switches wont need to learn that many ARP entries, and the distribution switches can work with route summarization from the access, so smaller tables could do the job.

I am well aware that most MSDPs use L3 from the ToR layer. I am sure that you are also well aware that most Enterprise virtualized and non-virtualized datacenters, do not use L3 from the ToR. A 5,000 server DC could be found in either market space. I am sure you are also well aware that Cisco's recommended design with Nexus 5500 and Nexus 7000 (also leveraging FCoE) is thought for Enterprise datacenters primarily where it provides a lot of flexibility.

I still can't see how the topology looks like in your design anyways, with 12 spine switches. I just cannot see how to add up 16 uplinks per ToR/leaf to 12 spine switches and keep ECMPs ... I guess that the design is made up of pairs, so six pairs of 7500s, each then aggregating about 40 ToRs, with 8 links from ToR to each of the 7500s. But if this is the case, each of the 7500s pair has no connection to the other pairs unless a core layer is deployed, which perhaps is assumed in the example (but not factored in for power calculations??).

So again, I think your comment confirms my point, while also acknowledge that for non-L3 designs, the limitations that I mentioned do apply.

What do I mean that you confirm my point?: The math in the interview is presented without context, so it creates the perception that the comparison applies to any and every environment. In fact, it only applies in a specific environment (L3 design), and completely reverses in others (L2 desing ... probably a more common one btw). So you pick an area to your advantage, exaggerated it (by picking Cisco's most sub-optimal platform for such design), and presented it as general. That is misleading.

To answer to Doug ... Let me first say that I respect you a lot, and I have very very much respect for Arista. It is a company with a lot of talent. But talent lives in many places, including Cisco too. My blog post was to state that the math is misleading. Nothing more. I still believe it is, as stated above. You say "Nillo had to significantly modify the design, away from the specification" ... what? :-) ... what specification? None was provided during the interview ... how could I change it?!

To your challenge, two things to say:

1. Can YOU spec a 3MW datacenter where you can do life vMotion between any two racks and allow any server to mount any storage target from any point in the network? :-D
2. I write here for the fun of it and on my free time ... oh ... and I choose what I write about ;-) (not really the "when", because my free time isn't always mine ;-) ).

... I thought nobody would care a bit about what I write ... wow ...

Wednesday, July 20, 2011

Building a network takes more than switches and ports (or why Arista's math is wrong ...)

I couldn't help reading a recent interview with Arista's CEO on NetworkWorld. Very good interview, and nice insight on certain topics by the always bright Jayshree Ullal. However, there were claims in that interview which are ... misleading, to be polite.

The interview can be found here: http://www.networkworld.com/community/blog/QA-with-jayshree-ullal-ceo-arista-networks

In there, they talk of a hypothetical network to support 5,000 servers. The number comes from a hypothetical 3MW datacenter, assuming all servers of the same kind with consumption at 600W. In a perfect world, that's 5,000 servers. They go and say this fits in 125 cabinets, so putting 40 servers per cabinet and therefore considering 24KW of power per cabinet ... Pretty agressive in conventional datacenters, but certainly possible.

So considering standard 42RU cabinets, the math goes that you need two 1RU ToR switches per cabinet, and the desired oversubscription at the first hop is 3:1. No doubt, the hypothetical scenario was chosen to obtain best fit for Arista's boxes, but even so, I think the math fails.

Before I begin, a couple considerations with this approach:

at least in EMEA region, I have seen few designs request such low oversubscription ratios. 4:1 is very common in scenarios where local switching at the ToR isn't even required, with 8:1 being common where local switching is available. But ok, large SPs deploying dense cloud infrastructure would push for lower oversubscription.
there are other ways to handle cabling than to put a pair of ToR switches per cabinet. With CX-1 cabling to the server, there are cables which go up to 10meters which allows for several cabinets of distance, so it isn't uncommon to see a 2RU device being used for every other rack which reduces the number of managed devices and also power consumption.

Why the math is misleading

Arista considers using a 7050-64 (64x10GE, 125Watts), which is good enough for 40 ports to each of the servers and 16 10GE uplinks, leaving some spare ports unused. The ToR devices will be aggregated on Arista's 7500. With 16 uplinks per ToR, 2 ToRs per cabinet and 125 cabinets, you need 4,000 10GE aggregation ports in the upper layer. The statement goes you need 12 Arista 7500 series at the aggregation layer and 250 Arista 7050s ToR switches.. A quote is given of 75.6kw worth of power: the main variable here.

(I would again highlight that few organizations deploy a 4,000 10GE aggregation layer to provide connectivity to 5,000 servers today ...)

How are ToR devices connected to the upper layer 12 devices? This isn't mentioned. The Arista 7500 claims to support 384 10GE ports. On a two spine design, you can therefore support 48 ToR switches. On a four spine design, 96 ToR switches. On a eight spine design, 192 switches. A 12 spine design doesn't add up to 16 uplinks to keep an even topology ...

Another thing isn't mentioned: how do they keep redundancy an multi-pathing? M-LAG currently works across two spines ... No TRILL or equivalent yet. Where do they do L3?

Finally: 5,000 servers means a minimum of 5,000 Mac addresses and IP addresses. In fact, more, because you will have at least also a management interface. If you run VMs on each server, then each VM is a minimum of one mac and one IP. As per current datasheet the Arista 7500 supports only 16K mac addresses and 8K ARP entries. At the first hop router you must be able to keep one ARP entry per IP in the subnets. If each server has one IP for management one for traffic (again, will be more), that's already 10K ARP entries ...

So, how does this "green" network that Arista depicts work at all?! Can't keep all information in the forwarding table, and can't manage redundancy for so many paths ...

If we do the math reversed, with 10 VMs per server, two vNICs per VM, you need at least 150K ARP entries and mac address entires.

With current Arista numbers, best you can do is around 800 servers.

Why the Arista reported Cisco math is also misleading ...

For the Cisco example, they pick on a design with a Nexus 5548. This obviates that we also have a 1 RU box with 64 10GE ports: the Nexus 3064. Had they chosen that box, we would use the same number of devices at the ToR, and produce similar power consumption at the edge calculation (which is the bulk of the power number).

But we could also say that you can use the Nexus 5596, which gives us greater density. Use 60 server facing ports and 20 uplinks and rack them every other rack. This means we only need 84 devices at the access layer, which further reduces power consumption and management complexity.

For redundancy between the spine and access layer we could propose FabricPath, but this isn't yet available on the Nexus 5500 (although it will be in the months to come). This also scales better in terms of mac address table explosion because it uses conditional mac learning. And the Nexus 7000 also scales way better in the number of L3 entries, capable of storing up to 1M L3 entries.

So if we do the math by building a network that actually works ... Arista's would stop at 800 servers and Cisco's could indeed scale at 5,000 servers.

Monday, July 11, 2011

One layer, two layers … how many do your really need?

I do not know how many times over the last few months I have had to deal with the "we want a network with less layers, one layer is the ideal" comment. In fact it is a recurring question. The number of customers that over the past 10 years has asked me why do we need a core, or whether they can collapse aggregation and access is significant.

In the end, it comes down to the perception that by collapsing you may save money, on smaller deployments, and to simplify the network, on larger ones. I think that this perception is often wrong, and the truth is these days several networking vendors are trying to exploit it in a misleading way.

Why and when do you need more than one layer?

I will look at this from a data center perspective, so I will consider that end-points connecting to the network are primarily servers. But the analysis is similar for campus, with a different set of endpoints (desktops, laptops, wireless access points, printers, badge readers, etc.) which have very different bandwidth & security requirements.

If the number of servers you have fits in a single switch, then you can do with a one layer approach [I know, some people may read this and be thinking "some solutions out there convert the network in a single switch" … I'll get to those in other blog posts too].

So from a physical point of view this is obvious: if you can connect all your servers to one single switch, you need one layer. Now you want to have redundancy, so you will want the servers to connect each to at least two switches. So you have two switches which then connect to the rest of your network.

If you have 40 servers with just two NIC per server, you can probably do with a pair of 48-port switches. What if you have 400 instead of 40? Then you use a switch with +400-ports and you are done. But well, then you realize that you have to put that big switch somehow far from your servers and you spend lots of money on fiber cables and also complicate operations.

What if the switch needs to be replaced/upgraded? You need to impact ALL your servers ... And also, what if I need 4,000 servers now instead of 400?

So you say it is best to use smaller switches and place them closer to your servers. Top of Rack is where most people put them. Why is this good?
- can use less fiber between racks
- can use copper within the rack (cheaper)
- simplifies (physical) operations [i.e. when you need to upgrade/replace the switch you impact less devices]

So most people would connect servers to a pair of smaller switches placed at the Top of the Rack (ToR). Now you have to interconnect those ToR switches together. So we need a 2-layer network. The switch that we use to interconnect the ToR switches would be called a "distribution" or "aggregation" layer switch. Because you want redundancy, you would put two of those.

Pretty basic so far. In practical terms, there is almost no way to get away of a 2-layer network.

Ok, but more than 2 layers is really an overkill! This is just vendors trying to rip money off of people!

Same principle applies to begin with. A pair of distribution switches would have a limited number of ports. Let's say for simplification that you use two uplinks per ToR switch, one to each distribution switch. If you have 40 ToR switches you MAY do with say a 48-port switch for distribution [notice the MAY in uppercase, for there are considerations other than the number of switchports here].

Say you have 400 ToR devices now ... you need a pair of +400-port switches in distribution layer. Say you have 4,000 ToR devices ... Not so easy to think of a single device with 4,000 ports. But the key point here now is say you have 40,000, or more ... How do you make sure that whatever the number, you know what to do and can grow the network.

This is where a third layer and a hierarchical modular design MAY come in place [notice the upper case on MAY, for there are other options here, discussed below]. By adding another layer, you multiply the scalability. Now you can design PODs of distribution pairs with a known capacity (based on the number of ports and other parameters like size of forwarding tables, scalability of control plane, etc) which you can interconnect with one another through a core layer.

So, do you need a core layer? Depends on the size of your network. But the point is that regardless of what any vendor may claim today, as the network grows above the physical limits of the devices you use to aggregate access switches, you need to add another layer to make the network grow in a scalable way.

OK, but I can also grow with a 2-layer design ...

Yes, and this is becoming more and more popular. In such architecture the ToR switch is typically defined as a 'leaf' switch, which connects to a 'spine' switch (spine would be similar to the distribution switch). You can have multiple 'spine' switches, so you can scale the network to a higher bandwidth for each ToR. If a ToR requires more than two uplinks, you can add more spine switches to provide more bandwidth.

But the size of the network measured by the number of leaf/ToR switches is still limited by the maximum port density of the spine/distribution switch. When you exceed that number you are bound to use another layer.

More reasons to consider multiple layers ...

So far we have seen the obvious physical reasons calling for 2 or 3 layer design. Needless to say that as platforms become denser in number of supported ports with each product generation, what before required 3 layers may be later accomplish in 2. But at scale, you always need to fall into multiple physical layers.

But there are other reasons why you want to use multiple layers, and those are operational more than anything. If you have all your servers connected to a pair of switches, or for that matter all your ToRs connected to a pair of switches which ALSO connect to the rest of the network, maintenance, replacements, and upgrades on those switches may have a big impact on the overall datacenter.

So in practical terms, it may be interesting to reduce the size of a distribution pair even if that requires adding a third layer, so that you reduce the failure domain. What I mean here is that let's say you have 4,000 servers and you can hook them all off of a single pair of distribution switches. It may be interesting to think of doing two PODs of 2,000 each even if that requires using double the number of distribution switches. Why? A serious problem in your pair of switches will not bring down all your computing capacity.

Of course people would argue "hey, I already have TWO distribution switches for THAT reason". Fair enough. But if you goof an update or you are impacted by a software bug that is data-plane related (i.e. related to traffic through the switches) there is a chance that you will affect both. This is rare, but happens.

The 2-layer approach already shields you in that level for when you need to upgrade your access switches, say to deploy a new feature or correct a software bug, you can test on a sandbox switch first, and then deploy the upgrade serially so you make sure there is no impact to the network.

There are other reasons to have a hierarchical network (with 2 or 3 layers depending on size), such as network protocol scaling by allowing aggregation (again, comes to minimize size of failure domain, this time on L3 control plane), applying policy in hierarchical way (allows scaling hardware reasources) and others, but I am not touching on any of these on purpose. The reason is that there are some that now would dispute that if you do NOT rely on network control protocols you get away with their limitations. This is certainly one of the ideas behind most people working on SDN and OpenFlow, as well as on some vendor proprietary implementations.

What I wanted to do with this post is prove that regardless of the network control plane approach (standards based with distributed control - OSPF, ISIS, TRILL -, standards based with centralized control - i.e. OpenFlow -, or proprietary implementations - i.e. BrocadeOne or JNPR's QFabric), at scale you ALWAYS need more than one physical network layer, and in large scale datacenters it is almost impossible to move away from three.

Translate