Translate

Wednesday, July 20, 2011

Building a network takes more than switches and ports (or why Arista's math is wrong ...)

I couldn't help reading a recent interview with Arista's CEO on NetworkWorld. Very good interview, and nice insight on certain topics by the always bright Jayshree Ullal. However, there were claims in that interview which are ... misleading, to be polite.

The interview can be found here: http://www.networkworld.com/community/blog/QA-with-jayshree-ullal-ceo-arista-networks

In there, they talk of a hypothetical network to support 5,000 servers. The number comes from a hypothetical 3MW datacenter, assuming all servers of the same kind with consumption at 600W. In a perfect world, that's 5,000 servers. They go and say this fits in 125 cabinets, so putting 40 servers per cabinet and therefore considering 24KW of power per cabinet ... Pretty agressive in conventional datacenters, but certainly possible.

So considering standard 42RU cabinets, the math goes that you need two 1RU ToR switches per cabinet, and the desired oversubscription at the first hop is 3:1. No doubt, the hypothetical scenario was chosen to obtain best fit for Arista's boxes, but even so, I think the math fails.

Before I begin, a couple considerations with this approach:

  • at least in EMEA region, I have seen few designs request such low oversubscription ratios. 4:1 is very common in scenarios where local switching at the ToR isn't even required, with 8:1 being common where local switching is available. But ok, large SPs deploying dense cloud infrastructure would push for lower oversubscription.
  • there are other ways to handle cabling than to put a pair of ToR switches per cabinet. With CX-1 cabling to the server, there are cables which go up to 10meters which allows for several cabinets of distance, so it isn't uncommon to see a 2RU device being used for every other rack which reduces the number of managed devices and also power consumption.

Why the math is misleading 

Arista considers using a 7050-64 (64x10GE, 125Watts), which is good enough for 40 ports to each of the servers and 16 10GE uplinks, leaving some spare ports unused. The ToR devices will be aggregated on Arista's 7500. With 16 uplinks per ToR, 2 ToRs per cabinet and 125 cabinets, you need 4,000 10GE aggregation ports in the upper layer. The statement goes you need 12 Arista 7500 series at the aggregation layer and 250 Arista 7050s ToR switches.. A quote is given of 75.6kw worth of power: the main variable here.

(I would again highlight that few organizations deploy a 4,000 10GE aggregation layer to provide connectivity to 5,000 servers today ...)

How are ToR devices connected to the upper layer 12 devices? This isn't mentioned. The Arista 7500 claims to support 384 10GE ports. On a two spine design, you can therefore support 48 ToR switches. On a four spine design, 96 ToR switches. On a eight spine design, 192 switches. A 12 spine design doesn't add up to 16 uplinks to keep an even topology ...

Another thing isn't mentioned: how do they keep redundancy an multi-pathing? M-LAG currently works across two spines ... No TRILL or equivalent yet. Where do they do L3?

Finally: 5,000 servers means a minimum of 5,000 Mac addresses and IP addresses. In fact, more, because you will have at least also a management interface. If you run VMs on each server, then each VM is a minimum of one mac and one IP. As per current datasheet the Arista 7500 supports only 16K mac addresses and 8K ARP entries. At the first hop router you must be able to keep one ARP entry per IP in the subnets. If each server has one IP for management one for traffic (again, will be more), that's already 10K ARP entries ...

So, how does this "green" network that Arista depicts work at all?! Can't keep all information in the forwarding table, and can't manage redundancy for so many paths ...

If we do the math reversed, with 10 VMs per server, two vNICs per VM, you need at least 150K ARP entries and mac address entires.

With current Arista numbers, best you can do is around 800 servers.

Why the Arista reported Cisco math is also misleading ...

For the Cisco example, they pick on a design with a Nexus 5548. This obviates that we also have a 1 RU box with 64 10GE ports: the Nexus 3064. Had they chosen that box, we would use the same number of devices at the ToR, and produce similar power consumption at the edge calculation (which is the bulk of the power number).

But we could also say that you can use the Nexus 5596, which gives us greater density. Use 60 server facing ports and 20 uplinks and rack them every other rack. This means we only need 84 devices at the access layer, which further reduces power consumption and management complexity.

For redundancy between the spine and access layer we could propose FabricPath, but this isn't yet available on the Nexus 5500 (although it will be in the months to come). This also scales better in terms of mac address table explosion because it uses conditional mac learning. And the Nexus 7000 also scales way better in the number of L3 entries, capable of storing up to 1M L3 entries.

So if we do the math by building a network that actually works ... Arista's would stop at 800 servers and Cisco's could indeed scale at 5,000 servers.

5 comments:

  1. Nillo – Nice blog posting thank you for your analysis of Jayshree’s conversation with Network World. Please allow me to point out a few things about your posting.

    (in interest of full disclosure I am an employee of Arista Networks)

    In regards to your Arista analysis
    • You state “The Arista 7500 claims to support 384 10GE ports” just to set the record straight the 7508 does in fact support 384 wirespeed L2/L3 10GbE ports no games, no gimmicks and no ‘local’ switching numbers
    • The 7050’s typical power draw is 125W not 260W as you indicate in your post
    • The proposed design is one that uses L3 ECMP down to the ToR, this is a popular design many of the large data centers as it provides scale and multi-pathing capabilities utilizing the proven technology provided via IPv4.
    • Based on a L3 ECMP design the rest of your points about why the Arista design will not work are not applicable.

    Your Cisco defense:
    • The Cisco Nexus 3064 was not used as according to Cisco it is not intended for general purpose data center designs, note Jayshree’s statement: “using their best practice design guide which recommends the Cisco 5500 with FCoE and the Nexus 7000”
    • Most hosters at this scale like to create a repeatable unit of measure; typically this is a rack. In your example of using a 5596 to aggregate multiple racks of equipment would not work operationally
    • According to your design fabric path could be used, once it is supported on the 5500. This would not be possible if you used the Nexus 3064 as you discussed earlier in your posting, as you know it can not support fabric path or FCoE for that matter two critical components of Cisco’s proprietary vendor lock-in strategy

    In summary you can build the network Jayshree describes with either Arista or Cisco products, but doing it with Arista minimizes your network’s power, space and cooling footprint. This is not a theoretical exercise we have people doing it today. At Arista we are trying to solve real world customer challenges in an open standards based manner. We welcome you and others to come join us, best regards…

    -mark

    ReplyDelete
  2. I find it odd that Nillo had to significantly modify the design, away from the specification sussed by many of the largest cloud providers in order to find a corner case where his products would possibly work, but then Nillo still failed to provide ANY quantitative data on power draw and operational cost.

    Nillo, can you spec out a network to support a 3MW critical load of servers using your model and of course to be safe be sure to use the max power draw numbers of your products so you don't cause a brownout for your customers.

    dg (note: also an Arista employee)

    ReplyDelete
  3. Mark, Doug, thank you for your comments. I wanted to reply properly which was taking more space that fits normally in here, so I made them another post instead ...

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete