by RSS Glen Kemp  |  May 26, 2015  |  Filed in: Industry Trends

[Editor’s Note: While this post is primarily concerned with Fortinet firewalls, particularly the FortiGate 3700D, it also serves as useful background for anyone considering different deployment scenarios for data center firewalls.]

Nestling somewhere near the top of the Fortinet product portfolio is the FortiGate 3700D. Whilst it is generally pitched as a data centre firewall, it is a good deal more flexible than your average big dumb packet pusher. It may be deployed in ultra-low latency, very high capacity, and dedicated Unified Threat Management (UTM) configurations depending on your requirements. Furthermore, if your requirements are neither fish nor fowl, it is also possible to create a balanced configuration.

My last post, Optimizing Your Network Design with the NP6 Platform, focused on how to get the best performance for North/South traffic flows (TL;DR; all flows should transverse a single Network Processing Unit (NPU)). If you want to turn on all the security bells and whistles on the FortiGate data center devices, this design could potentially leave NPUs idle.

So how do we tune the platform for high capacity/throughput designs? Well, with good design of course.

NPU Refresher

But first a small refresher; Why is port mapping important? If you recall, the FortiGate firewall’s performance figures come from a combination of general purpose x86 family processors and two Application Specific Integrated Circuits (ASICs), the NP6 and CP8. When a new session hits the firewall’s physical interface, the first packet is switched into the CPU for the three-way handshake. Once the session is established, the session is installed into the ASIC. All further packets in the same flow go straight from the switch fabric into the NPU for processing until the session times out or is cleanly dropped. From a capacity point of view each NPU is bound to a given set of CPU cores.

For relatively low throughputs (and I’m talking about multiples of 1Gbps bundles) this isn’t really an issue. However, in the context of multiple 10Gbps bundles, congestion may occur if all interfaces are attached to the same NPU. Clearly we cannot go beyond the capacity of a single NPU. It would be difficult to reproduce this in a production environment, and even in a lab, you’d need a lot of test gear. However, CPU-intensive flows (such as some IPS tasks or content filtering) or during sudden load spikes (fDC failover, for example) may create a temporary bottleneck.

The total number of sessions must also be considered; each NP6 has sufficient resources to accelerate between 10 and 15 million sessions. If 20 million sessions were piled onto a single NPU, traffic would still flow, but many sessions would be placed into the CPU “slow-path”. Clearly, this would not be ideal. Limiting ourselves to a single NPU also constrains the resources available during traffic burst events.

Extreme throughput scenario

How do we design for such an extreme scenario? By balancing the physical connections across each of the NPUs. There are a couple of ways this could be achieved, but the preferred option is to use 802.11AD Link Aggregation Groups (LAGs), or bundles of interfaces. This allows us to trade the low-latency “in and out of the same NPU” design for maximum capacity.

As you would expect, there is some science to this. Using our gargantuan 3700D as an example, we have 4 NPUs at our disposal, and a whole bunch of 10 GBE SFP+ sockets. There are many combinations possible, but I’ll pick two common examples.

Requirement: 2 4x10Gbps Bundles + 4 2x10Gbs Bundles

This scenario provides a mixture of 4x and 2x LAGs; two large North/South bundles of 40Gbps each, and four 20Gbps bundles for East/West Zones) This might suit a large data center with multiple server farms. The “Big” 4x10Gbps bundles will be balanced across all four NPUs, whilst the small bundles will be split across two, like so:

Big_LAG1:

  • NP6-0: Port26 (XAUI-0)
  • NP6-1: Port29 (XAUI-1)
  • NP6-2: Port7 (XAUI-2)
  • NP6-3: Port18 (XAUI-3)

Big_LAG2:

  • NP6-0: Port25 (XAUI-1)
  • NP6-1: Port32 (XAUI-2)
  • NP6-2: Port8 (XAUI-3)
  • NP6-3: Port15 (XAUI-0)

Small_LAG1:

  • NP6-2: Port5 (XAUI-0)
  • NP6-3: Port16 (XAUI-1)

Small_LAG2:

  • NP6-2: Port6 (XAUI-1)
  • NP6-3: Port17 (XAUI-2)

Small_LAG3:

  • NP6-0: Port28 (XAUI-2)
  • NP6-1: Port31 (XAUI-3)

Small_LAG4:

  • NP6-0: Port27 (XAUI-3)
  • NP6-1: Port30 (XAUI-0)

Requirement: 8 2x10Gbps Bundles

This is a slightly more conventional design that might suit a managed service deployment with physically separated tenants.

Small_LAG1:

  • NP6-0: Port26 (XAUI-0)
  • NP6-1: Port29 (XAUI-1)

Small_LAG2:

  • NP6-0: Port25 (XAUI-1)
  • NP6-1: Port32 (XAUI-2)

Small_LAG3:

  • NP6-0: Port28 (XAUI-2)
  • NP6-1: Port31 (XAUI-3)

Small_LAG4:

  • NP6-0: Port27 (XAUI-3)
  • NP6-1: Port30 (XAUI-0)

Small_LAG5:

  • NP6-2: Port5 (XAUI-0)
  • NP6-3: Port16 (XAUI-1)

Small_LAG6:

  • NP6-2: Port6 (XAUI-1)
  • NP6-3: Port17 (XAUI-2)

Small_LAG7:

  • NP6-2: Port7 (XAUI-2)
  • NP6-3: Port18 (XAUI-3)

Small_LAG8:

  • NP6-2: Port8 (XAUI-3)
  • NP6-3: Port15 (XAUI-0)

Other things to consider

The design assumes that your high throughput requirements stem from a large number of discrete sessions; a single flow in a 4x bundle of 10Gbps will be limited to the speed of a single physical interface. The second flow (assuming Layer 4 is used) should hash out to a different physical interface, and be able to saturate the next 10Gbps link, and so on.

In most cases, large-scale data center firewalls will be deployed as clustered pairs. This means we also need to think about where we deploy heartbeat and session synchronisation interfaces. Whilst we have a pair of CPU-bound interfaces for management purposes, in high throughput environments you’ll elect to use at least some NPU-bound interfaces. In reality, the traffic created by session sync between cluster members is minimal, but it still needs to be considered in your physical design. There are a few options available for tuning what and when traffic is synced, and your use-case may make it unnecessary (i.e., large numbers of short-lived flows).

Of course if you need truly ludicrous speeds then we can use the 40 GbE QSFP connections. From a cabling perspective, 40 GbE is “easy”. There are four QSFP’s on a 3700D, and they get an NPU each. This means, of course, that bundles of 2x40GbE are merely an extension of the design proposed above. Beyond that, we of course have the 100GbE-capable what’s-a-word-bigger-than-gargantuan 3810D.

When should we use it?

So when should we use this connection balancing approach? Environments that handle traffic volumes greater than a single NPU can provide or very high throughputs certainly qualify. Within the data center, challenging throughputs are more likely, but CPU-intensive features (such as UTM, etc.) are less likely to be deployed. Furthermore there is probably a more pragmatic solution to policing east-west traffic.

For the vast majority of customers, the simpler “in and out of the same NPU” design is going to offer low latency, matched to good throughput. And, of course, when you present such a cable plan to your data center team, they’ll think you’ve gone nuts. But as you can see, there is method behind the madness.

 

 

Thanks to VJ for his significant research and contribution

 

by RSS Glen Kemp  |  May 26, 2015  |  Filed in: Industry Trends