With the recent announcement of the new FortiGate 300D, 500D, and new blades for the 5000 series chassis, Fortinet continues our roll-out of the NP6 processor across the FortiGate portfolio. The NP6 processor is the "beating heart" of the FortiGate firewall; like something from science fiction, there is often more than one of them. While the headline performance figures from the NP6 are impressive, a good design and pragmatic implementation are the keys to a successful implementation.
What Makes the NP6 Special or Important?
The demands placed upon a modern firewall are diverse. Different processors are better suited to handle different types of traffic, with some tasks such as anti-virus scanning best handled by a general-purpose processor. Threats and counter-measures are constantly evolving, requiring constant tweaks to the engine. Conversely, multicast protocols can generate large volumes of data, but computationally speaking only a limited number of operations need to be supported - this makes them ideal for optimization in hardware.
With this in mind, the FortiGate is considered an ASIC firewall - an Application Specific Integrated Circuit (ASIC) performs the heavy lifting of network throughput. However, the NP6 ASIC is only half of the picture. There are in fact two distinct ASIC designs in most FortiGate firewalls:
- A Network Processor (NP) that accelerates IPv4, IPv6, IPsec encryption, and multicast traffic. The current generation is the NP6 and replaces the widely-deployed NP4.
- A Content Processor (CP) that offloads a variety of CPU-intensive security services used by the IPS, bulk crypto, and authentication engines. The current generation CP8 is deployed in many Fortinet products.
In addition to the ASICs, most devices also carry a general purpose processor, with multiple, and sometimes many, cores. For devices where low power consumption and smaller footprint is a bigger consideration than throughput, the entry-level and mid-range appliances contain either a “System on a Chip” (SoC) or a NP4Lite (a “diet” version of the NP4 processor), usually in conjunction with content processor (CP) and/or a general purpose processor. However, the recently released 300D, 500D, and new 5000 series blades join the existing 1500D and 3700D series firewalls with the “perfect trifecta” of the NP6, CP8, and conventional CPU.
For the appliances such as the 1500D and 3700D that house multiple NP6 processors, some consideration for the physical characteristics must be given. Each NP6 has a maximum throughput of 40Gbps and 10 million sessions. It’s easy to understand then how the 1500D platform can achieve 80Gbps and the 3700D 160Gbps. Once a session has been established through the firewall, the CPU offloads the session to the NPU. Once the session is installed, all subsequent packets in the flow are directed into the NPU. This is the fast-path technique used in various flavors by many firewall vendors. Once a session is in the fast-path, the general-purpose CPU is free for tasks more complex than simple traffic forwarding. Some flows will be directed into the CPU and the content processors. It is relatively straightforward to balance sessions across the processors to ensure the optimal usage of the available resources. It is however, more difficult to balance ingress traffic across the network processors as they are directly attached to the internal switch fabric. Consider the network interfaces on the firewall to be lanes in on a motorway/freeway/autobahn approaching a major city. Each city has a fixed number of lanes with congestion management. While it’s possible to manage the traffic within the lanes to a point, if traffic isn’t evenly distributed across the lanes and the cities, one can become congested while the other idles. If an NPU becomes saturated (for example, four 2x10GBs 802.11ad Aggregated Links (LAGs) attached to a single NPU), then new sessions could not be installed on the NPU and would instead be serviced by the CPU. The firewall would still function as it should, but inevitably the performance would be suboptimal.
VDOMs and LAGs
Another important operational consideration are VDOM and LAG configurations. Many customers choose to use VDOMs as an easy (and free!) method of virtualizing traffic on their firewalls. A common deployment method is to use one VDOM per application or tenant. For traffic transiting the firewall it is better to arrive and depart from the same NPU, for the same reason in air travel you usually want to return to the same city that you departed from - to avoid an additional “hop” between cities/NPUs. Obviously there are circumstances where you would want to do this; hardware-accelerated Inter-VDOM links are provided expressly for this purpose. However, for maximum performance north-south traffic should cross the same NPU. For cases where LAGs are used, for optimal hardware acceleration, the interface pairs should be sequential. For example, on a LAG1, port1 and port2 should be used. For LAG2, ports 3 and 4 should be used. You get the idea.
The below diagram shows a very stylized example of a typical (well at least to me) deployment.
This particular scenario shows VDOM-Live and VDOM-Test spread across two NPUs. This is not directly configured on the firewall, but by implication. All network interfaces and LAGs associated with each VDOM are connected to the same NPU. Each LAG is associated with at least one zone (GI-DMZ, Mobile Zone, Untrust) and the client VLANs are then tiered from each zone. In this case, hardware inter-VDOM links are configured to provide an accelerated path for east-west traffic between the VDOMs, and by extension, the NPUs. The root VDOM is only connected to CPU bound interfaces, but doesn't actually perform any significant processing in this context.
How do I find out which port is connected to which NPU?
There are two methods of finding this out. Using the Command Line Interface (CLI), just run the following commands:
config global diagnose npu np6 port-list
And you will be shown the interface connection table for each NPU:
FOOFIREWALL2 (global) # diag npu np6 port-list Chip XAUI Ports Max Cross-chip Speed offloading ------ ---- ------- ----- ---------- np6_0 0 port26 10G Yes 1 port25 10G Yes 2 port28 10G Yes 3 port27 10G Yes 0-3 port1 40G Yes ------ ---- ------- ----- ---------- np6_1 0 port30 10G Yes 1 port29 10G Yes 2 port32 10G Yes 3 port31 10G Yes 0-3 port3 40G Yes ------ ---- ------- ----- ---------- np6_2 0 port5 10G Yes 0 port9 10G Yes 0 port13 10G Yes 1 port6 10G Yes 1 port10 10G Yes 1 port14 10G Yes 2 port7 10G Yes 2 port11 10G Yes 3 port8 10G Yes 3 port12 10G Yes 0-3 port2 40G Yes ------ ---- ------- ----- ---------- np6_3 0 port15 10G Yes 0 port19 10G Yes 0 port23 10G Yes 1 port16 10G Yes 1 port20 10G Yes 1 port24 10G Yes 2 port17 10G Yes 2 port21 10G Yes 3 port18 10G Yes 3 port22 10G Yes 0-3 port4 40G Yes
Alternatively, do I what I do, open the FortiOS Handbook: Hardware Acceleration Guide and look for the diagram with the tiny port numbers written above the colour-coded physical interfaces:
It is often said that power is useless without control. The FortiGate platform and the NP6 processor have it in spades: the control comes at the design stage when you are mapping out physical interfaces to zones, and in turn VDOMs and NPUs. The exceptional performance of the FortiGate firewall comes from its hybrid design of conventional CPUs and ASICs. Optimal performance management is simply a case of understanding how traffic enters the firewall, and how it leaves it. In the same way one is not expected to understand the distinct mechanics of a supercar, one does need to understand what the loud pedal is for and where gas goes.