By Darren Giacomini
A quick overview of any Video Management System (VMS) platform will reveal a myriad of options to provide sustained operational status. Primary and secondary databases, redundant and co-located recording platforms, and critical service fail-over. The evolution of VMS platforms lend themselves to sustaining critical operations and in some cases delivering five-nines of availability for live and archived video.
While not every physical security application requires high availability, environments as hospitals, critical infrastructure, transportation, correctional facilities, and hospitality often do demand such uptime. With each of these critical applications, the VMS is capable of delivering high availability, but finds itself dependent on the underlying network infrastructure. If the network infrastructure lacks the resilience to provide continued operation, most VMS platforms will fail, despite their best efforts.
In this article, we will review the technologies that provide core network resilience and discuss some of the common pitfalls associated with them.
Resilience at the Network Core:
In the world of physical security, a collapsed core design is very common. A collapsed core design combines the roles of “distribution” and “core” switches to a single entity. In other words, servers, storage, and uplinks to intermediate distribution frame (IDF) locations, terminated to a single location that represents your collapsed core.
Terminating all of these entities to a single core switch, lacks resiliency in the design and can lead to excessive downtime or a single point of failure. You can achieve high availability by avoiding a single core switch.
Consider the following when designing your network core:
- Failure of a single core switch, in most cases, will result in a complete loss of operational status.
- Software updates (or maintenance windows) will result in a complete loss of operational status.
- Failure of any uplink from an IDF location will result in regional loss of data from the IDF.
Mechanisms to Provide Resilience at the Core:
– Switch Stacking:
Switch stacking is a mechanism that allows multiple switches to be connected, via stacking cables, to represent a single logical switch for both management and operation. One of the major drawbacks of switch stacking is resource depletion.
When you stack multiple switches together, you increase the port density of the managed switch, but the available resources are still defined by the limits of a single switch entity. Increasing the number of terminations at the core, without increasing resource capacity, can severely limit the scalability of the physical security network.
Resiliency from the edge of the network will be dependent on link aggregation. Since the switch stack is seen as a single logical switch, only a single uplink from the IDF closets can be terminated to the core. To provide resiliency from the edge, multiple uplinks are bundled together with Link Aggregation Control Protocol (LACP) to provide resiliency at the link level. The limitations or resource depletion, and resiliency only to link level, make this a less than optimal solution.
– Dual Core Layer 2 Resiliency:
In order to avoid the limitations associated with a single core or switch stack, multiple core switches can be deployed to serve as a termination points for local resources and IDF locations.
The topology of the network will consist of a link between the two core switches and a link from each of the core switches out to the IDF closet. With each IDF closet having redundant fiber connection back to each core, mechanisms must be put in place to ensure a Layer 2 loop does not occur on the network.
Execution of the spanning-tree algorithm will place one of the uplinks to the core in a blocking state, while the other uplink forwards the data back to the core.
It should be noted, that while there are redundant links back to the core of the network, only one link will be forwarding data at any given time. The limitations associated with this type of design are associated with the convergence time upon core or uplink failure.
When the primary forwarding uplink fails, or the primary core fails, the network will need to converge to the new primary path. In a simple unicast network, this can result in outages that can be in the tens of seconds. On a multicast or complex network, this convergence time can be significantly longer.
It is important to keep in mind that this is the convergence time is for the network alone. Each VMS platform has its own tolerance to outages and it is not uncommon of the VMS to fail once the network has recovered.
Due to potential outages during convergence times, this is not the optimal solution for resiliency.
– Dual Core Layer 3 Resiliency:
In the Layer 2 resilient design, convergence time upon failure was a significant weakness. The stability of the VMS is dependent upon a consistent exchange of data between components and more often than not interruption to that data flow can be terminal.
While each VMS has its own tolerance level to network outages, shorter is always better. Layer 3 routing protocols like OSPF, EIGRP, and ISIS can be used to configure your network uplinks as Layer 3 routed links that operate in an “Active-Active” environment.
Active-Active uplinks are not susceptible to Layer 2 loops and can actively pass data across both uplinks to the core. Given that both uplinks are active, and load balancing, convergence time on link or core failure is typically sub-second and does not affect the overall operation of the VMS.
The operational efficiency of this solution does however come at the cost of increased complexity. The complexity and expertise required to deploy and maintain this solution makes it very impractical for the majority of physical security applications.
Multi-Chassis LAG or MLAG Virtualized Cores:
In order to deliver the uncompromised performance of Layer 3 resiliency, while avoiding excessive complexity, some switch vendors support virtualization of the core of the network. Multi-Chassis LAG –or MLAG– allows for the two switches to be virtualized into a single core, while relying on Link aggregation from the edge of the network to provide resiliency.
A simple LACP aggregated link is configured at the IDF switch and the uplinks are split between the two virtual cores. Both links are active and load balancing at all times, with convergence in the sub-second rage for uplink and core failure. The virtualized chassis solution allows for Active-Active resilient paths, load balancing, and failures that are hitless to the VMS deployed on them.
802.1AQ Shortest Path Bridging:
Shortest Path bridging technology allows for the best of both worlds. All uplinks from the IDF locations are Active-Active and each of the core switches maintain independent resources for enhanced scalability. Extensive testing has proven that tens of thousands of streams can converge sub-second and configuration in most cases can be automated to reduce deployment errors.
While VMS platforms offer multiple layers of resilience at the application level, the functionality of these mechanisms are dependent on a network that provides high availability and fast convergence. With the ever-increasing number of applications that require “always on” video, resiliency at the core of the network is the first step to providing an underlying network infrastructure that can sustain five-nines of availability.
About The Author
Darren Giacomini is the Director of Networking at BCDVideo and has over 16 years of networking experience.