3 tips to reduce cooling-related network downtime

3 tips to reduce cooling-related network downtime

Cooling system failure can lead to potentially disastrous downtime for one's business.

The data center is one place where you never want to lose your cool. It takes a level head to manage the operational intricacies that so many businesses rely on.

Also, you literally cannot lose your cool. An overheated switch or server will induce downtime, and, once that happens, all bets are off. You'll figuratively and literally have to sweat it out as you struggle to get mission-critical equipment back online.  

Keep that from happening by adhering to these three cooling best practices:

1. Properly ventilate network switches

Network switches don't fail often, but when they do, they take known any downstream equipment connected to them offline. Data center managers increasingly utilize top-of-rack (ToR) switches instead of end-of-row switches because they're relatively affordable, and they require less wiring than end-of-row switches. But again, if a ToR switch overheats, the entire rack goes down with it. 

You can prevent this from happening by ensuring that ToR switches are properly ventilated. This can be challenging given that so many switches are mounted backwards for easier maintenance-aisle access to the switch ports. Not to mention, ToR switches are farther away from the perforated tiles in a raised-floor cooling environment.

This problem can be solved by installing hardware at the top of racks that orients front-to-back airflow regardless of orientation of the switch or distance from the cool-air source. Treated air is drawn in from the cool aisle, channeled into the switch fans, and then expelled in the cool aisle.

Cooling system failure can lead to potentially disastrous downtime for one's business.Cooling system failure can lead to potentially disastrous downtime for one's business.

2. Use active containment for high-density racks

As average rack densities increase, so do cooling requirements. More power means more heat, and more heat means greater chances of hot spots in your high density racks. All it takes is a few minutes in ambient temperatures ranging from 86-95 degrees Fahrenheit for a CPU to fry itself. Once that happens, server downtime ensues. 

Active containment is an airflow management mechanism that uses containment chambers placed above racks to facilitate the flow of hot air into return plenums. This preempts the development hotspots or overheating incidents in several important ways:

  1. Actively monitors air pressure inside the containment chamber and increases or decreases internal fan speeds accordingly. This ensures hot air doesn't stagnate around racks, especially during peak utilization times when servers are at their hottest. 
  2. Reduces the amount of heat that reaches the maintenance aisle. Since air is pulled up into the return plenum instead of flowing into a hot aisle, data center staff will have more pleasant working conditions should they need to perform any maintenance on racks. 
  3. Buys you some time in the event of a CRAC outage. Should your cooling system fail, active containment may buy you some precious minutes to get your redundant CRACs online, or at the very least, to come up with a contingency plan to avoid downtime. It may not seem like much now, but in a crisis, every minute matters. 

3. Focus on performance costs  

Power usage efficiency (PUE), while valuable, won't always tell the full story when it comes to cooling costs. Yes, you absolutely should be trying to cut down on the amount of energy you use to keep equipment cool – active air helps with that by facilitating better airflow, as does letting equipment run at higher temperatures. 

However, letting servers run at the higher end of ASHRAE's recommended temperature range means temperatures will reach dangerous levels much faster in the event of a cooling failure. So, while it's wise to operate with an eye on PUE, we also recommend tracking:

  • IT thermal conformance: The percent of IT equipment operating within ASHRAE's recommended temperature range while cooling systems are fully functioning
  • IT thermal resilience: The percent of IT equipment operating within ASHRAE's recommended temperature range after cooling systems have failed

The above, as well as PUE, are the main components of Green Grid's Performance Indicator metric. Whereas PUE identifies the amount of power consumed by non-IT equipment (e.g., CRACs.), Performance Indicator can be used as a metric to justify a slightly higher PUE if it means greater durability. In this way, it can reduce the odds of downtime stemming from cooling failures.