So, most people know that I work on some pretty big computers for a living. What’s really strange is how fragile those computes can be some times.

Here is the scenario. I’m called into work on a Sunday afternoon because our production cluster is having issues. Jobs will not run, everything is at a stand still. I come in and poke around a while and I finally find the culprit. A $1 fan on a 10Gigabit Ethernet card has failed, causing the Ethernet card to shut itself off to prevent itself from melting. The loss of the Ethernet card means that one of my management DNS servers has gone down. Because the DNS service is down the compute nodes have a time out before switching to the secondary. That time out period is enough to mean nothing runs. And I do mean nothing. 350+ compute nodes all making heat and keeping floor tiles down. All because of one little fan in just the right spot.

 

HCA_fan

Here you can see the card the fan was on.

2014-08-10 15.22.46

Advertisements