Xeon Heat Management
Last week I wrote about issues dealing with getting some of my older 32-bit Athlon processors to run in a low-power, low-heat mode during idle conditions. As I said then, being able to switch into this mode when the operating system isn't busy is enough to get you most of the way toward decent power and thermal management, although sometimes you need to do some other things, like use better fans or heatsinks. To illustrate just how much extra effort can sometimes be required, I thought I would talk about my efforts in trying to get one of my Xeon-based servers to operate at low temperatures.
The adventure all started when I picked up a couple of 3.4 GHz Xeon processors to use for application and network testing under VMware. I wanted to load the system with multiple Gigabit Ethernet cards, so I bought a SuperMicro X6DHE-XG2 dual-Xeon motherboard with multiple 64-bit PCI slots. I also picked up a Chenbro RM-216 2U chassis that came bundled with a 460-watt Zippy P2G-6460P power supply and four generic high-speed intake cooling fans. I also got a pair of Swiftech MCX603-V low-profile Xeon heatsinks, which fit inside the 2U chassis and also came with their own factory-supplied cooling fans. Lastly I got a hold of four Intel PRO/1000 MT dual-port server NICs, which came with low-profile brackets that would fit inside the chassis.
Once everything was assembled and the base software installed, the system seemed to run fine, but it also had a couple of problems. For one thing, all the fans made the server VERY LOUD, especially since it was sitting in my office a few feet from my desk—I literally couldn't even talk on the telephone while the server was running. Second, and much more worrisome, the system would periodically reboot or shut itself down with no apparent warning. After some poking and probing, I discovered that the source of all these problems was extreme heat from the CPUs. The system was running at 70 degrees Celsius at idle and would sometimes get up into the 90 degrees range when the VMs were busy. In turn, this caused the system BIOS to run all the fans at their highest speed (which was the primary source of the noise) and also caused the system to shut itself off when things got too hot.
The first thing I tried toward fixing this problem was to simply reseat the heatsinks with better thermal paste, but that only had a minor effect, so I started experimenting with different fans and placements.
The X6DHE-XG2 motherboard has headers for eight different fans (two for the CPUs, two for the rear exhaust, and four intakes right behind the drives), and the RM-213 chassis also has cutouts for the same intake and exhaust fans. Because I wasn't using the rear exhausts, adding those in seemed to be the most obvious solution. However, since the chassis is a 2U low-profile unit, the cutouts for the two exhaust fans are above the motherboard's rear-panel connector block, meaning they can only accommodate 40 mm fans. Now, a decent 80 mm fan (the most common size) can move a reasonable amount of air at a reasonable decibel level, but 40 mm fans have to really crank up the RPMs in order to move even a modest amount of air, and that means higher noise. And even after trying several different brands and models, the temperature either didn't drop by any significant amount or it actually got worse from the fans constraining airflow. Cumulatively, the only noticeable change was that the system got a lot louder.
For the next attempt, I tried to tackle the problem head-on by increasing the airflow around the CPUs themselves. The MCX603-V heatsinks have pretty good ratings on various message boards, and the bundled Delta AFB0812M fans have pretty good airflow and decibel ratings. However, the fans are also 20 mm thick, and in this application they appeared to be touching against the lid of the chassis, thereby causing airflow to be restricted and also causing some baffling vibration problems that produced extra noise.
In order to eliminate these problems, I ordered a set of Evercool EC8015H12B 15 mm-thick high-speed fans and swapped them in for the Deltas. The CPU temperature immediately dropped by a good 10 degrees just from the extra half-centimeter of airflow on top of the heatsinks. I was able to further lower the temperature by pointing the fans upwards, which amplified the natural convection effect of the CPU and heatsinks and allowed the chassis intake fans to push hot air off the top of the heatsinks at the same time they were blowing cool air onto the CPUs.
At this point I was pretty pleased with the results, but the noise from all the fans was getting on my nerves, so I went back to experimenting with replacement models for the chassis itself. I must have gone through a half-dozen vendors and fan models—"smart" variable-speed fans, high-density fans, you name it—but eventually I settled on an intake system consisting of the two Delta fans that came from Swiftech and a pair of Unincom U8025 sleeve fans, which cumulatively provided a good combination of targeted airflow and quiet operation. I left the exhaust ports open and unused because none of the 40 mm fans I tried ever seemed to do anything positive.
The biggest reduction in noise, however, came from replacing the power supply. One day I was reading about home-theater PC gear, and I stumbled across a site for a custom rackmount chassis that used Enermax EG451P-VD 2U power supplies. I figured that if they were quiet enough for home-theater gear, they had to be quieter than the Zippy power supply I was using, so I ordered one to test. Not only did the noise all but disappear, but the system temperature dropped another 10 degrees, too. These power supplies have separate intake and exhaust fans, each of which run at a lower rotational speed than a normal unit, which both improves airflow and also keeps the noise down. Unfortunately, these power supplies seem to be either discontinued or were never widely released in the U.S. market, and I've bought all the units I've been able to find for sale (I use them in all my rackmount systems now). [Update 3 (May 27): they have been discontinued; iStar seems to have some fairly quiet, dual-fan rackmount power supplies though.]
All told, my system is now running at a respectable 44 degrees Celsius at idle and only goes up into the low 70s when multiple VMs are active and busy (which isn't very common in the kind of testing I usually do). It even runs cool enough that I'm able to use the BIOS "workstation" fan setting, which uses a lower fan voltage during normal operations, but turns up the speed when the system starts to get warm. Overall, the energy requirements are about as low as can be expected (not very), there are no heat problems to speak of, and best of all I can't hear the server running unless I really put the system to work (thus causing the fans to spin up), even thought it's only a few feet from my desk. However, getting to this point took almost a year of experimentation and several dozen pieces of trial-and-error gear, most of which didn't help and some of which produced unexpectedly high results.
On a broader note, this kind of story is most useful for illustrating just how difficult it can be to get heat management issues under control with the current generation of processors, and why it's so important for vendors and users to pay attention to this stuff. Without this effort, the system would have been unusable for its intended purpose due to the instability from the out-of-control thermal issues, while the ham-handed fix of sticking more powerful fans into the box only produces louder systems that have to be isolated farther away from the operators. This is all very bad. But the good news is that chip designers have begun to realize that this is a dead-end path, and that the real fix here is to lower the power and temperature demands of the CPUs to begin with. There's also some interesting technology in the gaming and high-end PC sector that will probably find their way into servers pretty soon. I'll keep an eye out in this area and let you know what I find.