I've got a Debian GNU/Linux lenny installation (2.6.26-2-vserver-amd64 kernel) running on a Dell Poweredge 2950 with BIOS 2.0.1 (2007-10-27).
It has two
Intel(R) Xeon(R) CPU 5160 @ 3.00GHz processors (according
/proc/cpuinfo, 8 1GiB 667MHz DDR2 ECC modules (part number
HYMP512F72CP8N3-Y5), according to
dmidecode, and an
Intel Corporation 5000X Chipset Memory Controller Hub (rev 12)
The machine has been running stably for many months.
On the morning of March 31st, i started getting the following messages from the kernel, on the order of one pair of lines every 3 seconds:
Mar 31 07:04:38 zamboni kernel: [16883514.141275] EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x800Mar 31 07:04:38 zamboni kernel: [16883514.141278] EDAC i5000: NON-Retry Errors, bits= 0x800
A bit of digging turned up a redhat bug report that seems to suggest that these warnings are just noise, and should be ignorable. Another link thinks it's a conflict with IPMI, though i don't think this model actually has an IPMI subsystem correction: this machine does have IPMI, though i am not making use of it.
However, i also notice from munin logs that at the same time the error messages started, the machine exhibited a marked change in CPU activity (including in-kernel activity) and local timer interrupts:
I also note that more rescheduling interrupts started happening, and fewer megasas interrupts at about the same time. I'm not sure what this means.
A review of other logs and graphs on the system turns up no other evidence of interaction that might cause this kind of elevated activity.
One thought was that the elevated activity was just due to writing out a
bunch more logs. So i tried removing the
i5000_edac module just to
/var/log/kern.log cleaner. Leaving that turned off
doesn't lower the CPU utilization or change the interrupts, though.
Any suggestions on what might be going on, or further diagnostics i should run? The machine is in production, and I'd really rather not take down the machine for an extended period of time to do a lengthy memory test. But i also don't want to see this kind of extra CPU usage (more than double the machine's baseline).