I’ve got a Supermicro-based server that I’m in the process of setting up for Xen hosting purposes. After 3 or 4 days of uptime and light load (because it’s not in production yet) sitting in its rack in a datacentre weird things start to happen.
I get random kernel panics and OOPSes, it locks up or spontaneously reboots. When I power cycle it then the serial console gets a bit garbled and slow once it gets past the BIOS screen, and it rarely manages to boot a kernel then. If I turn it off for several hours and try again then I can usually get it booted. The BIOS event log contains multibit ECC errors.
Some of this sounds like overheating (the fact that nothing will boot yet this “gets better” after a few hours without power), but the ECC errors suggest bad RAM. The server has 4x1GB DDR2-533.
Earlier this evening I’ve booted the server into memtest86+ and left it running for default settings for over 5 hours. In this time it completed 4 test runs without error. I know it can take a long time for memtest86+ to find errors so I may let it run for a few days.
I did get curious though and poked around in memtest86+’s configuration. When I change the memory map from “BIOS-std” to “BIOS-all” I get this:
Memtest86+ v1.70 | Pass 1% Pentium D (65nm) 3192 MHz | Test 50% ################### L1 Cache: 16K 22322MB/s | Test #0 [Address test, walking ones] L2 Cache: 2048K 17348MB/s | Testing: 120K - 4096M 4112M Memory : 4112M 3117MB/s | Pattern: 00000000 Chipset : WallTime Cached RsvdMem MemMap Cache ECC Test Pass Errors ECC Errs --------- ------ ------- -------- ----- --- ---- ---- ------ -------- 0:00:01 4112M 0K e820-All on off Std 0 22 0 ----------------------------------------------------------------------------- Tst Pass Failing Address Good Bad Err-Bits Count Chan --- ---- ----------------------- -------- -------- -------- ----- ---- 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f04000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f08000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f10000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f20000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f40000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f80000 1 0 0 000ff000000 - 4080.0MB ffffffff 00000000 ff000004 1 0 0 000ff000000 - 4080.0MB ffffffff 00000000 ff000008 1 0 0 000ff000000 - 4080.0MB ffffffff 00000000 ff000010 1 0 0 000ff000000 - 4080.0MB ffffffff 00000000 ff000020 1 (ESC)Reboot (c)configuration (SP)scroll_lock (CR)scroll_unlock LOCKED
Instant errors, and all in the top ~400M of RAM.
But is this just a misconfiguration on my part of memtest86+? Is it expected that this should fail? Should I be taking out the last stick of RAM and seeing if life gets better? Is there some PAE issue here where memtest86+ can’t address higher than that amount of RAM?
My situation is made more difficult by the fact that this server is in a datacentre in San Francisco and I am in UK; my only means of interaction with it is by serial console and remote PDU to power cycle if necessary. Graham‘s going there for me in a couple of days and Paul may go there in a few weeks so I’d like to be able to make some suggestions of things they could try when they get there.
Any ideas?