Dear Lazyweb, am I using memtest86+ correctly?

I’ve got a Supermicro-based server that I’m in the process of setting up for Xen hosting purposes. After 3 or 4 days of uptime and light load (because it’s not in production yet) sitting in its rack in a datacentre weird things start to happen.

I get random kernel panics and OOPSes, it locks up or spontaneously reboots. When I power cycle it then the serial console gets a bit garbled and slow once it gets past the BIOS screen, and it rarely manages to boot a kernel then. If I turn it off for several hours and try again then I can usually get it booted. The BIOS event log contains multibit ECC errors.

Some of this sounds like overheating (the fact that nothing will boot yet this “gets better” after a few hours without power), but the ECC errors suggest bad RAM. The server has 4x1GB DDR2-533.

Earlier this evening I’ve booted the server into memtest86+ and left it running for default settings for over 5 hours. In this time it completed 4 test runs without error. I know it can take a long time for memtest86+ to find errors so I may let it run for a few days.

I did get curious though and poked around in memtest86+’s configuration. When I change the memory map from “BIOS-std” to “BIOS-all” I get this:

      Memtest86+ v1.70      | Pass  1%
Pentium D (65nm) 3192 MHz   | Test 50% ###################
L1 Cache:   16K 22322MB/s   | Test #0  [Address test, walking ones]
L2 Cache: 2048K 17348MB/s   | Testing:  120K - 4096M 4112M
Memory  : 4112M  3117MB/s   | Pattern:   00000000
Chipset :


 WallTime   Cached  RsvdMem   MemMap   Cache  ECC  Test  Pass  Errors ECC Errs
 ---------  ------  -------  --------  -----  ---  ----  ----  ------ --------
   0:00:01   4112M       0K  e820-All    on   off   Std     0      22        0
 -----------------------------------------------------------------------------
Tst  Pass   Failing Address          Good       Bad     Err-Bits  Count Chan
---  ----  -----------------------  --------  --------  --------  ----- ----
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f04000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f08000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f10000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f20000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f40000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f80000      1
  0     0  000ff000000 -  4080.0MB  ffffffff  00000000  ff000004      1
  0     0  000ff000000 -  4080.0MB  ffffffff  00000000  ff000008      1
  0     0  000ff000000 -  4080.0MB  ffffffff  00000000  ff000010      1
  0     0  000ff000000 -  4080.0MB  ffffffff  00000000  ff000020      1
(ESC)Reboot  (c)configuration  (SP)scroll_lock  (CR)scroll_unlock         LOCKED

Instant errors, and all in the top ~400M of RAM.

But is this just a misconfiguration on my part of memtest86+? Is it expected that this should fail? Should I be taking out the last stick of RAM and seeing if life gets better? Is there some PAE issue here where memtest86+ can’t address higher than that amount of RAM?

My situation is made more difficult by the fact that this server is in a datacentre in San Francisco and I am in UK; my only means of interaction with it is by serial console and remote PDU to power cycle if necessary. Graham‘s going there for me in a couple of days and Paul may go there in a few weeks so I’d like to be able to make some suggestions of things they could try when they get there.

Any ideas?