Dear Lazyweb,
You will probably want to skip this if you have no knowledge of server motherboards and RAM and/or no interest in helping me.
I have a new server based on a Tyan S3970 motherboard with four DIMMs. It was assembled by the supplier and subjected to a burn-in test. It seems however that they did not look at the BIOS event log before shipping it because when I got it, it was full of messages regarding single-bit memory errors with date stamps stretching back through the previous week. This is plausible since the ECC RAM will correct single-bit errors.
Anyway, so I thought it would be a single bad DIMM, turned off ECC in the BIOS and broke out memtest86+.
What I found was that after approximately 90 minutes, memtest86+ reported errors across the entire memory range (implicating all DIMMs). Here’s an example:
WallTime Cached RsvdMem MemMap Cache ECC Test Pass Errors ECC Errs --------- ------ ------- -------- ----- --- ---- ---- ------ -------- 1:59:27 8192M 160K e820-Std on off Std 0 80 0 ----------------------------------------------------------------------------- Tst Pass Failing Address Good Bad Err-Bits Count Chan --- ---- ----------------------- -------- -------- -------- ----- ---- 7 0 000974b216c - 2420.1MB 9c9a2a71 9c9a0a71 00002000 1 7 0 00126a320cc - 4714.1MB 819293fd 8192b3fd 00002000 1 7 0 00114c0012c - 4428.0MB 8e8b2ec2 8e8b0ec2 00002000 1 7 0 001115920ec - 4373.1MB 652557a0 652577a0 00002000 1 7 0 00165b030cc - 5723.1MB 86cb57f6 86cb77f6 00002000 1 7 0 0016069710c - 5638.4MB b59513f4 b59533f4 00002000 1 7 0 0014969e0ec - 5270.8MB 15be53f9 15be73f9 00002000 1 7 0 001094370cc - 4244.4MB 2b779fdd 2b77bfdd 00002000 1 7 0 00139f8d0ec - 5023.8MB 1c54d9dd 1c54f9dd 00002000 1 7 0 001568ad0cc - 5480.8MB 318657e8 318677e8 00002000 1
At this point I of course began to suspect the motherboard, but in the interest of thorough testing I decided to try just one pair of DIMMs. These tested for over 3 hours without a problem. I thought perhaps that this pair was good whereas the other pair might be bad, so I swapped them over. The other pair then tested for over 8 hours without error. So it’s definitely not the DIMMs.
I checked that the DIMMs are all identical (they are) and studied the motherboard manual closely:
http://www.tyan.com/manuals/m_s3970_110.pdf
The memory section on page 28 states:
For optimal dual-channel DDR operation, always install memory in pairs beginning with P1_DIMM7 and P1_DIMM8. Refer to the following table for supported DDRII populations.
The table then shows that you should install the DIMMs in pairs, starting with slots 7 and 8. So, 7→8, 5→8, 3→8 and 1→8 are the only supported configurations.
The server had been delivered with slots 1→4 populated. I have just changed that to 5→8 and it’s now over 4 hours into a test without an error, which is the most I’ve achieved with all 4 DIMMs installed. If we assume that no more errors are encountered, would you be satisfied with this conclusion?
I am still a little bit worried that the motherboard is faulty in some way, because giving a consistent single-bit memory error seems like a really weird outcome for running in an unsupported configuration. I would have thought that it would either not detect the RAM, or it would be OK. Is this behaviour something you would expect?
I’ve opened a support request with Tyan to ask if this is normal behaviour, but I’ve no idea when or if they will respond.