Dear Lazyweb,
You will probably want to skip this if you have no knowledge of server motherboards and RAM and/or no interest in helping me.
I have a new server based on a Tyan S3970 motherboard with four DIMMs. It was assembled by the supplier and subjected to a burn-in test. It seems however that they did not look at the BIOS event log before shipping it because when I got it, it was full of messages regarding single-bit memory errors with date stamps stretching back through the previous week. This is plausible since the ECC RAM will correct single-bit errors.
Anyway, so I thought it would be a single bad DIMM, turned off ECC in the BIOS and broke out memtest86+.
What I found was that after approximately 90 minutes, memtest86+ reported errors across the entire memory range (implicating all DIMMs). Here’s an example:
WallTime Cached RsvdMem MemMap Cache ECC Test Pass Errors ECC Errs --------- ------ ------- -------- ----- --- ---- ---- ------ -------- 1:59:27 8192M 160K e820-Std on off Std 0 80 0 ----------------------------------------------------------------------------- Tst Pass Failing Address Good Bad Err-Bits Count Chan --- ---- ----------------------- -------- -------- -------- ----- ---- 7 0 000974b216c - 2420.1MB 9c9a2a71 9c9a0a71 00002000 1 7 0 00126a320cc - 4714.1MB 819293fd 8192b3fd 00002000 1 7 0 00114c0012c - 4428.0MB 8e8b2ec2 8e8b0ec2 00002000 1 7 0 001115920ec - 4373.1MB 652557a0 652577a0 00002000 1 7 0 00165b030cc - 5723.1MB 86cb57f6 86cb77f6 00002000 1 7 0 0016069710c - 5638.4MB b59513f4 b59533f4 00002000 1 7 0 0014969e0ec - 5270.8MB 15be53f9 15be73f9 00002000 1 7 0 001094370cc - 4244.4MB 2b779fdd 2b77bfdd 00002000 1 7 0 00139f8d0ec - 5023.8MB 1c54d9dd 1c54f9dd 00002000 1 7 0 001568ad0cc - 5480.8MB 318657e8 318677e8 00002000 1
At this point I of course began to suspect the motherboard, but in the interest of thorough testing I decided to try just one pair of DIMMs. These tested for over 3 hours without a problem. I thought perhaps that this pair was good whereas the other pair might be bad, so I swapped them over. The other pair then tested for over 8 hours without error. So it’s definitely not the DIMMs.
I checked that the DIMMs are all identical (they are) and studied the motherboard manual closely:
http://www.tyan.com/manuals/m_s3970_110.pdf
The memory section on page 28 states:
For optimal dual-channel DDR operation, always install memory in pairs beginning with P1_DIMM7 and P1_DIMM8. Refer to the following table for supported DDRII populations.
The table then shows that you should install the DIMMs in pairs, starting with slots 7 and 8. So, 7→8, 5→8, 3→8 and 1→8 are the only supported configurations.
The server had been delivered with slots 1→4 populated. I have just changed that to 5→8 and it’s now over 4 hours into a test without an error, which is the most I’ve achieved with all 4 DIMMs installed. If we assume that no more errors are encountered, would you be satisfied with this conclusion?
I am still a little bit worried that the motherboard is faulty in some way, because giving a consistent single-bit memory error seems like a really weird outcome for running in an unsupported configuration. I would have thought that it would either not detect the RAM, or it would be OK. Is this behaviour something you would expect?
I’ve opened a support request with Tyan to ask if this is normal behaviour, but I’ve no idea when or if they will respond.
When I’ve seen misconfigured memory, the BIOS alerted me to that effect in the POST (and stopped at that point). It was very clear.
Thanks Adrian.
Perhaps I should ask the supplier to lend me another 4x2G DIMMs so I can test the server fully populated. If there are still no errors then I’ll know it was just down to misconfiguration. If it has errors then I’ll know it’s the motherboard/cpu. Either way I would then return the extra DIMMs.
Does that seem too cheeky? I can’t help thinking they screwed up by not checking the BIOS event log.
Cheers,
Andy