February 3, 2008 – The ongoing struggle

Dear Lazyweb,

You will probably want to skip this if you have no knowledge of server motherboards and RAM and/or no interest in helping me.

I have a new server based on a Tyan S3970 motherboard with four DIMMs. It was assembled by the supplier and subjected to a burn-in test. It seems however that they did not look at the BIOS event log before shipping it because when I got it, it was full of messages regarding single-bit memory errors with date stamps stretching back through the previous week. This is plausible since the ECC RAM will correct single-bit errors.

Anyway, so I thought it would be a single bad DIMM, turned off ECC in the BIOS and broke out memtest86+.

What I found was that after approximately 90 minutes, memtest86+ reported errors across the entire memory range (implicating all DIMMs). Here’s an example:

 WallTime   Cached  RsvdMem   MemMap   Cache  ECC  Test  Pass  Errors ECC Errs
 ---------  ------  -------  --------  -----  ---  ----  ----  ------ --------
   1:59:27   8192M     160K  e820-Std    on   off   Std     0      80        0
 -----------------------------------------------------------------------------
Tst  Pass   Failing Address          Good       Bad     Err-Bits  Count Chan
---  ----  -----------------------  --------  --------  --------  ----- ----
  7     0  000974b216c -  2420.1MB  9c9a2a71  9c9a0a71  00002000      1
  7     0  00126a320cc -  4714.1MB  819293fd  8192b3fd  00002000      1
  7     0  00114c0012c -  4428.0MB  8e8b2ec2  8e8b0ec2  00002000      1
  7     0  001115920ec -  4373.1MB  652557a0  652577a0  00002000      1
  7     0  00165b030cc -  5723.1MB  86cb57f6  86cb77f6  00002000      1
  7     0  0016069710c -  5638.4MB  b59513f4  b59533f4  00002000      1
  7     0  0014969e0ec -  5270.8MB  15be53f9  15be73f9  00002000      1
  7     0  001094370cc -  4244.4MB  2b779fdd  2b77bfdd  00002000      1
  7     0  00139f8d0ec -  5023.8MB  1c54d9dd  1c54f9dd  00002000      1
  7     0  001568ad0cc -  5480.8MB  318657e8  318677e8  00002000      1

At this point I of course began to suspect the motherboard, but in the interest of thorough testing I decided to try just one pair of DIMMs. These tested for over 3 hours without a problem. I thought perhaps that this pair was good whereas the other pair might be bad, so I swapped them over. The other pair then tested for over 8 hours without error. So it’s definitely not the DIMMs.

I checked that the DIMMs are all identical (they are) and studied the motherboard manual closely:

http://www.tyan.com/manuals/m_s3970_110.pdf

The memory section on page 28 states:

For optimal dual-channel DDR operation, always install memory in pairs beginning with P1_DIMM7 and P1_DIMM8. Refer to the following table for supported DDRII populations.

The table then shows that you should install the DIMMs in pairs, starting with slots 7 and 8. So, 7→8, 5→8, 3→8 and 1→8 are the only supported configurations.

The server had been delivered with slots 1→4 populated. I have just changed that to 5→8 and it’s now over 4 hours into a test without an error, which is the most I’ve achieved with all 4 DIMMs installed. If we assume that no more errors are encountered, would you be satisfied with this conclusion?

I am still a little bit worried that the motherboard is faulty in some way, because giving a consistent single-bit memory error seems like a really weird outcome for running in an unsupported configuration. I would have thought that it would either not detect the RAM, or it would be OK. Is this behaviour something you would expect?

I’ve opened a support request with Tyan to ask if this is normal behaviour, but I’ve no idea when or if they will respond.

The ongoing struggle

I'll get there one day.

Day: February 3, 2008

Possible hardware issue with Tyan S3970 motherboard

February 2008
M	T	W	T	F	S	S
« Jan				Mar »
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29