Yes, we know

Thanks for that, noodles! I’ve been wanting to say it myself for some time but for some reason decided against it for fear of being castigated as a grumpy old man. I shall not hold back from the ranting in future.

On the one hand maybe there are some people out there who read blogs yet somehow haven’t discovered xkcd. On the other hand, the more places I see it the more mainstream it feels, and we can’t have that!

What Would Lazyweb Do?

In relation to my recent hardware issues I now have a bit of a dilemma, although it’s not a bad kind of dilemma to have.

Yesterday afternoon memtest86+ locked up after 10h28m running. It didn’t report any memory errors but clearly there are hardware problems there if even memtest can lock it up. So I got myself into the mindset of returning the whole server and getting a refund, going with a different vendor.

I asked for recommendations of who else to try, and someone suggested a company I will call Vendor C (I don’t know why I am hiding the identities of the vendors involved but it feels like the right thing to do until this is all sorted out).

I sent off a mail to Vendor C, showing them my quote for the system I’ve bought from Vendor S and asking them if they can match it. Here is the spec of the hardware in question:

  • 1U generic Supermicro chassis, 4xSATA hotswap
  • Pentium D 940 3.2GHz dual core
  • 4x1GB ECC DDR2-677
  • 4xWestern Digital 250GB SATA RE 16Mb cache 7.5KRPM
  • DVD-ROM
  • IPMI 2.0

Vendor C mailed back within 30 minutes to confirm that they could match Vendor S’s quote, but also suggesting that I may like to have a Core 2 Duo-based system instead as it would be a more modern CPU that uses less power. Their suggested spec for a system based on a Core 2 Duo Conroe 2.13Hz comes in slightly cheaper than the Pentium D 3.2GHz dual core.

I was a little concerned that 2.13GHz Core 2 Duo would not compete performance-wise with a Pentium D 3.2GHz dual core (Intel claims that Core 2 Duo is 40% faster than Pentium D at same clock speed and uses 40% less energy) so I had a look and a system from Vendor C based on the 2.4GHz Core 2 Duo (which also has 4MiB cache as opposed to 2MiB) is only slightly more expensive than Vendor S’s original quote. I’d be willing to pay a few more $ for that.

So at this point my mindset is “return server under RMA, get refund by yelling as loud as I need to, then buy 2.4GHz Core 2 Duo-based system from Vendor C”. I compose the email to Vendor S informing them of the bad news that the server’s hardware is flaky and I am returning it under warranty. I did not indicate if I want a refund or not as I wanted to see how they dealt with the matter first.

I was expecting Vendor S to be unhappy and give me a little bit of a hard time in returning the server. I was mentally preparing myself for a battle. Their reply then came as a bit of a shock, but a good one. They have been very reasonable about the whole thing and have offered me three choices:

  • Return the server and get an immediate refund.
  • Return the server and have them try and identify and replace broken parts.
  • Return the server and have the updated model from their range shipped to me instead at no extra charge. The updated model is based on a Core 2 Duo 1.8GHz.

The second one doesn’t sound too appealing to me, but to have the other two options immediately offered to me is really good customer service. So much so that I would feel a little bad just asking for my money back.

I have a slight issue with the third option in that I don’t think a 1.8GHz Core 2 Duo is going to be comparable in performance to a Pentium D 3.2GHz dual core. It may well compete in performance/Watt but if it hasn’t got the overall performance I need then it’s useless to me. Ideally I’d want the 2.4GHz Core 2 Duo here.

It appears that my options now pan out to:

  • Get refund, spend it with Vendor C to get 2.4GHz Core 2 Duo-based system.
  • Have Vendor S send me the 1.8GHz Core 2 Duo-based system and hope it performs well enough.
  • Be a ballbreaker and ask for a 2.4GHz Core 2 Duo-based system from Vendor S.
  • Be a bit cheeky and ask Vendor S how much extra for an upgrade from 1.8GHz to 2.4GHz.

I should probably point out that although Vendor S came highly recommended, and have always been polite and helpful, they already annoyed me by taking my order and saying they could supply within one week but actually taking almost four weeks to deliver what has turned out to be broken hardware, all the while giving me excuse after excuse. Do they deserve a second chance? They are clearly eager to maintain a good relationship if possible.

Lazyweb, what would you do?

Dear Lazyweb, am I using memtest86+ correctly?

I’ve got a Supermicro-based server that I’m in the process of setting up for Xen hosting purposes. After 3 or 4 days of uptime and light load (because it’s not in production yet) sitting in its rack in a datacentre weird things start to happen.

I get random kernel panics and OOPSes, it locks up or spontaneously reboots. When I power cycle it then the serial console gets a bit garbled and slow once it gets past the BIOS screen, and it rarely manages to boot a kernel then. If I turn it off for several hours and try again then I can usually get it booted. The BIOS event log contains multibit ECC errors.

Some of this sounds like overheating (the fact that nothing will boot yet this “gets better” after a few hours without power), but the ECC errors suggest bad RAM. The server has 4x1GB DDR2-533.

Earlier this evening I’ve booted the server into memtest86+ and left it running for default settings for over 5 hours. In this time it completed 4 test runs without error. I know it can take a long time for memtest86+ to find errors so I may let it run for a few days.

I did get curious though and poked around in memtest86+’s configuration. When I change the memory map from “BIOS-std” to “BIOS-all” I get this:

      Memtest86+ v1.70      | Pass  1%
Pentium D (65nm) 3192 MHz   | Test 50% ###################
L1 Cache:   16K 22322MB/s   | Test #0  [Address test, walking ones]
L2 Cache: 2048K 17348MB/s   | Testing:  120K - 4096M 4112M
Memory  : 4112M  3117MB/s   | Pattern:   00000000
Chipset :


 WallTime   Cached  RsvdMem   MemMap   Cache  ECC  Test  Pass  Errors ECC Errs
 ---------  ------  -------  --------  -----  ---  ----  ----  ------ --------
   0:00:01   4112M       0K  e820-All    on   off   Std     0      22        0
 -----------------------------------------------------------------------------
Tst  Pass   Failing Address          Good       Bad     Err-Bits  Count Chan
---  ----  -----------------------  --------  --------  --------  ----- ----
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f04000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f08000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f10000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f20000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f40000      1
  0     0  000e7f00000 -  3711.0MB  ffffffff  00000000  e7f80000      1
  0     0  000ff000000 -  4080.0MB  ffffffff  00000000  ff000004      1
  0     0  000ff000000 -  4080.0MB  ffffffff  00000000  ff000008      1
  0     0  000ff000000 -  4080.0MB  ffffffff  00000000  ff000010      1
  0     0  000ff000000 -  4080.0MB  ffffffff  00000000  ff000020      1
(ESC)Reboot  (c)configuration  (SP)scroll_lock  (CR)scroll_unlock         LOCKED

Instant errors, and all in the top ~400M of RAM.

But is this just a misconfiguration on my part of memtest86+? Is it expected that this should fail? Should I be taking out the last stick of RAM and seeing if life gets better? Is there some PAE issue here where memtest86+ can’t address higher than that amount of RAM?

My situation is made more difficult by the fact that this server is in a datacentre in San Francisco and I am in UK; my only means of interaction with it is by serial console and remote PDU to power cycle if necessary. Graham‘s going there for me in a couple of days and Paul may go there in a few weeks so I’d like to be able to make some suggestions of things they could try when they get there.

Any ideas?

Links for 2007-01-05

Ubuntu’s Launchpad and releasing source code

Uraeus, correct me if I am wrong but as far as I was aware Launchpad is not released software i.e. no one but Canonical uses it. Therefore IMHO there isn’t really a big issue about whether its source code should be released or not, since there is no one actually running it to take advantage of that. This assertion of course only follows if you believe that the main point of Free software is to empower the people running it, and not simply to share coding techniques.

If Canonical were to package Launchpad as a general purpose issue tracker, support tool etc. and other people were to start using it then yes I believe it would be extremely important for the source to be available, but as it stands I see it as an internal project and piece of infrastructure. Freely distributing its sources may enable faster improvements to Launchpad itself, but that would probably be hard work for someone outside Canonical and people inside presumably have access to its source anyway.

Argh when will this end?

I realise that being on so many LUG and similar lists means I’m setting myself up for this but seeing the same conversation play out 10 times already is really doing my head in!


$ cd ~/Maildir
$ find . -type f | xargs grep -il 'http://petitions.pm.gov.uk/softwarepatents'
./lug-hampshire/cur/1167746700.27551_0.bitfolk.com:2,S
./lug-hampshire/cur/1167748676.30387_0.bitfolk.com:2,S
./debian-uk/cur/1167842646.25530_0.bitfolk.com:2,S
./lug-aberdeen/cur/1167904508.25092_0.bitfolk.com:2,S
./lug-aberdeen/cur/1167878764.2444_0.bitfolk.com:2,S
./lug-aberdeen/cur/1167873078.30123_0.bitfolk.com:2,S
./lug-gl/cur/1167917038.6388_0.bitfolk.com:2,S
./lug-gl/cur/1167919940.10510_0.bitfolk.com:2,S
./lug-gl/cur/1167918956.9559_0.bitfolk.com:2,S
./lug-surrey/cur/1167838262.19494_0.bitfolk.com:2,S
./lug-surrey/cur/1167837614.17607_0.bitfolk.com:2,S
./lug-sb/cur/1167906342.26657_0.bitfolk.com:2,S
./lug-sussex/cur/1167845280.28760_0.bitfolk.com:2,S
./lug-sussex/cur/1167827746.4426_0.bitfolk.com:2,S
./lug-sussex/cur/1167914202.4414_0.bitfolk.com:2,RS
./lug-sussex/cur/1167827668.4349_0.bitfolk.com:2,S
./lug-sussex/cur/1167828364.4997_0.bitfolk.com:2,S
./lug-master/cur/1167906018.26248_0.bitfolk.com:2,S
./cur/1167748049.28558_0.bitfolk.com:2,S
./cur/1167815237.20322_0.bitfolk.com:2,S

Pity the petition reads like it was written by a frothing loon!

PS Yes I have signed it.