Problem solved?

Weird… I’m going to quote from a mail I sent:

In anycase, these have been some of the weirdest hardware moments I’ve ever had: it seems that an existing 32MB DIMM was right on the edge of giving up at the moment that I installed a new 128MB DIMM in the firewall and converted it to reiserfs.

It was on the edge in such a manner that it survived 12 consecutive passes of complete memtest86 test-sets (and 40 3-parallel process kernel compiles without a sig11)…. this morning however, the BIOS caught a memory error at bootup (the machine had crashed during the night); a subsequent run of memtest86 caught the error within the first 30 seconds. I’ve established with some DIMM-swapping that the problem is with the DIMM and not the slot it was in.

So, the DIMM was in the process of dying as it were… it was being sporadically flakey at a time when many variables were being added to the equation, which made the fault all the more difficult to confirm.

Err, I was wrong.

So the ram is fine if we can trust memtest86. Why am I still getting oopses?! I sent a mail to the reiserfs mailing list and got the typical reiserfs answer: your hardware is borked, it _can’t_ be our code. Hmmmm… right. I’m sure the reiserfs code is well-tested, but it’s still not a good approach to blame the hardware _immediately_, especially if tests indicate elsewhere. What the hell, I’m going to try and reproduce this. At the moment the oopses are so sporadic (sometimes days apart) that finding the problem is going to take weeks. If I can reproduce the oops however, that’s going to make things much easier.

Unstable firewall

Well children, my firewall had gone weirdly unstable after I’d added 128MB of new PC133 ram. It’s a Celery 300A (@450 of course) and it was oopsing miserably after years of faithful service. After memtesting it seemed that suddenly there were errors in one of the _existing_ dimms. Hmmm, it turs out that CAS wait state was set on auto in the bios. I’m theorising that the BIOS somehow based its CAS on the new RAM and that the old RAM couldn’t quite accommodate that. So, I set the CAS ws to 3 and 17 hours of memtesting (12 passes) reported 0 errors. The lesson we learn: know your CAS and know your memtest.

memtest86 is a brilliant tool: