Diagnostics and the Infinite Loop
About a month ago I attempted to setup my primary Gentoo Linux machine to have two keyboards, two mice and two video cards. There would be two X sessions running and a separate login screen for each monitor. In essence it would seem like I had two computers even though both of them were running off the same box. With computers becoming insanely powerful as they have, it seemed only appropriate.
I started using my regular GeForce 6800 card combined with an old GeForce 4 MX card a friend of mine lent me. Using a modification of the configurations I found in a Linux Gazette article combined with some configuration information using the new evdev driver (I can’t remember the site but I think it was in the Gentoo forms) I setup a working xorg.conf and custom gdm configuration that gave me two stable login screens.
I had to use the non-3D “nv” driver for the GeForce 4 since the GeForce 4 uses legacy 3D drivers that can’t be installed in Linux alongside the current drivers for the GeForce 6800 without doing some hacking and driver renaming.
So in attempt to get 3D on both screens, I bought a PCI-E GeForce 6600 under the false assumption that a 16x PCI-E card would fit into a 1x slot with all the extra pins just hanging off the end (I could have swore I read that somewhere!). Yea that didn’t work, so now I had a spare PCI-E card.
I then bought a GeForce FX PCI card. I wanted two graphically accelerated desktops, although I knew the PCI card would be much slower than the PCI-E card. That’s when I started running into stability problems.
I had to revert to the GeForce 4 card using the non-3D driver. Eventually I tried the GeForce FX again in non-3D mode. After running stable for a while I tried once again to enable 3D on both cards. Amazingly it worked and was stable for well over a week. A random crash forced me to reboot and suddenly the stability problem reappeared (along with a nasty evdev driver bug from the 1.2 release that was reverted the next day by the Gentoo team, but not in time for something similar to Command Line Fu happening with me during my date with Alex).
I struggled with the stability problem, turning off 3D on one device and then the other. Eventually I looked at possible problems with ACPI and the BIOS. I may have had a buggy ACPI implementation or interrupt problems. I attempted a BIOS update on my Abit AL8-V board. Using the BIOS update tool, I didn’t use the BAT file that came with it. I just ran the EXE and didn’t realize there were about eight or nine command line argument that went after it. The BIOS update failed. It booted to a “failed checksum” and would only allow me to boot to a floppy (I did the previous BIOS update from a CD). I have run into this before on an old Tyan board.
I made a late night trip to the evil Wal-Mart, which I had successfully boycotted until I needed a floppy drive at 9pm at night. I restarted the flash update using the bat file which set all the correct arguments. It was looking good, the update was proceeding. I went to my kitchen, came back and the screen was blank and the system would never boot again. It also turns out that my Abit AL8-V, being less than two years old, was all ready out of warranty.
I bought a Abit AB9-QuadGT board which wouldn’t boot anything when I first got it. One BIOS update later (and having to repurchase the floppy drive I returned to Wal-Mart), it would randomly crash and segfault in both Windows and Linux. I sent it back and exchanged it for a Gigabyte GA-P35-DS3P. I installed the board and it wouldn’t turn on. I couldn’t believe I had gotten two bad boards in a row. Well I hadn’t. The board was fine, but the power supply, only a few months old, was dead.
A trip to a local store for an overpriced 450W the next day got me up and running again. The system booted, and after a quick kernel compile, I had networking, sound and everything once again, and now I had two 16x PCI-E slots so I could use that old 6600 I bought earlier.
Then, it started locking up again. It made no sense. The system ran stable for over a week before I got into this whole mess and I had replaced the motherboard and video card! It wasn’t a hardware problem. I started tracking back to my emerge logs to try to find the real cause. My systems automatically pull all their updates from Gentoo portage each morning at 4am.
I though it may have been a problem with the evdev driver. Several downgrades and upgrades later, I moved onto the nvidia drivers. Downgrading the drivers meant also downgrading xorg which led to dependency hell. I did notice one thing, xorg refused to compile with different ntpl use flags settings as mesaGL. The older nvidia drivers refuse to compile at all claiming my kernel configuration was corrupt.
Eventually I unmasked everything and upgraded everything back to their current versions. Taking a chance I reloaded X. It has been stable for several hours!
The only thing I can think of is that xorg problem with MesaGL. They weren’t configured with the right use flags to work with each other as far as threading was concerned (ntpl). I did have ntpl set in my make.conf, but looking at Mesa, it turns out ntpl isn’t support on my amd64 platform. I have no idea how xorg correctly emerged without an error the first time.
So after nearly three weeks and lots of parts upgrades, I have the exact same system with an extra PCI-E card, a slightly better motherboard and no significant performance increases other than video. At least I can watch HD movies on both monitors now and I could go up to a Quad Core, like I’d ever need one.
I’m a few weeks behind on several other things including my thesis, but putting all this time into the current problem has also helped me focus at work as well as when I get home. My apartment is a mess and the energy rush I’ve had all week has turned into sleeping in until about 1pm today. I’m trying desperately not to think about the fact that all of this was caused by a software upgrade. It’s almost enough for me to jump the Gentoo ship…but not quite yet.