NAS Server Testing, Part 5: Real data and another 64 bit problem
We are testing the following hardware as a NAS server in our "archive" level of NAS service (Even at an "archive" SLA, we prefer they stay up!):
- Tyan Thunder K8SR mainboard, dual AMD Opteron 200 series ready (socket 940)
- SMDC IPMI daughtercard for Lights Out operation
- 2 AMD Opteron processors
- 1GB RAM
- 3ware 9500 SATA controller card
- 2 SATA disks (mirrored with Linux md0 driver)
- 12 400 GB SATA data disks, RAID 5 with hot spare, 4 Terabytes usable
The OS we are testing is a fully patched Fedora Core 4, with the 2.6.15 kernel. All unneeded services are turned off. All Workstation centric packages other than the minimum set needed to configure the server are not installed..
Part of the reason for all this is to get past all the early 2.6 kernel issues documented in the entries around the time of "When Linux Breaks". Short version: Our best guess from reading at kernel.org is that 2.6.10 or above is where we need to be to not have SLAB Cache issues, which we are having like crazy on 2 different 2.6.5 kerneled systems. We would *really* like to be on 2.6 to get at it's speed improvements, but not at the cost of any more downtime. If we can't get 2.6.15 working well, we'll drop back to 2.4.31 everywhere and wait some more.
We have beaten it to death with everything I have documented here in this series: Netperf, iozone, and connectathon. It passed every test. So our next test was to move real data from the old server being migrated off of to the new NAS. And it hung up. The NFS client (we like to pull data rather than push) was just stone cold frozen.
Getting on a 3rd server, our very fast TruCluster, we tried to read and copy. Worked better, but still it hangs.
We were puzzled. Naturally there is more talk on the team of forgetting 2.6 kerneled Linux as ever being good as a NAS server. Go back to 2.4.31 where we were so happy! One thing is bothersome though: You'd think if it was really that bad we would have heard more about that by now: that we'd stumble across it in Googlespace when we are searching for clues. (Insert your jokes about searching for or catching clues here ---> . <--- )
We did what any good CE would do. We started changing things one at a time to see what happens: we have a hang, so we have *no* useful diagnostics in the logs:
- Removed the IPMI card (since it watchs NIC IP traffic). NADA.
- We tried replacing the NIC by installing an Intel E1000 card to bypass the Tyan's onboard Broadcom. Nothing different.
- We dropped the FC version of 2.6.15 for a plain one from kernel.org. No Change.
- We dropped FC4 for RH Advanced Server 4.0. 2.6.9 Kernel. Now it all works fine. Humm: what *did* the RH folks fix for the production version of this OS?
We would stay here on RH EL 4.0 but we are concerned that the 2.6.9 kernel level will have the SLAB cache issues still, and don't want to go into production on a kernel that far downlevel in the 2.6 tree. There is some speculative talk that if RedHat fixed one thing like the hangs in EL 4.0, they might have pulled down the SLAB cache fixes from up the kernel road too.
Along the way, little clues have now accumulated. Dan kept doing research at each step. His intuition and late night tossing and turning had him convinced that the problem was with the Network cards, even though we would seem to have eliminated them. He is correct. He turns up this:
http://bugs.centos.org/view.php?id=1121Well fine. A problem with checksums on NIC's, but only on 64 bit Opteron processors. Fine. Great. Back to the crossroads. There is a workaround, but only for the Intel E1000 or the Broadcom when running the BCM drivers, not the TGZ drivers: You can turn hardware checksum offload off with a parameter in /etc/modprobe.conf:
Alias eth0 e1000option e1000 XsumRX=0
That is for an Intel E1000 card obviously...
This does not work at the current time for TGZ drivers because they do not accept turning off checksumming as a parameter. Nothing to say you can't go into the source and fix it though. We suspect without knowing (because we didn't look) that what the RedHat EL 4.0 drivers for tgz and E1000 do is turn this off as a default.
Back on went FC4, all the patches, and one change: the one above in modprobe.conf. And a re-copy of the production data went as smooth as silk.
We found all this because we were using a copy of production data: no amount of beating made it fail, but the second we started moving a copy of production data, it triggered the special set of circumstances that led to the NFS client and NFS server hangs. If we had not been testing real data, the first time we would have founf this was when we already were in "production" (as much as Archive data is production anyway).
It was fixed because Dan went back and questioned his assumptions, and also trusted his instincts: he just could not believe NAS was this bad on a post 2.6.10 kernel.
We'll test more before we go into production, to be sure.
_____
tags:
