Kernel Hackage
Since I last posted here about out CentOS NAS cluster, we have been in the weeds. Our hopes for Linux being able to deal with this Enterprise class level of support have been shaken *and* stirred. I will let Dan Goetzman tell the story in a sec, but first some background since my last post.
When we first released the CentOS server, it was not into full production, move everything from the Tru64 server mode. We were more cautious than that. The Tru64 file server, despite being out of support and now running on hardware with no support contract was still not causing any problems. Not any *new* ones anyway. So we migrated our groups home directories first, and then a few "lower availability required" file systems, and then sat back and evaluated.
At first it looked like we would go ahead and live with the Sun NFSV2 Stale Handle problem (noted in my first post), but then a raft of patches to the kernel came out, and there were quite a number of them that hit areas of the kernel that were of interest to us, specifically in NFS and GFS.
Dan and I talked about it, and decided to try the new kernel. That meant re-certification but we decided to try it on the test hardware. Immediately Dan found a problem with HP-UX clients, and it was *deadly*. Worse, we found out the old server had this problem too! We had not actually tested the entire mix of HP-UX clients possible.
The HP-UX Problem
UNIX and Linux have the concept of bits set to define the read and write ability of a file. If I own file 'xyz' and the write bits are turned off, I can not write to the file even though I own it. I can use the 'chmod' command to turn the write bit on, and then I can write to it.
The funny thing about that is that by using the chmod command, I am technically writing to the file, actually the inode of the file. That means that there is a bit of code someplace that makes sure I own the file and can do it.
With GFS, and *only* GFS as the backing store, and HP-UX and only certain
versions of HP-UX as the client, accessing via NFS, we went down a code path
where HP-UX would attempt to creat a file, and then get rejected when it tried
to write to the file.
Dan's initial look at this came back with the theory that the GFS team had begun to use certain generic kernel file system semantics, and that other file systems like EXT3 and XFS had not.
This was a show stopping problem. Our environment is far too heterogeneous to
work with a gaping hole like this. We talked about it some more. Dan's
research had found one other post about this issue, meaning we were out
someplace in the code that very few had followed into. That was going to mean
that the "Many Eyes, Shallow Bugs" leverage of Open Source was not working to
our advantage.
Having the source code meant that we could see if this was something that we could fix, but Dan told me at least three times that he was not a kernel guy, and that he was not even sure what the Posix compliant behavior should be. He decided to take a swing at it anyway. I turn it over to him here:
Dan's Kernel Story
I finally have "hacked together" a fix for the "HP-UX NFS client
on a el5 based NFS server with GFS filesystems" problem!
After adding a bunch of "printk's" to the kernel and many kernel builds, I was
able to trace down the kernel function that was at the root of our problem. It
seems that NFSD calls vfs_create (and that returns OK)
and then calls nfsd_setattr to set the file attributes
correctly. nfsd_setattr does a few things and ends up
calling notify_change, and down the road a bit more will
end up calling gfs_setattr and then farther down the path
will end up calling generic_permission (a regular kernel
routine).
It's this generic_permission call that returns
-EACCES. Apparently due to the fact that the file was
created with the correct owner, but with NO access
permissions in the case of the HP-UX NFS client. Interesting, this
generic_permission call is supposed to replace the
gfs_permission call that was the way it was done in the
pre 2.6.10 days. Apparently GFS is the only filesystem (as of the el5 vintage)
that has made this change. ext3 does not yet call
generic_permission. I found patches to make this change
to XFS, but a trace of XFS on el5 reveals it does not call
generic_permission at this time. So, that's why it only
fails on GFS on CentOS5!
Not really wanting to change a kernel function that other things might call, I
elected to change where the nfsd layer in the kernel gets the error returned
(by notify_change). My hack simply checks if
notify_change returns error=-EACCES
and then IF (NFS uid == inode uid) reset the error var to
0. That is, if the owner of the inode is the same as the calling owner uid
then allow access. I added a printk at the kernel.debug level so I can see
this via syslog if I have the kernel.debug level set to log. To verify it
works...
Initial tests indicate success. I have all 3 nodes on the cluster up on this
"BMCFIX" kernel now. HP-UX NFS clients seem to work AOK now.
I posted this to bugs.centos.com case, to have the experts look at how to
provide a more permanent fix. As I am not really up on things like POSIX
compliance and all. This is just a hack to prove that I am on the correct path
and it will resolve the problem.
Anyhow, here is the patch if you are interested in the exact code that was
added to fs/nfsd/vfs.c to the nfsd_setattr function:
+++ vfs.c 2008-01-18 13:18:40.000000000 -0600
@@ -348,6 +348,11 @@
if (!check_guard || guardtime == inode->i_ctime.tv_sec) {
fh_lock(fhp);
err = notify_change(dentry, iap);
+ /* Allow access override if owner for HP-UX NFS client bug on GFS */
+ if (err == -EACCES & (current->fsuid == inode->i_uid)) {
+ printk (KERN_DEBUG "nfsd_setattr: Bug detected! Ignoring -EACCES error for owner\n");
+ err = 0;
+ }
err = nfserrno(err);
fh_unlock(fhp);
Dan's bug number is at http://bugs.centos.org and is 2583.
We Admit: Its a Hack
I post this all here in the spirit of openness, should anyone follow us out here to the bare edge of GFS based NAS servers. We do not know what the right way to really fix this problem would be, but we looked at it as being like my example about owning a file system being an implicit authority to at least write to the inode.
Dan stood this code up a four days ago, and so far, so good. In fact, we know that it is doing what we want in terms of being a cluster because a "network burp" caused thre NFS service to migrate from one node to another. We only knew about it because we saw it in the log. The customer facing service kept right on running.
Open Source
This problem has all sorts of things about the advantages and dis-advantages of Open Source, all wrapped into one neat bug number.
- By having the source code, and a guy good enough to read and understand it, we were able to fix a severe problem in-house, with relying on anyone
- Because we were on the bleeding edge where very few folks appear to be, we were on our own. "Many Eyes, Shallow Bugs" principle does not work when there are not many sets of eyes looking at all the possible cases
-
Linux is great for a heterogeneous environment, as long as one is willing to
put in the time and effort sometimes. Along the way of shooting this bug,
Dan was laughing about some of the code comments about all the other patches
in Linux to deal with various corner cases for things like Irix and other
more obscure combinations of problems. It is easy to see why the embedded
market loves Linux.
- By choosing CentOS, we chose not having a support option, but one way out of this would be to use the equivalent version of RedHat, and taking out a support contract. That back door possibility was part of the attraction of CentOS.
-
By tripping over this now, and documenting it, we have hopefully made life
easier for whomever comes this way next: Dan notes that XFS is getting ready
to start using the kernel provided file systems semantics, so they would
have seen this next.

Im not saying that commercial vendors can't always deliver but that sometimes its good to have the source IF you have the guys or gals that can hack it ;)
Replies to this comment