Thursday 25 November 2010

netbooting Ubuntu, a cautionary tale of NFS3 and tcp_wrappers...

One of the things I like to have available on my LAN is a decent live Linux system that I can network boot into.  It's incredibly handy for those occasions when you need to do something to, or with, an installed system that requires it not to be running.

This little tale starts a few days ago when I wanted to upgrade my Fedora 13 box to Fedora 14.  No big deal, but I keep a copy of the old root filesystem on a separate partition in case of problems, said copy typically being made (e.g. with dd, and filesystem UUIDs, etc. adjusted as necessary) from a live Ubuntu.  Why Ubuntu?  Because it's really easy to network boot it, competent/complete enough for every issue I've needed it to fix so far... and sometimes it's just nice to see how 'the other side' does things.

So, anyway: my backup in preparation for my Fedora 14 upgrade.  I told my box to boot from the network, selected Ubuntu 10.04.1 from the menu and watched the kernel/initrd load as normal, but then something I'd not seen before happened.  Immediately after mounting the root filesystem, and instead of the normal live Ubuntu startup messages, there was just a string of short read: 24 < 28 messages dumped to the console, one per second.  After a while it just gave up and complained that it could find no usable root filesystem.  Same story for all the other live Ubuntus I had handy.

Uh-oh.  Thing is, this used to work just fine.  And, of course, I hadn't changed anything.  Now, nobody has ever changed anything, right?  Of course I had changed something, but I'm getting ahead of myself.

A quick google turned up something interesting about the order of the kernel arguments mattering (it shouldn't) and leading to that peculiar short read: 24 < 28 message.  "Funny," I thought, "It never mattered before...  And I don't remember changing them in the PXE menu config. wotnot either."  Nonetheless, I fiddled with them and got precisely nowhere.  Something was causing that NFS mount to bork, and I had no idea what.

So it was time for a different approach, and perhaps to get some more idea of what's going wrong...  in the pursuit of which I tried mounting the NFS export on my box, which failed.  Specifically it failed with an error message that left me even more puzzled: mount.nfs: Argument list too long.  "WTF?!" I said, nonplussed.  Naturally, I tried googling that and got pretty much nowhere.  It's not exactly a 'normal' error to get from a simple mount command and, come to think of it, it's not an error I've seen in a long time -- typically it's caused by insanely successful wildcard matches in shell commands, etc., not by mount 192.168.3.154:/export/netboot/ubuntu-10.04.1-i386 /mnt/tmp.


By that point I was scratching my head and thinking "This used to work.  Clearly something has changed.  Something unexpected, perhaps.  Now...  What have I changed since I last definitely had this working?"

And then it hit me:  My network had changed since I last knew for sure that it was working.  Specifically, I added an extra router to provide wifi for guests, etc.  There's more to the why than that, but that's a long, miserable, story in its own right, starring a Netgear D843G router as the villain eventually exposed by an plucky young Android phone and a single remote wget command.  Anyway, the relevant part is that I now have an extra router/AP that thinks my LAN is the internet.  It has a LAN of its own, of course, and that's actually a good thing since it means I can very effectively keep 'guest' wifi use under control.  There's a difference between "Sure, use my wifi." and "Sure, use my wifi... and maybe rummage around on my fileserver as well." after all.

With that in mind, I'd added some rules to my hosts.allow and hosts.deny to explicitly allow NFS 3 access from my regular LAN but not from the 'guest' LAN.  Shouldn't be relevant, right?  Oh, but it was...  and not just because I'd botched it.

Here's the thing:  there's a few ways you can specify a range of IP addresses.  192.168.3.0/24 means the same as 192.168.3.0/255.255.255.0 means the same as 192.168.3. and I'd gone with the /24 version in my config.  Error number one, right there:  tcp_wrappers is old.  Old and unloved.  It does understand that form of netmask, but only for IPv6 addresses... and only just.

"Hooray!" I cried, having found the problem.  But would you care to guess what happened when I corrected my netmasks by changing them to the 255.255.255.0 format?  That's right: nothing changed.  Still that short read: 24 < 28.

That's just not fair, really, is it.  I mean, sure that config cockup shouldn't have caused those problems, but it was what had changed since the thing last worked properly...  and, well, "Bother," said Pooh.

For the sake of testing, I threw an ALL : ALL rule into my hosts.allow and...  suddenly...  everything was working as it should.  Live Ubuntu, mounting from a normal box:  no problems at all.  Aaaaargh.  So it was tcp_wrappers that was causing the problem!  But... WTF?  That's when I decided to try the 'trailing dot' method of specifying the network...  and...  that worked.

So, to recap:
192.168.3.0/24 -- bad.  That style of netmask is only for IPv6.  Oops, my mistake there.
192.168.3.0/255.255.255.0 -- bad.  No idea what tcp_wrapper's problem is with that.  Smells of bug to me.
192.168.3. -- good.

Oh, for completeness' sake, tcp_wrappers also allows the use of asterisks as wildcards in some circumstances and I could potentially have tried 192.168.3.* but since it was now working and I had things to do, I didn't get around to trying that.

So, long-story-short version:  for some reason a perfectly normal and valid netmask in hosts.allow was causing NFS to throw wobblies leading to that peculiar short read: 24 < 28 message and the equally confusing mount.nfs: Argument list too long.  Who'dathunkit.

But do you want to know what the real kicker was in all of this?  Sure you do, if you've read this far.  And it's a real facepalm worthy one too...

The kicker is that the whole thing was an exercise in Forgetting Something Important:  That extra router/AP is doing NAT, so everything 'behind' it appears to come from it... and it's part of my regular LAN.  In other words: there's no bloody point in trying to restrict access to the 'guest' IP addresses, what I actually need is to restrict access from the 'guest' router's IP address.

Wait, it gets worse.

All this tcp_wrappers malarky is only relevant to NFS3... and the only things I export via NFS3 are my local package repositories and the necessary stuff to network boot things - all read only, and none of it even vaguely in need of protecting from accidental guest access.  All the 'real' stuff is exported via Samba and NFS4.  For what it's worth, Samba is easy enough to lock down sufficiently, and I've already done so.  NFS4 is slightly more fun, and doesn't use tcp_wrappers at all, but a couple of simple firewall rules suffice.

Anyway, given the dearth of relevant google hits when searching for the two specific error messages I was seeing, I figured it might be worth sharing the story.