Waiting For Sun
Last updated: Tue, 26 Dec 2006 11:01:00 GMT
In the shadow of Earth.
This one might ramble. Lay in coffee and carbohydrates.
I've often marveled at the kind of people that build and program extremely remote systems. Space, aeronautics, deep sea -- environments so distant, devices so critical, human interaction difficult, that mistakes simply cannot be tolerated. I wonder what sort of insanely meticulous, anal people these must be. How great it must be to know your craft so deeply.
Well, now I know. Not how great it must be, but I know that I'm not one of those meticulous people.
Before I moved out here, I worried that being so remote from my machines would make interactive use next to impossible. I fretted that the ~350ms latency would make SSH unusable. I fretted for no reason. It turns out that 350ms, while noticeable, isn't so painful. It feels no worse than a 9600 serial line, and large volumes out output are, naturally, less hampered by latency problems. I've done my fair share of administration over serial, so I feel at home. My connection's at least as stable as any I've had previously, too, although I still use screen when I'm doing something that I really don't want interrupting.
I made efforts to test more thoroughly. It's easy to become a little slapdash when you're ten minutes from the machine room, and there are half a dozen other people waiting to help out. I've got myself a test box, I've repeated these things again and again, ironed out all of the unforseens.
Well, the forseeable unforseens.
The one thing my test box doesn't have is RAID. The remote server in question, the only one I don't have more than one of, has six uSCSI discs, set up in mirrored pairs. I didn't want the extra expenditure of hardware SCSI RAID, and the 'board already had a good Adaptec two-channel SCSI controller on there, so I went with software. Software RAID1 may be cheap and cheerful, but it performs fairly well. It doesn't suck the lifeblood out of your machine, like software RAID5 might, and Solaris' DiskSuite (now free with the OS and rebranded LVM, I recall) is one of the better implementations.
DiskSuite might be pretty good, but Solaris 10 is, how shall we say, more patchy. I've had trouble with kernel patches from day one, with Solaris 10 x86. It doesn't play well with zones (I may have mentioned) and the "upgrade" to GRUB was problematic. One of the other things it doesn't do is play nicely with DiskSuite. Specifically, it has real difficulty telling which real device I may have booted off.
This is a known issue. Sun apparently have no intention of fixing the problem. The approved solution is to break software mirroring, destroy the table of contents on one half of the mirror, so that the partition sizes do not confuse the OS by appearing to be the same, and install your patch. Afterwards, reinstate the table of contents, rebuild your metadevices, resynch.
That's an ugly solution by anyone's standards.
And so, the fateful time came, and I sat around twiddling my thumbs, waiting for T-zero, liftoff. I busied myself preparing by instructing my host not to boot from the mirror metadevice -- one command, one edit of vfstab.
I rebooted.
The host came back up, immediately something was wrong. For some reason, my root partition was not writable. Half of my services failed to start. I could not log in on the console. I could log in via SSH, strangely. The mount command told me that / was mounted read-write, and from the right device, but I couldn't write anything. I couldn't remount /, I couldn't mount the other half of the mirror anywhere, because it was apparently already mounted somewhere. Except that it wasn't.
No logs -- nowhere to write them.
It occurred to me too late that, blase as I'd been, after all that testing, and all those times I've done these kinds of things before, I had forgotten to separate the other half of the mirror before rebooting. This, I fear, is probably my error. I suspect that the other half of the mirror has synchronised itself over the live half, although what it should have had to write, and when, I can't imagine.
With a partially functional machine, I was loathe to drop to single user so that I might try and correct whatever nastiness I was falling foul of, because I couldn't log in to the console when the machine booted as far as it could into multi-user. I dreaded rebooting to single-user to find that I still couldn't talk to the console.
Marooned.
GRUB to the rescue, perhaps. A fail-safe boot.
I was at that point that I discovered that, from my iBook, over SSH to a FreeBSD server, using cu to connect, with the Sun host's BIOS redirected to serial, something somewhere had forgotten what a cursor keypress looked like. And those cursor keys were absolutely vital, because without them I couldn't choose which option I'd like to boot.
I've never needed to, and I've never (I kick myself now) learned enough about terminals and terminfo to know where to start digging. Panicked times like these aren't conducive to study. Whenever I've been crippled like this before, there's been a fallback: cat and redirection, sed, vi and "hjkl", rdfile and wrfile.
I tried a few things -- no luck. By now, two hours had gone and I was past worrying about getting the patching done and more worried about just restoring service.
I could get to a GRUB commandline, and I had a test box on my desk, I decided I do the job of booting to fail-safe manually. I know nothing about GRUB, and not as much as I should about OPB, but I know plenty about sash, from my IRIX days, and it's not like I haven't had to do similar before.
I booted my test box back to GRUB and examined the three commands that comprise the fail-safe boot option:
root (hd0,0,a) kernel /boot/multiboot kernel/unix -s module /boot/x86.miniroot-safe
Line by line, I typed these into the GRUB commandling on my remote server. And then, after consulting the list of available commands, I typed boot. Just as my index finger hit return, I though "maybe I should have tried this on the test box first?"
Hang.
For the record, there is a way to get a Solaris 10 x86 machine to boot to anything you want, from GRUB, without the cursor keys:
Here are the notes from my journal:
1) boot to grub menu, hit 'e', to edit menu 2) hit 'd' until all lines are deleted 3) using 'o' (add a line, cursor moves down to the newly entered line) then 'e' (edit this line) type the boot commands in. for reference, here's what booting to single user looks like on my test host: root (hd0,0,a) kernel /platform/i86pc/multiboot -s module /platform/i86pc/boot_archive and here's boot to fail-safe: root (hd0,0,a) kernel /boot/multiboot kernel/unix -s module /boot/x86.miniroot-safe 4) hit 'b'
You know, those guys who program space vehicles don't always get it right. Some probes just cartwheel into the surface; they land the wrong way up, beached, slowly discharging their batteries, unable to open solar panels. They run off the end of a loop and just spin away, into the shadow of the planet they're supposed to be watching.
Billions of dollars, gone.
People laugh at them, at the expensive mistakes they make, unable to hide from the media glare. I never have, and I'm glad of that now. I know how they feel.
I sit here, on the sunny side of the planet, waiting for the UK to spin out of the shadow of Earth. I'm waiting for people there to stir, and wend their way to work, where someone might power-cycle my bloody server.
And then I'll see if I can fix it.