On a Crash Course

Ok I got back to work after What The Hack and I am now the senior sysadmin as my boss has now left. I’m a reasonable sysadmin but I’m still learning the ropes with regard to what I have actually taken over, so guess what happens.

First morning back and I’m in need of a gentle day browsing the net, writing email and reading my planets of choice, the main monitoring machine, running Nagios, goes down. Hard disk completely fucked, isn’t even recognised by the BIOS. Is it backed up? Is it my arse. So I spend the rest of the week flying blind and not knowing when things go down while I try to get my machine back up. This is not as easy as it sounds, I don’t know Nagios so it’s a crash course. I’m nearly there now, I just have to finish getting gnokii to send me sms messages, separating the config out into different files and adding the rest of the machines and services to the monitoring. The only problem is that we had a script and a database that created our DNS config files which was also not backed up as far as I can tell and like Nagios, it went down with the disk, so next time I have to add a new domain I’m going to have to go on a BIND crach course. Arse.

I begin to devise my own very comprehensive backup and mirroring scheme. All of our clients data is safe as houses of course, but I fear for the configuration of the infrastructure machines that I have inherited and I don’t want to have to rewrite it from scratch. I don’t want my BIND crash course to be a live crash course. Paranoia is a good trait in a sysadmin. Lesson 1: legislate for absolute catastrophe.

So, while I’m getting there with that, one of my tape backup drives dies and none of the spares work. So I’m waiting for the new drive to arrive before I start my Amanda crash course.

I was expecting the new drive today, but there was an error with the retailer’s stock levels and they don’t have the number we ordered and didn’t ship it. So maybe tomorrow.

Anyway, I’ve stepped ahead of myself a little here. That was sometime this afternoon. Last night I was helping my friend Dan move some stuff out of his old house as he has to clear the place before the landlord goes in. As gnokii isn’t texting me yet, I was unaware until this morning that one of our servers went down and wouldn’t come back up. When I found out this morning, I tried to do what I could but there was no way in remotely so, as is my responsibility, I raced into work early to sort it out. Disk errors. Lots of ’em. I fixed them but there were some complicated replication services on other machines that were dependent on this machine being available, so this problem made a mess of quite a few things. My boss managed to sort that mess out and we then spent the rest of the day building a replacement machine to take the weight of the one that went down. Hmmm crash course in installing a binary kernel module on the only supported Linux version with a huge list of kernel boot parameters on an unwilling RAID system. Took all day to get right, including trying out all of the available options that meant avoiding the binary only kernel module. Why these people don’t open source their driver I don’t know. It would mean someone could maintain the driver and bring it up to date with current mainline kernels and maybe get in the kernel proper. It would also add a sales point to their device that it is supported by the Linux kernel proper and free them in some sense of developer responsibility.

Anyway, I have to spend tomorrow installing all of the necessary, moving over the necessary from the other machine and then replacing the unnecessary, which will be a 5am start on Thursday morning :s

Just a hint to the readers: this is not indicative of the stability or quality of my employers servers, of the quality of service my company provides or the ability of the former or current sysadmins. You just don’t hear sysadmins whining out loud like this that often 😉