linux

Last Friday after lunch, I got an alarm that a certain system was down. It wasn’t one that had direct impact for clients, as it’s a backend system mostly used for running scripts and collecting application info. An antique, so to speak…I checked our nagios and also cacti and sure enough, we had lost contact with that system. I finished what I was doing and went to the datacenter to check it out, thinking it would be a quick fix.I got there and after connecting to the console I got the expected kernel dump. The words “out of memory” immediately came to mind.The system was completely unresponsive, so I rebooted it. And that’s when the fun began…

It needed a file system check. Said file system check aborted at about 80% prompting me to enter single user mode and run the fsck myself, which I did. I confirmed the device’s name, entered the command with the ‘-y’ option to answer yes to every prompt and pressed enter. It started to chug along and spat out the usual messages about fixing inodes, and then it crashed again. It just said “e2fsck exited with signal 11”. Things were not looking good. Signal 11 is a SEGFAULT, and that usually involves memory…

Since the system had 2 disks in RAID 1, I broke the mirror, and removed one of those disks for backup in case it got even messier. I booted with just one disk and tried again. Still no luck. fsck still segfaulted, which is something I had never seen before. I googled for it and all I got were a few old pages (dating from 1999 to 2002 if I remember correctly). Some pointed to memory problems, others suggested disk corruption. At that time I was getting pissed. This was taking too long. I returned upstairs, got a Finix CD from a colleague and this time I brought 2 new disks similar to the ones installed in the server. I used these disks to build a clean mirror, and then fooled the system into booting from one of the problematic discs. It started to mirror that disk onto the blank one. I now had a backup of sorts 🙂

Then I continued trying to  boot from the CD, and running fsck from there… it still broke. With help from my colleague which was also curious as to what had happened, we tried a couple of options. It kept breaking. it would start running the fsck, fixing lots of errors then segfault. Then out of sheer curiosity we tried to mount the disk. It mounted! We tried reading from it and it looked gibberish. We agreed that disk was destroyed. I was already considering reinstalling the system when my colleague tried booting from cd and trying to read the other disk. I wasn’t too happy. That was my backup. I didn’t want to mess with it.He argued he’d mount it read-only and we went ahead. This copy also mounted as before, but now we could actually see lots of files. Apparently everything was there… We quickly used the information on that disk to configure and mount one of the nfs mounts the system was already allowed to access. We then copied everything onto that storage. It took a little time, but finished without a single error. I checked several files and they all looked perfect.I rebuilt the mirror using this copy and one of the blank disks, and then ran another fsck. Once again I got the usual screenfuls of errors but this time it finished without errors… I admit I was a little suspicious as I removed the CD and booted the system, but it worked.

I was ready to trash the system, and reinstall it from scratch, but a little patience and some stubbornness sometimes pays off. It was Friday, and if I had reinstalled, it would probably not be ready before everyone left, and I’m not sure if I could find out everything needed to restore the system to it’s production state.

Lessons learnt:

  • Make sure every system is fully documented. This one wasn’t, and that led to us being more willing to try harder to recover the disks
  • Make sure you don’t panic, and take the extra seconds to make sure you don’t mess up. Our priority was to salvage the data. I tried to achieve that by breaking the mirror and keeping a copy. Turns out that one of the disks was really messed up. By breaking the mirror we preserved a salvageable copy.
  • Don’t try to do it all alone. By working with a colleague, you can discuss important steps and minimize risk.
  • Lastly .. don’t just give up.

Of course things could have turned out worse. The disks could have been irrecoverably damaged, and the only way out would have been installing new disks and reinstalling from backup…But then again when I first went downstairs, I was just expecting a 10 min stay, and I think I stayed there for over 3 hours.

Today I glanced across a ITWorld newsletter with some Unix tips. I knew them all except one.

Well, actually it’s not exactly a tip or even a new command. It’s just a new way of using a command that is quite well known. We’re all familiar with the echo command, I’m sure, but I was quite surprised when I say this:

$ echo *

The wildcard will be expanded and all the files in the local directory will be printed out on one single line separated by spaces.

I stared at it for a moment. It is so brilliantly simple that it’s really amazing.

Of course, you won’t see any files that begin with a ‘.’ and any symbolic links will be listed along with your other files. The biggest problem is if you have filenames with spaces in their name. Not that any unix or linux guy would do that right? But if you do, they’ll show up mixed with the rest of the files.

It’s not perfect, but it’s just so amazingly simple. The echo command has been along since the early versions of unix, and so have shell wildcards, I jest never thought of putting the two together. Old dogs can still learn old tricks after all.

Whenever I upgrade or reinstall my 64 bit Kubuntu PC, I always have issues with the flash player, and I end up copying libraries and plugins from one place or another. This time I found a cool site (http://blog.mattrudge.net) that has this precious bit of information:
Installing Flash Player on Ubuntu 10.04 64-bit.

 

It worked like a charm 🙂

 

This is something I found online, on a linux site I have since forgotten the name. (Ping me if you know it’s origin, so I can give proper credit to the author)
Anyway, I have it in my shell’s alias’ file:

nn='netstat -an | grep ESTABLISHED | awk '\''{print $5}'\'' | awk -F: '\''{print $1}'\'' | sort | uniq -c | awk '\''{ printf("%s\t%s\t",$2,$1); for (i = 0; i < $1; i++) {printf("*")}; print ""}'\'''

It’s rather useful, as I can quickly see a graphic representation of my TCP connections, in the ESTABLISHED state.
The output is something like this:

~$ nn
10.XXX.XX.XX    1       *
10.XXX.XX.XX    5       *****
10.XXX.XX.X     2       **
143.XXX.XXX.XX  1       *
192.XXX.X.XX    1       *
208.XX.XXX.XX   1       *
212.XXX.XXX.XX  1       *
213.XX.XXX.XX   1       *

For security reasons I mangled the IP addresses with X’s. This was taken on my work PC, but imagine running it on a webserver. It might just help you figure out who’s sucking up your connections (or not).
I recently had a problem with a webserver, and this little snippet led me to find out that a certain IP address had 250 established connections to each of the frontend servers of that particular service. One iptables command later, and we could breathe again…