sysadmin

I’ve been taking a look at our resolvers and I was surprised by some of the results I found.
I ran a tcpdump for 10 minutes capturing packets sent to one of our resolvers and extracted the names being queried.
During those 10 minutes that particular resolver answered 1,25 million queries for 250 thousand distinct names.
Looking through the list there were many names that result from mis-configured equipment and other mistakes, but that’s on the low end of queries. It’s the higher end, with the most commonly resolved names that actually interests us.
The list is topped by a name that is hard-coded into some of our clients routers. Having several hundred thousand of those devices out in the open making queries does skew the results so I ignored those queries, and just looked at the rest of the names. I ordered them by frequency and here is a brief analysis of the top 50 names.

As one might expect, at the top of the list comes ‘www.facebook.com’ but I was actually surprised to find so many names related to facebook. There are also ‘static.ak.fbcdn.net’, ‘apps.facebook.com’, ‘profile.ak.fbcdn.net’, ‘pixel.facebook.com’, ‘creative.ak.fbcdn.net’, ‘platform.ak.fbcdn.net’, ‘external.ak.fbcdn.net’, ‘static.ak.connect.facebook.com’, ‘photos-g.ak.fbcdn.net’, ‘photos-b.ak.fbcdn.net’, ‘photos-e.ak.fbcdn.net’, ‘static.ak.facebook.com’, ‘photos-c.ak.fbcdn.net’, ‘photos-a.ak.fbcdn.net’, and if I had dug deeper, I would certainly have found more names.
In case you haven’t figured it out fbcdn stands for facebook content delivery network, and ak means Akamai.
Out of the top 50 names queried, 15 belong or are related to Facebook. That is impressive.

The second most popular name being queried was a root server. Not sure I understand why, but there were many, many queries resolving ‘a.root-servers.net’. A close third was Google’s ‘www.google-analytics.com’. No surprise here, as it is probably the most widely used analytics solution today.
Fourth place was used by our own voip proxy, which is always nice to see 🙂
In fifth place we have ‘google.com’, followed by ‘www.youtube.com’, and ‘www.google.com’. Funny that our local ‘www.google.pt’ only made 13th place.
Also related to youtube are some names like ‘i1.ytimg.com’, ‘i2.ytimg.com’, ‘i3.ytimg.com’, ‘i4.ytimg.com’ that show up at the lower end of the 50.
There is also ‘googleads.g.doubleclick.net’, and ‘pagead2.googlesyndication.com’ which are self-explanatory.

Then there are a couple of ntp servers, and at least 1 anti-virus name I recognize.

This was just a trial run, and I found the results pretty interesting.
Maybe I can automate this, and see what other surprises hide lurking in the data.

Last Friday after lunch, I got an alarm that a certain system was down. It wasn’t one that had direct impact for clients, as it’s a backend system mostly used for running scripts and collecting application info. An antique, so to speak…I checked our nagios and also cacti and sure enough, we had lost contact with that system. I finished what I was doing and went to the datacenter to check it out, thinking it would be a quick fix.I got there and after connecting to the console I got the expected kernel dump. The words “out of memory” immediately came to mind.The system was completely unresponsive, so I rebooted it. And that’s when the fun began…

It needed a file system check. Said file system check aborted at about 80% prompting me to enter single user mode and run the fsck myself, which I did. I confirmed the device’s name, entered the command with the ‘-y’ option to answer yes to every prompt and pressed enter. It started to chug along and spat out the usual messages about fixing inodes, and then it crashed again. It just said “e2fsck exited with signal 11”. Things were not looking good. Signal 11 is a SEGFAULT, and that usually involves memory…

Since the system had 2 disks in RAID 1, I broke the mirror, and removed one of those disks for backup in case it got even messier. I booted with just one disk and tried again. Still no luck. fsck still segfaulted, which is something I had never seen before. I googled for it and all I got were a few old pages (dating from 1999 to 2002 if I remember correctly). Some pointed to memory problems, others suggested disk corruption. At that time I was getting pissed. This was taking too long. I returned upstairs, got a Finix CD from a colleague and this time I brought 2 new disks similar to the ones installed in the server. I used these disks to build a clean mirror, and then fooled the system into booting from one of the problematic discs. It started to mirror that disk onto the blank one. I now had a backup of sorts 🙂

Then I continued trying to  boot from the CD, and running fsck from there… it still broke. With help from my colleague which was also curious as to what had happened, we tried a couple of options. It kept breaking. it would start running the fsck, fixing lots of errors then segfault. Then out of sheer curiosity we tried to mount the disk. It mounted! We tried reading from it and it looked gibberish. We agreed that disk was destroyed. I was already considering reinstalling the system when my colleague tried booting from cd and trying to read the other disk. I wasn’t too happy. That was my backup. I didn’t want to mess with it.He argued he’d mount it read-only and we went ahead. This copy also mounted as before, but now we could actually see lots of files. Apparently everything was there… We quickly used the information on that disk to configure and mount one of the nfs mounts the system was already allowed to access. We then copied everything onto that storage. It took a little time, but finished without a single error. I checked several files and they all looked perfect.I rebuilt the mirror using this copy and one of the blank disks, and then ran another fsck. Once again I got the usual screenfuls of errors but this time it finished without errors… I admit I was a little suspicious as I removed the CD and booted the system, but it worked.

I was ready to trash the system, and reinstall it from scratch, but a little patience and some stubbornness sometimes pays off. It was Friday, and if I had reinstalled, it would probably not be ready before everyone left, and I’m not sure if I could find out everything needed to restore the system to it’s production state.

Lessons learnt:

  • Make sure every system is fully documented. This one wasn’t, and that led to us being more willing to try harder to recover the disks
  • Make sure you don’t panic, and take the extra seconds to make sure you don’t mess up. Our priority was to salvage the data. I tried to achieve that by breaking the mirror and keeping a copy. Turns out that one of the disks was really messed up. By breaking the mirror we preserved a salvageable copy.
  • Don’t try to do it all alone. By working with a colleague, you can discuss important steps and minimize risk.
  • Lastly .. don’t just give up.

Of course things could have turned out worse. The disks could have been irrecoverably damaged, and the only way out would have been installing new disks and reinstalling from backup…But then again when I first went downstairs, I was just expecting a 10 min stay, and I think I stayed there for over 3 hours.

One of the things I have on my (short) list to blog about is SNMP.

I’ve been thinking about how to approach it, and today, one of the sysadmins I follow on twitter (@standaloneSA) tweeted that he had written an entry about SNMP. I went to his blog to check it out, and I highly recommend it. It’s much better than what I would have written, so I’ll just point people his way.

You can reach his blog entry here.  Well done, Matt.

As every sysadmin knows, the systems we manage are subject to change. New applications are installed, existing applications have their workload changed, and other similar events are all part of an evolving system. I’d venture that very few systems don’t grow or evolve during their lifetime. Sometimes you know what the expected evolution is right at the beginning of the system’s life, but most often, things aren’t so well planned to give you that foresight. We need a way to keep up with the the system’s behavior, and use this knowledge to forecast how it will behave in a month or in a year’s time.

There are a variety of tools out there, but my favorites are Nagios and Cacti. Both are well known in the industry, are actively developed, and both can use SNMP to collect their information. For this article I’m mostly referring to Cacti as it allows us to gather information and display it as a graph. Nagios is used to generate alarms when (some of these) values exceed certain thresholds, but that’s an all different article…

From the very beginning of the system’s life, we should start collecting performance values. These include the basic information such as CPU load, Memory usage, Network traffic (for every interface), Disk occupation (for every partition in use) and also information that is related to the system’s use. Depending on what applications we have running, this can be a mix of metrics to keep up with.For example, for a webserver we should monitor the number of processes, number of requests/s, average time per request, and any other value that can help understand the system’s performance.For a database system, we should monitor the number of queries/s the number of open files, the average time per query among others. Database systems have a number of items that we can monitor, and the more we use, the better picture we will be able to have of our system.

Mixing system information with application performance is very useful as we’ll see next. At first this information doesn’t seem to help much, but this is the system’s baseline, akin to it’s heartbeat showing you how the system behaves as a whole, and how it evolves as you add new clients or new functionality.

As you gather data you can see some trends emerging from the graphs. In all probability you shouldn’t expect a lot of flat lines. You’ll have a lot more sloped ones. These allow you to forecast your system’s behavior into the future. This in invaluable information, allowing you to pinpoint when you will have capacity problems, and give you data to back up your request for new servers. You can also show management what will happen if you don’t get them.
Correlating application with system metrics is what allows you to really make an informed decision regarding capacity. You can ask questions like “what will happen if the requests per second increases by 20% or 40%?” This is the main advantage of baselines for management. They allow you to have a visual representation of how your system’s behave and how you stand in capacity terms.

Baselines also serve another purpose, one much dearer to us sysadmins. They show us what our system’s are like under “normal” usage. If there is a problem somewhere, you will probably see a deviation in the metrics you are collecting. Sometimes it will help you pinpoint the exact cause of the problem. Have you been slashdotted? That would show up as a significant increase in TCP connections, with higher network traffic, leading to higher system load, and very probably using up all your memory.Having a ‘picture’ of the system under normal load is something you truly appreciate once you’re in trouble wondering what is happening. But then, it is too late to collect data to analyze. So start now, start building your baseline today, and I assure you it will help you in the future.