Some time ago system administrators bragged about how much uptime their systems had. It was seen by many as a reflection of their skills in keeping their systems up and running without needing a reboot.

There even were some web sites dedicated to displaying such information. People do like to brag 🙂

Then one day someone realized that having uptime that covered multiple years was probably not such a good thing after all… People started worrying what would happen if the system actually needed to be shut down. Would it come up again on it’s own? And if it didn’t, where were the people that had installed it and were familiar with it’s particular quirks? Most had probably moved on to other positions or had even left the company.

And what about security? How many kernel exploits have been fixed since that particular version that is 4 or 5 years old? Long uptimes became increasingly unpopular.

In today’s world a long uptime is usually only found in very special installations, and not frequently in systems that are connected to the internet, the potentially lethal realm where hackers (or even script kiddies) can wreak havoc of an unpatched system in a matter of minutes (or is it seconds?).

Yes, maybe there are firewalls and IDS’ between the system and the internet, but hey, if your systems are unpatched, how up to date are your security systems?

I bet there are two kinds of people reading this. The first is the one that smiles and nods agreeingly, and the other has already stopped reading and is busy checking uptimes 🙂

Recently a friend of mine told me how worried and how nervous she was about an impending audit to her systems. I tried to help calm her fears, and that led to this blog post about how to react to a system’s audit.

I am by no means an expert, but I have had my systems audited more than once, and with the proper attitude, it can actually be a positive thing, not necessarily a nuisance.

First of all, it’s your systems that are being audited. It isn’t about you. Of course, the results, good or bad, will reflect on you, But that is a fact of life, and now isn’t the time to worry about it. You should of course be familiar with your systems, their state of compliance with the law and with your company’s rules and guidelines, and if there are any known deviations, you should be prepared to explain them. Very few people run perfectly tight ships, and there is always room for improvement. Take this audit as a chance to identify any weak spots and thus help you fix them. If there were any previous audits, review them. If there were some actions they recommended you should take, be prepared to either show you have implemented them, or to provide some explanation as to why they haven’t been implemented.

Clearly understanding the objectives.

Before the audit even starts you should have a meeting with your management. What do they expect from the audit, and what limitations, if any, are in place. Depending on the size of your infrastructure, and the scope of the audit, you may be working solo, or heading a team. Make sure management is aware that time and resources will be allocated to the audit,

Then you should schedule a meeting with the auditing team. Sit down with them and make sure you are both working on common ground. Both teams must agree on what will be audited and how, and for how long. This doesn’t have to be set in stone, but it helps if both teams have a clear understanding of what they are doing before they actually start. This sometimes is a tricky situation. The auditors don’t always have a clear understanding of your system, and it is up to you to enable them to understand your system and how they can perform their audit without disrupting the operation of the system.

Use this meeting to make sure the audit team has all it needs to start working.

Agree on methodology

Usually the audit team will propose a methodology to use. Make sure you understand it, and are comfortable with it. If you have any questions ask them now, and they’ll be happy to explain, and eventually adjust to your concerns.

This is where you’ll define one of the most important features: who will actually access the data. Some audit teams insist on accessing the systems themselves, others are willing to ask you for the information they need. Both approaches have pros and cons. Having some other team accessing the systems can, eventually, disrupt daily activities. It will however grant them access to the data so they can collect exactly what they need. On the other hand, if you agree all requests come through you, you can collect the data they need whenever it is more appropriate, but that will add to your workload, and to your involvement in the audit.

Starting the audit

If this is a formal audit and you have a large team, it’s a good idea to schedule an informal meeting to present the auditors to the rest of the team. Explain what they will be doing, and make sure you identify anyone responsible for any major parts of your infrastructure. Make sure everyone is aware of the audit and if necessary instruct them on how to cooperate with the audit team.

Answer questions in a truthful manner

You are the expert when it comes to your systems. You can’t expect the auditors to get to know it as well as you do in a couple of days. Just think how long it took you to get to know it all. That means they will have questions, and you should be prepared to answer them. And, as with everything in life, you should give honest answers. Don’t lie to the auditors. At the very least you will loose self-respect, and at worst you can get fired, and eventually prosecuted. Be professional and they will respect that, even if there are some problems in your system.

As a side note, one of the best auditors I’ve worked with, started his work with some very, very basic questions, that could eventually lead you to  believe he wasn’t very knowledgeable about what he was auditing. As the audit progressed his questions became much more specific, and detailed, showing us that he did indeed fully understand the system, and that the initial questions were just part of his method.

Review the results (and act upon them)

The audit will produce at the very least a report on it’s findings. At the end of the audit you should have another meeting in which the auditors present their report. If possible try to get a draft copy of the report so you can be prepared for the meeting. The auditors will explain their findings and you have a chance to ask questions about any issues. Make sure you understand their findings.

After the audit

Use the information in the report to take another look at your systems, now through the auditor’s eyes. Prepare an answer for every issue they addressed. Maybe it isn’t an issue after all, and there is some reason for it’s existence? Then document it. If it is a real issue you should address, work the solution into your todo list, and write down that the issue is being addressed. This will eventually help your system when it comes to the next audit, and it will also leave you in control.

Prepare another meeting with management, who probably already received a copy of the report, and present your answers to the issues within it. This may be the end of the audit, but not necessarily of your evolvement with it. Present your plan to solve the issues in the report, and agree on a timeline to address those issues.


I ended up writing more than I intended, and I’m certain this is by no means an exhaustive list, but I think the basic ideas are present, and you can adapt them to your own situation. There is no reason to panic because of the audit. A set of fresh eyes looking at things can sometimes be very useful.

I’ve been taking a look at our resolvers and I was surprised by some of the results I found.
I ran a tcpdump for 10 minutes capturing packets sent to one of our resolvers and extracted the names being queried.
During those 10 minutes that particular resolver answered 1,25 million queries for 250 thousand distinct names.
Looking through the list there were many names that result from mis-configured equipment and other mistakes, but that’s on the low end of queries. It’s the higher end, with the most commonly resolved names that actually interests us.
The list is topped by a name that is hard-coded into some of our clients routers. Having several hundred thousand of those devices out in the open making queries does skew the results so I ignored those queries, and just looked at the rest of the names. I ordered them by frequency and here is a brief analysis of the top 50 names.

As one might expect, at the top of the list comes ‘’ but I was actually surprised to find so many names related to facebook. There are also ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, and if I had dug deeper, I would certainly have found more names.
In case you haven’t figured it out fbcdn stands for facebook content delivery network, and ak means Akamai.
Out of the top 50 names queried, 15 belong or are related to Facebook. That is impressive.

The second most popular name being queried was a root server. Not sure I understand why, but there were many, many queries resolving ‘’. A close third was Google’s ‘’. No surprise here, as it is probably the most widely used analytics solution today.
Fourth place was used by our own voip proxy, which is always nice to see 🙂
In fifth place we have ‘’, followed by ‘’, and ‘’. Funny that our local ‘’ only made 13th place.
Also related to youtube are some names like ‘’, ‘’, ‘’, ‘’ that show up at the lower end of the 50.
There is also ‘’, and ‘’ which are self-explanatory.

Then there are a couple of ntp servers, and at least 1 anti-virus name I recognize.

This was just a trial run, and I found the results pretty interesting.
Maybe I can automate this, and see what other surprises hide lurking in the data.

Last Friday after lunch, I got an alarm that a certain system was down. It wasn’t one that had direct impact for clients, as it’s a backend system mostly used for running scripts and collecting application info. An antique, so to speak…I checked our nagios and also cacti and sure enough, we had lost contact with that system. I finished what I was doing and went to the datacenter to check it out, thinking it would be a quick fix.I got there and after connecting to the console I got the expected kernel dump. The words “out of memory” immediately came to mind.The system was completely unresponsive, so I rebooted it. And that’s when the fun began…

It needed a file system check. Said file system check aborted at about 80% prompting me to enter single user mode and run the fsck myself, which I did. I confirmed the device’s name, entered the command with the ‘-y’ option to answer yes to every prompt and pressed enter. It started to chug along and spat out the usual messages about fixing inodes, and then it crashed again. It just said “e2fsck exited with signal 11”. Things were not looking good. Signal 11 is a SEGFAULT, and that usually involves memory…

Since the system had 2 disks in RAID 1, I broke the mirror, and removed one of those disks for backup in case it got even messier. I booted with just one disk and tried again. Still no luck. fsck still segfaulted, which is something I had never seen before. I googled for it and all I got were a few old pages (dating from 1999 to 2002 if I remember correctly). Some pointed to memory problems, others suggested disk corruption. At that time I was getting pissed. This was taking too long. I returned upstairs, got a Finix CD from a colleague and this time I brought 2 new disks similar to the ones installed in the server. I used these disks to build a clean mirror, and then fooled the system into booting from one of the problematic discs. It started to mirror that disk onto the blank one. I now had a backup of sorts 🙂

Then I continued trying to  boot from the CD, and running fsck from there… it still broke. With help from my colleague which was also curious as to what had happened, we tried a couple of options. It kept breaking. it would start running the fsck, fixing lots of errors then segfault. Then out of sheer curiosity we tried to mount the disk. It mounted! We tried reading from it and it looked gibberish. We agreed that disk was destroyed. I was already considering reinstalling the system when my colleague tried booting from cd and trying to read the other disk. I wasn’t too happy. That was my backup. I didn’t want to mess with it.He argued he’d mount it read-only and we went ahead. This copy also mounted as before, but now we could actually see lots of files. Apparently everything was there… We quickly used the information on that disk to configure and mount one of the nfs mounts the system was already allowed to access. We then copied everything onto that storage. It took a little time, but finished without a single error. I checked several files and they all looked perfect.I rebuilt the mirror using this copy and one of the blank disks, and then ran another fsck. Once again I got the usual screenfuls of errors but this time it finished without errors… I admit I was a little suspicious as I removed the CD and booted the system, but it worked.

I was ready to trash the system, and reinstall it from scratch, but a little patience and some stubbornness sometimes pays off. It was Friday, and if I had reinstalled, it would probably not be ready before everyone left, and I’m not sure if I could find out everything needed to restore the system to it’s production state.

Lessons learnt:

  • Make sure every system is fully documented. This one wasn’t, and that led to us being more willing to try harder to recover the disks
  • Make sure you don’t panic, and take the extra seconds to make sure you don’t mess up. Our priority was to salvage the data. I tried to achieve that by breaking the mirror and keeping a copy. Turns out that one of the disks was really messed up. By breaking the mirror we preserved a salvageable copy.
  • Don’t try to do it all alone. By working with a colleague, you can discuss important steps and minimize risk.
  • Lastly .. don’t just give up.

Of course things could have turned out worse. The disks could have been irrecoverably damaged, and the only way out would have been installing new disks and reinstalling from backup…But then again when I first went downstairs, I was just expecting a 10 min stay, and I think I stayed there for over 3 hours.

I’ve been reading a copy of Web Operations by John Allspaw, Jesse Robbins and a bunch of other equally knowledgeable people.

My boss lent me his copy, and I found it so good that less than halfway through the book, I decided I had to get my own copy. Yes, I think it’s that good!

If you’ve been paying any attention to how any reasonably large company creates and deploys services for the web, you probably have heard about devops. The concept isn’t new, but lately it has been getting more attention. It has found some success in bridging the gap between developers and operations people. It’s about time we stop blaming one another 🙂

The book is great, especially in the way it is written, with lots of real life stories, some good, some bad, and you can actually empathy for some of the problems they faced.

Most of the stuff in the book (up to what I’ve read) isn’t new. But seeing it there, printed in paper, allowing you to read about other people’s hard earned lessons is pretty darn good. I can relate to some of those issues as I’ve lived through similar problems, even if at a smaller scale.

Seeing that our solutions were similar to the pros does make one warm and fuzzy.

This is one book everyone working in operations should read. And also most of the guys developing for the web.

The good news is that my copy arrived today, so I’ll be returning my boss’ copy tomorrow. Let’s see what he has to say when he starts to dig into it 🙂