maintenance

As every sysadmin knows, the systems we manage are subject to change. New applications are installed, existing applications have their workload changed, and other similar events are all part of an evolving system. I’d venture that very few systems don’t grow or evolve during their lifetime. Sometimes you know what the expected evolution is right at the beginning of the system’s life, but most often, things aren’t so well planned to give you that foresight. We need a way to keep up with the the system’s behavior, and use this knowledge to forecast how it will behave in a month or in a year’s time.

There are a variety of tools out there, but my favorites are Nagios and Cacti. Both are well known in the industry, are actively developed, and both can use SNMP to collect their information. For this article I’m mostly referring to Cacti as it allows us to gather information and display it as a graph. Nagios is used to generate alarms when (some of these) values exceed certain thresholds, but that’s an all different article…

From the very beginning of the system’s life, we should start collecting performance values. These include the basic information such as CPU load, Memory usage, Network traffic (for every interface), Disk occupation (for every partition in use) and also information that is related to the system’s use. Depending on what applications we have running, this can be a mix of metrics to keep up with.For example, for a webserver we should monitor the number of processes, number of requests/s, average time per request, and any other value that can help understand the system’s performance.For a database system, we should monitor the number of queries/s the number of open files, the average time per query among others. Database systems have a number of items that we can monitor, and the more we use, the better picture we will be able to have of our system.

Mixing system information with application performance is very useful as we’ll see next. At first this information doesn’t seem to help much, but this is the system’s baseline, akin to it’s heartbeat showing you how the system behaves as a whole, and how it evolves as you add new clients or new functionality.

As you gather data you can see some trends emerging from the graphs. In all probability you shouldn’t expect a lot of flat lines. You’ll have a lot more sloped ones. These allow you to forecast your system’s behavior into the future. This in invaluable information, allowing you to pinpoint when you will have capacity problems, and give you data to back up your request for new servers. You can also show management what will happen if you don’t get them.
Correlating application with system metrics is what allows you to really make an informed decision regarding capacity. You can ask questions like “what will happen if the requests per second increases by 20% or 40%?” This is the main advantage of baselines for management. They allow you to have a visual representation of how your system’s behave and how you stand in capacity terms.

Baselines also serve another purpose, one much dearer to us sysadmins. They show us what our system’s are like under “normal” usage. If there is a problem somewhere, you will probably see a deviation in the metrics you are collecting. Sometimes it will help you pinpoint the exact cause of the problem. Have you been slashdotted? That would show up as a significant increase in TCP connections, with higher network traffic, leading to higher system load, and very probably using up all your memory.Having a ‘picture’ of the system under normal load is something you truly appreciate once you’re in trouble wondering what is happening. But then, it is too late to collect data to analyze. So start now, start building your baseline today, and I assure you it will help you in the future.

As per my previous entry, I decided to clean up my PC and removed a ton of dust. I also removed the CPU cooler in order to clean it. After I put it back together, I started thinking about the thermal compound that was on the CPU. It looked good, but since I’ve had the PC for 2 years and never changed it, even if I removed the cooler a couple of times. I thought it might be a good idea to get some new compound, and see what effect it had.

So today I bought a syringe with 2.5 g of Artic Silver Céramique. I got home (through the hottest day of the year) and opened up the box again and replaced the thermal compound.

There was a slight improvement. Honestly i was expecting a little more, but that’s what the values read.
The system was left to run in idle in both measurements, just as I had done yesterday.

So repeat after me: A little maintenance can help your system. Clean those dustbunnies, and renew the thermal compound while you’re at it 🙂

Actually I’m a bit embarrassed by this article. I live in Portugal, and we are now experiencing summer in all it’s glory. Meaning long days and high temperatures everywhere. It’s quite normal to exceed 35°C, or 95 °F, for you yanks 🙂

I’ve been postponing cleaning my CPU Box for some time now. Today, after lunch, I decided I’d get the job done. And I was in for quite a surprise.

My home system is a Quad Core Q6600 running at 2.40GHz, and I thought it would be interesting to measure the idle temps before and after the operation:

before after
temps before cleaning temps after cleaning

The thing is, I found out my system was actually full of dust. And I mean full of it. Every single fan had dust on it, every surface had dust on it. The CPU’s cooler had dust on it. The CPU cooler’s fan had dust on it… I mean lot’s of dust, and dust everywhere, which is not a good thing, when you want to run a healthy system.

So I took apart some of the major components and cleaned them as best I could short of rinsing them under a running tap 🙂

Here are some pictures I took with my cellphone showing the dust, and also how the system looked when I managed to remove the dust:

Finally here’s a picture of the pile of dust I collected:

So, the bottom line is, clean your systems if you want them to run smoothly. As you can see my temperatures have lowered on average 7 degrees (Celsius) just by cleaning away the dust.

I didn’t disassemble the whole rig. If I had done so, I would have been able to clean it much better, but would also have taken much, much longer.