Fun with rrdtool
RRDtool is a graphing tool. It contains tools that allow you to create a thing called a “round-robin database” (hence RRD), update it regularly with fresh data, and plot graphs based on that data.
- 1. Introduction to RRDtool
- 2. RRDtool graph gallery
- 2.1. CPU usage
- 2.2. Load levels
- 2.3. RAM usage
- 2.4. Temperatures and fan speeds
- 2.5. Network traffic
- 2.6. Disk space and activity
1. Introduction to RRDtool
The specific of RRDtool is that it’s meant for data that changes all the time, continuously. There’s a lot of data in a computer system that you can collect in this manner. Processor temperature, how much data you transfer via network connections, how much RAM or other resources you consume at any given time, disk space and activity, ping response times and so on. If you can get figures for it, you can plot it.
Typical usage of RRDtool involves three steps.
First, one defines the data structure for a specific purpose. For instance, if I intend to follow the network traffic on a specific interface, I will need to record two data entries: the amounts of data going out and into the interface, respectively. The definition step also involves defining limits, such as the upper and lower limits for what constitutes valid data (for example, negative values will have no meaning for this case), as well as choosing a minimum step, to help the tool later when it has to decide how much time without data is too much and should it reuse the last figure or say “N/A”.
Once you’ve defined your database it’s time to start collecting data into it. You will most likely do this periodically, at fine time intervals, usually via cron. How often you do this depends on how fine a grain you want to obtain for the results. Usually this step involves reading figures from the relevant system reporting tools, such as files under /proc, and entering them in the database once a minute (or more often, or more seldom).
It must be noted that RRDtool is sophisticated enough to be able to cope with different types of data, such as always-incrementing counters (even taking range wrapping into account) (for example, traffic amounts) or gauges that go up and down all the time (such as temperatures).
Finally, once you’ve collected some data, it’s time for the fun part: plotting the graphs. Again, RRDtool’s capabilities are quite interesting and if you add your own ideas to the mix you can obtain very cool results. RRDtool allows you to “zoom” in and out on any time interval, as long as it is covered by collected data and the step you collected at is fine enough. It even provides a CGI that will make it easier to offer the graphs on the Web, and there are many 3rd party tools out there to help with this.
2. RRDtool graph gallery
Enough theory, time for some eye-candy. The RRDtool gallery holds a handful of interesting examples.
As for myself, I’ve posted below some of the more colorful results from my own desktop system, with explanations next to each one.
The system in question is a regular i386 desktop system, using Debian Linux and a recent 2.6 Linux kernel. The graphs are all based on the same 4-hour period, depicting typical daily desktop usage on a Linux system: web browsing, watching movies, TV, text and image editing, file transfers and so on.
2.1. CPU usage

This is the kind of information you see for example in the top console tool. It is based on information taken from /proc/stat, which lists for each CPU the time it spends in one of several states (nice, user, system and idle), in measuring units called jiffies.
You can see it has been a particularly full 4-hour period, with the CPU being constantly solicited to some degree for one task or another, and seldom fully idle. Which is normal, since desktop computers are meant to be used.
2.2. Load levels

You may already be familiar with the concept of “load level”. It is more specific to Linux (and UNIX) systems and seldom used for Windows systems.
If you don’t know what it means, here’s a simplistic explanation. The load level is the number of processes that have had to be put “on hold” by the CPU at the same time in the most recent time period. “On hold” means they are attempting to do something but the CPU is currently busy and will get back to them as soon as possible.
A load level of 1 is considered a sort of landmark and you can see it has been marked with a cyan line on the graph. Staying below this line has a meaning of “no process was kept waiting” and is something to be desired.
Of course, in practical circumstances this is not always attainable, nor a bad thing if it doesn’t happen. Desktop systems are routinely used to seeing loads of 3 or more. It doesn’t mean that the system becomes unresponsive when a high load is reached. A good CPU scheduling mechanism will make sure that that doesn’t happen even under high loads.
You will notice that there are 3 load plots: one for the last 60 seconds, one for 5 minutes and one for 15. The data was collected from /proc/loadavg.
2.3. RAM usage

Here you can see how the Linux kernel deals with memory allocation. I have 768 MB of RAM in this system and you can see it being wisely used almost entirely for something. This is a good thing, since RAM is the fastest storage medium in today’s computers, and the more you have in RAM is the less you need to get from slower mediums such as fixed disks.
You can see that the actual amount of RAM used by the currently running applications hovers between the 400 and 500 MB marks (the brown area). The orange areas show how the kernel puts the rest of the RAM to good use by loading it chock full of cached data and disk buffers. Only a relatively tiny area (the light yellow) is kept really free, in order to accomodate any sudden jumps in memory requirements.
I haven’t plotted swap usage on this graph because 768 MB of RAM is plenty for my desktop and swap never seems to pass 1 or 2 MB, which would be practically invisible at this scale.
The data was taken from /load/meminfo.
2.4. Temperatures and fan speeds

This graph depicts all the temperature sensors in my desktop computer, as well as some fan speeds. First we have the temperatures from the CPU, the motherboard chipset and the inside of the case, respectively, as the green areas, collected from the output of the sensors tool.
Then come the temperatures of the three fixed disks in the system, which are read from their SMART parameters and shown as the thick orange and red lines.
Finally, we have the speeds of the front fan and the CPU cooler fan, as the blue and magenta lines. These are obviously not to scale, since the rest of the graph is in Celsius degrees, but have been included here only so that I can keep an eye out for any sudden drops or increases.
Some of you can probably tell this is a decent setup, temperature-wise, I have managed to put together.
I have a Nexus 90mm fan in front, blowing over the fixed disks and keeping them a confortable 30-something Celsius (around 90F) even under load. There’s a 120mm Nexus in the back, but unfortunately it’s not hooked up to a speed sensor. The chipset stays under 40 Celsius at all times (104F), while the CPU goes between under 40 and 45 Celsius under load (115F). There’s a bit of wire management that helps the air go from front to back unobstructed.
It must be noted that this is an Athlon XP and that these temperatures are due to various not-so-ordinary measures I have taken. A good quality CPU cooler (Arctic Cooling Silent Copper 3); undervolting the core (to 1.35V, from 1.55V); motherboard powersaving; and last but not least, active FSB throttling.
2.5. Network traffic

Next we come to the amounts of data transferred via the network, and I trust the graph is fairly self-explanatory. You can see the amounts downloaded as the blue area and the amounts uploaded as the red line.
Once again, I’ve inspired myself from a graph I saw in the RRDtool gallery and I’ve also plotted a “base” line which stays green when the computer is online and turns red when the connection drops for some reason.
The nice thing about network graphs is that you can get RRDtool to compute totals over the time period that is displayed, and therefore you can get an idea of how much traffic you make. Especially useful if your ISP has a monthly or daily quota and you’d like to double-check their claims.
The data was collected from /proc/net/dev. Please note that this particular graph shows the traffic on my LAN interface (the internal house network), not the outside traffic I make via my ISP.
2.6. Disk space and activity

Finally we have some disk-related statistics. First we have a graph which combines used and free space on the 3 main partitions I like to keep an eye on: the one containing the operating system files, the home partition and the Windows partition.
Used space is shown with a darker color, while the free space appears lighter. You can see a variation in the space on my home partition during the plotted time. You can also see that I’m slowly running out of space. :(

And here is data collected from /sys/block/*/stat files, which shows how much work my three fixed disks are doing, based on sectors read and written.
hde and hdg together form a RAID1 matrix and therefore they always work together when it comes to writing data to them. On the other hand, hde seems to be preferred a bit more when reading data, a fact backed up by the SMART figures, which list slightly higher figures for certain readings.
This particular graph is somewhat out of the ordinary. Normally, the RAID discs would work together, like you see in the left part of the graph: either both writing, either both reading. But on the right you can see one of them doing massive readings and the other doing massive writes. And it’s proof that having these graphs around is actually useful, because I could tell the instant I saw this that something was up. Indeed, upon checking /proc/mdstat it became apparent that there was a RAID resync in progress.
