Create a single RRD file for each LUN. This is ugly but works.

I forgot to mention. RRD is Round Robin Database. Information can be found at http://oss.oetiker.ch/rrdtool/

I chose to create a subdirectory for each I/O node. Into these directories I created the RRD files.

So.

  • mkdir /gpfs/scratch/frodo-io1
  • mkdir /gpfs/scratch/frodo-io2
  • mkdir /gpfs/scratch/frodo-io3
  • mkdir /gpfs/scratch/frodo-io3

I then created a short perl script to create the database files.

 

#!/usr/bin/perl
#————————————————————————-
# Author Richard Hickey
#————————————————————————-

use RRDs;
use strict;
use warnings;

print `clear` , “\n”;

my $rrd_file;

for ($rrd_file=0;$rrd_file<=91;$rrd_file++) {
RRDs::create(“/gpfs/scratch/temp/dm-$rrd_file.rrd”,
    “–start”, 1393346138,
    “–step”, 300,
    ‘DS:rrqms:GAUGE:1200:U:U’,
    ‘DS:wrqms:GAUGE:1200:U:U’,
    ‘DS:rps:GAUGE:1200:U:U’,
    ‘DS:wps:GAUGE:1200:U:U’,
    ‘DS:readMBs:GAUGE:1200:U:U’,
    ‘DS:writeMBs:GAUGE:1200:U:U’,
    ‘DS:avgrqsz:GAUGE:1200:U:U’,
    ‘DS:avgqsz:GAUGE:1200:U:U’,
    ‘DS:await:GAUGE:1200:U:U’,
    ‘DS:svctm:GAUGE:1200:U:U’,
    ‘DS:util:GAUGE:1200:U:U’,
    ‘RRA:AVERAGE:0.5:1:288’,
    ‘RRA:AVERAGE:0.5:3:672’,
    ‘RRA:AVERAGE:0.5:24:730’,
);
my $err=RRDs::error;
if ($err) {print “problem updating dm_$rrd_file.rrd: $err\n”;}
}
This created 91 separate RRD files called dm-0 through dm-91. I then copied these files into each of the four I/O node subdirectories. This gave me the Round Robin Databases which I could then start populating.

To populate the databases and start start collecting the information I used the following perl script and put it in /etc/cron.d so that it would run once a day and gather statistics every 5 minutes and do this 288 times. 288 * 5 minutes = 24 hours.

#!/usr/bin/perl

#————————————————————————-
# Author Richard Hickey
# Date 25 February 2014
#————————————————————————-

use RRDs;
use strict;
use warnings;
use POSIX qw(strftime);

print `clear` , “\n”;

#————————————————————————-
# layout of iostat data
# lun rrqms wrqms rps wps readMBs writeMBs avgrqsz avgqsz await svctm util
#————————————————————————-

#————————————————————————-
# set up some variables to use
#————————————————————————-
my @get_data;      my $get_data;
my $hostname = `/bin/hostname -s`; chomp($hostname);
my $err ;

#————————————————————————-
# run iostat and pipe into IOSTAT
#————————————————————————-
open(IOSTAT, “/usr/bin/iostat -dmtx dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 dm-7 dm-8 dm-9 dm-10 dm-11 dm-12 dm-13 dm-14 dm-15 dm-16 dm-17 dm-18 dm-19 dm-20 dm-21 dm-22 dm-23 dm-24 dm-25 dm-26 dm-27 dm-28 dm-29 dm-30 dm-31 dm-32 dm-33 dm-34 dm-35 dm-36 dm-37 dm-38 dm-39 dm-40 dm-41 dm-42 dm-43 dm-44 dm-45 dm-46 dm-47 dm-48 dm-49 dm-50 dm-51 dm-52 dm-53 dm-54 dm-55 dm-56 dm-57 dm-58 dm-59 dm-60 dm-61 dm-62 dm-63 dm-64 dm-65 dm-66 dm-67 dm-68 dm-69 dm-70 dm-71 dm-72 dm-73 dm-74 dm-75 dm-76 dm-77 dm-78 dm-79 dm-80 dm-81 dm-82 dm-83 dm-84 dm-85 dm-86 dm-87 dm-88 dm-89 dm-90 dm-91 300 288 |”) || die “Can’t open iostat- $!”;

#————————————————————————-
# walk through the output and parse the data
#————————————————————————-
while (<IOSTAT>){
    chop;
    if (/^dm-/) {
    my $now_string = strftime(“%s”,localtime(time));
    s/\s+/,/g;
    @get_data = split(/,/);
#        print”/gpfs/scratch//$hostname/$get_data[0].rrd $now_string:$get_data[1]:$get_data[2]:$get_data[3]:$get_data[4]:$get_data[5]:$get_data[6]:$get_data[7]:$get_data[8]:$get_data[9]:$get_data[10]:$get_data[11]\n”;

#————————————————————————-
# update the rrd databases
#————————————————————————-
        RRDs::update (“/site/GPFS/iostats/$hostname/$get_data[0].rrd”,”$now_string:$get_data[1]:$get_data[2]:$get_data[3]:$get_data[4]:$get_data[5]:$get_data[6]:$get_data[7]:$get_data[8]:$get_data[9]:$get_data[10]:$get_data[11]”);
        $err=RRDs::error;
        if ($err) {print “problem updating $get_data[0].rrd: $err\n”;}
        }
    next;

}
close IOSTAT;

Great. Now I am gathering the I/O statistics for each LUN on each I/O node in 5 minute intervals. The nice thing about the RRD files is that they never grow in size. Which is one of the nice reasons to use them.

Next we’ll go over how to pull all this data together in a nice graphical form.

 

Advertisements