You are currently browsing the tag archive for the ‘gpfs’ tag.

Here is a couple of examples of what one of the charts looks like. Each file system will have similar charts for 12 hours, 24 hours, 48 hours, 1 week, and 1 month.

I’ve removed part of the LUN names for obfuscation purposes.

24 hour graph of utilization

24 hour graph of utilization

24 hour chart showing average wait time

24 hour chart showing average wait time

 

Advertisements

Here is the main script I use to parse the gpfs.tmp files for which I/O nodes have which dm’s associated with which file system and then using that data create the multitude of graphs.

I make graphs for 12 hour (one Navy watch), 24 hours, 48 hours, a week, and a month. There are two main graphs I’m creating right now. The Average Wait graph and the % utilization graph. Also, if you delve into the code you will see I search for data in the lun name so that I don’t add the metadata luns into the charts. It just keeps it cleaner.

FYI. I’ve modified the scripts to remove any reference to any system where I work. So, I don’t “think” I’ve introduced any errors into the scripts, but it’s definitely possible that I have.

#!/usr/bin/python
# written by Richard Hickey
# 20 March 2014
# This script will read the lun layout files /gpfs/scratch/*.tmp
# and then create the utilization and average wait graphs in /var/www/html/iostats

import re
import sys
import rrdtool

#————————————————————————————-
# Set up an array with all of the file systems to parse through
# Set up a dictionary called filesystem for human readable names
#————————————————————————————-
myGPFSArray = [“gpfs_alpha”, “gpfs_beta”, “gpfs_ops”, “gpfs_scratch”]
filesystem = {‘gpfs_alpha’:’Alpha’, ‘gpfs_beta’:’Beta’, ‘gpfs_ops’:’Ops’, ‘gpfs_scratch’:’Scratch’}

#————————————————————————————-
# This function opens the gpfs lun mapping configuration file
# and fills in a data array with LUN, host, dm, and state (state isn’t used)
#————————————————————————————-
def getData(GPFSFileSystem):
    try:
        myFile=open(‘/gpfs/scratch/’ + GPFSFileSystem + ‘.tmp’, ‘r’) # open the config file
        myConfigArray = [] # initialize the array
        for line in myFile: # walk through the file line by line
            line = line.strip() # remove the newline character
            myline = line.split( ) # break the line into pieces using space
            myConfigArray.append(myline)
        myFile.close() # close the config file
        return(myConfigArray) #return an array consisting of the data from the config file

    except IOError:
        print ‘Could not open file ‘, myFile

#————————————————————————————-
# Create the Graph routine
#————————————————————————————-
def GraphCreate(lunData, areaData, graphtype):
    title = [’12 Hours’,’One Day’,’2 Days’,’One Week’,’One Month’]
    subpath = [’12’,’24’,’48’,’week’,’month’]
    path = ‘/var/www/html/iostats/’
    start = [‘-12h’, ‘-24h’,’-48h’,’-1w’,’-1m’]
    horizontalRule = ‘HRULE:90#000000:’

    #————————————————————————————-
    # set some perameters based on the graph type
    #————————————————————————————-
    if graphtype == ‘await’:
        verticalLabel = ‘Milliseconds’
        subtitle = ‘ Average Wait ‘
        filename = GPFSFileSystem + ‘_await.png’
        upperLimit = ’80’
        lowerLimit = ‘0’
    if graphtype == ‘util’:
        verticalLabel = ‘%’
        subtitle = ‘  % Utilization ‘
        filename = GPFSFileSystem + ‘_data.png’
        upperLimit = ‘100’
        lowerLimit = ‘0’
   
    #————————————————————————————-
    # Create the Graph
    #————————————————————————————-
    for count in range(5): # walk through the five chart types
        fullpath = path + ‘/’ + subpath[count] + ‘/’ + filename
        fulltitle = ‘/gpfs/’ + filesystem[GPFSFileSystem] + subtitle + title[count]
        rrdtool.graph(fullpath,
            ‘–title’, fulltitle,
            ‘–imgformat’, ‘PNG’,
            ‘–width’, ‘800’,
            ‘–height’, ‘400’,
            ‘–vertical-label’, verticalLabel,
            ‘–start’, start[count],
            ‘–upper-limit’, upperLimit,
            ‘–lower-limit’, lowerLimit,
            horizontalRule,
            lunData,
            areaData )   
   

#————————————————————————————-
# Main routine
#————————————————————————————-
for GPFSFileSystem in myGPFSArray:
    myConfigArray = getData(GPFSFileSystem)
    print ‘Doing ‘ + GPFSFileSystem

        #————————————————————————————-
    # Pull the individual components out of each line of the config file
        #————————————————————————————-
    utilData = []
    awaitData = []
    areaData = []
    for line in myConfigArray: # the line is an array with LUN HOST DM STATE
        lunType = re.search(r’data’,line[0])
        if lunType:
            tmplun = line[0] 
            lun = tmplun.split(‘_’)
            node = line[1]
            tmpdm = line[2]
            dm = tmpdm.split(‘/’)
            x = ‘DEF:’ + lun[0]+ ‘_’ + lun[1] + ‘=/gpfs/scratch/’ + node + ‘/’ + dm[2] + ‘.rrd:util:AVERAGE’
            utilData.append(x) # this creates the utilData array with the DEF lines of the rrdgraph
            y = ‘DEF:’ + lun[0]+ ‘_’ + lun[1] + ‘=/gpfs/scratch/’ + node + ‘/’ + dm[2] + ‘.rrd:await:AVERAGE’
            awaitData.append(y) # this creates the awaitData array with the DEF lines of the rrdgraph

            # The following populates the AREA portion of the rrdgraph array named areaData
            # The primary reason to break these apart is just to set the colors differently
            if node == ‘frodo-io3’:
                z = ‘AREA:’ + lun[0]+ ‘_’ + lun[1] + ‘#421c52:’ + lun[0]+ ‘_’ + lun[1]   
                areaData.append(z)
            if node == ‘frodo-io4’:
                z = ‘AREA:’ + lun[0]+ ‘_’ + lun[1] + ‘#005500:’ + lun[0]+ ‘_’ + lun[1]   
                areaData.append(z)
            if node == ‘frodo-io5’:
                z = ‘AREA:’ + lun[0]+ ‘_’ + lun[1] + ‘#21b6a8:’ + lun[0]+ ‘_’ + lun[1]   
                areaData.append(z)
            if node == ‘frodo-io6’:
                z = ‘AREA:’ + lun[0]+ ‘_’ + lun[1] + ‘#3300ff:’ + lun[0]+ ‘_’ + lun[1]   
                areaData.append(z)
   
        #————————————————————————————-
        # Call the function that creates the graphs
        #————————————————————————————-
    GraphCreate(utilData, areaData, ‘util’) # call the graph creating function
    GraphCreate(awaitData, areaData, ‘await’) # call the graph creating function

Now it’s time to get to the meat of things. Here is a bash script that will create some tmp files containing which dm’s on which I/O nodes go with which file systems.

#!/bin/bash

/usr/lpp/mmfs/bin/mmlsconfig|grep /dev/|awk -F\/ ‘{print $3}’|while read fs
do
    echo “Creating the tmp file for ${fs}”
    /usr/lpp/mmfs/bin/mmlsdisk ${fs} -M |grep frodo > ${fs}.tmp
done

So a few notes to make this easier to understand. The first major line is

/usr/lpp/mmfs/bin/mmlsconfig|grep /dev/|awk -F\/ ‘{print $3}’|while read fs

mmlsconfig gives way more data than just a list of file systems. I just want the file system names to input into a different command. I could make a static list, but then if something changed it would take manual intervention to get it correct again. Better to do a few extra steps right now and automate it. So, mmlsconfig gives to much informations, so I grep for /dev which gives me just the file systems (/dev/gpfs_scratch), I then awk -F\/ to split the line up using the / (the \ is so that awk doesn’t think the / is a special character) as my splitter. I then grab the third item, which is just the file system name (gpfs_scratch).

Now that I have just the file system name I push that into the mmlsdisk command. The -M option will display the underlying disk name on the I/O server node. I then output that information into a temp file. IE gpfs_scratch.tmp

/usr/lpp/mmfs/bin/mmlsdisk ${fs} -M |grep frodo > ${fs}.tmp

Easy peasy. Now I have my configuration files containing which dm on which I/O node goes with which gpfs file system. It’s now time to write a script to pull all this information together and make a nice pretty graph out of it.

Create a single RRD file for each LUN. This is ugly but works.

I forgot to mention. RRD is Round Robin Database. Information can be found at http://oss.oetiker.ch/rrdtool/

I chose to create a subdirectory for each I/O node. Into these directories I created the RRD files.

So.

  • mkdir /gpfs/scratch/frodo-io1
  • mkdir /gpfs/scratch/frodo-io2
  • mkdir /gpfs/scratch/frodo-io3
  • mkdir /gpfs/scratch/frodo-io3

I then created a short perl script to create the database files.

 

#!/usr/bin/perl
#————————————————————————-
# Author Richard Hickey
#————————————————————————-

use RRDs;
use strict;
use warnings;

print `clear` , “\n”;

my $rrd_file;

for ($rrd_file=0;$rrd_file<=91;$rrd_file++) {
RRDs::create(“/gpfs/scratch/temp/dm-$rrd_file.rrd”,
    “–start”, 1393346138,
    “–step”, 300,
    ‘DS:rrqms:GAUGE:1200:U:U’,
    ‘DS:wrqms:GAUGE:1200:U:U’,
    ‘DS:rps:GAUGE:1200:U:U’,
    ‘DS:wps:GAUGE:1200:U:U’,
    ‘DS:readMBs:GAUGE:1200:U:U’,
    ‘DS:writeMBs:GAUGE:1200:U:U’,
    ‘DS:avgrqsz:GAUGE:1200:U:U’,
    ‘DS:avgqsz:GAUGE:1200:U:U’,
    ‘DS:await:GAUGE:1200:U:U’,
    ‘DS:svctm:GAUGE:1200:U:U’,
    ‘DS:util:GAUGE:1200:U:U’,
    ‘RRA:AVERAGE:0.5:1:288’,
    ‘RRA:AVERAGE:0.5:3:672’,
    ‘RRA:AVERAGE:0.5:24:730’,
);
my $err=RRDs::error;
if ($err) {print “problem updating dm_$rrd_file.rrd: $err\n”;}
}
This created 91 separate RRD files called dm-0 through dm-91. I then copied these files into each of the four I/O node subdirectories. This gave me the Round Robin Databases which I could then start populating.

To populate the databases and start start collecting the information I used the following perl script and put it in /etc/cron.d so that it would run once a day and gather statistics every 5 minutes and do this 288 times. 288 * 5 minutes = 24 hours.

#!/usr/bin/perl

#————————————————————————-
# Author Richard Hickey
# Date 25 February 2014
#————————————————————————-

use RRDs;
use strict;
use warnings;
use POSIX qw(strftime);

print `clear` , “\n”;

#————————————————————————-
# layout of iostat data
# lun rrqms wrqms rps wps readMBs writeMBs avgrqsz avgqsz await svctm util
#————————————————————————-

#————————————————————————-
# set up some variables to use
#————————————————————————-
my @get_data;      my $get_data;
my $hostname = `/bin/hostname -s`; chomp($hostname);
my $err ;

#————————————————————————-
# run iostat and pipe into IOSTAT
#————————————————————————-
open(IOSTAT, “/usr/bin/iostat -dmtx dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 dm-7 dm-8 dm-9 dm-10 dm-11 dm-12 dm-13 dm-14 dm-15 dm-16 dm-17 dm-18 dm-19 dm-20 dm-21 dm-22 dm-23 dm-24 dm-25 dm-26 dm-27 dm-28 dm-29 dm-30 dm-31 dm-32 dm-33 dm-34 dm-35 dm-36 dm-37 dm-38 dm-39 dm-40 dm-41 dm-42 dm-43 dm-44 dm-45 dm-46 dm-47 dm-48 dm-49 dm-50 dm-51 dm-52 dm-53 dm-54 dm-55 dm-56 dm-57 dm-58 dm-59 dm-60 dm-61 dm-62 dm-63 dm-64 dm-65 dm-66 dm-67 dm-68 dm-69 dm-70 dm-71 dm-72 dm-73 dm-74 dm-75 dm-76 dm-77 dm-78 dm-79 dm-80 dm-81 dm-82 dm-83 dm-84 dm-85 dm-86 dm-87 dm-88 dm-89 dm-90 dm-91 300 288 |”) || die “Can’t open iostat- $!”;

#————————————————————————-
# walk through the output and parse the data
#————————————————————————-
while (<IOSTAT>){
    chop;
    if (/^dm-/) {
    my $now_string = strftime(“%s”,localtime(time));
    s/\s+/,/g;
    @get_data = split(/,/);
#        print”/gpfs/scratch//$hostname/$get_data[0].rrd $now_string:$get_data[1]:$get_data[2]:$get_data[3]:$get_data[4]:$get_data[5]:$get_data[6]:$get_data[7]:$get_data[8]:$get_data[9]:$get_data[10]:$get_data[11]\n”;

#————————————————————————-
# update the rrd databases
#————————————————————————-
        RRDs::update (“/site/GPFS/iostats/$hostname/$get_data[0].rrd”,”$now_string:$get_data[1]:$get_data[2]:$get_data[3]:$get_data[4]:$get_data[5]:$get_data[6]:$get_data[7]:$get_data[8]:$get_data[9]:$get_data[10]:$get_data[11]”);
        $err=RRDs::error;
        if ($err) {print “problem updating $get_data[0].rrd: $err\n”;}
        }
    next;

}
close IOSTAT;

Great. Now I am gathering the I/O statistics for each LUN on each I/O node in 5 minute intervals. The nice thing about the RRD files is that they never grow in size. Which is one of the nice reasons to use them.

Next we’ll go over how to pull all this data together in a nice graphical form.

 

So, I’m going to put this up on my site just so that I have a record of this and so that others can use these scripts as an example. Understand, these scripts are crude to say the least, but they work. The final goal here is to automatically create graphs showing the current and historical performance of our GPFS file system on a disk by disk basis. I’ve decided to use some Perl, Python, Bash, and RRD tools to do this. Ya, go figure.

This is going to end up being several posts long. There is a lot of data. First the background on what and why.

Here is the scenario. I have a large linux cluster running IBM GPFS. Picture 300+ nodes connecting across QDR Infiniband to 4 I/O nodes that are each connected to the storage subsystems with 2 8GB fibre links. Also each storage subsystem has 2 heads for redundancy. So there are a possible 4 different routes to each storage LUN from each IO node. Each gpfs file system has between 4 and 16 LUNs, and there are 4-8 file systems per cluster. So 4 routes times 4 IO nodes times 16 LUNs times 8 file systems = big mess.

Now Redhat does try to make it a bit easier with something called dynamic multipathing. Basically what it does is assign a “dm” name to each lun and hides all the different pathing options. Here’s an example of what one looks like

mpathbd (360001ff08020b000000002e469560164a) dm-53 DDN,SFA 10000
size=2.1T features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
|-+- policy=’round-robin 0′ prio=100 status=active
| `- 4:0:12:120 sdsj 135:368 active ready  running
|-+- policy=’round-robin 0′ prio=90 status=enabled
| `- 3:0:14:120 sdjl 8:496   active ready  running
|-+- policy=’round-robin 0′ prio=20 status=enabled
| `- 3:0:5:120  sdmm 69:480  active ready  running
`-+- policy=’round-robin 0′ prio=10 status=enabled
  `- 4:0:4:120  sdwz 70:752  active ready  running
What this is showing is that there are 4 paths to the 2.1TB lun. The system (without multipathing) can access them as /dev/sdsj /dev/sdjl /dev/sdmm and /dev/sdwz. Or as /dev/dm-53. You might be wondering why bother with Multipathing at all? Well, what happens if I have a fibre link go down? I lose 2 of the 4 /dev/sdxx devices. If I pointed to them directly I’d have a disk failure. However, multipathing automagically load balances and fails over to the working path in case of a failure.

Okay. Enough about multipathing and why we have it. Suffice it to say that we do. So. easy peasy right? Ya, not so much. Since dynamic multipathing is “dynamic” it means that the /dev/dm-xx name can change on reboot or when we make any major changes to the system. This means that the dm for a lun in the beta file system today may end up being the dm in the scratch file system after reboot, or not. Really? Really? Why?

However, all is not lost. GPFS has a nice little command that you can run (it’s slow so beware) that will give you a mapping of all the dm numbers, by IO server per file system.

/usr/lpp/mmfs/bin/mmlsdisk gpfs_scratch -M

Disk name     IO performed on node     Device             Availability
————  ———————–  —————–  ————
ddn7_data40_nsd frodo-io3               /dev/dm-70         up
ddn7_data41_nsd frodo-io4               /dev/dm-34         up
ddn7_data42_nsd frodo-io5               /dev/dm-66         up
ddn7_data43_nsd frodo-io6               /dev/dm-63         up
ddn7_data91_nsd frodo-io5               /dev/dm-35         up
ddn7_meta11_nsd frodo-io6               /dev/dm-46         up
ddn7_meta12_nsd frodo-io3               /dev/dm-12         up

This shows the lun name ddn7_data40_nsd, the IO node it’s talking to frodo_io3, the DM name on that node /dev/dm-70 and the status of up.

Now, we understand how dynamic multipathing works, and we now know a way to get GPFS to show us which dm goes to which lun on which I/O node. We’re making progress here.

So. At this point we have the ability to figure out which LUN on which I/O node goes to which GPFS file system. So, let’s start gathering data. I found it easiest to just gather the statistics on every dm on each I/O node and then separate it out into individual file system later. So, the next post is how I did that.

 

Just a quick follow on to the last post. If you try to use native IB support with GPFS instead of the IP over IB you need to set the verbsRdma to enabled. You also need to set the rdma device to use with the verbsPort setting. If you get the following error in /var/mfms/gen/mmfslog then you didn’t properly set the verbsPort

VERBS RDMA starting.
VERBS RDMA library libibverbs.so (version >= 1.1) loaded and initialized.
VERBS RDMA library libibverbs.so unloaded.
VERBS RDMA failed to start, no verbsPorts defined.

Here is what I used for our single port HBA Infiniband setup.

mmchconfig verbsPorts=”ib0/1″
mmchconfig verbsRdma=enable

Then looking at my configuration I now show things correctly

[root@topaz-m1 ~]# mmlsconfig
Configuration data for cluster aaa.bbb.navy.mil:
—————————————————-
clusterName aaa.bbb.navy.mil
clusterId 72452797724XXXXXXXX
clusterType lc
autoload no
minReleaseLevel 3.2.1.5
dmapiFileHandleSize 32
dataStructureDump /scratch/root/GPFS_Dump
maxblocksize 512K
maxFilesToCache 7000
maxMBpS 2048
maxStatCache 28000
pagepool 512M
verbsPorts ib0/1
verbsRdma enable

File systems in cluster aaa.bbb.navy.mil:
———————————————
/dev/gpfs_a
/dev/gpfs_b
/dev/gpfs_c
/dev/gpfs_d

Success. I am now running gpfs natively across Infiniband with Linux. Much better performance this way.
Our basic configuration in case you are wondering.

40 Dell 1950 and 2950 nodes.

  • 1 head node 2950
  • 4 I/O 2950 nodes with HBA’s and HCA’s
  • 32 Compute nodes HBA’s
  • several misc nodes with HBA’s

Qlogic single port HBA’s
Redhat Enterprise Server 5.1
GPFS 3.2.1.9
DDN 9900

Blog Stats

  • 60,602 hits
September 2019
S M T W T F S
« Jun    
1234567
891011121314
15161718192021
22232425262728
2930  
Advertisements