So, I’m going to put this up on my site just so that I have a record of this and so that others can use these scripts as an example. Understand, these scripts are crude to say the least, but they work. The final goal here is to automatically create graphs showing the current and historical performance of our GPFS file system on a disk by disk basis. I’ve decided to use some Perl, Python, Bash, and RRD tools to do this. Ya, go figure.

This is going to end up being several posts long. There is a lot of data. First the background on what and why.

Here is the scenario. I have a large linux cluster running IBM GPFS. Picture 300+ nodes connecting across QDR Infiniband to 4 I/O nodes that are each connected to the storage subsystems with 2 8GB fibre links. Also each storage subsystem has 2 heads for redundancy. So there are a possible 4 different routes to each storage LUN from each IO node. Each gpfs file system has between 4 and 16 LUNs, and there are 4-8 file systems per cluster. So 4 routes times 4 IO nodes times 16 LUNs times 8 file systems = big mess.

Now Redhat does try to make it a bit easier with something called dynamic multipathing. Basically what it does is assign a “dm” name to each lun and hides all the different pathing options. Here’s an example of what one looks like

mpathbd (360001ff08020b000000002e469560164a) dm-53 DDN,SFA 10000
size=2.1T features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
|-+- policy=’round-robin 0′ prio=100 status=active
| `- 4:0:12:120 sdsj 135:368 active ready  running
|-+- policy=’round-robin 0′ prio=90 status=enabled
| `- 3:0:14:120 sdjl 8:496   active ready  running
|-+- policy=’round-robin 0′ prio=20 status=enabled
| `- 3:0:5:120  sdmm 69:480  active ready  running
`-+- policy=’round-robin 0′ prio=10 status=enabled
  `- 4:0:4:120  sdwz 70:752  active ready  running
What this is showing is that there are 4 paths to the 2.1TB lun. The system (without multipathing) can access them as /dev/sdsj /dev/sdjl /dev/sdmm and /dev/sdwz. Or as /dev/dm-53. You might be wondering why bother with Multipathing at all? Well, what happens if I have a fibre link go down? I lose 2 of the 4 /dev/sdxx devices. If I pointed to them directly I’d have a disk failure. However, multipathing automagically load balances and fails over to the working path in case of a failure.

Okay. Enough about multipathing and why we have it. Suffice it to say that we do. So. easy peasy right? Ya, not so much. Since dynamic multipathing is “dynamic” it means that the /dev/dm-xx name can change on reboot or when we make any major changes to the system. This means that the dm for a lun in the beta file system today may end up being the dm in the scratch file system after reboot, or not. Really? Really? Why?

However, all is not lost. GPFS has a nice little command that you can run (it’s slow so beware) that will give you a mapping of all the dm numbers, by IO server per file system.

/usr/lpp/mmfs/bin/mmlsdisk gpfs_scratch -M

Disk name     IO performed on node     Device             Availability
————  ———————–  —————–  ————
ddn7_data40_nsd frodo-io3               /dev/dm-70         up
ddn7_data41_nsd frodo-io4               /dev/dm-34         up
ddn7_data42_nsd frodo-io5               /dev/dm-66         up
ddn7_data43_nsd frodo-io6               /dev/dm-63         up
ddn7_data91_nsd frodo-io5               /dev/dm-35         up
ddn7_meta11_nsd frodo-io6               /dev/dm-46         up
ddn7_meta12_nsd frodo-io3               /dev/dm-12         up

This shows the lun name ddn7_data40_nsd, the IO node it’s talking to frodo_io3, the DM name on that node /dev/dm-70 and the status of up.

Now, we understand how dynamic multipathing works, and we now know a way to get GPFS to show us which dm goes to which lun on which I/O node. We’re making progress here.

So. At this point we have the ability to figure out which LUN on which I/O node goes to which GPFS file system. So, let’s start gathering data. I found it easiest to just gather the statistics on every dm on each I/O node and then separate it out into individual file system later. So, the next post is how I did that.

 

Advertisements