Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difference to values in IB monitor #1

Open
mkluge opened this issue Sep 13, 2013 · 4 comments
Open

difference to values in IB monitor #1

mkluge opened this issue Sep 13, 2013 · 4 comments

Comments

@mkluge
Copy link

mkluge commented Sep 13, 2013

Hi John,

do you still maintain xltop? I installed it on a Lustre 2.1.3 cluster and have in parallel a small script running that queries the IB port. I see a large difference between throughput reported by the IB monitor and "xltop u s". The sum of the throughput on all IB port of all oss servers matches the sum of the throughput for all servers as reported by xltop. The numbers are just differently distributed. As the IB monitor only uses "perfquery -r" once a second, I believe this data more than xltop. Do you have any idea how to debug this?

Regards, Michael

@jhammond
Copy link
Owner

Hi Michael,

Sorry for the delay. I though I responded to you over the weekend but it must have never been sent.

I haven't had any reason to update xltop in some time so I haven't been actively maintaining it.

The difference you seem may be explained by the difference in sampling intervals or by that fact that xltop uses a moving average whereas perfquery -r will give you counter deltas. What values are you using for the tick, window, and interval in your xltop-master.conf?

Best,

John

@mkluge
Copy link
Author

mkluge commented Sep 18, 2013

Hi John,

tick = 2
window = 5
interval=5

The benchmark runs pretty long (> 5mins) and shows the same values all the time. I have 4 servers, showing 7,5,5, and 3 GB/s while the live IB and the live OST monitor (little scripts I wrote) show both 5 GB/s on all servers all the time.

Regards, Michael

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone: (+49) 351 463-34217
Fax: (+49) 351 463-37773
e-mail: [email protected]
WWW: http://www.tu-dresden.de/zih

Am 17.09.2013 um 20:06 schrieb John Hammond:

Hi Michael,

Sorry for the delay. I though I responded to you over the weekend but it must have never been sent.

I haven't had any reason to update xltop in some time so I haven't been actively maintaining it.

The difference you seem may be explained by the difference in sampling intervals or by that fact that xltop uses a moving average whereas perfquery -r will give you counter deltas. What values are you using for the tick, window, and interval in your xltop-master.conf?

Best,

John


Reply to this email directly or view it on GitHub.

@jhammond
Copy link
Owner

Thanks Michael. I'll take a look at the code. In the mean time, could you try again with (tick, window, interval) = (1, 5, 5) and (5, 5, 5)?

@mkluge
Copy link
Author

mkluge commented Sep 19, 2013

Hi John,

did that. I took a couple of screenshots (attached). The upper part of
the image shows "xltop u s", the middle part a 1 second interval sum of
the values in /proc/fs/lustre/obdfilter/scratch-*/stats on taurusoss2
and the lower part a 1 second interval dump of both IB interfaces of
taurusoss2 as well.

The oss2_write_phase* screenshots are very interesting. The screenshots
were taken about 30 seconds after the benchmark started writing in an
interval of about 15-30 seconds. The values that xltop shows for
taurusoss2 are somehow never more that a factor of 2 away from the real
value. I'm attaching the oss monitor script, just to make sure ...

Regards, Michael

--- 8< -----------------------------------------------
for OST in ls -1d /proc/fs/lustre/obdfilter/scratch-* ; do
OLD[$OST]=0
done

while [ 1 ] ; do
sleep 1
SUM=0
for OST in ls -1d /proc/fs/lustre/obdfilter/scratch-* ; do
NAME=echo $OST | cut -d / -f 6
VAL=cat $OST/stats | grep write_bytes | awk '{print $7}'
OV=${OLD[$OST]}
DIFF=$(($VAL-$OV))
DIFF=$(($DIFF/1024))
DIFF=$(($DIFF/1024))
#echo "$NAME: $DIFF MB/s"
OLD[$OST]=$VAL
SUM=$(($SUM+$DIFF))
done
echo "SUM : $SUM MB/s"
done
--- 8< -----------------------------------------------

On 18.09.2013 13:27, John Hammond wrote:

Thanks Michael. I'll take a look at the code. In the mean time, could
you try again with (tick, window, interval) = (1, 5, 5) and (5, 5, 5)?


Reply to this email directly or view it on GitHub
#1 (comment).

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone: (+49) 351 463-34217
Fax: (+49) 351 463-37773
e-mail: [email protected]
WWW: http://www.tu-dresden.de/zih

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants