Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-9618: Bake in knowledge for rate based stats to FTDC parsing. #4658

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

dgottlieb
Copy link
Member

No description provided.

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 30, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 30, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 30, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 31, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 31, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 31, 2024
@dgottlieb dgottlieb requested a review from benjirewis December 31, 2024 19:35
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 2, 2025
@dgottlieb
Copy link
Member Author

Sample output
plot

Copy link
Member

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some initial comments.

// ratioMetricToFields is a global variable identifying the metric names that are to be graphed as
// some ratio. The two members (`Numerator` and `Denominator`) refer to the suffix* of a metric
// name. For example, `UserCPUSecs` will appear under `proc.viam-server.UserCPUSecs` as well as
// `proc.modules.<foo>.UserCPUSecs`. If the `Denominator` is the empty string, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart to use suffixes here to allow one mapping in ratioMetricToFields to handle both viam-server and module CPU usage calculations...

//
// When computing rates for metrics across two "readings", we simply subtract the numerators and
// denominator and divide the differences. We use the `windowSizeSecs` to pick which "readings"
// should be compared. This creates a sliding window. We (currently) bias this window to better
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bias this window

What does this mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you understand the idea of windowing -- but because the usage of bias is heavily influenced, I'll deconstruct the general idea first:

So when we create a datapoint for a graph that is stating "what percent of the CPU this is process using", it's not helpful to talk about at that specific instance in time. It's either running right now (i.e: "100%") or it's not (i.e: "0%"). So instead we talk about CPU usage over some "window" of time. When one does ps aux, the CPU percentage there is over the lifetime of the process.

So if a viam-server runs for 9 minutes doing nothing, that would say a number close to 0%. And if at that point a user then opened a video stream that stayed on the CPU for an entire minute (modulo some system scheduling overhead), ps aux wouldn't report a usage of 100%, but rather slowly go up from 0% -> 10% over the course of that one minute.

So regarding the window, we have 2 extremes. What's happening this very instant and what's happened over the lifetime of the process. I don't like either of those for this case, so I chose something in between. I felt like a process that flips from completely idle to max CPU should see that reflected in a graph over the coarse of 5 seconds.

I feel that choice skews more towards the "what's happening this instant" side of the scale. So "bias this window" refers to two properties:

  • This is a human choice. There are other valid options.
  • The choice is not a binary A or B. But rather a relatively continuous range of options to select from.

The context where I hear "bias" used in this way is (not coincidentally) from my college OS class talking about the (in theory) linux CPU scheduler. Where it wants to prioritize running programs that used its entire time slice to do work. And it similarly has to age out older program behavior. Which gets modeled by some variable r[ecency bias]:

new_priority = (1-r)old_priority + (r)percent_of_timeslice_used_in_last_run

So choosing r=1 gives a priority influenced completely by the last run. And choosing smaller values believe that a program's future behavior is better predicting by its more historical behavior.

Sorry -- this was huge. Probably should have clarified offline. Definitely recommend a change (maybe just avoid mentioning what our window size preference is?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big thanks for the thoughtful explanation here. The concept generally makes sense to me and I don't think I need much more in the way of documentation. I think if you're finding that 5s is a time window that's providing reasonable output wrt CPU usage, then it seems like a perfectly valid choice to me.

}

func (rr ratioReading) toValue() (float32, error) {
if math.Abs(rr.Denominator) < 1e-9 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Do you have a pre-existing epsilon constant lying around in this package somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right in that I have one (that I completely forgot about!)

But it's (right now) in a different package

ftdc/cmd/parser.go Outdated Show resolved Hide resolved

value, err := diff.toValue()
if err != nil {
// The denominator did not change -- divide by zero error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Might as well report the created error here, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

forCompare := idx - windowSizeSecs
if forCompare < 0 {
// If we haven't
forCompare = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this create a divide by zero issue? Comparing with self?

Copy link
Member Author

@dgottlieb dgottlieb Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition (and the above one) isn't written in the most readable way: https://github.com/viamrobotics/rdk/pull/4658/files#diff-a6de85d19062a282c0d7f2c427b26f6598fcf073ef34752a97ec02b053a62f18R295-R296

edit the above was perhaps cryptic -- my point is idx == forCompare can not happen. Because the above if-statement skips the whole idx == 0 case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sat and thought about how to change this to make it more obvious. I failed.

I did realize that the abstraction of ratioMetrics (as I describe them) is incorrect. All of this windowing code only makes sense for time-based things. And in that case we can be (fairly) confident the denominator is always increasing.

There can be other "metrics that are computed as a ratio of things" that are not time based. And consequently don't need this windowing logic. And therefore need not worry about a denominator being 0 after some subtraction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha maybe you can confirm my understanding: is this lie setting the value we should compare to to the earliest possible recorded value in the case that 5 seconds have not actually elapsed?

As for the concern about ratioMetric vs something like rateMetric, I don't think it matters too much for now. If we start adding more non-time-based ratio metrics, then we could probably revisit this (unexported) naming...

Co-authored-by: Benjamin Rewis <[email protected]>
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 2, 2025
Copy link
Member Author

@dgottlieb dgottlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will push some changes tomorrow

}

func (rr ratioReading) toValue() (float32, error) {
if math.Abs(rr.Denominator) < 1e-9 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right in that I have one (that I completely forgot about!)

But it's (right now) in a different package

forCompare := idx - windowSizeSecs
if forCompare < 0 {
// If we haven't
forCompare = 0
Copy link
Member Author

@dgottlieb dgottlieb Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition (and the above one) isn't written in the most readable way: https://github.com/viamrobotics/rdk/pull/4658/files#diff-a6de85d19062a282c0d7f2c427b26f6598fcf073ef34752a97ec02b053a62f18R295-R296

edit the above was perhaps cryptic -- my point is idx == forCompare can not happen. Because the above if-statement skips the whole idx == 0 case.


value, err := diff.toValue()
if err != nil {
// The denominator did not change -- divide by zero error.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

@viambot viambot removed the safe to test This pull request is marked safe to test from a trusted zone label Jan 3, 2025
@dgottlieb dgottlieb requested a review from benjirewis January 3, 2025 20:15
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Jan 3, 2025
Copy link
Member

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % my one question to check my understanding.

//
// When computing rates for metrics across two "readings", we simply subtract the numerators and
// denominator and divide the differences. We use the `windowSizeSecs` to pick which "readings"
// should be compared. This creates a sliding window. We (currently) bias this window to better
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big thanks for the thoughtful explanation here. The concept generally makes sense to me and I don't think I need much more in the way of documentation. I think if you're finding that 5s is a time window that's providing reasonable output wrt CPU usage, then it seems like a perfectly valid choice to me.

forCompare := idx - windowSizeSecs
if forCompare < 0 {
// If we haven't
forCompare = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha maybe you can confirm my understanding: is this lie setting the value we should compare to to the earliest possible recorded value in the case that 5 seconds have not actually elapsed?

As for the concern about ratioMetric vs something like rateMetric, I don't think it matters too much for now. If we start adding more non-time-based ratio metrics, then we could probably revisit this (unexported) naming...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants