-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-9618: Bake in knowledge for rate based stats to FTDC parsing. #4658
base: main
Are you sure you want to change the base?
Changes from 8 commits
1cf65bd
5ee4e7a
3882e54
bb70f7e
61da580
56a4664
41d71a3
f8a1c9c
17f1461
df0896b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -132,10 +132,197 @@ func (gpw *gnuplotWriter) addPoint(timeSeconds int64, metricName string, metricV | |
writelnf(gpw.getDatafile(metricName), "%v %.5f", timeSeconds, metricValue) | ||
} | ||
|
||
func (gpw *gnuplotWriter) addFlatDatum(datum ftdc.FlatDatum) { | ||
// ratioMetric describes which two FTDC metrics that should be combined to create a computed | ||
// value. Such as "CPU %". Can also be used to express "requests per second". | ||
type ratioMetric struct { | ||
Numerator string | ||
// An empty string Denominator will use the datum read timestamp value for its denominator. For | ||
// graphing a per-second rate. | ||
Denominator string | ||
} | ||
|
||
// ratioMetricToFields is a global variable identifying the metric names that are to be graphed as | ||
// some ratio. The two members (`Numerator` and `Denominator`) refer to the suffix* of a metric | ||
// name. For example, `UserCPUSecs` will appear under `proc.viam-server.UserCPUSecs` as well as | ||
// `proc.modules.<foo>.UserCPUSecs`. If the `Denominator` is the empty string, the | ||
// `ratioReading.Time` value will be used. | ||
// | ||
// When computing rates for metrics across two "readings", we simply subtract the numerators and | ||
// denominator and divide the differences. We use the `windowSizeSecs` to pick which "readings" | ||
// should be compared. This creates a sliding window. We (currently) bias this window to better | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What does this mean? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you understand the idea of windowing -- but because the usage of So when we create a datapoint for a graph that is stating "what percent of the CPU this is process using", it's not helpful to talk about at that specific instance in time. It's either running right now (i.e: "100%") or it's not (i.e: "0%"). So instead we talk about CPU usage over some "window" of time. When one does So if a viam-server runs for 9 minutes doing nothing, that would say a number close to 0%. And if at that point a user then opened a video stream that stayed on the CPU for an entire minute (modulo some system scheduling overhead), So regarding the window, we have 2 extremes. What's happening this very instant and what's happened over the lifetime of the process. I don't like either of those for this case, so I chose something in between. I felt like a process that flips from completely idle to max CPU should see that reflected in a graph over the coarse of 5 seconds. I feel that choice skews more towards the "what's happening this instant" side of the scale. So "bias this window" refers to two properties:
The context where I hear "bias" used in this way is (not coincidentally) from my college OS class talking about the (in theory) linux CPU scheduler. Where it wants to prioritize running programs that used its entire time slice to do work. And it similarly has to age out older program behavior. Which gets modeled by some variable
So choosing r=1 gives a priority influenced completely by the last run. And choosing smaller values believe that a program's future behavior is better predicting by its more historical behavior. Sorry -- this was huge. Probably should have clarified offline. Definitely recommend a change (maybe just avoid mentioning what our window size preference is?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Big thanks for the thoughtful explanation here. The concept generally makes sense to me and I don't think I need much more in the way of documentation. I think if you're finding that 5s is a time window that's providing reasonable output wrt CPU usage, then it seems like a perfectly valid choice to me. |
||
// portray "recent" resource utilization. | ||
var ratioMetricToFields = map[string]ratioMetric{ | ||
"UserCPU": {"UserCPUSecs", "ElapsedTimeSecs"}, | ||
"SystemCPU": {"SystemCPUSecs", "ElapsedTimeSecs"}, | ||
// PerSec ratios use an empty string denominator. | ||
"HeadersProcessedPerSec": {"HeadersProcessed", ""}, | ||
} | ||
|
||
// ratioReading is a reading of two metrics described by `ratioMetric`. This is what will be graphed. | ||
type ratioReading struct { | ||
GraphName string | ||
// Seconds since epoch. | ||
Time int64 | ||
Numerator float32 | ||
Denominator float64 | ||
|
||
// `isRate` == false will multiply by 100 for displaying as a percentage. Otherwise just display | ||
// the quotient. | ||
isRate bool | ||
} | ||
|
||
func (rr ratioReading) toValue() (float32, error) { | ||
if math.Abs(rr.Denominator) < 1e-9 { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [nit] Do you have a pre-existing epsilon constant lying around in this package somewhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right in that I have one (that I completely forgot about!) But it's (right now) in a different package |
||
return 0.0, fmt.Errorf("divide by zero error, metric: %v", rr.GraphName) | ||
} | ||
|
||
if rr.isRate { | ||
return float32(float64(rr.Numerator) / rr.Denominator), nil | ||
} | ||
|
||
// A percentage | ||
return float32(float64(rr.Numerator) / rr.Denominator * 100), nil | ||
} | ||
|
||
func (rr *ratioReading) diff(other *ratioReading) ratioReading { | ||
return ratioReading{ | ||
rr.GraphName, | ||
rr.Time, | ||
rr.Numerator - other.Numerator, | ||
rr.Denominator - other.Denominator, | ||
rr.isRate, | ||
} | ||
} | ||
|
||
// pullRatios returns true if any of the `ratioMetrics` match the input `reading`. If so, a new | ||
// `ratioReading` is added to the `deferredReadings`. | ||
dgottlieb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
func pullRatios( | ||
reading ftdc.Reading, | ||
readingTS int64, | ||
ratioMetrics map[string]ratioMetric, | ||
outDeferredReadings map[string]*ratioReading, | ||
) bool { | ||
ret := false | ||
for ratioMetricName, ratioMetric := range ratioMetrics { | ||
if strings.HasSuffix(reading.MetricName, ratioMetric.Numerator) { | ||
ret = true | ||
|
||
// `metricIdentifier` is expected to be of the form `rdk.foo_module.`. Leave the | ||
// trailing dot as we would be about to re-add it. | ||
metricIdentifier := strings.TrimSuffix(reading.MetricName, ratioMetric.Numerator) | ||
// E.g: `rdk.foo_module.User CPU%'. | ||
graphName := fmt.Sprint(metricIdentifier, ratioMetricName) | ||
if _, exists := outDeferredReadings[graphName]; !exists { | ||
outDeferredReadings[graphName] = &ratioReading{GraphName: graphName, Time: readingTS, isRate: ratioMetric.Denominator == ""} | ||
} | ||
|
||
outDeferredReadings[graphName].Numerator = reading.Value | ||
if ratioMetric.Denominator == "" { | ||
outDeferredReadings[graphName].Denominator = float64(readingTS) | ||
} | ||
|
||
continue | ||
} | ||
|
||
if ratioMetric.Denominator != "" && strings.HasSuffix(reading.MetricName, ratioMetric.Denominator) { | ||
ret = true | ||
|
||
// `metricIdentifier` is expected to be of the form `rdk.foo_module.`. Leave the | ||
// trailing dot as we would be about to re-add it. | ||
metricIdentifier := strings.TrimSuffix(reading.MetricName, ratioMetric.Denominator) | ||
// E.g: `rdk.foo_module.User CPU%'. | ||
graphName := fmt.Sprint(metricIdentifier, ratioMetricName) | ||
if _, exists := outDeferredReadings[graphName]; !exists { | ||
outDeferredReadings[graphName] = &ratioReading{GraphName: graphName, Time: readingTS, isRate: false} | ||
} | ||
|
||
outDeferredReadings[graphName].Denominator = float64(reading.Value) | ||
continue | ||
} | ||
} | ||
|
||
return ret | ||
} | ||
|
||
func (gpw *gnuplotWriter) addFlatDatum(datum ftdc.FlatDatum) map[string]*ratioReading { | ||
// deferredReadings is an accumulator for readings of metrics that are used together to create a | ||
// graph. Such as `UserCPUSecs` / `ElapsedTimeSecs`. | ||
deferredReadings := make(map[string]*ratioReading) | ||
|
||
// There are two kinds of metrics. "Simple" metrics that can simply be passed through to the | ||
// gnuplotWriter. And "ratio" metrics that combine two different readings. | ||
// | ||
// For the ratio metrics, we use a two pass algorithm. The first pass will pair together all of | ||
// the necessary numerators and denominators. The second pass will write the computed datapoint | ||
// to the underlying gnuplotWriter. | ||
// | ||
// Ratio metrics are identified by the metric suffix. E.g: `rdk.custom_module.UserCPUSecs` will | ||
// be classified as a (numerator in a) ratio metric. We must also take care to record the prefix | ||
// of the ratio metric, the "metric identifier". There may be `rdk.foo_module.UserCPUSecs` in | ||
// addition to `rdk.bar_modular.UserCPUSecs`. Which should create two CPU% graphs. | ||
for _, reading := range datum.Readings { | ||
// pullRatios will identify if the metric is a "ratio" metric. If so, we do not currently | ||
// know what to graph and `pullRatios` will accumulate the relevant information into | ||
// `deferredReadings`. | ||
isRatioMetric := pullRatios(reading, datum.ConvertedTime().Unix(), ratioMetricToFields, deferredReadings) | ||
if isRatioMetric { | ||
// Ratio metrics need to be compared to some prior ratio metric to create a data | ||
// point. We do not output any information now. We instead accumulate all of these | ||
// results to be later used. These are named "deferred values". | ||
continue | ||
} | ||
|
||
gpw.addPoint(datum.ConvertedTime().Unix(), reading.MetricName, reading.Value) | ||
} | ||
|
||
return deferredReadings | ||
} | ||
|
||
// Ratios are averaged over a "recent history". This window size refers to a time in seconds, but we | ||
// actually measure with respect to consecutive FTDC readings. The output value will use the system | ||
// clock difference to compute a correct rate. We just accept there may be fuzziness with respect to | ||
// how recent of a history we're actually using. | ||
// | ||
// Consider adding logging when two FTDC readings `windowSizeSecs` apart is not reflecting by their | ||
// system time difference. | ||
const windowSizeSecs = 5 | ||
|
||
// The deferredValues input is in FTDC reading order. On a responsive system, adjacent items in the | ||
// slice should be one second apart. | ||
func (gpw *gnuplotWriter) writeDeferredValues(deferredValues []map[string]*ratioReading, logger logging.Logger) { | ||
for idx, currReadings := range deferredValues { | ||
if idx == 0 { | ||
// The first element cannot be compared to anything. It would create a divide by zero | ||
// problem. | ||
continue | ||
} | ||
|
||
// `forCompare` is the index element to compare the "current" element pointed to by `idx`. | ||
forCompare := idx - windowSizeSecs | ||
if forCompare < 0 { | ||
// If we haven't | ||
forCompare = 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Won't this create a divide by zero issue? Comparing with self? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This condition (and the above one) isn't written in the most readable way: https://github.com/viamrobotics/rdk/pull/4658/files#diff-a6de85d19062a282c0d7f2c427b26f6598fcf073ef34752a97ec02b053a62f18R295-R296 edit the above was perhaps cryptic -- my point is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I sat and thought about how to change this to make it more obvious. I failed. I did realize that the abstraction of There can be other "metrics that are computed as a ratio of things" that are not time based. And consequently don't need this windowing logic. And therefore need not worry about a denominator being 0 after some subtraction. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotcha maybe you can confirm my understanding: is this lie setting the value we should compare to to the earliest possible recorded value in the case that 5 seconds have not actually elapsed? As for the concern about |
||
} | ||
|
||
prevReadings := deferredValues[forCompare] | ||
for metricName, currRatioReading := range currReadings { | ||
var diff ratioReading | ||
if prevratioReading, exists := prevReadings[metricName]; exists { | ||
diff = currRatioReading.diff(prevratioReading) | ||
} else { | ||
logger.Infow("Deferred value missing a previous value to diff", | ||
"metricName", metricName, "time", currRatioReading.Time) | ||
continue | ||
} | ||
|
||
value, err := diff.toValue() | ||
if err != nil { | ||
// The denominator did not change -- divide by zero error. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [nit] Might as well report the created error here, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea |
||
continue | ||
} | ||
gpw.addPoint(currRatioReading.Time, metricName, value) | ||
} | ||
} | ||
} | ||
|
||
// Render runs the compiler and invokes gnuplot, creating an image file. | ||
|
@@ -216,11 +403,14 @@ func main() { | |
graphOptions := defaultGraphOptions() | ||
for { | ||
if render { | ||
deferredValues := make([]map[string]*ratioReading, 0) | ||
gpw := newGnuPlotWriter(graphOptions) | ||
for _, flatDatum := range data { | ||
gpw.addFlatDatum(flatDatum) | ||
deferredValues = append(deferredValues, gpw.addFlatDatum(flatDatum)) | ||
} | ||
|
||
gpw.writeDeferredValues(deferredValues, logger) | ||
|
||
gpw.Render() | ||
} | ||
render = true | ||
|
@@ -283,6 +473,8 @@ func main() { | |
case strings.HasPrefix(cmd, "reset range"): | ||
graphOptions.minTimeSeconds = 0 | ||
graphOptions.maxTimeSeconds = math.MaxInt64 | ||
case cmd == "refresh" || cmd == "r": | ||
nolintPrintln("Refreshing graphs with new data") | ||
case len(cmd) == 0: | ||
render = false | ||
default: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart to use suffixes here to allow one mapping in
ratioMetricToFields
to handle bothviam-server
and module CPU usage calculations...