Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-9618: Bake in knowledge for rate based stats to FTDC parsing. #4658

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 194 additions & 2 deletions ftdc/cmd/parser.go
Original file line number Diff line number Diff line change
Expand Up @@ -132,10 +132,197 @@ func (gpw *gnuplotWriter) addPoint(timeSeconds int64, metricName string, metricV
writelnf(gpw.getDatafile(metricName), "%v %.5f", timeSeconds, metricValue)
}

func (gpw *gnuplotWriter) addFlatDatum(datum ftdc.FlatDatum) {
// ratioMetric describes which two FTDC metrics that should be combined to create a computed
// value. Such as "CPU %". Can also be used to express "requests per second".
type ratioMetric struct {
Numerator string
// An empty string Denominator will use the datum read timestamp value for its denominator. For
// graphing a per-second rate.
Denominator string
}

// ratioMetricToFields is a global variable identifying the metric names that are to be graphed as
// some ratio. The two members (`Numerator` and `Denominator`) refer to the suffix* of a metric
// name. For example, `UserCPUSecs` will appear under `proc.viam-server.UserCPUSecs` as well as
// `proc.modules.<foo>.UserCPUSecs`. If the `Denominator` is the empty string, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart to use suffixes here to allow one mapping in ratioMetricToFields to handle both viam-server and module CPU usage calculations...

// `ratioReading.Time` value will be used.
//
// When computing rates for metrics across two "readings", we simply subtract the numerators and
// denominator and divide the differences. We use the `windowSizeSecs` to pick which "readings"
// should be compared. This creates a sliding window. We (currently) bias this window to better
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bias this window

What does this mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you understand the idea of windowing -- but because the usage of bias is heavily influenced, I'll deconstruct the general idea first:

So when we create a datapoint for a graph that is stating "what percent of the CPU this is process using", it's not helpful to talk about at that specific instance in time. It's either running right now (i.e: "100%") or it's not (i.e: "0%"). So instead we talk about CPU usage over some "window" of time. When one does ps aux, the CPU percentage there is over the lifetime of the process.

So if a viam-server runs for 9 minutes doing nothing, that would say a number close to 0%. And if at that point a user then opened a video stream that stayed on the CPU for an entire minute (modulo some system scheduling overhead), ps aux wouldn't report a usage of 100%, but rather slowly go up from 0% -> 10% over the course of that one minute.

So regarding the window, we have 2 extremes. What's happening this very instant and what's happened over the lifetime of the process. I don't like either of those for this case, so I chose something in between. I felt like a process that flips from completely idle to max CPU should see that reflected in a graph over the coarse of 5 seconds.

I feel that choice skews more towards the "what's happening this instant" side of the scale. So "bias this window" refers to two properties:

  • This is a human choice. There are other valid options.
  • The choice is not a binary A or B. But rather a relatively continuous range of options to select from.

The context where I hear "bias" used in this way is (not coincidentally) from my college OS class talking about the (in theory) linux CPU scheduler. Where it wants to prioritize running programs that used its entire time slice to do work. And it similarly has to age out older program behavior. Which gets modeled by some variable r[ecency bias]:

new_priority = (1-r)old_priority + (r)percent_of_timeslice_used_in_last_run

So choosing r=1 gives a priority influenced completely by the last run. And choosing smaller values believe that a program's future behavior is better predicting by its more historical behavior.

Sorry -- this was huge. Probably should have clarified offline. Definitely recommend a change (maybe just avoid mentioning what our window size preference is?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big thanks for the thoughtful explanation here. The concept generally makes sense to me and I don't think I need much more in the way of documentation. I think if you're finding that 5s is a time window that's providing reasonable output wrt CPU usage, then it seems like a perfectly valid choice to me.

// portray "recent" resource utilization.
var ratioMetricToFields = map[string]ratioMetric{
"UserCPU": {"UserCPUSecs", "ElapsedTimeSecs"},
"SystemCPU": {"SystemCPUSecs", "ElapsedTimeSecs"},
// PerSec ratios use an empty string denominator.
"HeadersProcessedPerSec": {"HeadersProcessed", ""},
}

// ratioReading is a reading of two metrics described by `ratioMetric`. This is what will be graphed.
type ratioReading struct {
GraphName string
// Seconds since epoch.
Time int64
Numerator float32
Denominator float64

// `isRate` == false will multiply by 100 for displaying as a percentage. Otherwise just display
// the quotient.
isRate bool
}

func (rr ratioReading) toValue() (float32, error) {
if math.Abs(rr.Denominator) < 1e-9 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Do you have a pre-existing epsilon constant lying around in this package somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right in that I have one (that I completely forgot about!)

But it's (right now) in a different package

return 0.0, fmt.Errorf("divide by zero error, metric: %v", rr.GraphName)
}

if rr.isRate {
return float32(float64(rr.Numerator) / rr.Denominator), nil
}

// A percentage
return float32(float64(rr.Numerator) / rr.Denominator * 100), nil
}

func (rr *ratioReading) diff(other *ratioReading) ratioReading {
return ratioReading{
rr.GraphName,
rr.Time,
rr.Numerator - other.Numerator,
rr.Denominator - other.Denominator,
rr.isRate,
}
}

// pullRatios returns true if any of the `ratioMetrics` match the input `reading`. If so, a new
// `ratioReading` is added to the `deferredReadings`.
dgottlieb marked this conversation as resolved.
Show resolved Hide resolved
func pullRatios(
reading ftdc.Reading,
readingTS int64,
ratioMetrics map[string]ratioMetric,
outDeferredReadings map[string]*ratioReading,
) bool {
ret := false
for ratioMetricName, ratioMetric := range ratioMetrics {
if strings.HasSuffix(reading.MetricName, ratioMetric.Numerator) {
ret = true

// `metricIdentifier` is expected to be of the form `rdk.foo_module.`. Leave the
// trailing dot as we would be about to re-add it.
metricIdentifier := strings.TrimSuffix(reading.MetricName, ratioMetric.Numerator)
// E.g: `rdk.foo_module.User CPU%'.
graphName := fmt.Sprint(metricIdentifier, ratioMetricName)
if _, exists := outDeferredReadings[graphName]; !exists {
outDeferredReadings[graphName] = &ratioReading{GraphName: graphName, Time: readingTS, isRate: ratioMetric.Denominator == ""}
}

outDeferredReadings[graphName].Numerator = reading.Value
if ratioMetric.Denominator == "" {
outDeferredReadings[graphName].Denominator = float64(readingTS)
}

continue
}

if ratioMetric.Denominator != "" && strings.HasSuffix(reading.MetricName, ratioMetric.Denominator) {
ret = true

// `metricIdentifier` is expected to be of the form `rdk.foo_module.`. Leave the
// trailing dot as we would be about to re-add it.
metricIdentifier := strings.TrimSuffix(reading.MetricName, ratioMetric.Denominator)
// E.g: `rdk.foo_module.User CPU%'.
graphName := fmt.Sprint(metricIdentifier, ratioMetricName)
if _, exists := outDeferredReadings[graphName]; !exists {
outDeferredReadings[graphName] = &ratioReading{GraphName: graphName, Time: readingTS, isRate: false}
}

outDeferredReadings[graphName].Denominator = float64(reading.Value)
continue
}
}

return ret
}

func (gpw *gnuplotWriter) addFlatDatum(datum ftdc.FlatDatum) map[string]*ratioReading {
// deferredReadings is an accumulator for readings of metrics that are used together to create a
// graph. Such as `UserCPUSecs` / `ElapsedTimeSecs`.
deferredReadings := make(map[string]*ratioReading)

// There are two kinds of metrics. "Simple" metrics that can simply be passed through to the
// gnuplotWriter. And "ratio" metrics that combine two different readings.
//
// For the ratio metrics, we use a two pass algorithm. The first pass will pair together all of
// the necessary numerators and denominators. The second pass will write the computed datapoint
// to the underlying gnuplotWriter.
//
// Ratio metrics are identified by the metric suffix. E.g: `rdk.custom_module.UserCPUSecs` will
// be classified as a (numerator in a) ratio metric. We must also take care to record the prefix
// of the ratio metric, the "metric identifier". There may be `rdk.foo_module.UserCPUSecs` in
// addition to `rdk.bar_modular.UserCPUSecs`. Which should create two CPU% graphs.
for _, reading := range datum.Readings {
// pullRatios will identify if the metric is a "ratio" metric. If so, we do not currently
// know what to graph and `pullRatios` will accumulate the relevant information into
// `deferredReadings`.
isRatioMetric := pullRatios(reading, datum.ConvertedTime().Unix(), ratioMetricToFields, deferredReadings)
if isRatioMetric {
// Ratio metrics need to be compared to some prior ratio metric to create a data
// point. We do not output any information now. We instead accumulate all of these
// results to be later used. These are named "deferred values".
continue
}

gpw.addPoint(datum.ConvertedTime().Unix(), reading.MetricName, reading.Value)
}

return deferredReadings
}

// Ratios are averaged over a "recent history". This window size refers to a time in seconds, but we
// actually measure with respect to consecutive FTDC readings. The output value will use the system
// clock difference to compute a correct rate. We just accept there may be fuzziness with respect to
// how recent of a history we're actually using.
//
// Consider adding logging when two FTDC readings `windowSizeSecs` apart is not reflecting by their
// system time difference.
const windowSizeSecs = 5

// The deferredValues input is in FTDC reading order. On a responsive system, adjacent items in the
// slice should be one second apart.
func (gpw *gnuplotWriter) writeDeferredValues(deferredValues []map[string]*ratioReading, logger logging.Logger) {
for idx, currReadings := range deferredValues {
if idx == 0 {
// The first element cannot be compared to anything. It would create a divide by zero
// problem.
continue
}

// `forCompare` is the index element to compare the "current" element pointed to by `idx`.
forCompare := idx - windowSizeSecs
if forCompare < 0 {
// If we haven't
forCompare = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this create a divide by zero issue? Comparing with self?

Copy link
Member Author

@dgottlieb dgottlieb Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition (and the above one) isn't written in the most readable way: https://github.com/viamrobotics/rdk/pull/4658/files#diff-a6de85d19062a282c0d7f2c427b26f6598fcf073ef34752a97ec02b053a62f18R295-R296

edit the above was perhaps cryptic -- my point is idx == forCompare can not happen. Because the above if-statement skips the whole idx == 0 case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sat and thought about how to change this to make it more obvious. I failed.

I did realize that the abstraction of ratioMetrics (as I describe them) is incorrect. All of this windowing code only makes sense for time-based things. And in that case we can be (fairly) confident the denominator is always increasing.

There can be other "metrics that are computed as a ratio of things" that are not time based. And consequently don't need this windowing logic. And therefore need not worry about a denominator being 0 after some subtraction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha maybe you can confirm my understanding: is this lie setting the value we should compare to to the earliest possible recorded value in the case that 5 seconds have not actually elapsed?

As for the concern about ratioMetric vs something like rateMetric, I don't think it matters too much for now. If we start adding more non-time-based ratio metrics, then we could probably revisit this (unexported) naming...

}

prevReadings := deferredValues[forCompare]
for metricName, currRatioReading := range currReadings {
var diff ratioReading
if prevratioReading, exists := prevReadings[metricName]; exists {
diff = currRatioReading.diff(prevratioReading)
} else {
logger.Infow("Deferred value missing a previous value to diff",
"metricName", metricName, "time", currRatioReading.Time)
continue
}

value, err := diff.toValue()
if err != nil {
// The denominator did not change -- divide by zero error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Might as well report the created error here, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

continue
}
gpw.addPoint(currRatioReading.Time, metricName, value)
}
}
}

// Render runs the compiler and invokes gnuplot, creating an image file.
Expand Down Expand Up @@ -216,11 +403,14 @@ func main() {
graphOptions := defaultGraphOptions()
for {
if render {
deferredValues := make([]map[string]*ratioReading, 0)
gpw := newGnuPlotWriter(graphOptions)
for _, flatDatum := range data {
gpw.addFlatDatum(flatDatum)
deferredValues = append(deferredValues, gpw.addFlatDatum(flatDatum))
}

gpw.writeDeferredValues(deferredValues, logger)

gpw.Render()
}
render = true
Expand Down Expand Up @@ -283,6 +473,8 @@ func main() {
case strings.HasPrefix(cmd, "reset range"):
graphOptions.minTimeSeconds = 0
graphOptions.maxTimeSeconds = math.MaxInt64
case cmd == "refresh" || cmd == "r":
nolintPrintln("Refreshing graphs with new data")
case len(cmd) == 0:
render = false
default:
Expand Down
1 change: 1 addition & 0 deletions ftdc/custom_format.go
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,7 @@ func flatten(datum datum, schema *schema) ([]float32, error) {
//
// [ Reading{"webrtc.connections", 5}, Reading{"webrtc.bytesSent", 8096} ].
type FlatDatum struct {
// Time is a 64 bit integer representing nanoseconds since the epoch.
Time int64
Readings []Reading
}
Expand Down
2 changes: 1 addition & 1 deletion module/modmanager/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -1343,7 +1343,7 @@ func (m *module) registerProcessWithFTDC() {
return
}

m.ftdc.Add(fmt.Sprintf("modules.%s", m.process.ID()), statser)
m.ftdc.Add(fmt.Sprintf("proc.modules.%s", m.process.ID()), statser)
}

func getFullEnvironment(
Expand Down
2 changes: 1 addition & 1 deletion robot/impl/local_robot.go
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,7 @@ func newWithResources(
// the web service has not been "started".
ftdcWorker = ftdc.New(ftdc.DefaultDirectory(config.ViamDotDir, partID), logger.Sublogger("ftdc"))
if statser, err := sys.NewSelfSysUsageStatser(); err == nil {
ftdcWorker.Add("viam-server", statser)
ftdcWorker.Add("proc.viam-server", statser)
}
}

Expand Down
Loading