Skip to content

Commit

Permalink
feat: Add 'unpublished' flag to configs (#1446)
Browse files Browse the repository at this point in the history
## Which problem is this PR solving?

- We have some config options that we'd like to avoid publishing,
because we don't know for sure we'll need them, but we want to give
ourselves the ability to mess with them while tuning a configuration. By
adding the ability to flag a config item as unpublished, it won't go in
the main documentation and we can modify it later without breaking
changes.

## Short description of the changes

- Add unpublished flag to metadata struct
- Add it to a couple of fields
- Regenerate docs
  • Loading branch information
kentquirk authored Nov 22, 2024
1 parent ac659ab commit dc650e9
Show file tree
Hide file tree
Showing 16 changed files with 353 additions and 35 deletions.
56 changes: 51 additions & 5 deletions config.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Honeycomb Refinery Configuration Documentation

This is the documentation for the configuration file for Honeycomb's Refinery.
It was automatically generated on 2024-10-11 at 16:33:01 UTC.
It was automatically generated on 2024-11-22 at 17:59:58 UTC.

## The Config file

Expand Down Expand Up @@ -339,6 +339,20 @@ Decreasing this will check the trace cache for timeouts more frequently.
- Type: `duration`
- Default: `100ms`

### `MaxExpiredTraces`

MaxExpiredTraces is the maximum number of expired traces to process.

This setting controls how many traces are processed when it is time to make a sampling decision.
Up to this many traces will be processed every `SendTicker` duration.
If this number is too small it will mean Refinery is spending less time calculating sampling decisions, resulting in data arriving at Honeycomb slower.
If your `collector_collect_loop_duration_ms` is above 3 seconds it is recommended to reduce this value and the `SendTicker` duration.
This will mean Refinery makes fewer sampling decision calculations each `SendTicker` tick, but gets the chance to make decisions more often.

- Eligible for live reload.
- Type: `int`
- Default: `5000`

## Debugging

`Debugging` contains configuration values used when setting up and debugging Refinery.
Expand Down Expand Up @@ -388,6 +402,7 @@ This is useful for evaluating sampling rules.
When DryRun is enabled, traces is decorated with `meta.refinery.
dryrun.kept` that is set to `true` or `false`, based on whether the trace would be kept or dropped.
In addition, `SampleRate` will be set to the incoming rate for all traces, and the field `meta.refinery.dryrun.sample_rate` will be set to the sample rate that would have been used.
NOTE: This setting is not compatible with `DisableTraceLocality=true`, because drop trace decisions shared among peers do not contain all the relevant information needed to send traces to Honeycomb.

- Eligible for live reload.
- Type: `bool`
Expand Down Expand Up @@ -721,6 +736,16 @@ Since each incoming span generates multiple outgoing spans, a minimum sample rat
- Type: `int`
- Default: `100`

### `Insecure`

Insecure controls whether to send Refinery's own OpenTelemetry traces via http instead of https.

When true Refinery will export its internal traces over http instead of https.
Useful if you plan on sending your traces to a different refinery instance for tail sampling.

- Not eligible for live reload.
- Type: `bool`

## Peer Management

`PeerManagement` controls how the Refinery cluster communicates between peers.
Expand Down Expand Up @@ -885,6 +910,8 @@ The collection cache is used to collect all active spans into traces.
It is organized as a circular buffer.
When the buffer wraps around, Refinery will try a few times to find an empty slot; if it fails, it starts ejecting traces from the cache earlier than would otherwise be necessary.
Ideally, the size of the cache should be many multiples (100x to 1000x) of the total number of concurrently active traces (average trace throughput * average trace duration).
NOTE: This setting is now deprecated and no longer controls the cache size.
Instead the maxmimum memory usage is controlled by `MaxMemoryPercentage` and `MaxAlloc`.

- Eligible for live reload.
- Type: `int`
Expand Down Expand Up @@ -970,6 +997,18 @@ By disabling this behavior, it can help to prevent disruptive bursts of network
- Eligible for live reload.
- Type: `bool`

### `RedistributionDelay`

RedistributionDelay controls the amount of time Refinery waits after each cluster scaling event before redistributing in-memory traces.

This value should be longer than the amount of time between individual pod changes in a bulk scaling operation (changing the cluster size by more than one pod).
Each redistribution generates additional traffic between peers.
If this value is too short, multiple consecutive redistributions will occur and the resulting traffic may overwhelm the cluster.

- Not eligible for live reload.
- Type: `duration`
- Default: `30s`

### `ShutdownDelay`

ShutdownDelay controls the maximum time Refinery can use while draining traces at shutdown.
Expand All @@ -982,13 +1021,20 @@ This value should be set to a bit less than the normal timeout period for shutti
- Type: `duration`
- Default: `15s`

### `EnableTraceLocality`
### `DisableTraceLocality`

EnableTraceLocality controls whether all spans that belongs to the same trace are sent to a single Refinery for processing.
DisableTraceLocality controls whether all spans that belongs to the same trace are sent to a single Refinery for processing.

If `true`, Refinery's will route all spans that belongs to the same trace to a single peer.
When `false`, Refinery will route all spans that belong to the same trace to a single peer.
This is the default behavior ("Trace Locality") and the way Refinery has worked in the past.
When `true`, Refinery will instead keep spans on the node where they were received, and forward proxy spans that contain only the key information needed to make a trace decision.
This can reduce the amount of traffic between peers in most cases, and can help avoid a situation where a single large trace can cause a memory overrun on a single node.
If `true`, the amount of traffic between peers will be reduced, but the amount of traffic between Refinery and Redis will significantly increase, because Refinery uses Redis to distribute the trace decisions to all nodes in the cluster.
It is important to adjust the size of the Redis cluster in this case.
NOTE: This setting is not compatible with `DryRun` when set to true.
See `DryRun` for more information.

- Eligible for live reload.
- Not eligible for live reload.
- Type: `bool`

### `HealthCheckTimeout`
Expand Down
1 change: 1 addition & 0 deletions config/metadata.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ type Field struct {
Pattern string `yaml:"pattern,omitempty"`
Envvar string `yaml:"envvar,omitempty"`
CommandLine string `yaml:"commandLine,omitempty"`
Unpublished bool `yaml:"unpublished,omitempty"`
}

type Group struct {
Expand Down
6 changes: 5 additions & 1 deletion config/metadata/configMeta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,7 @@ groups:
validations:
- type: minimum
arg: 1000
summary: Max number of expired traces to process.
summary: is the maximum number of expired traces to process.
description: >
This setting controls how many traces are processed when it is time to
make a sampling decision. Up to this many traces will be processed every
Expand Down Expand Up @@ -1375,6 +1375,7 @@ groups:
type: int
valuetype: nondefault
firstversion: v2.9
unpublished: true
default: 1000
reload: false
summary: Maximum size for batching drop decisions.
Expand All @@ -1385,6 +1386,7 @@ groups:
type: duration
valuetype: nondefault
firstversion: v2.9
unpublished: true
default: 1s
reload: true
summary: Interval for sending drop decisions in batches.
Expand All @@ -1395,6 +1397,7 @@ groups:
type: int
valuetype: nondefault
firstversion: v2.9
unpublished: true
default: 1000
reload: false
summary: Maximum size for batching kept decisions.
Expand All @@ -1405,6 +1408,7 @@ groups:
type: duration
valuetype: nondefault
firstversion: v2.9
unpublished: true
default: 1s
reload: true
summary: Interval for sending kept decisions in batches.
Expand Down
72 changes: 66 additions & 6 deletions config_complete.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
## Honeycomb Refinery Configuration ##
######################################
#
# created on 2024-10-11 at 16:33:00 UTC from ../../config.yaml using a template generated on 2024-10-11 at 16:32:50 UTC
# created on 2024-11-22 at 17:59:57 UTC from ../../config.yaml using a template generated on 2024-11-22 at 17:59:50 UTC

# This file contains a configuration for the Honeycomb Refinery. It is in YAML
# format, organized into named groups, each of which contains a set of
Expand Down Expand Up @@ -355,6 +355,22 @@ Traces:
## Eligible for live reload.
# SendTicker: 100ms

## MaxExpiredTraces is the maximum number of expired traces to process.
##
## This setting controls how many traces are processed when it is time to
## make a sampling decision. Up to this many traces will be processed
## every `SendTicker` duration. If this number is too small it will mean
## Refinery is spending less time calculating sampling decisions,
## resulting in data arriving at Honeycomb slower.
## If your `collector_collect_loop_duration_ms` is above 3 seconds it is
## recommended to reduce this value and the `SendTicker` duration. This
## will mean Refinery makes fewer sampling decision calculations each
## `SendTicker` tick, but gets the chance to make decisions more often.
##
## default: 5000
## Eligible for live reload.
# MaxExpiredTraces: 5_000

###############
## Debugging ##
###############
Expand Down Expand Up @@ -411,6 +427,9 @@ Debugging:
## to the incoming rate for all traces, and the field
## `meta.refinery.dryrun.sample_rate` will be set to the sample rate that
## would have been used.
## NOTE: This setting is not compatible with `DisableTraceLocality=true`,
## because drop trace decisions shared among peers do not contain all the
## relevant information needed to send traces to Honeycomb.
##
## Eligible for live reload.
# DryRun: true
Expand Down Expand Up @@ -752,6 +771,16 @@ OTelTracing:
## Eligible for live reload.
# SampleRate: 100

## Insecure controls whether to send Refinery's own OpenTelemetry traces
## via http instead of https.
##
## When true Refinery will export its internal traces over http instead
## of https. Useful if you plan on sending your traces to a different
## refinery instance for tail sampling.
##
## Not eligible for live reload.
# Insecure: false

#####################
## Peer Management ##
#####################
Expand Down Expand Up @@ -935,6 +964,9 @@ Collection:
## necessary. Ideally, the size of the cache should be many multiples
## (100x to 1000x) of the total number of concurrently active traces
## (average trace throughput * average trace duration).
## NOTE: This setting is now deprecated and no longer controls the cache
## size. Instead the maxmimum memory usage is controlled by
## `MaxMemoryPercentage` and `MaxAlloc`.
##
## default: 10000
## Eligible for live reload.
Expand Down Expand Up @@ -1020,6 +1052,21 @@ Collection:
## Eligible for live reload.
# DisableRedistribution: false

## RedistributionDelay controls the amount of time Refinery waits after
## each cluster scaling event before redistributing in-memory traces.
##
## This value should be longer than the amount of time between individual
## pod changes in a bulk scaling operation (changing the cluster size by
## more than one pod). Each redistribution generates additional traffic
## between peers. If this value is too short, multiple consecutive
## redistributions will occur and the resulting traffic may overwhelm the
## cluster.
##
## Accepts a duration string with units, like "30s".
## default: 30s
## Not eligible for live reload.
# RedistributionDelay: 30s

## ShutdownDelay controls the maximum time Refinery can use while
## draining traces at shutdown.
##
Expand All @@ -1036,14 +1083,27 @@ Collection:
## Eligible for live reload.
# ShutdownDelay: 15s

## EnableTraceLocality controls whether all spans that belongs to the
## DisableTraceLocality controls whether all spans that belongs to the
## same trace are sent to a single Refinery for processing.
##
## If `true`, Refinery's will route all spans that belongs to the same
## trace to a single peer.
## When `false`, Refinery will route all spans that belong to the same
## trace to a single peer. This is the default behavior ("Trace
## Locality") and the way Refinery has worked in the past. When `true`,
## Refinery will instead keep spans on the node where they were received,
## and forward proxy spans that contain only the key information needed
## to make a trace decision. This can reduce the amount of traffic
## between peers in most cases, and can help avoid a situation where a
## single large trace can cause a memory overrun on a single node.
## If `true`, the amount of traffic between peers will be reduced, but
## the amount of traffic between Refinery and Redis will significantly
## increase, because Refinery uses Redis to distribute the trace
## decisions to all nodes in the cluster. It is important to adjust the
## size of the Redis cluster in this case.
## NOTE: This setting is not compatible with `DryRun` when set to true.
## See `DryRun` for more information.
##
## Eligible for live reload.
# EnableTraceLocality: false
## Not eligible for live reload.
# DisableTraceLocality: false

## HealthCheckTimeout controls the maximum duration allowed for
## collection health checks to complete.
Expand Down
16 changes: 15 additions & 1 deletion metrics.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Honeycomb Refinery Metrics Documentation

This document contains the description of various metrics used in Refinery.
It was automatically generated on 2024-10-11 at 16:33:00 UTC.
It was automatically generated on 2024-11-22 at 17:59:56 UTC.

Note: This document does not include metrics defined in the dynsampler-go dependency, as those metrics are generated dynamically at runtime. As a result, certain metrics may be missing or incomplete in this document, but they will still be available during execution with their full names.

Expand All @@ -13,6 +13,10 @@ This table includes metrics with fully defined names.
| collect_cache_buffer_overrun | Counter | Dimensionless | The number of times the trace overwritten in the circular buffer has not yet been sent |
| collect_cache_capacity | Gauge | Dimensionless | The number of traces that can be stored in the cache |
| collect_cache_entries | Histogram | Dimensionless | The number of traces currently stored in the cache |
| trace_cache_set_dur_ms | Histogram | Dimensionless | duration to set a trace in the cache |
| trace_cache_take_expired_traces_dur_ms | Histogram | Dimensionless | duration to take expired traces from the cache |
| trace_cache_remove_traces_dur_ms | Histogram | Dimensionless | duration to remove traces from the cache |
| trace_cache_get_all_dur_ms | Histogram | Dimensionless | duration to get all traces from the cache |
| cuckoo_current_capacity | Gauge | Dimensionless | current capacity of the cuckoo filter |
| cuckoo_future_load_factor | Gauge | Percent | the fraction of slots occupied in the future cuckoo filter |
| cuckoo_current_load_factor | Gauge | Percent | the fraction of slots occupied in the current cuckoo filter |
Expand Down Expand Up @@ -60,6 +64,16 @@ This table includes metrics with fully defined names.
| trace_aggregate_sample_rate | Histogram | Dimensionless | aggregate sample rate of both kept and dropped traces |
| collector_redistribute_traces_duration_ms | Histogram | Milliseconds | duration of redistributing traces to peers |
| collector_collect_loop_duration_ms | Histogram | Milliseconds | duration of the collect loop, the primary event processing goroutine |
| collector_outgoing_queue | Histogram | Dimensionless | number of traces waiting to be send to upstream |
| collector_drop_decision_batch_count | Histogram | Dimensionless | number of drop decisions sent in a batch |
| collector_expired_traces_missing_decisions | Gauge | Dimensionless | number of decision spans forwarded for expired traces missing trace decision |
| collector_expired_traces_orphans | Gauge | Dimensionless | number of expired traces missing trace decision when they are sent |
| drop_decision_batches_received | Counter | Dimensionless | number of drop decision batches received |
| kept_decision_batches_received | Counter | Dimensionless | number of kept decision batches received |
| drop_decisions_received | Counter | Dimensionless | total number of drop decisions received |
| kept_decisions_received | Counter | Dimensionless | total number of kept decisions received |
| collector_kept_decisions_queue_full | Counter | Dimensionless | number of times kept trace decision queue is full |
| collector_drop_decisions_queue_full | Counter | Dimensionless | number of times drop trace decision queue is full |
| cluster_stress_level | Gauge | Dimensionless | The overall stress level of the cluster |
| individual_stress_level | Gauge | Dimensionless | The stress level of the individual node |
| stress_level | Gauge | Dimensionless | The stress level that's being used to determine whether to activate stress relief |
Expand Down
Loading

0 comments on commit dc650e9

Please sign in to comment.