feat: Add 'unpublished' flag to configs (#1446)

## Which problem is this PR solving? - We have some config options that we'd like to avoid publishing, because we don't know for sure we'll need them, but we want to give ourselves the ability to mess with them while tuning a configuration. By adding the ability to flag a config item as unpublished, it won't go in the main documentation and we can modify it later without breaking changes. ## Short description of the changes - Add unpublished flag to metadata struct - Add it to a couple of fields - Regenerate docs
honeycombio · Nov 22, 2024 · dc650e9 · dc650e9
1 parent ac659ab
commit dc650e9
Show file tree

Hide file tree

Showing 16 changed files with 353 additions and 35 deletions.
diff --git a/config.md b/config.md
@@ -1,7 +1,7 @@
 # Honeycomb Refinery Configuration Documentation
 
 This is the documentation for the configuration file for Honeycomb's Refinery.
-It was automatically generated on 2024-10-11 at 16:33:01 UTC.
+It was automatically generated on 2024-11-22 at 17:59:58 UTC.
 
 ## The Config file
 
@@ -339,6 +339,20 @@ Decreasing this will check the trace cache for timeouts more frequently.
 - Type: `duration`
 - Default: `100ms`
 
+### `MaxExpiredTraces`
+
+MaxExpiredTraces is the maximum number of expired traces to process.
+
+This setting controls how many traces are processed when it is time to make a sampling decision.
+Up to this many traces will be processed every `SendTicker` duration.
+If this number is too small it will mean Refinery is spending less time calculating sampling decisions, resulting in data arriving at Honeycomb slower.
+If your `collector_collect_loop_duration_ms` is above 3 seconds it is recommended to reduce this value and the `SendTicker` duration.
+This will mean Refinery makes fewer sampling decision calculations each `SendTicker` tick, but gets the chance to make decisions more often.
+
+- Eligible for live reload.
+- Type: `int`
+- Default: `5000`
+
 ## Debugging
 
 `Debugging` contains configuration values used when setting up and debugging Refinery.
@@ -388,6 +402,7 @@ This is useful for evaluating sampling rules.
 When DryRun is enabled, traces is decorated with `meta.refinery.
 dryrun.kept` that is set to `true` or `false`, based on whether the trace would be kept or dropped.
 In addition, `SampleRate` will be set to the incoming rate for all traces, and the field `meta.refinery.dryrun.sample_rate` will be set to the sample rate that would have been used.
+NOTE: This setting is not compatible with `DisableTraceLocality=true`, because drop trace decisions shared among peers do not contain all the relevant information needed to send traces to Honeycomb.
 
 - Eligible for live reload.
 - Type: `bool`
@@ -721,6 +736,16 @@ Since each incoming span generates multiple outgoing spans, a minimum sample rat
 - Type: `int`
 - Default: `100`
 
+### `Insecure`
+
+Insecure controls whether to send Refinery's own OpenTelemetry traces via http instead of https.
+
+When true Refinery will export its internal traces over http instead of https.
+Useful if you plan on sending your traces to a different refinery instance for tail sampling.
+
+- Not eligible for live reload.
+- Type: `bool`
+
 ## Peer Management
 
 `PeerManagement` controls how the Refinery cluster communicates between peers.
@@ -885,6 +910,8 @@ The collection cache is used to collect all active spans into traces.
 It is organized as a circular buffer.
 When the buffer wraps around, Refinery will try a few times to find an empty slot; if it fails, it starts ejecting traces from the cache earlier than would otherwise be necessary.
 Ideally, the size of the cache should be many multiples (100x to 1000x) of the total number of concurrently active traces (average trace throughput * average trace duration).
+NOTE: This setting is now deprecated and no longer controls the cache size.
+Instead the maxmimum memory usage is controlled by `MaxMemoryPercentage` and `MaxAlloc`.
 
 - Eligible for live reload.
 - Type: `int`
@@ -970,6 +997,18 @@ By disabling this behavior, it can help to prevent disruptive bursts of network
 - Eligible for live reload.
 - Type: `bool`
 
+### `RedistributionDelay`
+
+RedistributionDelay controls the amount of time Refinery waits after each cluster scaling event before redistributing in-memory traces.
+
+This value should be longer than the amount of time between individual pod changes in a bulk scaling operation (changing the cluster size by more than one pod).
+Each redistribution generates additional traffic between peers.
+If this value is too short, multiple consecutive redistributions will occur and the resulting traffic may overwhelm the cluster.
+
+- Not eligible for live reload.
+- Type: `duration`
+- Default: `30s`
+
 ### `ShutdownDelay`
 
 ShutdownDelay controls the maximum time Refinery can use while draining traces at shutdown.
@@ -982,13 +1021,20 @@ This value should be set to a bit less than the normal timeout period for shutti
 - Type: `duration`
 - Default: `15s`
 
-### `EnableTraceLocality`
+### `DisableTraceLocality`
 
-EnableTraceLocality controls whether all spans that belongs to the same trace are sent to a single Refinery for processing.
+DisableTraceLocality controls whether all spans that belongs to the same trace are sent to a single Refinery for processing.
 
-If `true`, Refinery's will route all spans that belongs to the same trace to a single peer.
+When `false`, Refinery will route all spans that belong to the same trace to a single peer.
+This is the default behavior ("Trace Locality") and the way Refinery has worked in the past.
+When `true`, Refinery will instead keep spans on the node where they were received, and forward proxy spans that contain only the key information needed to make a trace decision.
+This can reduce the amount of traffic between peers in most cases, and can help avoid a situation where a single large trace can cause a memory overrun on a single node.
+If `true`, the amount of traffic between peers will be reduced, but the amount of traffic between Refinery and Redis will significantly increase, because Refinery uses Redis to distribute the trace decisions to all nodes in the cluster.
+It is important to adjust the size of the Redis cluster in this case.
+NOTE: This setting is not compatible with `DryRun` when set to true.
+See `DryRun` for more information.
 
-- Eligible for live reload.
+- Not eligible for live reload.
 - Type: `bool`
 
 ### `HealthCheckTimeout`

diff --git a/config/metadata.go b/config/metadata.go
@@ -45,6 +45,7 @@ type Field struct {
 	Pattern      string       `yaml:"pattern,omitempty"`
 	Envvar       string       `yaml:"envvar,omitempty"`
 	CommandLine  string       `yaml:"commandLine,omitempty"`
+	Unpublished  bool         `yaml:"unpublished,omitempty"`
 }
 
 type Group struct {

diff --git a/config/metadata/configMeta.yaml b/config/metadata/configMeta.yaml
@@ -409,7 +409,7 @@ groups:
         validations:
           - type: minimum
             arg: 1000
-        summary: Max number of expired traces to process.
+        summary: is the maximum number of expired traces to process.
         description: >
           This setting controls how many traces are processed when it is time to
           make a sampling decision. Up to this many traces will be processed every
@@ -1375,6 +1375,7 @@ groups:
         type: int
         valuetype: nondefault
         firstversion: v2.9
+        unpublished: true
         default: 1000
         reload: false
         summary: Maximum size for batching drop decisions.
@@ -1385,6 +1386,7 @@ groups:
         type: duration
         valuetype: nondefault
         firstversion: v2.9
+        unpublished: true
         default: 1s
         reload: true
         summary: Interval for sending drop decisions in batches.
@@ -1395,6 +1397,7 @@ groups:
         type: int
         valuetype: nondefault
         firstversion: v2.9
+        unpublished: true
         default: 1000
         reload: false
         summary: Maximum size for batching kept decisions.
@@ -1405,6 +1408,7 @@ groups:
         type: duration
         valuetype: nondefault
         firstversion: v2.9
+        unpublished: true
         default: 1s
         reload: true
         summary: Interval for sending kept decisions in batches.

diff --git a/config_complete.yaml b/config_complete.yaml
@@ -2,7 +2,7 @@
 ## Honeycomb Refinery Configuration ##
 ######################################
 #
-# created on 2024-10-11 at 16:33:00 UTC from ../../config.yaml using a template generated on 2024-10-11 at 16:32:50 UTC
+# created on 2024-11-22 at 17:59:57 UTC from ../../config.yaml using a template generated on 2024-11-22 at 17:59:50 UTC
 
 # This file contains a configuration for the Honeycomb Refinery. It is in YAML
 # format, organized into named groups, each of which contains a set of
@@ -355,6 +355,22 @@ Traces:
     ## Eligible for live reload.
     # SendTicker: 100ms
 
+    ## MaxExpiredTraces is the maximum number of expired traces to process.
+    ##
+    ## This setting controls how many traces are processed when it is time to
+    ## make a sampling decision. Up to this many traces will be processed
+    ## every `SendTicker` duration. If this number is too small it will mean
+    ## Refinery is spending less time calculating sampling decisions,
+    ## resulting in data arriving at Honeycomb slower.
+    ## If your `collector_collect_loop_duration_ms` is above 3 seconds it is
+    ## recommended to reduce this value and the `SendTicker` duration. This
+    ## will mean Refinery makes fewer sampling decision calculations each
+    ## `SendTicker` tick, but gets the chance to make decisions more often.
+    ##
+    ## default: 5000
+    ## Eligible for live reload.
+    # MaxExpiredTraces: 5_000
+
 ###############
 ## Debugging ##
 ###############
@@ -411,6 +427,9 @@ Debugging:
     ## to the incoming rate for all traces, and the field
     ## `meta.refinery.dryrun.sample_rate` will be set to the sample rate that
     ## would have been used.
+    ## NOTE: This setting is not compatible with `DisableTraceLocality=true`,
+    ## because drop trace decisions shared among peers do not contain all the
+    ## relevant information needed to send traces to Honeycomb.
     ##
     ## Eligible for live reload.
     # DryRun: true
@@ -752,6 +771,16 @@ OTelTracing:
     ## Eligible for live reload.
     # SampleRate: 100
 
+    ## Insecure controls whether to send Refinery's own OpenTelemetry traces
+    ## via http instead of https.
+    ##
+    ## When true Refinery will export its internal traces over http instead
+    ## of https. Useful if you plan on sending your traces to a different
+    ## refinery instance for tail sampling.
+    ##
+    ## Not eligible for live reload.
+    # Insecure: false
+
 #####################
 ## Peer Management ##
 #####################
@@ -935,6 +964,9 @@ Collection:
     ## necessary. Ideally, the size of the cache should be many multiples
     ## (100x to 1000x) of the total number of concurrently active traces
     ## (average trace throughput * average trace duration).
+    ## NOTE: This setting is now deprecated and no longer controls the cache
+    ## size. Instead the maxmimum memory usage is controlled by
+    ## `MaxMemoryPercentage` and `MaxAlloc`.
     ##
     ## default: 10000
     ## Eligible for live reload.
@@ -1020,6 +1052,21 @@ Collection:
     ## Eligible for live reload.
     # DisableRedistribution: false
 
+    ## RedistributionDelay controls the amount of time Refinery waits after
+    ## each cluster scaling event before redistributing in-memory traces.
+    ##
+    ## This value should be longer than the amount of time between individual
+    ## pod changes in a bulk scaling operation (changing the cluster size by
+    ## more than one pod). Each redistribution generates additional traffic
+    ## between peers. If this value is too short, multiple consecutive
+    ## redistributions will occur and the resulting traffic may overwhelm the
+    ## cluster.
+    ##
+    ## Accepts a duration string with units, like "30s".
+    ## default: 30s
+    ## Not eligible for live reload.
+    # RedistributionDelay: 30s
+
     ## ShutdownDelay controls the maximum time Refinery can use while
     ## draining traces at shutdown.
     ##
@@ -1036,14 +1083,27 @@ Collection:
     ## Eligible for live reload.
     # ShutdownDelay: 15s
 
-    ## EnableTraceLocality controls whether all spans that belongs to the
+    ## DisableTraceLocality controls whether all spans that belongs to the
     ## same trace are sent to a single Refinery for processing.
     ##
-    ## If `true`, Refinery's will route all spans that belongs to the same
-    ## trace to a single peer.
+    ## When `false`, Refinery will route all spans that belong to the same
+    ## trace to a single peer. This is the default behavior ("Trace
+    ## Locality") and the way Refinery has worked in the past. When `true`,
+    ## Refinery will instead keep spans on the node where they were received,
+    ## and forward proxy spans that contain only the key information needed
+    ## to make a trace decision. This can reduce the amount of traffic
+    ## between peers in most cases, and can help avoid a situation where a
+    ## single large trace can cause a memory overrun on a single node.
+    ## If `true`, the amount of traffic between peers will be reduced, but
+    ## the amount of traffic between Refinery and Redis will significantly
+    ## increase, because Refinery uses Redis to distribute the trace
+    ## decisions to all nodes in the cluster. It is important to adjust the
+    ## size of the Redis cluster in this case.
+    ## NOTE: This setting is not compatible with `DryRun` when set to true.
+    ## See `DryRun` for more information.
     ##
-    ## Eligible for live reload.
-    # EnableTraceLocality: false
+    ## Not eligible for live reload.
+    # DisableTraceLocality: false
 
     ## HealthCheckTimeout controls the maximum duration allowed for
     ## collection health checks to complete.

diff --git a/metrics.md b/metrics.md
@@ -1,7 +1,7 @@
 # Honeycomb Refinery Metrics Documentation
 
 This document contains the description of various metrics used in Refinery.
-It was automatically generated on 2024-10-11 at 16:33:00 UTC.
+It was automatically generated on 2024-11-22 at 17:59:56 UTC.
 
 Note: This document does not include metrics defined in the dynsampler-go dependency, as those metrics are generated dynamically at runtime. As a result, certain metrics may be missing or incomplete in this document, but they will still be available during execution with their full names.
 
@@ -13,6 +13,10 @@ This table includes metrics with fully defined names.
 | collect_cache_buffer_overrun | Counter | Dimensionless | The number of times the trace overwritten in the circular buffer has not yet been sent |
 | collect_cache_capacity | Gauge | Dimensionless | The number of traces that can be stored in the cache |
 | collect_cache_entries | Histogram | Dimensionless | The number of traces currently stored in the cache |
+| trace_cache_set_dur_ms | Histogram | Dimensionless | duration to set a trace in the cache |
+| trace_cache_take_expired_traces_dur_ms | Histogram | Dimensionless | duration to take expired traces from the cache |
+| trace_cache_remove_traces_dur_ms | Histogram | Dimensionless | duration to remove traces from the cache |
+| trace_cache_get_all_dur_ms | Histogram | Dimensionless | duration to get all traces from the cache |
 | cuckoo_current_capacity | Gauge | Dimensionless | current capacity of the cuckoo filter |
 | cuckoo_future_load_factor | Gauge | Percent | the fraction of slots occupied in the future cuckoo filter |
 | cuckoo_current_load_factor | Gauge | Percent | the fraction of slots occupied in the current cuckoo filter |
@@ -60,6 +64,16 @@ This table includes metrics with fully defined names.
 | trace_aggregate_sample_rate | Histogram | Dimensionless | aggregate sample rate of both kept and dropped traces |
 | collector_redistribute_traces_duration_ms | Histogram | Milliseconds | duration of redistributing traces to peers |
 | collector_collect_loop_duration_ms | Histogram | Milliseconds | duration of the collect loop, the primary event processing goroutine |
+| collector_outgoing_queue | Histogram | Dimensionless | number of traces waiting to be send to upstream |
+| collector_drop_decision_batch_count | Histogram | Dimensionless | number of drop decisions sent in a batch |
+| collector_expired_traces_missing_decisions | Gauge | Dimensionless | number of decision spans forwarded for expired traces missing trace decision |
+| collector_expired_traces_orphans | Gauge | Dimensionless | number of expired traces missing trace decision when they are sent |
+| drop_decision_batches_received | Counter | Dimensionless | number of drop decision batches received |
+| kept_decision_batches_received | Counter | Dimensionless | number of kept decision batches received |
+| drop_decisions_received | Counter | Dimensionless | total number of drop decisions received |
+| kept_decisions_received | Counter | Dimensionless | total number of kept decisions received |
+| collector_kept_decisions_queue_full | Counter | Dimensionless | number of times kept trace decision queue is full |
+| collector_drop_decisions_queue_full | Counter | Dimensionless | number of times drop trace decision queue is full |
 | cluster_stress_level | Gauge | Dimensionless | The overall stress level of the cluster |
 | individual_stress_level | Gauge | Dimensionless | The stress level of the individual node |
 | stress_level | Gauge | Dimensionless | The stress level that's being used to determine whether to activate stress relief |