Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make tracing a task public so self-tracing is possible #6972

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

arielb1
Copy link
Contributor

@arielb1 arielb1 commented Nov 13, 2024

There is some desire to make it possible for tasks to trace themselves to discover slow wakeups, something similar to the test I added.

This PR makes some functions public (but unstable) to make that easier.

I currently shared it with someone I'm working with to see whether it helps them debug their "slow task" problems. I'll make this PR less WIP after I get feedback from them.

WIP: actually add docs.

@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Nov 13, 2024
@arielb1 arielb1 force-pushed the self-trace branch 3 times, most recently from 35d68a7 to 1716248 Compare November 13, 2024 17:02
@arielb1
Copy link
Contributor Author

arielb1 commented Nov 13, 2024

cc @carllerche

also for seeing how to make this pass CI without removing the test

@arielb1 arielb1 force-pushed the self-trace branch 5 times, most recently from d93b7cf to 5375c24 Compare November 14, 2024 19:16
@arielb1
Copy link
Contributor Author

arielb1 commented Nov 18, 2024

[PR status: I'm having some people check out this PR to see if it helps them debug some "slow task" problems. After I see the lessons learned I'll make it less WIP]

@Darksonn Darksonn marked this pull request as draft November 18, 2024 11:58
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-taskdump --cfg tokio_taskdump labels Nov 18, 2024
@jlizen
Copy link

jlizen commented Nov 21, 2024

Chiming in, @arielb1 asked me to poke at this implementation in a somewhat realistic scenario where it might be useful.

One such case that I've encountered is while operating a service that uses a Redis cluster as a backend for distributed precision/adaptive throttling, with various multi-step state operations and varying key cardinality. Redis's single threaded event loop and key space partitioning makes it prone to bottlenecks if code is poorly optimized. Debugging performance issues can be complex and require a decent amount of context on Redis's underlying behaviors. Since, Redis-perspective per-command logs typically don't include time spent while waiting in the event loop queue, metrics tend to be heavily aggregated, and performance issues are difficult to directly trigger without large-scale load testing (and with realistic traffic shapes).

A bit of magic to systematically trace the slower futures without a lot of manual instrumentation by end users, and without introducing new bottlenecks due to blanket log/metric outputs, would be quite handy. This change would open up that door, and then other libraries could wrap their futures with self-tracing logic.

I threw together a crude simulation where I ran a redis cluster locally (8 nodes, key space evenly distributed). I then simulated a big (and fairly hot) key alongside a bunch of normal keys. My wrapper future takes a trace + dumps that trace as well as a human-readable key name, in case the duration of the total lookup is > 500ms.

This specific implementation is pretty heavy-handed for the use case above compared to a plain timer + simple event output. But, I can imagine end user scenarios where there are more complex futures that contain multiple i/o calls, different function contexts, etc, where the task trace might be handy. Ariel has another open PR that will also probably make access to the backtrace more useful.

Probably the larger benefit is for libraries rather than usage directly by end users. It might be nice for @arielb1 to look at how this API feels in e.g. a tower layer, which seems like a good use case for this.

My code and output are below. You'll see a subset of my bigkey calls tripping the threshold, as well as certain regular keys that are routed to the same node and stuck behind the bigkeys in line.

Here is my code:

use std::{future::Future, pin::Pin, task::{Context, Poll}, time::{Duration, Instant}};

use itertools::Itertools;
use rand::distributions::{Alphanumeric, DistString};
use redis::{cluster::ClusterClient, cluster_async::ClusterConnection, AsyncCommands};
use tokio::{runtime::dump::{Root, Trace}, task::JoinSet};
use tracing::{error, info_span, warn, Instrument};
use tracing_subscriber;

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt::init(); 
    let client = ClusterClient::new(vec!["redis://127.0.0.1:6379"]).unwrap();
    let connection = client.get_async_connection().await.unwrap();

    let elements: Vec<(&'static str, &'static str)> = get_good_keys().into_iter().interleave(get_bad_keys().into_iter()).collect();

    let mut set = JoinSet::new();
    for (key, value) in elements.iter() {
        set.spawn(maybe_traced_redis_set(key, value, connection.clone()));
    }

    set.join_all().await;
}


async fn maybe_traced_redis_set(key: &'static str, value: &'static str, mut connection: ClusterConnection) {
    let span = info_span!("redis_set", key = key);
    let future = connection.set::<&str, &str, String>(key, value);
    let maybe_traced_future = SelfTracingFuture::new(future).instrument(span);

    if let Err(e) = maybe_traced_future.await {
        error!("error setting {key}: {e}");
    }
}


fn get_good_keys() -> Vec<(&'static str, &'static str)> {
    (0..200).map(|el| {
        let leaked: &'static str = el.to_string().leak();
        (leaked, leaked)
    }).collect()
}

fn get_bad_keys() -> Vec<(&'static str, &'static str)> {
    let big_key: &'static str = Alphanumeric.sample_string(&mut rand::thread_rng(), 5_000_000).leak();

    vec![("big_key", big_key); 50]
}


pin_project_lite::pin_project! { pub struct SelfTracingFuture<F: Future> {
        #[pin]
        f: Root<F>,
        t_last: State,
    }
}

enum State {
    NotStarted,
    Running { since: Instant },
    Alerted,
}


impl<F: Future> SelfTracingFuture<F> {
    fn new(f: F) -> Self {
        Self {
            f: Trace::root(f),
            t_last: State::NotStarted,
        }
    }
}


impl<F: Future> Future for SelfTracingFuture<F> {
    type Output = F::Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<F::Output> {
        let mut this = self.project();
        let now = Instant::now();
        let t_last = match this.t_last {
            State::Running { since } => Some(*since),
            State::NotStarted => {
                *this.t_last = State::Running { since: now };
                None
            }
            State::Alerted => {
                // don't double-alert for the same future
                None
            }
        };
        if t_last.is_some_and(|t_last| now.duration_since(t_last) > Duration::from_millis(500)) {
            let (res, trace) = tokio::runtime::dump::Trace::capture(|| this.f.as_mut().poll(cx));
            *this.t_last = State::Alerted;
            warn!("future ran long: {trace}");
            
            return res;
        }
        this.f.poll(cx)
    }
}

And here is my output:

     Running `build/private/cargo-target/debug/test_trace`
2024-11-21T02:20:21.795154Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.985727Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.985690Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.985997Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986514Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986774Z  WARN redis_set{key="191"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986933Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986032Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986290Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986339Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.985800Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986253Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986078Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.987240Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986810Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.988358Z  WARN redis_set{key="70"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
<snip, maybe half the bigkey calls were >500 ms and a scattering of other calls>

And then for completeness, here you can see the overlap of the bigkey + regular key futures that were running long - all were routed to node 4:

bash-5.2# redis-cli -c
127.0.0.1:6379> cluster keyslot big_key
(integer) 11045
127.0.0.1:6379> cluster keyslot 191
(integer) 11242
127.0.0.1:6379> cluster keyslot 70
(integer) 10833
127.0.0.1:6379> cluster nodes
8fd9761e7124eb3572428f6955a0d666ad804751 240.10.3.4:7001@17001 master - 0 1732152536000 7 connected 4096-6143
91490ff14cf53f07b17044a480162072bf3f6a3e 240.10.3.4:6379@16379 myself,master - 0 1732152535000 1 connected 0-2047
aef5d6bebe5d2a05cb1932657f7352bfc3e913ba 240.10.3.4:7000@17000 master - 0 1732152536782 0 connected 2048-4095
5d55504cde8d478a3006b14ead7097676d768546 240.10.3.4:7004@17004 master - 0 1732152535000 4 connected 10240-12287
04fcec03a691deb0d6dc65d83e1b4e647f5a113f 240.10.3.4:7006@17006 master - 0 1732152535000 5 connected 14336-16383
014709135cc1968a32a0f97d729a3a709d333eb1 240.10.3.4:7002@17002 master - 0 1732152537833 3 connected 6144-8191
6da488511ca9434ddd0c88f026c37741b059648a 240.10.3.4:7003@17003 master - 0 1732152535000 6 connected 8192-10239
218d002c72b76b9255056d990f09e11d82756227 240.10.3.4:7005@17005 master - 0 1732152534074 2 connected 12288-14335

@jlizen
Copy link

jlizen commented Nov 21, 2024

Going to also throw in a test against a more realistic application that has more complicated usage of the tokio runtime to see if anything rattles loose.

@jlizen
Copy link

jlizen commented Dec 16, 2024

I did some more testing in a real application. It is an axum server that handles user flows across a series of endpoints that span vending javascript to the client, vending inputs for a hashcash proof of work challenge, then various crypto operations to validate the solution and vend an encrypted token.

I injected the self-tracing task functionality in an outer tower middleware layer. It dumps a trace in case a request takes longer than 500ms. I then injected a 5% chance of a sleep inside a place where we make a remote call to AWS DynamoDB. This mimics a real performance bottleneck we encountered, due to overlarge table size, that was annoying to debug due to lack of specific metrics wrapping that remote call at the time.

I then simulated user realistic user flows that hit all endpoints in the browser, acquired tokens, etc. I didn't test at load but did send enough traffic to hit a bit of concurrency. I didn't see any signs of strange behavior with regard to the executor or otherwise. Seemed to behave as expected.

The stack trace was useful and pointed directly to the line of code that had the sleep added. See abbreviated output below:

warn 2024-12-16T18:45:51.813062959Z long running task: ╼ 
<snip> 
 └╼ test_app[5d607e4097ef04f4]::routing::apis::<test_app_backends[6d9dd0500a3162d0]::cp_datastore::backends::api_gateway_proxy::ApiGatewayProxy>::{closure#4} at /TestApp/test_app/src/routing.rs:269:22
    └╼ test_app[5d607e4097ef04f4]::handlers::verify::verify_handler::<test_app_backends[6d9dd0500a3162d0]::cp_datastore::backends::api_gateway_proxy::ApiGatewayProxy> at /TestApp/test_app/src/handlers/verify.rs:305:1
       └╼ test_app[5d607e4097ef04f4]::handlers::verify::verify_handler::<test_app_backends[6d9dd0500a3162d0]::cp_datastore::backends::api_gateway_proxy::ApiGatewayProxy>::{closure#0} at /TestApp/test_app/src/handlers/verify.rs:365:6
          └╼ test_app[5d607e4097ef04f4]::handlers::verify::verify_handler_inner::<test_app_backends[6d9dd0500a3162d0]::cp_datastore::backends::api_gateway_proxy::ApiGatewayProxy> at /TestApp/test_app/src/handlers/verify.rs:652:85
             └╼ test_app[5d607e4097ef04f4]::handlers::verify::check_and_insert_with_maybe_sleep at /TestApp/test_app/src/handlers/verify.rs:727:67
                └╼ <tokio[b5f4bbd0e64ee820]::time::sleep::Sleep as core[23aac720c248f827]::future::future::Future> at /TestApp/src/tokio/tokio/src/time/sleep.rs:448:22
                   └╼ <tokio[b5f4bbd0e64ee820]::time::sleep::Sleep> at /TestApp/src/tokio/tokio/src/time/sleep.rs:404:16

@arielb1 arielb1 force-pushed the self-trace branch 4 times, most recently from 24f51f0 to 048978f Compare January 9, 2025 15:47
@arielb1 arielb1 changed the title [WIP] make tracing a task public so self-tracing is possible make tracing a task public so self-tracing is possible Jan 9, 2025
@arielb1 arielb1 marked this pull request as ready for review January 9, 2025 15:47
@arielb1
Copy link
Contributor Author

arielb1 commented Jan 9, 2025

Now ready for review.

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable enough.

Comment on lines 230 to 248
/// let mut trace = None;
///
/// // some future
/// let mut test_future = Box::pin(async move { tokio::task::yield_now().await; 0 });
///
/// // trace it once
/// Trace::root(std::future::poll_fn(|cx| {
/// if trace.is_none() {
/// // make sure we only trace once
/// let (res, captured) = Trace::capture(|| test_future.as_mut().poll(cx));
/// trace = Some(captured);
/// res
/// } else {
/// test_future.as_mut().poll(cx)
/// }
/// })).await;@@
///
/// // check that there are backtraces
/// assert!(!trace.unwrap().resolve_backtraces().is_empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example isn't so nice. I would probably do something along these lines:

// some future
let mut test_future = std::pin!(async move { tokio::task::yield_now().await; 0 });

// trace it once
let (trace, res) = Trace::root(std::future::poll_fn(|cx| {
    let (res, trace) = Trace::capture(|| test_future.as_mut().poll(cx));
    Poll::Ready((trace, res))
})).await;

let output = match res {
    Poll::Ready(output) => output,
    Poll::Pending => test_future.await,
};

println!("{trace}");

Perhaps we should have a function along these lines to make it easier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will switch to your example. My intended use is closer to the task_trace_self test. I think a helper function would better fit in a separate PR.

Comment on lines 19 to 25
pin_project_lite::pin_project! { pub struct PrettyFuture<F: Future> {
#[pin]
f: Root<F>,
t_last: State,
logs: Arc<Mutex<Vec<Trace>>>,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting here could be improved.

Comment on lines 9 to 15
use std::{
future::Future,
pin::Pin,
sync::{Arc, Mutex},
task::{Context, Poll},
time::{Duration, Instant},
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use this syntax for imports. Generally, we have one use std:: per module.

@Darksonn Darksonn requested a review from jswrenn January 9, 2025 16:07
@Darksonn
Copy link
Contributor

Seems like the doc-tests fail to compile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-taskdump --cfg tokio_taskdump R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants