-
-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
notes on profiling Trio programs #943
Comments
I haven't done an extensive comparison, but last time I used vmprof it was definitely "coroutine aware" in the sense you mean. (Really, it just queries the interpreter for the current stack whenever the sampling interrupt fires, and the interpreter stack does not include suspended tasks.) |
Regarding your categories: I think of "deterministic profilers" as more like a kind of coverage tool. They can be handy in specific circumstances if you're curious which functions are called, and how many times, but if I want to know about actual time (CPU or wall-clock), then I don't use them. Too many times that I've had cProfile tell me that a function was slow because cProfile was making it slow. Anyway, I wrote a little test program that I thought might misattribute time to Are you sure that your code isn't, like... accidentally calling Test program: import trio
import trio.testing
async def main():
async with trio.open_nursery() as nursery:
async def sleepy_child():
await trio.sleep(5)
nursery.cancel_scope.cancel()
s, r = trio.open_memory_channel(1)
async def noisy_sender():
while True:
await s.send(None)
async def noisy_receiver():
while True:
await r.receive()
nursery.start_soon(sleepy_child)
await trio.testing.wait_all_tasks_blocked()
nursery.start_soon(noisy_sender)
nursery.start_soon(noisy_receiver)
trio.run(main) |
Thank you-- I think you're right and I'm revisiting vmprof. A few things I'd like to make sense of:
|
Yes, if we're to believe vmprof, sleep and sleep_until are taking (8% + 7%) of my app time. Likely due to many loops which sleep for a small time, wake up, and do negligible work. It may reflect things not being event-driven enough. Casual programmers tend to coordinate by periodic loops which share state, rather than deal with the intricacies of building and composing events. generic_function detail: |
Huh. I have no idea, actually. If it was counting time sleeping ("no CPU use"), then I would expec that to show up inside
The top part (above
For the Right now |
Oh, and I'm totally baffled by the |
Two cases:
|
With a bare bones implementation of HTTP get, I was able to reduce CPU due to these requests from 11.3% to 3.0% of my app total (per vmprof). As mentioned, the original version used asks library. Requests are made at a rate of 30Hz, and response size is 220K; the connection is not SSL. My replacement uses Trio networking directly along with httptools for parsing.
While implementing that I noticed that CPU use becomes high when receive_some chunk size is under some threshold. E.g. for 1K size, my performance was worse than asks, and at 10K it became significantly better than asks (diminishing returns beyond that). For asks, there is a receive_some size of 10,000 which appears sufficient-- the issue is more regarding h11 overhead (theelous3/asks#108). So at 1K size, clearly the 200+ Trio scheduling events are involved. It's an interesting and tangible case to look into as far as Trio library profiling. I'd like to try it against the cancellation refactoring PR #910 as well. I have vmprof comparing 16K and 1K chunk size. Screenshots don't work well for comparison since vmprof UI is not great (segment lengths don't correspond to time). To summarize, in the 16K case, receive_some -> wrapper -> _nonblocking_helper is mostly balanced between aenter, aexit, wait_readable, and Python coroutine internals. In the 1K case, aenter, aexit dominate due to checkpoint_if_cancelled and cancel_shielded_checkpoint, respectively. |
Like These functions make it hard to see which areas of my code are actually problematic. E.g. some child of the root may show 10%, vs. a sibling of 3%, but the first is mostly It may be an artifact or bug in this particular PyCharm call tree display of the vmprof data. Things look fairly different in the vmprof UI, I don't see these unaccounted times. |
I tried PR #910, no impact on my app. |
I'll try to describe a major weakness I've learned about sampling profilers. Sampling profilers are good for looking at relative CPU use between parts of a program within one session. For example, assuming stack-based sampling, it will reveal that all stacks at function A and below are taking roughly 10% of runtime. However, that number is highly dependent on how many times function A is called relative to other parts of the program. That will not be deterministic for any real program employing concurrency and which has nondeterministic input, so measurement will be highly variable. Sampling profiles are not good for comparing the effects of minor code changes between sessions. By "minor" I mean not substantially changing the call graph, and not making a gross performance change in some hot code path. Unfortunately many of the important optimizations we make will be "minor" under this definition. (E.g. speeding up some often-used function by 25%, or detecting a 15% CPU regression in a 3rd party websocket package.) The reason is that the output of sampling profiles cannot be reasonably normalized to "how much CPU does a call to this function take", because they lack information on call counts. You may conclude: use a traditional timing approach for measuring minor optimizations. However, that's currently hard to do for async functions-- which was one of the motivations for exploring profilers. So I hope the introduction of task-friendly timing primitives will help the situation (e.g. task_perf_counter). |
It's definitely true that sampling profilers can't be used on their own to measure speedups, because they don't measure speed. I feel like the most classic boring approach to optimization is:
There are a lot of other approaches too of course – e.g. sampling profilers usually sample a random subset of CPU cycles, so they're most useful to discover where CPU cycles are going. Another popular approach for production distributed systems is to sample a random subset of requests, and take detailed traces of them, so that later on you can ask questions like "which % of our requests hit this cache?" or "for requests in the 99th latency percentile, which backend service are they getting stuck waiting for?" I don't think I have enough information to understand why it's difficult to set up a reproducible benchmark for your system. But I agree that it's hard to do performance work without one! |
Writing benchmarks is a tall order for many scenarios-- application level code under rapid development, not-very-engineered domain-specific components, 3rd party code, ... Even for well-defined pieces like a networking library, it's hard to write a benchmark which represents reality, or otherwise which has the proper tooling for record and replay of real input. I've only had case 1: the unneeded lockWhile reviewing our websocket integration, I realized that we have a lock on send which is redundant, because trio-websocket is locking internally. Original method: async def send(self, cmd):
send_metric.inc()
send_bytes_metric.inc(len(cmd))
async with self._send_lock:
await self._ws.send_message(cmd) So I'm curious-- what was the CPU cost of that lock? I use TrioPerfTimer, which is a context manager using
and run the full application, as-is. On exit:
run again with the lock removed:
so the answer is ~30 usec per call. case 2: the not-so-regressionThe sampling profiler appeared to highlight a regression in receive-side code of trio-websocket and/or wsproto. I ran a number of sessions, it seemed real. It's 3rd party code, neither library has a benchmark suite. So I pulled the trio-websocket source into my project, added TrioPerfTimer around the suspected code (body of reader_task loop), and ran the actual application with the old and new versions of trio-websocket and wsproto. Old:
New:
After several runs yielding similar results, the answer: no regression. Being able to ask targeted performance questions of a real running application this easily is indispensable. |
For "case 2" above, I went back to look at how the sampling profiler output may have been misleading. Perhaps it's counting some I/O blocking cases as using CPU. Specifically, I'm suspecting the "empty time" of Trio _epoll_wait() and/or wait_task_resheduled() (i.e. the periods where these calls are exactly at the bottom of the call stack). I don't understand what mysterious thing is consuming about 50% of _epoll_wait() execution time. (I should take longer captures in the future-- |
|
Profiler is vmprof 0.4.15 on x86_64 GNU/Linux using default settings. I might try instrumenting _epoll_wait() with TrioPerfTimer to get some opinion on actual timing. |
timing results from
the BusyResourceError case was never hit |
Hey everyone, I am the author of yappi and has just seen this thread by chance. And for the current status: please see this: sumerc/yappi#21 |
profiler categories: deterministic (cProfile, yappi), sampling (vmprof, pyflame)
deterministic:
perf_counter()
, but only 1 usec resolution forprocess_time()
.sampling:
the bad news: none of these profilers appear to be coroutine aware
Neither type of profiler will correctly report "total time" in the context of cooperative multitasking. E.g. function awaits on trio.sleep, which yields to scheduler, which resumes another task, and so on, all incorrectly attributed as children of the original function.
The profilers should somehow use yield or else details of the particular event loops (asyncio, Trio, etc.) to stop attributing time to the parent function. They apparently already do this for select, etc.
It seems like a ripe area to improve tooling for async / await / Trio. I can't find any prior work for Python addressing this.
the good news: "own time" and function counts are still useful in the context of coroutines
The text was updated successfully, but these errors were encountered: