Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove hot uses of pathlib #14110

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

bonzini
Copy link
Collaborator

@bonzini bonzini commented Jan 10, 2025

pathlib's implementation of Path iteration is expensive; "foo.parents" has quadratic complexity when building it:

        return self._path._from_parsed_parts(self._drv, self._root,
                                             self._tail[:-idx - 1])

and comparisons performed by "path in file.parents" potentially have the same issue. Introduce a new function that checks whether a file is under a path in the same way, removing usage of Path from the biggest hotspot.

@bonzini bonzini requested a review from jpakkane as a code owner January 10, 2025 09:06
@bonzini bonzini force-pushed the speedup-pathlib branch 2 times, most recently from 0b2ef7f to bf03ed2 Compare January 10, 2025 12:11
@bonzini bonzini added this to the 1.8 milestone Jan 10, 2025
@eli-schwartz
Copy link
Member

Introduce a new function that checks whether a file is under a path in the same way, removing usage of Path from the biggest hotspot.

Isn't this just os.path.commonpath? This is a well-known problem with pathlib -- it loves to reimplement os.path via complicated and fiddly data structures with questionable correctness characteristics. And apparently questionable performance characteristics too...

@bonzini
Copy link
Collaborator Author

bonzini commented Jan 10, 2025

It's similar to commonpath indeed, but 1) it returns a boolean 2) it only operates on two paths 3) it is not symmetrical (one argument is the path and the other is a file) 4) it is faster.

And apparently questionable performance characteristics too...

Yeah, I like pathlib's ergonomics but I draw the line at quadratic implementations...

@eli-schwartz
Copy link
Member

It's similar to commonpath indeed, but 1) it returns a boolean 2) it only operates on two paths 3) it is not symmetrical (one argument is the path and the other is a file) 4) it is faster.

Faster, maybe. I haven't timed it. But with regard to returning a boolean etc. the point of commonpath isn't "this function returns whether a path is an ancestor" but rather that it's the trivial building block for such a check.

It's code we already use in several places:

if os.path.commonpath(x, y) == x:
    do_thing_with(y)

@bonzini
Copy link
Collaborator Author

bonzini commented Jan 10, 2025

Yes, it is faster: :)

>>> timeit.timeit('path_starts_with("/usr/bin/", "/usr/bin/foo/bar/baz")',  
     globals={'path_starts_with':path_starts_with, 'is_windows': 'windows', 'os': os})
5.237490472005447

>>> timeit.timeit('path_starts_with("/usr/bin/foo/bar/baz", "/usr/bin/")',  
     globals={'path_starts_with':path_starts_with, 'is_windows': 'windows', 'os': os})
6.680338967334726

>>> timeit.timeit('os.path.commonpath(["/usr/bin/", "/usr/bin/foo/bar/baz"]) == "/usr/bin"',
     globals={'path_starts_with':path_starts_with, 'is_windows': 'windows', 'os': os})
13.45890446798876

Also it doesn't crash and burn if you have different drives on Windows. :) Should I adjust other uses of commonpath?

@eli-schwartz
Copy link
Member

eli-schwartz commented Jan 10, 2025

Also it doesn't crash and burn if you have different drives on Windows. :)

Yup, we catch ValueError for those. So I suppose it may anyways be worth extracting into a dedicated function. :P

Should I adjust other uses of commonpath?

Yes, please. If we need this function for performance in a hot path anyway, then we might as well use it everywhere the same code pattern is used.

@bonzini bonzini force-pushed the speedup-pathlib branch 2 times, most recently from d1eaebd to 7d781c8 Compare January 10, 2025 15:37
mesonbuild/utils/universal.py Outdated Show resolved Hide resolved
mesonbuild/interpreter/interpreter.py Outdated Show resolved Hide resolved
mesonbuild/interpreter/interpreter.py Outdated Show resolved Hide resolved
mesonbuild/interpreter/interpreter.py Outdated Show resolved Hide resolved
@bonzini bonzini added the pathlib upgrading to/downgrading from pathlib.Path--including temporary workarounds for older Python label Jan 28, 2025
@bonzini bonzini force-pushed the speedup-pathlib branch 2 times, most recently from aacb03c to f3786b3 Compare January 28, 2025 14:41
Introduce an alternative to os.path.commonpath(name, path) == path,
which is a common idiom throughout Meson.  Call it is_parent_path
just like the existing static method in Generator.
It is a bit faster and handles drives on Windows without the need
for an exception handler.

Signed-off-by: Paolo Bonzini <[email protected]>
This saves about 1% of execution time; path manipulation is expensive.
About 1% more could be saved by avoiding usage of "foo in bar.parents",
which is an expensive way to say "bar.startswith(foo)".

Signed-off-by: Paolo Bonzini <[email protected]>
pathlib's implementation of Path iteration is expensive;
"foo.parents" has quadratic complexity when building it:

        return self._path._from_parsed_parts(self._drv, self._root,
                                             self._tail[:-idx - 1])

and comparisons performed by "path in file.parents" potentially
have the same issue.  Use the newly introduced "is_parent_path"
function to check whether a file is under a path in the same way.
This removes Path from its hottest use.

Signed-off-by: Paolo Bonzini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pathlib upgrading to/downgrading from pathlib.Path--including temporary workarounds for older Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants