Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Performance problems of multi process implementation of a block of code #28

Open
wabomo opened this issue May 27, 2021 · 3 comments
Open

Comments

@wabomo
Copy link

wabomo commented May 27, 2021

In the "inside_polygon" function of "cytopy/data/geometry.py", the performance of multi process implementation is very poor. A 10000 line of data takes nearly 30 seconds, but it only takes less than 0.04 seconds to change to normal programming.For the time being, I simply handle it like this, adding a row count judgment.Maybe you have a better way to deal with it.

if len(df) < 100000:  # row count judgment.
    # Single thread implementation
    xy = df[[x, y]].values
    (min_x, min_y, max_x, max_y) = poly.bounds
    mask = []
    for p in xy:
        bol = min_x <= p[0] <= max_x and min_y <= p[1] <= max_y and point_in_poly(p, poly) is True
        mask.append(bol)
else:
    # Multi process implementation
    if njobs < 0:
        njobs = cpu_count()
    xy = df[[x, y]].values
    f = partial(point_in_poly, poly=poly)
    with Pool(njobs) as pool:
        mask = list(pool.map(f, xy))
return df.iloc[mask]

My configuration:
python: 3.8.7
Memory: 32g
CPU: i7-1165g7 (4 cores and 8 threads)

@burtonrj
Copy link
Owner

Hi, thanks for the feedback, is much appreciated. Sorry about the poor performance. To be honest, I haven't touched this particular functionality in a long time!

I don't think I ever did exhaustive performance checks (which is embarrassing) as it worked okay on the validation data I was running.

I'm currently working on a new release, albeit slowly (I'm the only dev on this project). If you're happy I would like to test your solution and include it in the new release with attribution ofc. Or you can make a PR on the v2.1 branch and make these changes.

Feel free to contact me at [email protected] for further discussion.

Thanks,
Ross

@whitews
Copy link

whitews commented Jul 1, 2022

Hey Ross!

Not sure if you've already addressed this, but wanted to chime in. I recently moved my implementation for finding points in a polygon from FlowKit to FlowUtils. Basically, all C extension based functions will go in FlowUtils from now on, and I have binary distributions on PyPI for FlowUtils for all the major platforms & recent Python versions. This should go a long way in helping those users without C compilers get the dependencies set up.

The points in ellipsoid (supporting true n-dimensional data) function was moved to FlowUtils as well. Both of these are compatible with the GatingML definitions, i.e. how to handle points on the boundary, etc.

On a side note, have you considered using FlowKit for CytoPy? CytoPy is a much more ambitious project in many ways. FlowKit aims to be an API for the more traditional flow concepts, and will never be or have a GUI in it's code base. I've toyed with the idea of an Electron based front-end but it would be a separate project and I need to finish the FlowJo workspace support in FlowKit first. Would be happy to discuss via email if you are interested.

Regards,
-Scott

@burtonrj
Copy link
Owner

Hey Scott!

Thanks for reaching out, I think if I remember rightly I have resolved this issue in the unpublished dev branch (v3.0) of CytoPy, but this is good to know. My implementation is still in Python so your C extension will probably be quicker.

I haven't considered FlowKit, mostly because CytoPy is aimed more at high-dimensional cytometry data and biomarker discovery. Its designed around the assumption of the following workflow:

  1. clean with autogates
  2. batch correct if needed
    1. cluster/classify
    1. visualise and extract results.

I do really admire what you have done with FlowKit however and I think it's major advantage is the GatingML compatibility which is a huge gap in CytoPy currently.

The future for CytoPy, once I've finished writing my PhD thesis and have time to implement it, is a decoupled ecosystem.
v3.0 of CytoPy is going to decouple many of the tools so that it's not such a behemoth. Tbh, CytoPy was a bit too ambitious to start with and this created a lot of problems in the original code base.

So in the future there will be CytoPy v3.0, which will depend on some smaller packages that can be used in isolation for those that don't need automated gating and a complex MongoDB database to track meta-data. There is probably a lot of overlap here in our work - I use your FlowUtils and FlowIO package extensively! Great job btw. If you ever need help maintaining it, please let me know, I owe you one for creating such great packages!

As for future CytoPy work, I've manage to make a start by pulling out some of functionality into smaller packages:
CytoTools [https://github.com/burtonrj/CytoTools] - reading files, transformations, dimension reduction, peak alignment and more
CytoPlots [https://github.com/burtonrj/CytoPlots] - classic facs plots in Python and some more general tools for stuff like scatterplots to be used with UMAP, tSNE etc
CytoCluster [https://github.com/burtonrj/CytoCluster] - as long as you have your results in a Pandas DataFrame you can apply all your popular clustering algorithms plus a novel ensemble clustering algorithm currently under peer-review (pre-print here: https://doi.org/10.1101/2022.06.30.496829)

Perhaps there is some areas we can collaborate here. I'm of the same mind and have thought about an Electron frontend or perhaps Django and React (a bit like your ReFlow work) but again, it always comes down to time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants