Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make scancode parallelism configurable #610

Closed
wants to merge 118 commits into from

Conversation

RomanIakovlev
Copy link
Contributor

Fixes #609

yashkohli88 and others added 30 commits January 30, 2024 13:42
document._metadata.links.self.href is used in construct file path or blob name when storing the harvested data.  It should reflect the _schemaVersion of PodExtract.  Added test to verify this.
1. In AbstractProcessor, _schemaVersion is the combination of schemaVersion or toolVersion along the class hierarchy.
2. Most component related processors, e.g. mavenExtract or npmExtract, which are subClasses of abstractClealyDefinedProcessors, overrride toolVersion(),  see comments at AbstractProcessor.toolVersion().
This convention was introduced in commit "isolate toolVersion from schemaVersion".

The exception is PodExtract.  This commit aligns PodExtract with the rest of the component related processors.
This is for fix to exclude .git directory content in recent PR (#525).
Bump up the version to allow reharvest of pod components.
The recent fix to exclude content in the .git directory (#525) from pod packages will cause the file count to be different from the previous version. Update the toolVersion for PodExtract to 2.0.0 to reflect this.
qtomlinson and others added 27 commits August 12, 2024 13:32
Fix fetching latest version for some pod components
The "always" traversal policy behaves as follows:
- if the tool result (e.g. licensee) for a specific component exist, the component will be refetched and the tool will be rerun.
- if the tool result for a specific component is missing, using the "always" policy leads to a "Unreachable for reprocessing" status and the tool being skipped.

The "always" traversal policy is basically a rerun for all the previously ran tools.  It is somewhat cumbersome in the case to retriger harvest, especially for integration tests.

The proposed new policy make reharvest simpler:
- When the tool result for a component is available, the tool will be rerun and tool result updated, similar to the "always" policy.
- When the tool result for a component is not available, the component will be fetched and the tool will be run.
In summary, this "reharvestAlways" policy is to rerun the harvest tools if results exist and run the harvest tools if results are missing.
Derive license from info.license over classifiers in pypi registry data
Deploy dev crawler via GitHub action
Introduce a new traversal policy
APP_VERSION replaces it
add sha and version to ‘/‘ endpoint
@RomanIakovlev RomanIakovlev changed the base branch from master to prod October 23, 2024 11:17
@RomanIakovlev RomanIakovlev deleted the roman/scancode_parallelism branch October 23, 2024 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scancode parallelism is hardcoded to 2 processes
7 participants