Draft: First pass at a BenchmarkV2 that can hold multiple splits. #273

cwognum · 2025-02-27T21:03:35Z

This is currently a draft. Do not merge!

Copilot

PR Overview

This PR implements a draft of BenchmarkV2 that supports multiple splits and updates several related modules such as predictions, splitting, dataset access, evaluation results, and error handling.

Introduces BenchmarkV2Prediction for uploadable benchmark evaluations.
Refactors splitting functionality by replacing a single split with multiple splits (using SplitV2) and updates associated validators and error messages.
Updates dataset, evaluation, and artifact modules to align with the new benchmark and split design.

Reviewed Changes

File	Description
polaris/predictions/init.py	Adds the new BenchmarkV2Prediction model with minor spelling improvements needed in field descriptions.
polaris/evaluate/_predictions.py	Introduces prepare_predictions function and updates type aliases.
polaris/dataset/_subset.py	Refactors iter and adjusts filter_targets and extend_inputs implementation.
polaris/benchmark/_split.py	Renames and refines the split mixin with updated logging messages.
polaris/benchmark/init.py	Exposes the new SplitV2 in the public API.
polaris/_artifact.py	Adds repr, _repr_html, and str methods for better artifact display.
polaris/benchmark/_split_v2.py	Renames fields (using train/test over training/test) and revises validators and error messages.
polaris/utils/errors.py	Introduces a new InvalidSplitError for more granular error handling.
polaris/evaluate/_results.py	Removes V2 evaluation result classes to consolidate evaluation around V1.
polaris/utils/types.py	Narrows the ListOrArrayType alias to include numeric types.
polaris/evaluate/init.py	Adjusts exports to reflect removal of V2 results classes.
polaris/benchmark/_benchmark_v2.py	Refactors BenchmarkV2Specification to support multiple splits via a generator return signature.
polaris/benchmark/_base.py	Refactors BenchmarkSpecification and removes redundant str implementations.
polaris/evaluate/_metric.py	Removes duplicate prepare_predictions implementation.
polaris/competition/init.py	Removes the str method in favor of repr for consistency.
polaris/evaluate/_metadata.py	Removes ResultsMetadataV2 definitions in favor of a single metadata format.
polaris/dataset/_base.py	Removes an unnecessary str override.

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Copilot · 2025-02-27T21:05:41Z

polaris/predictions/__init__.py

+        description="The benchmark specification that these predictions were generated for."
+    )
+    split_label: str = Field(
+        description="The label of the split specified in the benchmark that these prediction where generated with."


[nitpick] The description for split_label contains a typo ('these prediction where generated with'). Consider updating it to 'the label of the split specified in the benchmark that these predictions were generated with'.

Suggested change

description="The label of the split specified in the benchmark that these prediction where generated with."

description="The label of the split specified in the benchmark that these predictions were generated with."

Copilot · 2025-02-27T21:05:42Z

polaris/benchmark/_benchmark_v2.py


-        return self
+        if max(split.max for split in self.splits) >= dataset_length:


The code uses 'split.max' to determine the maximum index but SplitV2 defines a 'max_index' property. Update the reference to 'split.max_index' to correctly validate the split indices against the dataset length.

Suggested change

if max(split.max for split in self.splits) >= dataset_length:

if max(split.max_index for split in self.splits) >= dataset_length:

First pass at a BenchmarkV2 that can hold multiple splits.

edf661b

cwognum requested a review from Copilot February 27, 2025 21:03

cwognum marked this pull request as draft February 27, 2025 21:03

Copilot AI reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: First pass at a BenchmarkV2 that can hold multiple splits. #273

Draft: First pass at a BenchmarkV2 that can hold multiple splits. #273

cwognum commented Feb 27, 2025 •

edited

Loading

Copilot AI left a comment

Copilot AI Feb 27, 2025

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI Feb 27, 2025

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

	description="The label of the split specified in the benchmark that these prediction where generated with."
	description="The label of the split specified in the benchmark that these predictions were generated with."


		return self
		if max(split.max for split in self.splits) >= dataset_length:

	if max(split.max for split in self.splits) >= dataset_length:
	if max(split.max_index for split in self.splits) >= dataset_length:

Draft: First pass at a BenchmarkV2 that can hold multiple splits. #273

Are you sure you want to change the base?

Draft: First pass at a BenchmarkV2 that can hold multiple splits. #273

Conversation

cwognum commented Feb 27, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Copilot AI Feb 27, 2025

Choose a reason for hiding this comment

Copilot AI Feb 27, 2025

Choose a reason for hiding this comment

cwognum commented Feb 27, 2025 •

edited

Loading