Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace TransportChunk with re_sorbet::ChunkBatch #8945

Merged
merged 40 commits into from
Feb 11, 2025
Merged

Conversation

emilk
Copy link
Member

@emilk emilk commented Feb 6, 2025

Related

What

An arrow record batch needs to follow a specific schema to be be compatible with Rerun,
and that schema is defined in ChunkSchema. If a record batch matches the schema, it can be converted to a `ChunkBatche.

The schema has:

  • One RowId column
  • N index (time) columns
  • N data (component) columns, all ListArrays

ChunkBatch::try_from(RecordBatfch) will automatically wrap data columns in ListArray if needed.

impl AsRef<ArrowRecordBatch> for ChunkBatch { … }
impl From<ChunkBatch> for ArrowRecordBatch { … }
impl TryFrom<ArrowRecordBatch> for ChunkBatch { … }

TODO

  • Fix all TODOs
  • Fix rust tests
  • Fix python roundtrip tests
  • Run full test suite

For future PRs

  • Put the magic string constants into constants
  • Introduce DataframeBatch
  • Rename ComponentColumnDescriptor to DataColumnSchema, etc
  • ComponentColumnDescriptor is both used for dataframes and chunks. Resolve somehow

@emilk emilk added 🏹 arrow concerning arrow exclude from changelog PRs with this won't show up in CHANGELOG.md labels Feb 6, 2025
Copy link

github-actions bot commented Feb 6, 2025

Web viewer built successfully. If applicable, you should also test it:

  • I have tested the web viewer
Result Commit Link Manifest
8282fc5 https://rerun.io/viewer/pr/8945 +nightly +main

Note: This comment is updated whenever you push a commit.

@emilk emilk force-pushed the emilk/chunk-schema branch 2 times, most recently from 7e171a3 to 3e5ad9c Compare February 9, 2025 09:44
@emilk emilk force-pushed the emilk/chunk-schema branch from 3e5ad9c to f01efda Compare February 10, 2025 14:56
@emilk emilk changed the title Add re_sorbet::ChunkSchema Replace TransportChunk with re_sorbet::ChunkBatch Feb 10, 2025
@emilk
Copy link
Member Author

emilk commented Feb 11, 2025

@rerun-bot full-check

Copy link

@emilk
Copy link
Member Author

emilk commented Feb 11, 2025

@rerun-bot full-check

Copy link

github-actions bot commented Feb 11, 2025

Started a full build: https://github.com/rerun-io/rerun/actions/runs/13259070067

@@ -61,9 +61,9 @@ def __init__(self, metadata: dict[bytes, bytes], col: pa.Array):
def component_descriptor(self) -> ComponentDescriptor:
kwargs = {}
if SORBET_ARCHETYPE_NAME in self.metadata:
kwargs["archetype_name"] = "rerun.archetypes" + self.metadata[SORBET_ARCHETYPE_NAME].decode("utf-8")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this used to be buggy (missing period after archetypes)

@@ -333,84 +327,32 @@ async fn stream_catalog_async(

re_log::info!("Starting to read...");
while let Some(result) = resp.next().await {
let input = TransportChunk::from(result.map_err(TonicStatusError)?);

// Catalog received from the ReDap server isn't suitable for direct conversion to a Rerun Chunk:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to see this gone!

Copy link
Contributor

@zehiko zehiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! this was quite easy to follow and makes sense to me.

crates/store/re_sorbet/src/chunk_batch.rs Outdated Show resolved Hide resolved
crates/store/re_sorbet/src/chunk_batch.rs Outdated Show resolved Hide resolved
@@ -35,6 +35,16 @@ pub enum ColumnDescriptor {
}

impl ColumnDescriptor {
/// Debug-only sanity check.
Copy link
Member

@teh-cmc teh-cmc Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing in this method that guarantees debug-only behavior though? (same elsewhere)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No guarantees other than the docstring contract

@emilk
Copy link
Member Author

emilk commented Feb 11, 2025

@teh-cmc as described in #8744 there are two types of record batches we care about:

  • chunks (single-entity)
  • dataframes (multi-entity)

This PR implements the first, with the second one coming as a second PR

@teh-cmc
Copy link
Member

teh-cmc commented Feb 11, 2025

@teh-cmc as described in #8744 there are two types of record batches we care about:

* chunks (single-entity)

* dataframes (multi-entity)

This PR implements the first, with the second one coming as a second PR

I guess most of my confusion comes from the mention of sorbet -- all this of this very much looks plain-old Rerun (although I guess plain-old Rerun can do multi-entity records now, which is the new thing).

(That ticket even says: "We will stop using sorbet name until we have cycles to make this a more universal spec.")

@emilk
Copy link
Member Author

emilk commented Feb 11, 2025

I see "sorbet" as the spec for how we encode Rerun data on-the-wire. As such, it covers both chunks and dataframes.

We don't use sorbet as a name in any of the meta tags (yet).

@emilk emilk merged commit d39fcfe into main Feb 11, 2025
36 checks passed
@emilk emilk deleted the emilk/chunk-schema branch February 11, 2025 14:11
emilk added a commit that referenced this pull request Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow exclude from changelog PRs with this won't show up in CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants