Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutating joins relationship documentation issues #7622

Open
bounlu opened this issue Jan 6, 2025 · 1 comment
Open

Mutating joins relationship documentation issues #7622

bounlu opened this issue Jan 6, 2025 · 1 comment

Comments

@bounlu
Copy link

bounlu commented Jan 6, 2025

Mutate-joins (dplyr) documentation says:

relationship

Handling of the expected relationship between the keys of x and y. If the expectations chosen from the list below are invalidated, an error is thrown.

NULL, the default, doesn't expect there to be any relationship between x and y. However, for equality joins it will check for a many-to-many relationship (which is typically unexpected) and will warn if one occurs, encouraging you to either take a closer look at your inputs or make this relationship explicit by specifying "many-to-many".

See the Many-to-many relationships section for more details.

"one-to-one" expects:
Each row in x matches at most 1 row in y.
Each row in y matches at most 1 row in x.

"one-to-many" expects:
Each row in y matches at most 1 row in x.

"many-to-one" expects:
Each row in x matches at most 1 row in y.

"many-to-many" doesn't perform any relationship checks, but is provided to allow you to be explicit about this relationship if you know it exists.

relationship doesn't handle cases where there are zero matches. For that, see unmatched.

I see there are 2 issues:

  1. one-to-many and many-to-one description looks awkward and reversed to me. Logically, it should specify from left to right, x -> y. So one-to-many should mean "Rows in x may match multiple rows in y". Similarly, many-to-one should mean "Multiple rows in x may match same row in y".

  2. Specifying relationship explicitly as one-to-many or many-to-one do not generate any warning or error if there is no such matching in the data, i.e. if only one-to-one exists. I would expect an error would be thrown if the specified relationship does not exist in the matching as the documentation says, otherwise I don't get the point of specifying the relationship explicitly.

I have read this but I believe the above issues still remain to be resolved.

@bounlu bounlu marked this as a duplicate of #7623 Jan 6, 2025
@DavisVaughan
Copy link
Member

The problem with:

one-to-many should mean "Rows in x may match multiple rows in y"

is that on its own it doesn't describe the full story. i.e. the important point is actually the restriction on y which isn't mentioned there. Really I guess the full documentation would look like

"one-to-one" expects:
Each row in x matches at most 1 row in y.
Each row in y matches at most 1 row in x.

"one-to-many" expects:
Each row in x matches any number of rows in y.
Each row in y matches at most 1 row in x.

"many-to-one" expects:
Each row in x matches at most 1 row in y.
Each row in y matches any number of rows in x.

We could probably do that, I do think it makes the documentation clearer here. I think previously I was just trying to list the restrictions it placed on x or y, so I left out Each row in x matches any number of rows in y. because that's not restricting anything.


one-to-many is by definition a superset of one-to-one, so its totally valid to write relationship = one-to-many on a one-to-one dataset. Here is a case where one-to-many is useful.

library(dplyr)

x <- tibble(
  a = c(1, 1, 2)
)
y <- tibble(
  a = c(1, 2)
)

left_join(x, y, relationship = "one-to-many")
#> Joining with `by = join_by(a)`
#> Error in `left_join()`:
#> ! Each row in `y` must match at most 1 row in `x`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants