Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect keying of by= results involving functions of keys #6708

Merged
merged 7 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,8 @@ rowwiseDT(

18. `as.data.table()` method for `data.frame`s (especially those with extended classes) is more consistent with `as.data.frame()` with respect to rention of attributes, [#5699](https://github.com/Rdatatable/data.table/issues/5699). Thanks @jangorecki for the report and fix.

19. Grouped queries on keyed tables no longer return an incorrectly keyed result if the _ad hoc_ `by=` list has some function call (in particular, a function which happens to return a strictly decreasing function of the keys), e.g. `by=.(a = rev(a))`, [#5583](https://github.com/Rdatatable/data.table/issues/5583). Thanks @AbrJA for the report and @MichaelChirico for the fix.

## NOTES

1. There is a new vignette on joins! See `vignette("datatable-joins")`. Thanks to Angel Feliz for authoring it! Feedback welcome. This vignette has been highly requested since 2017: [#2181](https://github.com/Rdatatable/data.table/issues/2181).
Expand Down
19 changes: 17 additions & 2 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2014,8 +2014,8 @@ replace_dot_alias = function(e) {
if (verbose) {last.started.at=proc.time();catf("setkey() afterwards for keyby=.EACHI ... ");flush.console()}
setkeyv(ans,names(ans)[seq_along(byval)])
if (verbose) {cat(timetaken(last.started.at),"\n"); flush.console()}
} else if (keyby || (haskey(x) && bysameorder && (byjoin || (length(allbyvars) && identical(allbyvars,head(key(x),length(allbyvars))))))) {
setattr(ans,"sorted",names(ans)[seq_along(grpcols)])
} else if (.by_result_is_keyable(x, keyby, bysameorder, byjoin, allbyvars, bysub)) {
setattr(ans, "sorted", names(ans)[seq_along(grpcols)])
}
setalloccol(ans) # TODO: overallocate in dogroups in the first place and remove this line
}
Expand Down Expand Up @@ -3051,6 +3051,21 @@ rleidv = function(x, cols=seq_along(x), prefix=NULL) {
ids
}

.by_result_is_keyable = function(x, keyby, bysameorder, byjoin, byvars, bysub) {
if (keyby) return(TRUE)
k = key(x)
if (is.null(k)) return(FALSE) # haskey(x) but saving 'k' for below
if (!bysameorder) return(FALSE)
if (byjoin) return(TRUE)
if (!length(byvars)) return(FALSE)
if (!identical(byvars, head(k, length(byvars)))) return(FALSE) # match key exactly, in order
# For #5583, we also ensure there are no function calls in by (which might break sortedness)
if (is.name(bysub)) return(TRUE)
if (identical(bysub[[1L]], quote(list))) bysub = bysub[-1L]
if (length(all.names(bysub)) > length(byvars)) return(FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice refactor and helper function name!

TRUE
}

.is_withFALSE_range = function(e, x, root=root_name(e), vars=all.vars(e)) {
if (root != ":") return(FALSE)
if (!length(vars)) return(TRUE) # e.g. 1:10
Expand Down
7 changes: 7 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -20757,3 +20757,10 @@ as.data.frame.tbl = function(x) {
x
}
test(2302, attr(as.data.table(y), "t1"), attr(as.data.frame(y), "t1"))

# by=foo(KEY) does not retain key (no way to guarantee monotonic transformation), #5583
DT = data.table(a=1:2, key='a')
test(2303.1, DT[, .N, by=.(b=rev(a))], data.table(b=2:1, N=1L))
test(2303.2, DT[, .(N=1L), by=.(b=rev(a))], data.table(b=2:1, N=1L)) # ensure no interaction with GForce
DT = data.table(a=2:3, b=1:0, key=c('a', 'b'))
test(2303.3, DT[, .N, by=.(ab=a^b, d=c(1L, 1L))], data.table(ab=c(2, 1), d=1L, N=1L))
Loading