-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve merge.data.table error messages for missing keys (#6556) #6713
base: master
Are you sure you want to change the base?
Conversation
Hi @aitap, |
It looks like the atime tests will only work with branches created inside the Rdatatable/data.table repository, not the outside forks. Unless you are a member of a team with write access to create new branches, there may be nothing you can fix. Apologies for the poke, @Anirban166, could you help with this? If this is not by design (e.g. performance tests are relatively expensive and therefore must be run only from local branches), could the action create a local branch from |
Hi @MichaelChirico, |
You can ignore the atime issue |
@MichaelChirico thanks for the clarification |
error = 'must be valid column names in x and y') | ||
test(1962.021, { | ||
if (!"z" %in% colnames(DT1) || !"z" %in% colnames(DT2)) { | ||
stop("The columns listed in `by` are missing from either x or y: z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove backticks, which are for markdown, not error messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Toby, sorry, I disagree.
The backticks serve to highlight that this is a code object, and not a plain English word. Without them, a reader can easily be confused into thinking there's some grammatical mistake "in by", or otherwise struggle to parse the message they're given.
Of course, we could choose some other convention (single/double quotes, e.g.), and we should try and pick one and stick to it throughout the codebase... but that's a separate issue.
Personally, these days I am using `arg=`
for function arguments to highlight that (1) it's code with the backticks and (2) it's a keyword argument with =
.
test(1601.4, merge(DT0, DT0, by="a"), | ||
warning="Neither of the input data.tables to join have columns.", | ||
error="Elements listed in `by`") | ||
error="The following columns are missing:\n - From `x`: a\n - From `y`: a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove newlines in error messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, or add? I find the current output hard to read:
The following columns are missing: - From x: a
I would find this much more readable (possibly indenting the second line):
The following columns are missing:
- From x: a
At a higher level, I wonder if translation would be easier if we instead structured the message like so:
The following columns are missing from x: ...
The following columns are missing from y: ...
R/merge.R
Outdated
if (!all(by.x %chin% nm_x)) { | ||
missing_in_x <- setdiff(by.x, nm_x) | ||
stopf("The following columns listed in `by.x` are missing from `x`: %s", | ||
toString(missing_in_x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brackify instead of toString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The brackify function adds brackets around column names in error messages, which may not align with the expected format in your test cases.should i change the format of the test case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, use brackify. It provides nice formatting and also some simple truncation mechanism in case missing_in_x
happens to have 10s or dozens of elements.
R/merge.R
Outdated
if (length(missing_in_x) > 0 || length(missing_in_y) > 0) { | ||
stopf("The following columns are missing:\n%s%s", | ||
if (length(missing_in_x) > 0) sprintf(" - From `x`: %s\n", toString(missing_in_x)) else "", | ||
if (length(missing_in_y) > 0) sprintf(" - From `y`: %s\n", toString(missing_in_y)) else "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please refactor to remove repetition
and also please remove line breaks inside function calls: use stopf(some, code)
instead of
stopf(some,
code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdhock I've implemented the changes you suggested. Please review them and let me know if there's anything else that needs to be addressed or improved.
I added documentation which explains that atime failure is normal in forks, https://github.com/Rdatatable/data.table/wiki/Performance-testing#can-not-be-run-from-forks
I believe the problem is not the "relatively expensive part" but rather that the action requires permission to upload artifacts to rdatatable/data.table repo. |
inst/tests/tests.Rraw
Outdated
if (!"z" %in% colnames(DT1)) { | ||
stop("Elements listed in `by.x` are missing from x: z") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a creative solution to the problem of the failing test (would you mind letting us know how you came up with it?), but, unfortunately, not the right one.
The idea here and below is to test the error raised by the following merge()
call, not to manually raise the error expected by the previous test code. Instead of calling stop()
in the test expression, set the error
argument to make sure it matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aitap,
Thank you for the feedback. I apologize for the confusion. You are absolutely right that the goal is to test the error raised by the merge() function call, not to manually trigger the error with a stop() call in the test expression.
I had initially added the stop() to handle the missing column case manually as a quick fix, I was about to make this change but missed it during my initial implementation.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6713 +/- ##
=======================================
Coverage 98.62% 98.62%
=======================================
Files 79 79
Lines 14642 14652 +10
=======================================
+ Hits 14441 14451 +10
Misses 201 201 ☔ View full report in Codecov by Sentry. |
R/merge.R
Outdated
@@ -17,9 +17,9 @@ merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FAL | |||
if (x0 && y0) | |||
warningf("Neither of the input data.tables to join have columns.") | |||
else if (x0) | |||
warningf("Input data.table '%s' has no columns.", "x") | |||
warningf("Input data.table x has no columns.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? This code is structured intentionally so that there are not two nearly-but-not-quite identical messages for translators.
R/merge.R
Outdated
missing_in_x <- setdiff(by, nm_x) | ||
missing_in_y <- setdiff(by, nm_y) | ||
if (length(missing_in_x) > 0 || length(missing_in_y) > 0) { | ||
stopf("The following columns are missing:%s%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be missing newlines.
R/merge.R
Outdated
missing_in_y <- setdiff(by, nm_y) | ||
if (length(missing_in_x) > 0 || length(missing_in_y) > 0) { | ||
stopf("The following columns are missing:%s%s", | ||
if (length(missing_in_x) > 0) sprintf(" - From x: %s", toString(missing_in_x)) else "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use gettextf()
to enable translation
@@ -8564,16 +8564,24 @@ test(1600.2, names(DT1[DT2, .(id1=id1, val=val, bla=sum(z1, na.rm=TRUE)), on="id | |||
# warn when merge empty data.table #597 | |||
DT0 = data.table(NULL) | |||
DT1 = data.table(a=1) | |||
|
|||
# Test 1601.1: Merge DT1 with itself on column 'a' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need comments like this unless the test is really obscure. You definitely don't need to write 'Test 1601.1' when that is very obvious from the next line.
Such comments are very easy to fall out of sync with the actual code as it changes over time. See for example https://swimm.io/learn/code-collaboration/comments-in-code-best-practices-and-mistakes-to-avoid.
If the test case's purpose is not obvious from the written code, often that's a sign that the test is poorly designed -- typically we should strive for the purpose of the test to be immediately apparent, only rarely needing small clarifying comments.
Hi @MichaelChirico and @tdhock, |
inst/tests/tests.Rraw
Outdated
if (!"z" %in% colnames(DT1) || !"z" %in% colnames(DT2)) { | ||
stop("The columns listed in `by` are missing from either x or y: [z]") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last occurrence of a test raising the error and then testing for it instead of letting merge()
do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry by mistake I missed it
Now I have made changes .
@venom1204 would you mind editing the PR title to be more self-contained?
Ideally we can glance now and far into the future at the title of a PR and have a good idea what part of the codebase it's about. See for example https://blog.montrealanalytics.com/4-tips-for-effective-pull-request-naming-f60793998f04. Especially as |
@MichaelChirico i changed the title of the pr. |
R/merge.R
Outdated
check_duplicate_names(x) | ||
check_duplicate_names(y) | ||
check_duplicate_names(x) | ||
check_duplicate_names(y) | ||
|
||
nm_x = names(x) | ||
nm_y = names(y) | ||
nm_x = names(x) | ||
nm_y = names(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is important. While R cares very little about the spaces at the beginning of the lines (or about spaces between the keywords in general), the people reading the code find it easier when the lines are aligned according to the function
/ if
/ while
blocks they belong to. In data.table
we increase the indentation by two spaces every time a new {
block begins and decrease it again when it ends. There are other styles out there, but this one is what we're using.
Removing those spaces above makes it look as if the line nm_x = names(x)
is outside the enclosing function
block, which is makes it misleading. Commits that change the spaces (or, in general, the looks) in the code without changing the substance (i.e. what actually gets run) are also detrimental for a less obvious reason. We store the change history in Git, more than 5000 changes dating back to 2008. This lets us quickly find the sources of problems using the power of bisect. If a change is small and only touches the code, it's easy to understand what's broken. When the change also moves around the text without changing its meaning, the challenge becomes harder.
While some whitespace changes can be good because they make the code more readable, we should definitely not break the indentation.
In this i have made the error popping out to ber more informative by two things
Which column(s) are missing.
Which data.table is missing the column(s).
Closes #6556