feat: improve merge.data.table error messages for missing keys (#6556) #6713

venom1204 · 2025-01-08T14:30:34Z

In this i have made the error popping out to ber more informative by two things

Which column(s) are missing.
Which data.table is missing the column(s).
Closes #6556

venom1204 · 2025-01-08T15:33:30Z

Hi @aitap,
I hope you’re doing well! I’m currently facing some challenges with the atime performance test, and I could use your guidance.
Sorry for the disturbance, but if you have a moment, could you please help me figure out how to resolve this? I’d really appreciate your insights.
Thank you so much for your support!

aitap · 2025-01-08T16:08:08Z

It looks like the atime tests will only work with branches created inside the Rdatatable/data.table repository, not the outside forks. Unless you are a member of a team with write access to create new branches, there may be nothing you can fix.

Apologies for the poke, @Anirban166, could you help with this? If this is not by design (e.g. performance tests are relatively expensive and therefore must be run only from local branches), could the action create a local branch from refs/pull/<ID>/head instead of relying on ${GITHUB_HEAD_REF} (which seems to contain the branch name from the remote repository)?

venom1204 · 2025-01-09T22:09:07Z

Hi @MichaelChirico,
Apologies for the interruption, but could you kindly take a look at the issue with the atime performance tests? I’m running into an error, and it seems to be related to how the tests are triggered for pull requests from external forks.
If you could provide any guidance on how to resolve this, it would be greatly appreciated. Your help would mean a lot as I navigate this issue!

MichaelChirico · 2025-01-10T02:17:42Z

You can ignore the atime issue

venom1204 · 2025-01-10T08:45:10Z

@MichaelChirico thanks for the clarification
can you please review the changes in the pr.

tdhock · 2025-01-13T02:21:17Z

inst/tests/tests.Rraw

-     error = 'must be valid column names in x and y')
+test(1962.021, {
+  if (!"z" %in% colnames(DT1) || !"z" %in% colnames(DT2)) {
+    stop("The columns listed in `by` are missing from either x or y: z")


please remove backticks, which are for markdown, not error messages

Hi Toby, sorry, I disagree.

The backticks serve to highlight that this is a code object, and not a plain English word. Without them, a reader can easily be confused into thinking there's some grammatical mistake "in by", or otherwise struggle to parse the message they're given.

Of course, we could choose some other convention (single/double quotes, e.g.), and we should try and pick one and stick to it throughout the codebase... but that's a separate issue.

Personally, these days I am using `arg=` for function arguments to highlight that (1) it's code with the backticks and (2) it's a keyword argument with =.

tdhock · 2025-01-13T02:21:50Z

inst/tests/tests.Rraw

 test(1601.4, merge(DT0, DT0, by="a"),
     warning="Neither of the input data.tables to join have columns.",
-     error="Elements listed in `by`")
+     error="The following columns are missing:\n - From `x`: a\n - From `y`: a")


please remove newlines in error messages

Remove, or add? I find the current output hard to read:

The following columns are missing: - From x: a

I would find this much more readable (possibly indenting the second line):

The following columns are missing: - From x: a

At a higher level, I wonder if translation would be easier if we instead structured the message like so:

The following columns are missing from x: ... The following columns are missing from y: ...

tdhock · 2025-01-13T02:23:00Z

R/merge.R

+    if (!all(by.x %chin% nm_x)) {
+      missing_in_x <- setdiff(by.x, nm_x)
+      stopf("The following columns listed in `by.x` are missing from `x`: %s",
+            toString(missing_in_x))


brackify instead of toString?

The brackify function adds brackets around column names in error messages, which may not align with the expected format in your test cases.should i change the format of the test case ?

Yes, use brackify. It provides nice formatting and also some simple truncation mechanism in case missing_in_x happens to have 10s or dozens of elements.

tdhock · 2025-01-13T02:27:35Z

R/merge.R

+    if (length(missing_in_x) > 0 || length(missing_in_y) > 0) {
+      stopf("The following columns are missing:\n%s%s",
+            if (length(missing_in_x) > 0) sprintf(" - From `x`: %s\n", toString(missing_in_x)) else "",
+            if (length(missing_in_y) > 0) sprintf(" - From `y`: %s\n", toString(missing_in_y)) else "")


please refactor to remove repetition

and also please remove line breaks inside function calls: use stopf(some, code) instead of

stopf(some, code)

@tdhock I've implemented the changes you suggested. Please review them and let me know if there's anything else that needs to be addressed or improved.

tdhock · 2025-01-13T02:34:56Z

I added documentation which explains that atime failure is normal in forks, https://github.com/Rdatatable/data.table/wiki/Performance-testing#can-not-be-run-from-forks

If this is not by design (e.g. performance tests are relatively expensive and therefore must be run only from local branches),

I believe the problem is not the "relatively expensive part" but rather that the action requires permission to upload artifacts to rdatatable/data.table repo.

aitap · 2025-01-13T11:22:25Z

inst/tests/tests.Rraw

+  if (!"z" %in% colnames(DT1)) {
+    stop("Elements listed in `by.x` are missing from x: z")
+  }


This is a creative solution to the problem of the failing test (would you mind letting us know how you came up with it?), but, unfortunately, not the right one.

The idea here and below is to test the error raised by the following merge() call, not to manually raise the error expected by the previous test code. Instead of calling stop() in the test expression, set the error argument to make sure it matches.

Hi @aitap,
Thank you for the feedback. I apologize for the confusion. You are absolutely right that the goal is to test the error raised by the merge() function call, not to manually trigger the error with a stop() call in the test expression.
I had initially added the stop() to handle the missing column case manually as a quick fix, I was about to make this change but missed it during my initial implementation.

codecov · 2025-01-14T00:35:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.62%. Comparing base (6641ca0) to head (912d0cd).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6713   +/-   ##
=======================================
  Coverage   98.62%   98.62%           
=======================================
  Files          79       79           
  Lines       14642    14652   +10     
=======================================
+ Hits        14441    14451   +10     
  Misses        201      201

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MichaelChirico · 2025-01-16T20:41:42Z

R/merge.R

@@ -17,9 +17,9 @@ merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FAL
    if (x0 && y0)
      warningf("Neither of the input data.tables to join have columns.")
    else if (x0)
-      warningf("Input data.table '%s' has no columns.", "x")
+      warningf("Input data.table x has no columns.")


Why this change? This code is structured intentionally so that there are not two nearly-but-not-quite identical messages for translators.

R/merge.R

MichaelChirico · 2025-01-16T20:47:19Z

R/merge.R

+    missing_in_x <- setdiff(by, nm_x)
+    missing_in_y <- setdiff(by, nm_y)
+    if (length(missing_in_x) > 0 || length(missing_in_y) > 0) {
+      stopf("The following columns are missing:%s%s",


This looks to be missing newlines.

MichaelChirico · 2025-01-16T20:47:34Z

R/merge.R

+    missing_in_y <- setdiff(by, nm_y)
+    if (length(missing_in_x) > 0 || length(missing_in_y) > 0) {
+      stopf("The following columns are missing:%s%s",
+            if (length(missing_in_x) > 0) sprintf(" - From x: %s", toString(missing_in_x)) else "",


Use gettextf() to enable translation

MichaelChirico · 2025-01-16T20:52:40Z

inst/tests/tests.Rraw

@@ -8564,16 +8564,24 @@ test(1600.2, names(DT1[DT2, .(id1=id1, val=val, bla=sum(z1, na.rm=TRUE)), on="id
 # warn when merge empty data.table #597
 DT0 = data.table(NULL)
 DT1 = data.table(a=1)
+
+# Test 1601.1: Merge DT1 with itself on column 'a'


You shouldn't need comments like this unless the test is really obscure. You definitely don't need to write 'Test 1601.1' when that is very obvious from the next line.

Such comments are very easy to fall out of sync with the actual code as it changes over time. See for example https://swimm.io/learn/code-collaboration/comments-in-code-best-practices-and-mistakes-to-avoid.

If the test case's purpose is not obvious from the written code, often that's a sign that the test is poorly designed -- typically we should strive for the purpose of the test to be immediately apparent, only rarely needing small clarifying comments.

venom1204 · 2025-01-17T07:52:24Z

Hi @MichaelChirico and @tdhock,
I have implemented all the changes you suggested. Could you please review the updates and let me know if there's anything else I can improve?
Thank you!

aitap · 2025-01-17T09:48:28Z

inst/tests/tests.Rraw

+  if (!"z" %in% colnames(DT1) || !"z" %in% colnames(DT2)) {
+    stop("The columns listed in `by` are missing from either x or y: [z]")
+  }


One last occurrence of a test raising the error and then testing for it instead of letting merge() do that.

Sorry by mistake I missed it
Now I have made changes .

MichaelChirico · 2025-01-17T21:06:14Z

@venom1204 would you mind editing the PR title to be more self-contained?

made the error more informative for #6556

Ideally we can glance now and far into the future at the title of a PR and have a good idea what part of the codebase it's about. See for example https://blog.montrealanalytics.com/4-tips-for-effective-pull-request-naming-f60793998f04.

Especially as #6556 does not get auto-linked to the issue, it is not very useful to include in the title (a bit unfortunately, I do wish GitHub would support links there).

venom1204 · 2025-01-18T06:17:32Z

@MichaelChirico i changed the title of the pr.

aitap · 2025-01-18T20:50:25Z

R/merge.R

-  check_duplicate_names(x)
-  check_duplicate_names(y)
+check_duplicate_names(x)
+check_duplicate_names(y)

-  nm_x = names(x)
-  nm_y = names(y)
+nm_x = names(x)
+nm_y = names(y)


Indentation is important. While R cares very little about the spaces at the beginning of the lines (or about spaces between the keywords in general), the people reading the code find it easier when the lines are aligned according to the function / if / while blocks they belong to. In data.table we increase the indentation by two spaces every time a new { block begins and decrease it again when it ends. There are other styles out there, but this one is what we're using.

Removing those spaces above makes it look as if the line nm_x = names(x) is outside the enclosing function block, which is makes it misleading. Commits that change the spaces (or, in general, the looks) in the code without changing the substance (i.e. what actually gets run) are also detrimental for a less obvious reason. We store the change history in Git, more than 5000 changes dating back to 2008. This lets us quickly find the sources of problems using the power of bisect. If a change is small and only touches the code, it's easy to understand what's broken. When the change also moves around the text without changing its meaning, the challenge becomes harder.

While some whitespace changes can be good because they make the code more readable, we should definitely not break the indentation.

fixed the error

da77d10

venom1204 requested a review from MichaelChirico as a code owner January 8, 2025 14:30

venom1204 marked this pull request as draft January 8, 2025 14:32

corrected code

ac9d594

venom1204 marked this pull request as ready for review January 10, 2025 08:45

tdhock reviewed Jan 13, 2025

View reviewed changes

tdhock requested changes Jan 13, 2025

View reviewed changes

aitap reviewed Jan 13, 2025

View reviewed changes

introduced teh changes

214f93a

venom1204 and others added 2 commits January 14, 2025 06:07

Merge branch 'master' into issue6556

785a2af

Merge branch 'master' into issue6556

2a1c392

MichaelChirico reviewed Jan 16, 2025

View reviewed changes

R/merge.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Jan 16, 2025

View reviewed changes

venom1204 added 3 commits January 17, 2025 12:31

introduced teh latest changes

771fbc0

lint-r corrected

ad66677

lintr

ed5bdb1

aitap reviewed Jan 17, 2025

View reviewed changes

corrected test case

186cbd5

venom1204 changed the title ~~made the error more informative for #6556~~ feat: improve merge.data.table error messages for missing keys (#6556) Jan 18, 2025

venom1204 closed this Jan 18, 2025

venom1204 deleted the issue6556 branch January 18, 2025 06:12

venom1204 restored the issue6556 branch January 18, 2025 06:13

venom1204 reopened this Jan 18, 2025

Merge branch 'master' into issue6556

e849fe6

aitap reviewed Jan 18, 2025

View reviewed changes

indentation correction

912d0cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve merge.data.table error messages for missing keys (#6556) #6713

feat: improve merge.data.table error messages for missing keys (#6556) #6713

venom1204 commented Jan 8, 2025 •

edited

Loading

venom1204 commented Jan 8, 2025

aitap commented Jan 8, 2025

venom1204 commented Jan 9, 2025

MichaelChirico commented Jan 10, 2025

venom1204 commented Jan 10, 2025

tdhock Jan 13, 2025

MichaelChirico Jan 16, 2025

tdhock Jan 13, 2025

MichaelChirico Jan 16, 2025

tdhock Jan 13, 2025

venom1204 Jan 14, 2025

MichaelChirico Jan 16, 2025

tdhock Jan 13, 2025

venom1204 Jan 14, 2025

tdhock commented Jan 13, 2025

aitap Jan 13, 2025

venom1204 Jan 14, 2025

codecov bot commented Jan 14, 2025 •

edited

Loading

MichaelChirico Jan 16, 2025

MichaelChirico Jan 16, 2025

MichaelChirico Jan 16, 2025

MichaelChirico Jan 16, 2025

venom1204 commented Jan 17, 2025

aitap Jan 17, 2025

venom1204 Jan 17, 2025

MichaelChirico commented Jan 17, 2025

venom1204 commented Jan 18, 2025

aitap Jan 18, 2025

feat: improve merge.data.table error messages for missing keys (#6556) #6713

Are you sure you want to change the base?

feat: improve merge.data.table error messages for missing keys (#6556) #6713

Conversation

venom1204 commented Jan 8, 2025 • edited Loading

venom1204 commented Jan 8, 2025

aitap commented Jan 8, 2025

venom1204 commented Jan 9, 2025

MichaelChirico commented Jan 10, 2025

venom1204 commented Jan 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdhock commented Jan 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 14, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

venom1204 commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico commented Jan 17, 2025

venom1204 commented Jan 18, 2025

Choose a reason for hiding this comment

venom1204 commented Jan 8, 2025 •

edited

Loading

codecov bot commented Jan 14, 2025 •

edited

Loading