-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log rejected data table results #296
Conversation
@@ -367,6 +367,21 @@ class VerticaDistributedFilesystemWritePipe(val config: DistributedFilesystemWri | |||
logger.info(s"Dropping Vertica rejects table now: " + dropRejectsTableStatement) | |||
jdbcLayer.execute(dropRejectsTableStatement) | |||
} else { | |||
// Log the first few rejected rows | |||
val rejectsDataQuery = "SELECT file_name, row_number, rejected_data, rejected_reason FROM " + rejectsTable + " LIMIT 10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are using the approach to log 10, shall we try to print different rejected_reason, so we can cover more variations of failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do a GROUP BY on rejected_reason, but then we wouldn't be able to provide the other rows (at least row_number, and possibly file_name and rejected_data as well).
But if we feel the rejected_reason and the count of each distinct reason is more important, we can display that instead. Having the row_number and rejected_data helps narrow where the rejected row came from, but having a distinct list of rejected_reason and the counts provides a better overview of the rejected rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This now provides a good summary of the rejected rows, but also provides example data for each reason (the following example only had a single reason, but the query works and will summarize for up to 10 reasons):
21/12/15 22:39:56 ERROR VerticaDistributedFilesystemWritePipe: Found 8 rejected rows, displaying up to 10 of the most common reasons:
21/12/15 22:39:56 ERROR VerticaDistributedFilesystemWritePipe: count | example_data | rejected_reason
21/12/15 22:39:56 ERROR VerticaDistributedFilesystemWritePipe: 8 | NULL | In column 1: Cannot set NULL value in NOT NULL column
@@ -367,6 +367,21 @@ class VerticaDistributedFilesystemWritePipe(val config: DistributedFilesystemWri | |||
logger.info(s"Dropping Vertica rejects table now: " + dropRejectsTableStatement) | |||
jdbcLayer.execute(dropRejectsTableStatement) | |||
} else { | |||
// Log the first few rejected rows | |||
val rejectsDataQuery = "SELECT file_name, row_number, rejected_data, rejected_reason FROM " + rejectsTable + " LIMIT 10" | |||
logger.info(s"Getting rejected rows via statement: " + rejectsDataQuery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we mention in the logs as well that we are printing only up to 10 rejected rows, so people won't think it is a complete list when reading log files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, see below.
Summary
Log rejected data table results
Description
The rejected data table is never persisted in the Vertica (it is a temporary table), so the user cannot see the exact reason for any rejected rows. This change logs the first few rows (10) from the rejected data table. Only the columns that are most helpful in narrowing down the error are printed (file_name, row_number, rejected_data, and rejected_data_reason).
Trying to persist the table was difficult due to our commit logic, but it can be address as part of #293.
This is a sample of what the rejected data table might contain (not persisted):
And this is a sample of what will now be printed in the logs if there is at least one rejected row (this example has 1 rejected row):
Related Issue
#275
Additional Reviewers
@alexr-bq
@alexey-temnikov
@jonathanl-bq
@ravjotbrar