Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log rejected data table results #296

Merged
merged 4 commits into from
Dec 16, 2021
Merged

Log rejected data table results #296

merged 4 commits into from
Dec 16, 2021

Conversation

jeremyprime
Copy link
Collaborator

Summary

Log rejected data table results

Description

The rejected data table is never persisted in the Vertica (it is a temporary table), so the user cannot see the exact reason for any rejected rows. This change logs the first few rows (10) from the rejected data table. Only the columns that are most helpful in narrowing down the error are printed (file_name, row_number, rejected_data, and rejected_data_reason).

Trying to persist the table was difficult due to our commit logic, but it can be address as part of #293.

This is a sample of what the rejected data table might contain (not persisted):

     node_name     |                                     file_name                                     |         session_id          |  transaction_id   | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length |                    rejected_reason
-------------------+-----------------------------------------------------------------------------------+-----------------------------+-------------------+--------------+--------------+------------+---------------+---------------------------+-------------------------------------------------------
 v_docker_node0001 | webhdfs://hdfs:50070/data/1a4212fb_b2c8_4998_9c1f_09d14b1f9e4a/0-0.snappy.parquet | v_docker_node0001-99:0x57cd | 45035996273707541 |           27 |            0 |          1 | NULL          |                         4 | In column 1: Cannot set NULL value in NOT NULL column

And this is a sample of what will now be printed in the logs if there is at least one rejected row (this example has 1 rejected row):

21/12/14 18:58:13 INFO VerticaDistributedFilesystemWritePipe: Checking number of rejected rows via statement: SELECT COUNT(*) as count FROM "dftest_a3b77254_d039_46cf_98ff_e1cbf81a9c57_COMMITS"
21/12/14 18:58:13 INFO VerticaDistributedFilesystemWritePipe: Verifying rows saved to Vertica is within user tolerance...
21/12/14 18:58:13 INFO VerticaDistributedFilesystemWritePipe: Number of rows_rejected=1. rows_copied=3. failedRowsPercent=0.25. user's failed_rows_percent_tolerance=0.5. passedFaultToleranceTest=true...PASSED.  OK to commit to database.
21/12/14 18:58:13 INFO VerticaDistributedFilesystemWritePipe: Getting rejected rows via statement: SELECT file_name, row_number, rejected_data, rejected_reason FROM "dftest_a3b77254_d039_46cf_98ff_e1cbf81a9c57_COMMITS" LIMIT 10
21/12/14 18:58:13 ERROR VerticaDistributedFilesystemWritePipe: file_name | row_number | rejected_data | rejected_reason
21/12/14 18:58:13 ERROR VerticaDistributedFilesystemWritePipe: webhdfs://hdfs:50070/data/a3b77254_d039_46cf_98ff_e1cbf81a9c57/0-0.snappy.parquet | 1 | NULL | In column 1: Cannot set NULL value in NOT NULL column

Related Issue

#275

Additional Reviewers

@alexr-bq
@alexey-temnikov
@jonathanl-bq
@ravjotbrar

@jeremyprime jeremyprime linked an issue Dec 14, 2021 that may be closed by this pull request
@@ -367,6 +367,21 @@ class VerticaDistributedFilesystemWritePipe(val config: DistributedFilesystemWri
logger.info(s"Dropping Vertica rejects table now: " + dropRejectsTableStatement)
jdbcLayer.execute(dropRejectsTableStatement)
} else {
// Log the first few rejected rows
val rejectsDataQuery = "SELECT file_name, row_number, rejected_data, rejected_reason FROM " + rejectsTable + " LIMIT 10"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using the approach to log 10, shall we try to print different rejected_reason, so we can cover more variations of failures?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do a GROUP BY on rejected_reason, but then we wouldn't be able to provide the other rows (at least row_number, and possibly file_name and rejected_data as well).

But if we feel the rejected_reason and the count of each distinct reason is more important, we can display that instead. Having the row_number and rejected_data helps narrow where the rejected row came from, but having a distinct list of rejected_reason and the counts provides a better overview of the rejected rows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This now provides a good summary of the rejected rows, but also provides example data for each reason (the following example only had a single reason, but the query works and will summarize for up to 10 reasons):

21/12/15 22:39:56 ERROR VerticaDistributedFilesystemWritePipe: Found 8 rejected rows, displaying up to 10 of the most common reasons:
21/12/15 22:39:56 ERROR VerticaDistributedFilesystemWritePipe: count | example_data | rejected_reason
21/12/15 22:39:56 ERROR VerticaDistributedFilesystemWritePipe: 8 | NULL | In column 1: Cannot set NULL value in NOT NULL column

@@ -367,6 +367,21 @@ class VerticaDistributedFilesystemWritePipe(val config: DistributedFilesystemWri
logger.info(s"Dropping Vertica rejects table now: " + dropRejectsTableStatement)
jdbcLayer.execute(dropRejectsTableStatement)
} else {
// Log the first few rejected rows
val rejectsDataQuery = "SELECT file_name, row_number, rejected_data, rejected_reason FROM " + rejectsTable + " LIMIT 10"
logger.info(s"Getting rejected rows via statement: " + rejectsDataQuery)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we mention in the logs as well that we are printing only up to 10 rejected rows, so people won't think it is a complete list when reading log files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, see below.

@jeremyprime jeremyprime merged commit 1dadb87 into main Dec 16, 2021
@jeremyprime jeremyprime deleted the log-rejected-rows branch December 16, 2021 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Look into how rejected rows are handled in v2 connector
3 participants