Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into how rejected rows are handled in v2 connector #275

Closed
ravjotbrar opened this issue Nov 18, 2021 · 4 comments · Fixed by #296
Closed

Look into how rejected rows are handled in v2 connector #275

ravjotbrar opened this issue Nov 18, 2021 · 4 comments · Fixed by #296

Comments

@ravjotbrar
Copy link
Collaborator

ravjotbrar commented Nov 18, 2021

Modify an example to contain rejected rows and see if those rows are contained in rejects table. We expect rows with issues to be in this table.

@jeremyprime
Copy link
Collaborator

Note that rows are only rejected during the copying/committing of data from HDFS to Vertica. This means that the data must be valid in HDFS, but for whatever reason it does not satisfy the constraints in Vertica.

A good way to test this is to create the schema ahead of time and use the Append save mode. This way the schema in Vertica and the schema in your DF can be different, causing rejected rows. There are examples of this in the functional-tests (see tests where FaultToleranceTestFail is true under EndToEndTests.scala).

@jeremyprime
Copy link
Collaborator

Also note that the error when the schema in the DF and Vertica differ is the same as ticket #284, where the log indicates that 0 rows were copied but also says that 0 rows were rejected (no data was copied to Vertica, in the target table or the rejects table).

@jeremyprime
Copy link
Collaborator

jeremyprime commented Dec 7, 2021

When testing a DF containing nullable values, against a Vertica table that is not nullable, the error is slightly different. The logs do report the number of copied rows and rejected rows:

21/12/07 22:28:58 WARN SchemaTools: S2V: Column i is NOT NULL in target table "dftest" but it's nullable in the DataFrame. Rows with NULL values in column i will be rejected.
21/12/07 22:28:58 INFO SchemaTools: Load by name. Column list: ("i")
21/12/07 22:28:58 INFO VerticaDistributedFilesystemWritePipe: The copy statement is:
COPY "dftest" ("i") FROM 'webhdfs://hdfs:50070/data/02ac98f0_f957_43a4_8a25_3f975eca117e/*.parquet' ON ANY NODE parquet REJECTED DATA AS TABLE "dftest_02ac98f0_f957_43a4_8a25_3f975eca117e_COMMITS" NO COMMIT
21/12/07 22:28:58 INFO VerticaDistributedFilesystemWritePipe: Performing copy from file store to Vertica
21/12/07 22:28:58 INFO VerticaDistributedFilesystemWritePipe: Checking number of rejected rows via statement: SELECT COUNT(*) as count FROM "dftest_02ac98f0_f957_43a4_8a25_3f975eca117e_COMMITS"
21/12/07 22:28:58 INFO VerticaDistributedFilesystemWritePipe: Verifying rows saved to Vertica is within user tolerance...
21/12/07 22:28:58 INFO VerticaDistributedFilesystemWritePipe: Number of rows_rejected=1. rows_copied=2. failedRowsPercent=0.3333333333333333. user's failed_rows_percent_tolerance=0.5. passedFaultToleranceTest=true...PASSED.  OK to commit to database.

With a sufficiently high failed_rows_percent_tolerance the good values are written to Vertica. However, the rejected rows are not committed to Vertica as expected (REJECTED DATA AS TABLE "<table_name>_<id>_COMMITS" NO COMMIT), only the rejects count and status are written to the logs and the status table.

If we want to save the rejected data we will need to change the current behaviour of the connector. For example, add a persist_rejected_data option (and rejected_data_table, otherwise default to <table>_<job_name>_rejected_data?), where the rejected data will be saved to a table if it is enabled and there was at least 1 rejected row.

For more information on how Vertica handles rejected rows, see here.

@jeremyprime
Copy link
Collaborator

jeremyprime commented Dec 8, 2021

This ticket only deals with improving the logging around the rejected data table. See #293 to persist the actual rejected data table.

@jeremyprime jeremyprime linked a pull request Dec 8, 2021 that will close this issue
@jeremyprime jeremyprime linked a pull request Dec 14, 2021 that will close this issue
jeremyprime added a commit that referenced this issue Dec 16, 2021
* Print out the first few rejected data rows (#275)

* Generated log messages without hard-coding indexes (#275)

* Add log on number of rows printed (#275)

* Update logging to provide an aggregate summary of rejected data (#275)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants