Improve handling of mis-formatted CSVs #67

pdobacz · 2022-10-26T11:02:30Z

If you put an invalid entry in a row beyond the scope of our type-detection mechanism (>10000 IIRC), it will not behave nicely.

The example comes from taxi-one-day.csv, which has a "\N" cell for a Real typed column. It gives me:

12:56:12.732 › postgres: 2022-10-26 12:56:12.732 CEST [74911] FEHLER:  ungültige Eingabesyntax für Typ real: »\N«
12:56:12.733 › postgres: 2022-10-26 12:56:12.732 CEST [74911] ZUSAMMENHANG:  COPY taxi-one-day, Zeile 15631, Spalte dropoff_longitude: »\N«
12:56:12.733 › postgres: 2022-10-26 12:56:12.732 CEST [74911] ANWEISUNG:  COPY "taxi-one-day" FROM STDIN (DELIMITER ',', FORMAT CSV, HEADER true)

in the console, but nothing useful in the GUI (Import failed and an import stuck halfway).

The text was updated successfully, but these errors were encountered:

yoid2000 · 2022-10-26T11:05:27Z

A little more info. Using psql in trusted mode yields this:

diffix=> \d
                         List of relations
 Schema |              Name               |   Type   |    Owner
--------+---------------------------------+----------+--------------
 public | Census Big                      | table    | diffix_admin
 public | Census Big_DiffixRowIndex_seq   | sequence | diffix_admin
 public | Taxi                            | table    | diffix_admin
 public | census_small                    | table    | diffix_admin
 public | census_small_DiffixRowIndex_seq | sequence | diffix_admin
(5 rows)

diffix=> select count(*) from "Taxi";
ERROR:  [PG_DIFFIX] Tables without an anonymization label can't be accessed in anonymized mode.

and in admin mode:

diffix=# select count(*) from "Taxi";
 count
-------
     0
(1 row)

pdobacz · 2022-10-26T11:09:24Z

The reason for this might be that the table import didn't reach marking the table personal / public and granting SELECT to diffix_trusted, so it's only visible to admin.

As suggested, we should remove a partially imported table on import failure as part of this issue.

yoid2000 · 2022-10-26T11:41:11Z

By the way, I deleted the corrupted table Taxi from Dashboard, but it did not delete from Metabase.

Ah, I'm wrong. It is basically deleted from Metabase, except it is still listed on the home page under the heading "Pick up where you left off", which apparently is a cache of recent work. Other than that, it appears to be gone.

It did get removed from Postgres!

pdobacz · 2022-10-26T14:43:43Z

OK, rolling back a partial import is easy and implemented now.

I'm scratching my head on how to handle the columns which have misformatted entries beyond the rows looked at by our detection code (I think we've discussed this before for desktop, can't find it now). Options coming to mind:

On any kind of error from PostgreSQL COPY, redo detection without the row limit, keeping fingers crossed it will work this time
Try to interpret the error and provide the user with the exact offending cell (might be difficult, as PostgreSQL provides us with a locale-specific prose, without any details)
Not do COPY at all, but process all the rows ourselves (ugh)
...I'm short on ideas, anything simple I'm missing?

cristianberneanu · 2022-10-27T12:58:21Z

Either we run the detection on the entire file from the start, guaranteeing that all fields have valid values in them, or we process all the rows ourselves and convert invalid values to NULLs / return an exact error with the offending row.

pdobacz · 2022-10-31T10:52:24Z

OK, so I'll merge the rollback part without resolving this issue. Rolling back is the low-hanging fruit, as it prevents the app to go into a split-brain state.

Since there are more reasons popping up to consider processing all rows ourselves (like the decimal point problem), let's postpone this issue until we make that decision.

pdobacz added the bug Something isn't working label Oct 26, 2022

pdobacz self-assigned this Oct 26, 2022

pdobacz mentioned this issue Oct 27, 2022

Rollback failed import #81

Merged

pdobacz removed their assignment Oct 31, 2022

pdobacz removed the bug Something isn't working label Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of mis-formatted CSVs #67

Improve handling of mis-formatted CSVs #67

pdobacz commented Oct 26, 2022

yoid2000 commented Oct 26, 2022

pdobacz commented Oct 26, 2022 •

edited

Loading

yoid2000 commented Oct 26, 2022 •

edited

Loading

pdobacz commented Oct 26, 2022

cristianberneanu commented Oct 27, 2022

pdobacz commented Oct 31, 2022

Improve handling of mis-formatted CSVs #67

Improve handling of mis-formatted CSVs #67

Comments

pdobacz commented Oct 26, 2022

yoid2000 commented Oct 26, 2022

pdobacz commented Oct 26, 2022 • edited Loading

yoid2000 commented Oct 26, 2022 • edited Loading

pdobacz commented Oct 26, 2022

cristianberneanu commented Oct 27, 2022

pdobacz commented Oct 31, 2022

pdobacz commented Oct 26, 2022 •

edited

Loading

yoid2000 commented Oct 26, 2022 •

edited

Loading