Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of mis-formatted CSVs #67

Open
pdobacz opened this issue Oct 26, 2022 · 6 comments
Open

Improve handling of mis-formatted CSVs #67

pdobacz opened this issue Oct 26, 2022 · 6 comments

Comments

@pdobacz
Copy link
Collaborator

pdobacz commented Oct 26, 2022

If you put an invalid entry in a row beyond the scope of our type-detection mechanism (>10000 IIRC), it will not behave nicely.

The example comes from taxi-one-day.csv, which has a "\N" cell for a Real typed column. It gives me:

12:56:12.732 › postgres: 2022-10-26 12:56:12.732 CEST [74911] FEHLER:  ungültige Eingabesyntax für Typ real: »\N«
12:56:12.733 › postgres: 2022-10-26 12:56:12.732 CEST [74911] ZUSAMMENHANG:  COPY taxi-one-day, Zeile 15631, Spalte dropoff_longitude: »\N«
12:56:12.733 › postgres: 2022-10-26 12:56:12.732 CEST [74911] ANWEISUNG:  COPY "taxi-one-day" FROM STDIN (DELIMITER ',', FORMAT CSV, HEADER true)

in the console, but nothing useful in the GUI (Import failed and an import stuck halfway).

@pdobacz pdobacz added the bug Something isn't working label Oct 26, 2022
@yoid2000
Copy link
Contributor

A little more info. Using psql in trusted mode yields this:

diffix=> \d
                         List of relations
 Schema |              Name               |   Type   |    Owner
--------+---------------------------------+----------+--------------
 public | Census Big                      | table    | diffix_admin
 public | Census Big_DiffixRowIndex_seq   | sequence | diffix_admin
 public | Taxi                            | table    | diffix_admin
 public | census_small                    | table    | diffix_admin
 public | census_small_DiffixRowIndex_seq | sequence | diffix_admin
(5 rows)

diffix=> select count(*) from "Taxi";
ERROR:  [PG_DIFFIX] Tables without an anonymization label can't be accessed in anonymized mode.

and in admin mode:

diffix=# select count(*) from "Taxi";
 count
-------
     0
(1 row)

@pdobacz
Copy link
Collaborator Author

pdobacz commented Oct 26, 2022

The reason for this might be that the table import didn't reach marking the table personal / public and granting SELECT to diffix_trusted, so it's only visible to admin.

As suggested, we should remove a partially imported table on import failure as part of this issue.

@yoid2000
Copy link
Contributor

yoid2000 commented Oct 26, 2022

By the way, I deleted the corrupted table Taxi from Dashboard, but it did not delete from Metabase.

Ah, I'm wrong. It is basically deleted from Metabase, except it is still listed on the home page under the heading "Pick up where you left off", which apparently is a cache of recent work. Other than that, it appears to be gone.

It did get removed from Postgres!

@pdobacz pdobacz self-assigned this Oct 26, 2022
@pdobacz
Copy link
Collaborator Author

pdobacz commented Oct 26, 2022

OK, rolling back a partial import is easy and implemented now.

I'm scratching my head on how to handle the columns which have misformatted entries beyond the rows looked at by our detection code (I think we've discussed this before for desktop, can't find it now). Options coming to mind:

  1. On any kind of error from PostgreSQL COPY, redo detection without the row limit, keeping fingers crossed it will work this time
  2. Try to interpret the error and provide the user with the exact offending cell (might be difficult, as PostgreSQL provides us with a locale-specific prose, without any details)
  3. Not do COPY at all, but process all the rows ourselves (ugh)
  4. ...I'm short on ideas, anything simple I'm missing?

@cristianberneanu
Copy link
Contributor

Either we run the detection on the entire file from the start, guaranteeing that all fields have valid values in them, or we process all the rows ourselves and convert invalid values to NULLs / return an exact error with the offending row.

@pdobacz
Copy link
Collaborator Author

pdobacz commented Oct 31, 2022

OK, so I'll merge the rollback part without resolving this issue. Rolling back is the low-hanging fruit, as it prevents the app to go into a split-brain state.

Since there are more reasons popping up to consider processing all rows ourselves (like the decimal point problem), let's postpone this issue until we make that decision.

@pdobacz pdobacz removed their assignment Oct 31, 2022
@pdobacz pdobacz removed the bug Something isn't working label Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants