Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysterious failures in the CSV Newspaper toolchain #519

Open
bondjimbond opened this issue Jul 19, 2022 · 8 comments
Open

Mysterious failures in the CSV Newspaper toolchain #519

bondjimbond opened this issue Jul 19, 2022 · 8 comments

Comments

@bondjimbond
Copy link
Collaborator

I'm hitting errors that I can't figure out with some objects I'm processing.

Five of the Newspaper Issues in this set error out: record numbers 34, 37, 71, 72, and 76.

Their metadata is no different from the pages that process without issue:
Screen Shot 2022-07-19 at 11 17 34 AM

No unusual characters that aren't present in other records.

And the file names all seem to be correct and appropriate:
Screen Shot 2022-07-19 at 11 20 56 AM

For the issues with errors, the entire issue fails to appear.

The log doesn't tell me much. Some of the "problem records" don't seem to actually show errors in the log at all (e.g. records 34 and 37). This bit for record 71 shows problems, but it's hard to understand. I do see some strange things here:

"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB

I checked, and we do have hidden files called ._2002-11-01-003.tif etc. This could be part of the problem, except that such files do not appear in the other directories. The other directories seem to be OK.

Is there anything obvious I'm missing?

[2022-07-19 15:07:50] ErrorException.ERROR: ErrorException {"message":"mkdir(): File exists","code":{"metadata":"<?xml version=\"1.0\"?>\n<mods xmlns=\"http://www.loc.gov/mods/v3\" xmlns:mods=\"http://www.loc.gov/mods/v3\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n  <titleInfo>\n      <title>\n\tLE BASTION DE NANAIMO, 2002\n  </title>\n    \n  </titleInfo>\n  <typeOfResource>text</typeOfResource>\n  <language>\n    <languageTerm type=\"text\">French</languageTerm>\n  </language>\n  <physicalDescription>\n\t\t\t<extent>nov.-d&#xE9;c. ; pp. 12</extent>\n\t\t\t\t</physicalDescription>\n  <location>\n\t\t\t</location>\n  <originInfo>\n    \t<dateIssued keyDate=\"yes\">2002-11-01</dateIssued>\n    \t<publisher>Association des Francophones de Nanaimo</publisher>\n    \t</originInfo>\n  <genre authority=\"marcgt\">newspaper</genre>\n  <subject>\n  <hierarchicalGeographic>\n                        </hierarchicalGeographic>\n</subject>\n  <part>\n  </part>\n</mods>\n\n","pages":["/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-004.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-005.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-006.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-007.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-008.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-009.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-010.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-011.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-012.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-013.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-014.tif"],"record_id":"71","no_datastreams_setting_flag":true,"file_name_field":"Directory","record":"[object] (stdClass: {\"key\":\"71\",\"Directory\":\"2002-11-01\",\"IssueTitle\":\"LE BASTION DE NANAIMO, 2002\",\"Type\":\"text\",\"Genre\":\"newspaper\",\"Date_Issued\":\"2002-11-01\",\"Language\":\"French\",\"localIdentifier\":\"\",\"Publisher_Place\":\"\",\"Publisher\":\"Association des Francophones de Nanaimo\",\"PhysicalLocation\":\"\",\"Extent\":\"nov.-déc. ; pp. 12\",\"note\":\"\",\"rightsstatement\":\"\"})","issue_level_output_dir":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71","issue_level_input_dir":"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01","MODS_expected":false,"metadata_file_path":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71/MODS.xml","page_path":"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-001.tif","pathinfo":{"dirname":"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01","basename":"._2002-11-01-001.tif","extension":"tif","filename":"._2002-11-01-001"},"filename_segments":["._2002","11","01","001"],"page_number":"1","page_level_output_dir":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71/1","OBJ_expected":false,"extension":"tif","page_output_path":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71/1/OBJ.tif","OCR_expected":false,"ocr_input_path":"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-001.txt","ocr_output_path":"/Volumes/BCHDP INGE/Output/shfcb/shfcb_37/71/1/OCR.txt"},"severity":2,"file":"/Users/brandon/sfuvault/mik/src/writers/CsvNewspapers.php","line":134} []
[2022-07-19 15:07:50] ErrorException.ERROR: ErrorException {"message":"problem writing package","record_key":"71","details":"[object] (mik\\exceptions\\MikErrorException(code: 0):  at /Users/brandon/sfuvault/mik/mik:105)"} []

Am I missing anything obvious that might account for the problem?

problem_records.log
mik.log
shfcb_37.csv
bchdp_news.txt

@bondjimbond
Copy link
Collaborator Author

I have traced these failures to several different factors:

  • The existence of a Thumbs.db file, automatically created by Windows, which might be throwing off MIK
  • The existence of those random hidden files (like ._2002-11-03-003.tif)
  • The use of an uppercase file extension (.TIF instead of .tif) also throws a failure

We should of course be getting clean data from digitizers, but that clearly can't be relied upon. Can the toolchain be modified to avoid these things? For example, it shouldn't get tripped up by uppercase vs lowercase file extensions... we could account for that. And the hidden files should be avoidable, because it's only looking for filenames that match the directory name, right?

@mjordan
Copy link
Collaborator

mjordan commented Jul 21, 2022

Thumbs.db is skipped in some toolchains:

mark@user-ThinkPad-X1-Carbon-6th:/tmp/mik$ grep -ri thumbs *
extras/scripts/check_files.php:    // belong (I'm looking at you thumbs.db). Get a list of all files in the
extras/scripts/remove_files.php:    'Thumbs.db',
extras/scripts/shutdownhooks/create_structure_files.php:  $exclude_array = array('..', '.DS_Store', 'Thumbs.db', '.');
extras/scripts/shutdownhooks/create_structure_files.php:  $exclude_array = array('..', '.DS_Store', 'Thumbs.db', '.');
src/filegetters/CsvCompound.php:            if (preg_match('/thumbs\.db/i', $pathinfo['basename'])) {
src/inputvalidators/CsvBooks.php:            'Thumbs.db',
src/inputvalidators/CsvBooks.php:            '.Thumbs.db',
src/inputvalidators/CsvBooks.php:            // Book directory cannot contain Thumbs.db, etc.
src/inputvalidators/CsvCompound.php:            'Thumbs.db',
src/inputvalidators/CsvCompound.php:            '.Thumbs.db',
src/inputvalidators/CsvCompound.php:            // Compound directory cannot contain Thumbs.db, etc. The CsvCompound
src/inputvalidators/CsvCompound.php:            // because of the presence of Thumbs.db, etc.

Maybe try to run the remove_files.php script to get rid of them? I think the case of the extension doesn't matter, but I haven't checked the code to confirm that. And yes, we can modify the toolchain to skip them.

@bondjimbond
Copy link
Collaborator Author

I've been going through my massive batch of files to process, and while MIK isn't showing any actual errors in the log anymore after I removed all those hidden files, there is still a set that aren't processing and do show up in the problem_records file.

The only difference between these files and the others is the uppercase .TIF extension.

After changing the extension from uppercase to lowercase, it processes correctly.

@mjordan
Copy link
Collaborator

mjordan commented Jul 21, 2022

👍 then let's make the file extension case irrelevant!

@bondjimbond
Copy link
Collaborator Author

Do you know offhand which file needs to be edited for this? I can give it a shot.

@MarcusBarnes
Copy link
Owner

@bondjimbond Here's where the tiff file extensions for newspapers are defined:

public $allowed_file_extensions_for_OBJ = array('tiff', 'tif');

Probably the best way to approach this is where ever else $allowed_file_extensions_for_OBJ is used, use strtolower when doing the file extension comparison check. This would handle other cases where there is inconsistent file extension naming like .Tif, .tiFF, etc.

@MarcusBarnes
Copy link
Owner

The quicker fix is just to add TIFF and TIF to $allowed_file_extensions_for_for_OBJ.

@bondjimbond
Copy link
Collaborator Author

@MarcusBarnes Perfect, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants