-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mysterious failures in the CSV Newspaper toolchain #519
Comments
I have traced these failures to several different factors:
We should of course be getting clean data from digitizers, but that clearly can't be relied upon. Can the toolchain be modified to avoid these things? For example, it shouldn't get tripped up by uppercase vs lowercase file extensions... we could account for that. And the hidden files should be avoidable, because it's only looking for filenames that match the directory name, right? |
Thumbs.db is skipped in some toolchains:
Maybe try to run the remove_files.php script to get rid of them? I think the case of the extension doesn't matter, but I haven't checked the code to confirm that. And yes, we can modify the toolchain to skip them. |
I've been going through my massive batch of files to process, and while MIK isn't showing any actual errors in the log anymore after I removed all those hidden files, there is still a set that aren't processing and do show up in the problem_records file. The only difference between these files and the others is the uppercase After changing the extension from uppercase to lowercase, it processes correctly. |
👍 then let's make the file extension case irrelevant! |
Do you know offhand which file needs to be edited for this? I can give it a shot. |
@bondjimbond Here's where the tiff file extensions for newspapers are defined: mik/src/filegetters/CdmNewspapers.php Line 49 in dd26fa8
Probably the best way to approach this is where ever else $allowed_file_extensions_for_OBJ is used, use strtolower when doing the file extension comparison check. This would handle other cases where there is inconsistent file extension naming like .Tif, .tiFF, etc. |
The quicker fix is just to add TIFF and TIF to $allowed_file_extensions_for_for_OBJ. |
@MarcusBarnes Perfect, thanks! |
I'm hitting errors that I can't figure out with some objects I'm processing.
Five of the Newspaper Issues in this set error out: record numbers 34, 37, 71, 72, and 76.
Their metadata is no different from the pages that process without issue:
No unusual characters that aren't present in other records.
And the file names all seem to be correct and appropriate:
For the issues with errors, the entire issue fails to appear.
The log doesn't tell me much. Some of the "problem records" don't seem to actually show errors in the log at all (e.g. records 34 and 37). This bit for record 71 shows problems, but it's hard to understand. I do see some strange things here:
"/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-001.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-002.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB Newspapers/shfcb_37/2002-11-01/._2002-11-01-003.tif","/Volumes/BCHDP INGE/For Ingest/SHFCB/SHFCB
I checked, and we do have hidden files called
._2002-11-01-003.tif
etc. This could be part of the problem, except that such files do not appear in the other directories. The other directories seem to be OK.Is there anything obvious I'm missing?
Am I missing anything obvious that might account for the problem?
problem_records.log
mik.log
shfcb_37.csv
bchdp_news.txt
The text was updated successfully, but these errors were encountered: