You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Every record have for SCOAP3 has pfd and xml files.
All fields in JSON are parsed from XML file. XML file we can find in .tar archive, for example, CERNQ000000010669A.tar
To download one tar archive takes at least 15miin, example: workflows qa
PDF files, for every record that was parsed from XML I mentioned above, can be found in .zip archive, for example: vtex00577479_a-2b.zip
ZIP archive is so big that to download it takes ages, cannot even tell how long since it crashes on the end. We have extracted files saved, but not the archive. Maybe would be better to save at first the archive and then extract it, rather than read files from zip/tar and save it in the extracted dir? Might be actually faster
Error while saving the file from ZIP:
[2023-10-30, 10:56:11 UTC] {connectionpool.py:471} WARNING - Failed to parse headers (url=https://s3.cern.ch:443/scoap3-qa-workflows-elsevier/extracted/vtex00564918_a-2b/vtex00564918_a-2b/05503213/v991sC/fp/issue_xml_xsl2881_fp.xml): [MissingHeaderBodySeparatorDefect()], unparsed data: 'HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nBucket: scoap3-qa-workflows-elsevier\r\nContent-Length: 0\r\nDate: Mon, 30 Oct 2023 10:56:11 GMT\r\nEtag: "2dcc3f604a9e3b2c93d9e72373fbe493"\r\nX-Amz-Request-Id: tx000009cd859128c20a345-00653f8bca-3cf7f37b-default\r\n\r\n'
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 469, in _make_request
assert_header_parsing(httplib_response.msg)
File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/util/response.py", line 91, in assert_header_parsing
raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: 'HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nBucket: scoap3-qa-workflows-elsevier\r\nContent-Length: 0\r\nDate: Mon, 30 Oct 2023 10:56:11 GMT\r\nEtag: "2dcc3f604a9e3b2c93d9e72373fbe493"\r\nX-Amz-Request-Id: tx000009cd859128c20a345-00653f8bca-3cf7f37b-default\r\n\r\n'
Every record have for SCOAP3 has pfd and xml files.
HOW TO MAKE IT FASTER?
HOW TO HAVE THE FILES FROM .ZIP DOWNLOADED CORRECTLY?
The text was updated successfully, but these errors were encountered: