Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eslevier: vtexXXXXX download #231

Open
ErnestaP opened this issue Oct 30, 2023 · 0 comments
Open

Eslevier: vtexXXXXX download #231

ErnestaP opened this issue Oct 30, 2023 · 0 comments

Comments

@ErnestaP
Copy link

ErnestaP commented Oct 30, 2023

Every record have for SCOAP3 has pfd and xml files.

  • All fields in JSON are parsed from XML file. XML file we can find in .tar archive, for example, CERNQ000000010669A.tar
  • To download one tar archive takes at least 15miin, example: workflows qa
  • PDF files, for every record that was parsed from XML I mentioned above, can be found in .zip archive, for example: vtex00577479_a-2b.zip
  • ZIP archive is so big that to download it takes ages, cannot even tell how long since it crashes on the end. We have extracted files saved, but not the archive. Maybe would be better to save at first the archive and then extract it, rather than read files from zip/tar and save it in the extracted dir? Might be actually faster
  • Error while saving the file from ZIP:
[2023-10-30, 10:56:11 UTC] {connectionpool.py:471} WARNING - Failed to parse headers (url=https://s3.cern.ch:443/scoap3-qa-workflows-elsevier/extracted/vtex00564918_a-2b/vtex00564918_a-2b/05503213/v991sC/fp/issue_xml_xsl2881_fp.xml): [MissingHeaderBodySeparatorDefect()], unparsed data: 'HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nBucket: scoap3-qa-workflows-elsevier\r\nContent-Length: 0\r\nDate: Mon, 30 Oct 2023 10:56:11 GMT\r\nEtag: "2dcc3f604a9e3b2c93d9e72373fbe493"\r\nX-Amz-Request-Id: tx000009cd859128c20a345-00653f8bca-3cf7f37b-default\r\n\r\n'
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 469, in _make_request
    assert_header_parsing(httplib_response.msg)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/util/response.py", line 91, in assert_header_parsing
    raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: 'HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nBucket: scoap3-qa-workflows-elsevier\r\nContent-Length: 0\r\nDate: Mon, 30 Oct 2023 10:56:11 GMT\r\nEtag: "2dcc3f604a9e3b2c93d9e72373fbe493"\r\nX-Amz-Request-Id: tx000009cd859128c20a345-00653f8bca-3cf7f37b-default\r\n\r\n'

HOW TO MAKE IT FASTER?
HOW TO HAVE THE FILES FROM .ZIP DOWNLOADED CORRECTLY?

ErnestaP added a commit to cern-sis/workflows that referenced this issue Nov 22, 2023
ErnestaP added a commit to cern-sis/workflows that referenced this issue Nov 22, 2023
ErnestaP added a commit to cern-sis/workflows that referenced this issue Nov 23, 2023
ErnestaP added a commit to cern-sis/workflows that referenced this issue Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant