-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seqio_cache_tasks fails on DataflowRunner #109
Comments
In case anyone else stumbles upon this or lands though the search - kind people at Apache Beam community have pointed out https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython Adding I'll be happy to submit a doc patch with instructions in case anyone points me to the right place to put it. |
@bzz please do add these details to the README |
Just to add another possible solution, after some time trying, what finally worked for me was a combination of setup.py and use of custom docker image for Dataflow workers (https://cloud.google.com/dataflow/docs/guides/using-custom-containers). |
When trying to cache a dataset that does not fit DirectRunner (e.g google-research/text-to-text-transfer-transformer#323 (comment)) on Cloud Dataflow without any
requirements.txt
, likeit fails with
ModuleNotFoundError: No module named 'seqio'
.If
seqio
added withit fails with
This seems to be cause by
seqio
depending ontensorflow-text
, which does not have any source release artifacts.But requirements cache in Apache Beam seem to be populated with
--no-binary :all:
before making it available to the workers.A try on a clean venv results in the same:
Am I doing something wrong, or how does everyone work this around? Would appreciate a hand here.
The text was updated successfully, but these errors were encountered: