-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure ExperimentFn is executable before submitting #11
Comments
Currently the main source of failures is serialization. I suggest to limit the validation on serialization first. In order to have E2E tests, we also continue the work on mocking the skein cluster. Could be another solution to test experiment function. |
Could you elaborate on that please? Which failures you are referring to? |
dill (or cloud-pickle) serializes everything that is needed by the python function. For non-trivial functions, it brings problems. Here 2 examples we met:
In tf-yarn, I would like to specify to the pickler which objects are provided remotely (logger, configuration, python packages, etc.) on the executor and limit the serialization. There is another solution we are investigating to solve this problem. Tensorflow python functions generate gRPC communication using protobuf scheme. These functions solve the serialization problem. So the solution will be to only instantiate tensorflow training servers then execute the ExperimentFn function on the driver. That's the idea behind this new example: https://github.com/criteo/tf-yarn/blob/master/examples/distributed.py |
Created a ticket about the serialization test: #32 |
This is a simple UX improvement which would allow for early error detection. The idea is to check if the
experiment_fn
passed torun_on_yarn
can be executed in the configured environment by calling$ path/to/env/bin/python -c "load_fn(...)()"
on the edge node prior to submitting.
The text was updated successfully, but these errors were encountered: