Ensure ExperimentFn is executable before submitting #11

superbobry · 2018-10-11T19:45:32Z

This is a simple UX improvement which would allow for early error detection. The idea is to check if the experiment_fn passed to run_on_yarn can be executed in the configured environment by calling

$ path/to/env/bin/python -c "load_fn(...)()"

on the edge node prior to submitting.

The text was updated successfully, but these errors were encountered:

jdlesage · 2019-01-10T08:39:54Z

Currently the main source of failures is serialization. I suggest to limit the validation on serialization first.

In order to have E2E tests, we also continue the work on mocking the skein cluster. Could be another solution to test experiment function.

superbobry · 2019-01-10T23:17:20Z

Currently the main source of failures is serialization.

Could you elaborate on that please? Which failures you are referring to?

jdlesage · 2019-01-11T14:02:59Z

dill (or cloud-pickle) serializes everything that is needed by the python function. For non-trivial functions, it brings problems. Here 2 examples we met:

logging: if a function uses a logger object, the pickler will try to serialize configuration then handlers. It makes no sense to serialize handlers. Actually, handlers must be instantiated remotely and injected to the logger. These functions with logger often cannot be unserialized.
six: if six must be serialized, dill serialization failed on non installed packaged referenced by six (in our cluster, the problem comes with tkinter). I don't think dill must detect that. Because it has to understand the 'six' code to understand that tkinter is optional.

In tf-yarn, I would like to specify to the pickler which objects are provided remotely (logger, configuration, python packages, etc.) on the executor and limit the serialization.

There is another solution we are investigating to solve this problem. Tensorflow python functions generate gRPC communication using protobuf scheme. These functions solve the serialization problem. So the solution will be to only instantiate tensorflow training servers then execute the ExperimentFn function on the driver. That's the idea behind this new example: https://github.com/criteo/tf-yarn/blob/master/examples/distributed.py

jdlesage · 2019-01-14T14:31:23Z

Created a ticket about the serialization test: #32

fhoering closed this as completed Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure ExperimentFn is executable before submitting #11

Ensure ExperimentFn is executable before submitting #11

superbobry commented Oct 11, 2018

jdlesage commented Jan 10, 2019 •

edited

Loading

superbobry commented Jan 10, 2019

jdlesage commented Jan 11, 2019

jdlesage commented Jan 14, 2019

Ensure ExperimentFn is executable before submitting #11

Ensure ExperimentFn is executable before submitting #11

Comments

superbobry commented Oct 11, 2018

jdlesage commented Jan 10, 2019 • edited Loading

superbobry commented Jan 10, 2019

jdlesage commented Jan 11, 2019

jdlesage commented Jan 14, 2019

jdlesage commented Jan 10, 2019 •

edited

Loading