You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using a distribution strategy for evaluation doesn't seem to work well.
We could split up the train_and_evaluate function, call distributed training with the cluster_spec and do evaluation separately via calling the estimator.evaluate function.
Currently train_and_evaluate evaluation is done in a thread that always reads the latest model.
https://github.com/tensorflow/estimator/blob/master/tensorflow_estimator/python/estimator/training.py#L798
Using a distribution strategy for evaluation doesn't seem to work well.
We could split up the train_and_evaluate function, call distributed training with the cluster_spec and do evaluation separately via calling the estimator.evaluate function.
This would allow to spawn many evaluators with skein, give each a different checkpoint path and do evaluation on different checkpoints in parallel.
The text was updated successfully, but these errors were encountered: