Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JS-3] Resource sharing across jobs #6

Merged
merged 31 commits into from
Dec 22, 2017

Conversation

wynot12
Copy link
Contributor

@wynot12 wynot12 commented Dec 22, 2017

Resolves #3.

This PR introduces resource-sharing across jobs in JobServer.
Specifically, there are two new components:

  • ResourcePool: a resource pool for JobServer to maintain whole cluster resource.
  • SchedulerImpl: a new scheduler implementation that immediately launches incoming jobs, simply using all executors

Additional changes:

  • Dolphin on JobServer only supports PS-collocation mode.
  • Removes FIFOScheduler, an old scheduler implementation, since we will use it anymore.
  • named parameter's short names are renamed since there are conflicts across packages.
  • the number of worker threads in pregel and dolphin can be configured through command-line; otherwise, the system will use Runtime.getRuntime().availableProcessors().

@@ -82,6 +82,11 @@
public final class OptimizationBenefitThreshold implements Name<Double> {
}

@NamedParameter(doc = "Whether the hyper-thread is enabled, which determines the proper number of trainer threads.",
short_name = "hyper_thread_enabled", default_value = "false")
public final class HyperThreadEnabled implements Name<Boolean> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been moved from DolphinParameters, since Pregel will use it, too.

@@ -14,7 +14,7 @@
# limitations under the License.

# EXAMPLE USAGE
# ./run_addinteger.sh -num_workers 3 -number_workers 3 -number_servers 2 -max_num_epochs 10 -num_mini_batches 15 -num_worker_blocks 15 -delta 4 -num_keys 100 -num_training_data 50 -num_test_data 5 -compute_time_ms 30 -max_num_eval_local 5 -input run_addinteger.sh -optimizer edu.snu.cay.dolphin.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -server_metric_flush_period_ms 1000 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3
# ./run_addinteger.sh -number_workers 3 -num_dolphin_workers 3 -num_servers 2 -max_num_epochs 10 -num_mini_batches 15 -num_worker_blocks 15 -delta 4 -num_keys 100 -num_training_data 50 -num_test_data 5 -compute_time_ms 30 -max_num_eval_local 5 -input run_addinteger.sh -optimizer edu.snu.cay.dolphin.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -server_metric_flush_period_ms 1000 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that number_workers is for example parameter (ExampleParameters.NumWorkers).

@@ -15,7 +15,7 @@

# EXAMPLE USAGE
# Classification
# ./run_gbt.sh -num_workers 2 -number_servers 1 -local true -input sample_gbt -max_num_eval_local 3 -test_data_path file://$(pwd)/sample_gbt_test -max_num_epochs 50 -num_mini_batches 10 -num_worker_blocks 10 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -timeout 200000 -num_trainer_threads 2 -optimizer edu.snu.cay.dolphin.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -metadata_path file://$(pwd)/sample_gbt.meta -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3
# ./run_gbt.sh -num_dolphin_workers 2 -num_servers 1 -local true -input sample_gbt -max_num_eval_local 3 -test_data_path file://$(pwd)/sample_gbt_test -max_num_epochs 50 -num_mini_batches 10 -num_worker_blocks 10 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -timeout 200000 -num_trainer_threads 2 -optimizer edu.snu.cay.dolphin.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -metadata_path file://$(pwd)/sample_gbt.meta -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_workers -> num_dolphin_workers is the only change.

@@ -14,7 +14,7 @@
# limitations under the License.

# EXAMPLE USAGE
# ./start_jobserver.sh -num_total_resources 10 -max_num_eval_local 10 -local true -timeout 300000 -scheduler edu.snu.cay.jobserver.driver.FIFOJobScheduler
# ./start_jobserver.sh -max_num_eval_local 5 -local true -timeout 300000 -scheduler edu.snu.cay.jobserver.driver.SchedulerImpl -num_executors 5 -executor_mem_size 128 -executor_num_cores 1 -executor_num_tasklets 4 -handler_queue_size 1024 -sender_queue_size 1024 -handler_num_threads 2 -sender_num_threads 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobServer requires executor spec arguments, since it will spawn all executors upon start.

@@ -15,7 +15,7 @@

# EXAMPLE USAGE
# Classification
# ./submit_gbt.sh -num_workers 2 -number_servers 1 -input file://$(pwd)/sample_gbt -test_data_path file://$(pwd)/sample_gbt_test -max_num_epochs 50 -num_mini_batches 10 -num_worker_blocks 10 -init_step_size 0.1 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -num_trainer_threads 2 -metadata_path file://$(pwd)/sample_gbt.meta -server_metric_flush_period_ms 1000
# ./submit_gbt.sh -input file://$(pwd)/sample_gbt -test_data_path file://$(pwd)/sample_gbt_test -max_num_epochs 50 -num_mini_batches 10 -num_worker_blocks 10 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -num_trainer_threads 2 -metadata_path file://$(pwd)/sample_gbt.meta -server_metric_flush_period_ms 1000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All submit scripts do not need executor-request parameters.
Scheduler will determine which and how many executors for a job.

@@ -168,13 +168,12 @@ public MasterSideMsgHandler getMsgHandler() {
public void start(final List<AllocatedExecutor> servers, final List<AllocatedExecutor> workers,
final AllocatedTable modelTable, final AllocatedTable trainingDataTable) {
try {
servers.forEach(server -> metricManager.startMetricCollection(server.getId(), getServerMetricConf()));
// TODO #5: tasklet-level metric collection
Copy link
Contributor Author

@wynot12 wynot12 Dec 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just removed metric collection for server tasklets, since we don't use metrics from servers.
But multiple worker tasklet from different jobs may incur problems, I'll fix the problem in a PR for #5.

Actually we don't need even server tasklet at all. But I'd like to minimize the code change. So let's tackle this issue when it becomes urgent.


return executorGroups;
}
// support collocation only
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the only assumption of PS-collocation.

Copy link
Contributor

@yunseong yunseong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work! The PR looks great.
I'll merge once tests pass (triggered by my minor fix).

@yunseong yunseong merged commit ec97ff1 into master Dec 22, 2017
@yunseong yunseong deleted the generic-jobserver-resource-sharing branch December 22, 2017 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants