-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JS-3] Resource sharing across jobs #6
Conversation
* introduce combiner * Minor fix (grammar, comments)
@@ -82,6 +82,11 @@ | |||
public final class OptimizationBenefitThreshold implements Name<Double> { | |||
} | |||
|
|||
@NamedParameter(doc = "Whether the hyper-thread is enabled, which determines the proper number of trainer threads.", | |||
short_name = "hyper_thread_enabled", default_value = "false") | |||
public final class HyperThreadEnabled implements Name<Boolean> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been moved from DolphinParameters
, since Pregel will use it, too.
@@ -14,7 +14,7 @@ | |||
# limitations under the License. | |||
|
|||
# EXAMPLE USAGE | |||
# ./run_addinteger.sh -num_workers 3 -number_workers 3 -number_servers 2 -max_num_epochs 10 -num_mini_batches 15 -num_worker_blocks 15 -delta 4 -num_keys 100 -num_training_data 50 -num_test_data 5 -compute_time_ms 30 -max_num_eval_local 5 -input run_addinteger.sh -optimizer edu.snu.cay.dolphin.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -server_metric_flush_period_ms 1000 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3 | |||
# ./run_addinteger.sh -number_workers 3 -num_dolphin_workers 3 -num_servers 2 -max_num_epochs 10 -num_mini_batches 15 -num_worker_blocks 15 -delta 4 -num_keys 100 -num_training_data 50 -num_test_data 5 -compute_time_ms 30 -max_num_eval_local 5 -input run_addinteger.sh -optimizer edu.snu.cay.dolphin.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -server_metric_flush_period_ms 1000 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that number_workers
is for example parameter (ExampleParameters.NumWorkers
).
@@ -15,7 +15,7 @@ | |||
|
|||
# EXAMPLE USAGE | |||
# Classification | |||
# ./run_gbt.sh -num_workers 2 -number_servers 1 -local true -input sample_gbt -max_num_eval_local 3 -test_data_path file://$(pwd)/sample_gbt_test -max_num_epochs 50 -num_mini_batches 10 -num_worker_blocks 10 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -timeout 200000 -num_trainer_threads 2 -optimizer edu.snu.cay.dolphin.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -metadata_path file://$(pwd)/sample_gbt.meta -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3 | |||
# ./run_gbt.sh -num_dolphin_workers 2 -num_servers 1 -local true -input sample_gbt -max_num_eval_local 3 -test_data_path file://$(pwd)/sample_gbt_test -max_num_epochs 50 -num_mini_batches 10 -num_worker_blocks 10 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -timeout 200000 -num_trainer_threads 2 -optimizer edu.snu.cay.dolphin.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -metadata_path file://$(pwd)/sample_gbt.meta -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_workers -> num_dolphin_workers
is the only change.
@@ -14,7 +14,7 @@ | |||
# limitations under the License. | |||
|
|||
# EXAMPLE USAGE | |||
# ./start_jobserver.sh -num_total_resources 10 -max_num_eval_local 10 -local true -timeout 300000 -scheduler edu.snu.cay.jobserver.driver.FIFOJobScheduler | |||
# ./start_jobserver.sh -max_num_eval_local 5 -local true -timeout 300000 -scheduler edu.snu.cay.jobserver.driver.SchedulerImpl -num_executors 5 -executor_mem_size 128 -executor_num_cores 1 -executor_num_tasklets 4 -handler_queue_size 1024 -sender_queue_size 1024 -handler_num_threads 2 -sender_num_threads 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JobServer requires executor spec arguments, since it will spawn all executors upon start.
@@ -15,7 +15,7 @@ | |||
|
|||
# EXAMPLE USAGE | |||
# Classification | |||
# ./submit_gbt.sh -num_workers 2 -number_servers 1 -input file://$(pwd)/sample_gbt -test_data_path file://$(pwd)/sample_gbt_test -max_num_epochs 50 -num_mini_batches 10 -num_worker_blocks 10 -init_step_size 0.1 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -num_trainer_threads 2 -metadata_path file://$(pwd)/sample_gbt.meta -server_metric_flush_period_ms 1000 | |||
# ./submit_gbt.sh -input file://$(pwd)/sample_gbt -test_data_path file://$(pwd)/sample_gbt_test -max_num_epochs 50 -num_mini_batches 10 -num_worker_blocks 10 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -num_trainer_threads 2 -metadata_path file://$(pwd)/sample_gbt.meta -server_metric_flush_period_ms 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All submit scripts do not need executor-request parameters.
Scheduler will determine which and how many executors for a job.
@@ -168,13 +168,12 @@ public MasterSideMsgHandler getMsgHandler() { | |||
public void start(final List<AllocatedExecutor> servers, final List<AllocatedExecutor> workers, | |||
final AllocatedTable modelTable, final AllocatedTable trainingDataTable) { | |||
try { | |||
servers.forEach(server -> metricManager.startMetricCollection(server.getId(), getServerMetricConf())); | |||
// TODO #5: tasklet-level metric collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just removed metric collection for server tasklets, since we don't use metrics from servers.
But multiple worker tasklet from different jobs may incur problems, I'll fix the problem in a PR for #5.
Actually we don't need even server tasklet at all. But I'd like to minimize the code change. So let's tackle this issue when it becomes urgent.
|
||
return executorGroups; | ||
} | ||
// support collocation only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the only assumption of PS-collocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard work! The PR looks great.
I'll merge once tests pass (triggered by my minor fix).
Resolves #3.
This PR introduces resource-sharing across jobs in JobServer.
Specifically, there are two new components:
ResourcePool
: a resource pool for JobServer to maintain whole cluster resource.SchedulerImpl
: a new scheduler implementation that immediately launches incoming jobs, simply using all executorsAdditional changes:
FIFOScheduler
, an old scheduler implementation, since we will use it anymore.Runtime.getRuntime().availableProcessors()
.