Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared tensor mechanism for avoiding tensor serialization/ipc costs #58

Merged
merged 5 commits into from
Jul 16, 2019

Conversation

jsjason
Copy link
Contributor

@jsjason jsjason commented Jul 13, 2019

Closes #57.

This PR implements the solution described in #57, using shared tensors and integer signals to pass intermediate tensor values instead of directly serializing tensors and sending them through queues (which are linux pipes, internally).

First, the main benchmark process examines the given pipeline and creates all shared tensors in advance. Each runner process is given a certain amount of output tensors to write their outputs. The device of each output tensor is set to be equal to their corresponding processes, and the shape is assumed to be given by the model implementation (RunnerModel.output_shape()). Note that all output tensors for step N are given to all processes of step N+1 as input tensors. The newly added class BenchmarkTensors acts as an abstraction for managing such shared tensors.
The number of output tensors per process can be configured via the new command line argument -t / --tensors_per_process, which has been added to the log directory format as well.

Screen Shot 2019-07-14 at 2 37 09 AM

Now, the information that needs to be passed through queues is an index indicating "which shared tensor is ready." I've implemented the index simply as a tuple of two integers: one for denoting the process index (there can be multiple processes per step), and one for denoting the tensor index (out of tensors_per_process tensors). Each process writes its outputs to one of its tensors_per_process output tensors and sends the correct index afterwards. In case all tensors_per_process tensors have been used up, it goes back to the first tensor and repeats the whole procedure, sort of like a ring buffer. To ensure that the process does not write to an output tensor that has not yet been consumed by a process from the next step, I've added a multiprocessing.Event object to every shared tensor so that consumers can notify waiting producers. Check the comments in the new TensorEvent class and the usages in runner.py for more details.

Test:

$ cat config/r2p1d-whole.json
...
    {
      "model": "models.r2p1d.model.R2P1DRunner",
      "gpus": [0,1,2,3,4,5,6,7],
...
$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python benchmark.py -c config/r2p1d-whole.json -v 2000 -mi 25

Before this PR, the average time taken between inference0_finish and runner1_start (the ipc part) was 13.010609 ms. After applying this PR, the same section takes only 2.661710 ms.

@jsjason jsjason requested a review from yunseong July 13, 2019 17:38
Copy link
Contributor

@yunseong yunseong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the PR with an amazing speedup!
The description is so kind and the code is pretty clean.

I've left a few minor comments. Please check them out!

control.py Outdated Show resolved Hide resolved
runner_model.py Outdated
@@ -17,6 +17,11 @@ def input_shape(self):
"""Returns the expected shape of the input tensor to this model."""
raise NotImplementedError

@staticmethod
def output_shape():
"""Returns the expected shape of the output tensor of this model."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed previously, could we add the description for the case of None in the input/output shape (i.e., not a Tensor at the dimension)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change should be applied to input_shape() as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

runner.py Outdated Show resolved Hide resolved
runner.py Show resolved Hide resolved
control.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@jsjason jsjason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yunseong I've addressed your comments, thank you very much! Could you take another look? I've added quite a few changes in both code and comments, regarding the tensor/non-tensor generalization.

control.py Outdated Show resolved Hide resolved
control.py Outdated Show resolved Hide resolved
runner.py Show resolved Hide resolved
runner_model.py Outdated
@@ -17,6 +17,11 @@ def input_shape(self):
"""Returns the expected shape of the input tensor to this model."""
raise NotImplementedError

@staticmethod
def output_shape():
"""Returns the expected shape of the output tensor of this model."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

runner.py Outdated Show resolved Hide resolved
@jsjason
Copy link
Contributor Author

jsjason commented Jul 15, 2019

Sorry for the extra commits, I left out a few details. I'm really ready for a review now.
https://github.com/snuspl/rnb/pull/58/files/21a479db230597fbd0549e02718c3028bb140ddf..c600b2b3d3c94908a8d22c0a562cff9a866cf0e1

runner.py Show resolved Hide resolved
Copy link
Contributor

@yunseong yunseong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Thank you so much for fixing this issue.
I'm merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass primitive signals instead of PTH tensors to avoid serialization/ipc costs
2 participants