Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[onert] Share memory for Reshape, ExapndDims and Squeeze #14057

Closed
wants to merge 27 commits into from

Conversation

mbencer
Copy link
Contributor

@mbencer mbencer commented Sep 23, 2024

This commit extends current tensor memory management infrastructure to allow tensor memory sharing if possible.

ONE-DCO-1.0-Signed-off-by: Mateusz Bencer [email protected]

Issue: #12836

This commit extends current tensor memory management infrastructure to allow tensor memory sharing if possible.

ONE-DCO-1.0-Signed-off-by: Mateusz Bencer [email protected]
@mbencer mbencer force-pushed the mbencer/ReshapeAvoidCopy branch from c5ddcc5 to c8d8a75 Compare October 2, 2024 13:28
@mbencer
Copy link
Contributor Author

mbencer commented Oct 2, 2024

@glistening @hseok-oh Thank you for review of previous version. In the current version I've changed completely approach. Now there memory sharing is processed during tensors allocation.

@mbencer mbencer requested a review from glistening October 2, 2024 13:29
@mbencer mbencer changed the title [onert] Optimize Reshape, ExpandDims and Squeeze [onert] Share memory for Reshape, ExapndDims and Squeeze Oct 2, 2024
@glistening
Copy link
Contributor

@mbencer Is is possible to split this PR to 3 PRs for each operator (i.e. Reshape, ExpandDims and Squeeze). If possible, could you please kindly create smaller PRs? It is a bit big for me to review as whole.

cc @hseok-oh, @ragmani

Copy link
Contributor

@ragmani ragmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbencer
This seems to be my last review on this work because I will be away from the office for a long time. I'm sorry, I won't be able to give feedback anymore.

}
}
reassign_indexes_to_single_sources(data.shared_memory_operand_map);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-place is dependent on specific operations, and the kernel implementations may vary for each backend. Also, the kernel implementations for the specific operations only exist in cpu backend now. So, it would be better to move this map creation into cpu backend.
I think a better place to create and append map is in KernelGenerator. However, currently in cpu backend, registering tensors is called before KernelGenerator, making it difficult to simply implement to move the map into KernelGenerator. You may need to unify BackendContext::genTensors() and BackendContext::genKernels() such as train backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. My intention was to make this mechanism more global but I see that it can be not applicable for other backends.
As you notice it's very problematic to move it to KernelGenerator - we need to pass this information for TensorBuilder ctor (setter seems to be not good approach and to initConsts (here the only possibility seems to be local backend context).
My proposition (already implemented) is to call it in runtime/onert/backend/cpu/Backend.h

Comment on lines 20 to 21
#include "GenModelTest.h"
#include "CircleGen.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "GenModelTest.h"
#include "CircleGen.h"
#include "CircleGen.h"
#include "GenModelTest.h"

#include "GenModelTest.h"
#include "CircleGen.h"

TEST_F(GenModelTest, optimized_reshape_inference)
Copy link
Contributor

@ragmani ragmani Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is for reshape test, but not reshape optimization(probably in-place) test. It would be better to rename this test and add tests to verify in-place implementation. If it's difficult to add it tests nnfw_api test, please add gtests in the implemented directory instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUCCEED();
}

TEST_F(GenModelTest, optimized_expand_dims_inference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

SUCCEED();
}

TEST_F(GenModelTest, optimized_squeeze_inference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

SUCCEED();
}

TEST_F(GenModelTest, optimized_reshape_reshape_reshape_chain_inference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

SUCCEED();
}

TEST_F(GenModelTest, reshape_input_model_input_inference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

SUCCEED();
}

TEST_F(GenModelTest, reshape_input_model_output_inference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

SUCCEED();
}

TEST_F(GenModelTest, reshape_output_model_output_inference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -217,6 +239,33 @@ createBackendContexts(compiler::ILoweredGraph &lgraph, bool linear_executor,

// Create contexts
auto whole_op_order = lgraph.graph().topolSortOperations();
const std::unordered_set<std::string> memory_sharing_supported_backends = {"cpu", "builtin"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove builtin backend from this set. The kernels in builtin backend deal with data transmission between other backends, so there is no need to apply in-place for this task. The required in-place optimization there has already been applied in the other way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - I re-wrote implementation to be local for cpu.

@mbencer
Copy link
Contributor Author

mbencer commented Oct 11, 2024

@mbencer Is is possible to split this PR to 3 PRs for each operator (i.e. Reshape, ExpandDims and Squeeze). If possible, could you please kindly create smaller PRs? It is a bit big for me to review as whole.

cc @hseok-oh, @ragmani

@glistening Thank you for response. Sure, I'll try to split the PR but let me introduce review request from @ragmani at first.
I am not sure if splitting based on operator (Reshape, ExpandDims) makes sense here but probably splitting to backend/core (with consideration #14057 (comment)) should make review easier ;)

@mbencer
Copy link
Contributor Author

mbencer commented Oct 11, 2024

@mbencer This seems to be my last review on this work because I will be away from the office for a long time. I'm sorry, I won't be able to give feedback anymore.

I see. Anyway thank you for very useful feedback! ;)

@mbencer mbencer requested a review from ragmani October 16, 2024 10:17
@mbencer
Copy link
Contributor Author

mbencer commented Oct 16, 2024

@mbencer Is is possible to split this PR to 3 PRs for each operator (i.e. Reshape, ExpandDims and Squeeze). If possible, could you please kindly create smaller PRs? It is a bit big for me to review as whole.

cc @hseok-oh, @ragmani

I've split part of the implementation into smaller PRs:

PR link description
#14227 [onert] Introduce tests for Reshape, Squeeze and ExpandDims
#14228 [onert] Introduce capabilities to find operands which can share memory
#14229 [onert/cpu] [Reshape, ExpandDims] Avoid copying memory if possible
#14230 [onert] Propagate shared memory operand indexes to cpu backend

The rest of changes are deeply dependent so I'll push it later.

@mbencer
Copy link
Contributor Author

mbencer commented Oct 17, 2024

Some time results (for 50 repeats) from my dev machine. Note: do NOT treat it as an official results:

From current branch

  • mobilenet v2:
MODEL_LOAD   takes 4.141 ms
PREPARE      takes 11.126 ms
EXECUTE      takes 6.264 ms
- MEAN     :  6.264 ms
- MAX      :  7.880 ms
- MIN      :  6.060 ms
- GEOMEAN  :  6.260 ms
  • mnist
MODEL_LOAD   takes 0.186 ms
PREPARE      takes 1.314 ms
EXECUTE      takes 0.220 ms
- MEAN     :  0.220 ms
- MAX      :  1.688 ms
- MIN      :  0.153 ms
- GEOMEAN  :  0.195 ms

From master

  • mobilenet v2:
MODEL_LOAD   takes 4.051 ms
PREPARE      takes 11.240 ms
EXECUTE      takes 6.298 ms
- MEAN     :  6.298 ms
- MAX      :  8.260 ms
- MIN      :  6.031 ms
- GEOMEAN  :  6.292 ms
  • mnist
MODEL_LOAD   takes 0.210 ms
PREPARE      takes 1.353 ms
EXECUTE      takes 0.233 ms
- MEAN     :  0.233 ms
- MAX      :  1.711 ms
- MIN      :  0.151 ms
- GEOMEAN  :  0.204 ms

Conclusion: Preparation time increases about 1% for mobilenet and almost 3% for mnist. Execution time(mean) decreases about 0.54% for mobilenet and 5.6% for mnist.

@mbencer
Copy link
Contributor Author

mbencer commented Oct 21, 2024

@mbencer Is is possible to split this PR to 3 PRs for each operator (i.e. Reshape, ExpandDims and Squeeze). If possible, could you please kindly create smaller PRs? It is a bit big for me to review as whole.
cc @hseok-oh, @ragmani

I've split part of the implementation into smaller PRs:
PR link description
#14227 [onert] Introduce tests for Reshape, Squeeze and ExpandDims
#14228 [onert] Introduce capabilities to find operands which can share memory
#14229 [onert/cpu] [Reshape, ExpandDims] Avoid copying memory if possible
#14230 [onert] Propagate shared memory operand indexes to cpu backend

The rest of changes are deeply dependent so I'll push it later.

@hseok-oh, @ragmani @zetwhite If you find a moment please take a look for PRs to review ;)

@zetwhite
Copy link
Contributor

If you find a moment please take a look for PRs to review ;)

Thanks for the notice. I'll take a look :)

@zetwhite
Copy link
Contributor

If you find a moment please take a look for PRs to review ;)

Thanks for the notice. I'll take a look :)

I read the draft and understood the overall direction. I could review your PR.
But I'm afraid that some runtime members (@Samsung/one_onert ) are out of the office until the middle of November, so it might be hard to get others' reviews.

Comment on lines 186 to 202
std::vector<ir::OperandIndex> registered_source_ind;
for (const auto &[_, source_ind] : tensor_builder->getSharedMemoryOperandIndexes())
{
if (ctx.external_operands().contains(source_ind))
continue;
if (tensor_builder->isRegistered(source_ind)) // some tensors can have the same source
continue;
tensor_builder->registerTensorInfo(source_ind, graph.operands().at(source_ind).info());
registered_source_ind.emplace_back(source_ind);
}

graph.operands().iterate([&](const ir::OperandIndex &ind, const ir::Operand &obj) {
if (ctx.external_operands().contains(ind))
return;
if (std::find(std::begin(registered_source_ind), std::end(registered_source_ind), ind) !=
std::end(registered_source_ind)) // skip tensors already registered
return;
Copy link
Contributor

@zetwhite zetwhite Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing #14228, I re-read this PR.

I'm a bit confused about this part.
In genTensors(), Is it sufficient just to register the source_ind first?


I thought this draft tried to allocate only source_operand and avoid allocating shared_operand.
(source_operand - operand matched with source_ind, shared_operand - operand matched with share_ind)
But I failed to understand how this code removes the allocation of shared operand.

@mbencer I guess there is sth I missed. Could you help me to understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that registerTensorInfo is responsible just for buildTensor (both non-const and const).

The second step is calling allocateNonconsts where we are calling tensor->setBuffer. Just at this point we are passing the same buffer for source tensor and shared tensor. It requires of course special handling of such memory lifetime. The lifetime is controlled by StaticTensorManager::claimPlan called during first use of memory buffer and StaticTensorManager::releasePlan called during the last use of memory buffer (the graph has to be topologically sorted).

Conclusion - we are creating tensors both for source_operand and shared_operand - they are just share memory buffer passed by setBuffer method.

Registering source tensors at the beginning here is needed to proper handling cases where a source tensor is constant - in such a case the shared tensor tensors has to be also a constant (has ExternalTensor type).
Without this additional code here we have no guarantee that source operands will be processed at first.

Copy link
Contributor

@zetwhite zetwhite Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, Thank you a lot for your kind explanation 👍
I missed the changes in StaticTensorManager.cc. Now i clearly understood it :)

@mbencer mbencer requested a review from zetwhite October 30, 2024 12:28
Comment on lines 60 to 67
if (graph.operands().at(op.getInputs().at(0)).info().isDynamic())
{
return false;
}
if (graph.operands().at(op.getOutputs().at(0)).info().isDynamic())
{
return false;
}
Copy link
Contributor

@hseok-oh hseok-oh Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbencer Is there any reason to not allow on dynamic shape?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hseok-oh In general I believe that it's possible to handle but my plan was to implement it separately to limit the scope of this feature. Note that dynamic tensor have a separate path of building - DynamicTensorManager::buildTensor. Dyn shapes handling requires additional branch here to handle a case where source memory tensor is a constant (has ExternalTensor type). To research is also a case where source memory tensor has static shape - in such a case DynamicMemoryManager shouldn't be pass to a tensor ctor. The rest should be even simpler because dyn tensors don't re-use common memory (controlled by [static]MemoryManager).

To sum-up:
I can handle it also as a part of this feature or create a separate issue ;)

@mbencer
Copy link
Contributor Author

mbencer commented Dec 17, 2024

all parts from handing sharing memory for static tensors are already merged

@mbencer mbencer closed this Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants