-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Ascend platform adaptation code #1881
base: devel
Are you sure you want to change the base?
Conversation
ZhengdQin
commented
Aug 30, 2022
- Add transfer-to-ascend module, one can use command "dp transfer-to-ascend mix_precision -i water.pb -o Ascend_transfer.pb" to transfer a model to mix-precision Ascend_transfer.pb, the model can excute on Ascend platform.
- Modify dp test module for Ascend platform.
- Modify Lammps +deepMD for Ascend platform
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## devel #1881 +/- ##
==========================================
- Coverage 78.00% 77.47% -0.54%
==========================================
Files 118 116 -2
Lines 9853 10139 +286
==========================================
+ Hits 7686 7855 +169
- Misses 2167 2284 +117
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
The title should be more precise. |
deepmd/utils/network.py
Outdated
b_initializer, | ||
trainable = trainable) | ||
variable_summaries(b, 'bias') | ||
if final_layer and GLOBAL_ASCEND_OUT_PRECISION: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@denghuilu It looks similar as mixed_prec
. However the weight for mixed_prec
is cast but the weight for GLOBAL_ASCEND_OUT_PRECISION
is not cast. What do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like they have already set the precisions before running the networks:
deepmd-kit/deepmd/utils/transfer_to_ascend.py
Lines 77 to 78 in 58367be
jdata["model"]["descriptor"]["precision"] = "float16" | |
jdata["model"]["fitting_net"]["precision"] = "float16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we modify the original precision directly. We tried different mix-precision models on Ascend platform and found that the GLOBAL_ASCEND_OUT_PRECISION being float32 (only the last biasadd is float32) is important to ensure the accuracy of the ascend transfered model, so we cast the every weight except the last biasadd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has the same aim as mixed_prec
so it's better to merge these two variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! Considering our mix-precision model is defferent with the mixed_prec defined model. only the last biasadd is the float32 type, so using mixed_prec needs to change the code logic and makes it difficult to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add document for how to use deepmd-kit on ascend?
deepmd/train/trainer.py
Outdated
if not self.is_compress: | ||
if self.is_ascend_transfer: | ||
self._init_from_frz_model() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a new case for ascend transfer? Is it the same as training with the --init-model option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. The ascend transfer is very similar as training with init-model, the reason we cannot use training with init-model is the following:
- Considering we may add new functions in the ascend transfer module in the future, developing a new module has better augmentability.
- We cannot use train with init-model directly, since we only build a model without training. At the same time, we can automatically modify the input.json. In this way, we can finish build, freeze and transfer in one command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we cannot use training with init-model is the following:
Considering we may add new functions in the ascend transfer module in the future, developing a new module has better augment ability.
We cannot use dp train with init-model directly, since we only build a model without training. At the same time, we can automatically modify the input.json. In this way, we can finish build, freeze and transfer in one command.
@@ -458,6 +458,7 @@ def model_args (): | |||
doc_sw_rmin = 'The lower boundary of the interpolation between short-range tabulated interaction and DP. It is only required when `use_srtab` is provided.' | |||
doc_sw_rmax = 'The upper boundary of the interpolation between short-range tabulated interaction and DP. It is only required when `use_srtab` is provided.' | |||
doc_compress_config = 'Model compression configurations' | |||
doc_ascend_transfer = 'Model transfer to ascend mix-precision model' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument would not be needed if ascend training is the same as --init-model
training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, ascend transfer is different from init-model training. please see the detailed explanations in the above reply.
deepmd/utils/network.py
Outdated
b_initializer, | ||
trainable = trainable) | ||
variable_summaries(b, 'bias') | ||
if final_layer and GLOBAL_ASCEND_OUT_PRECISION is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding a option out_precision
to the interface (default=GLOBAL_TF_FLOAT_PRECISION
), so not only ascent is supported.
Note: changing the behavior of the function by a global variable is dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! We have removed the global variable and added the option out_precision to the interface.
deepmd/utils/network.py
Outdated
GLOBAL_ASCEND_OUT_PRECISION, | ||
b_initializer, | ||
trainable = trainable) | ||
variable_summaries(b, 'bias') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mv variable_summaries
out of the if-else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we have fixed it.
source/api_cc/include/DeepPot.h
Outdated
* @brief Initialize the DP. | ||
* @param[in] model The name of the frozen model file. | ||
* @param[in] gpu_rank The GPU rank. Default is 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz update the doc str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we have add the reference.
source/api_cc/include/DeepPot.h
Outdated
* @param[in] model The name of the frozen model file. | ||
* @param[in] gpu_rank The GPU rank. Default is 0. | ||
**/ | ||
void init (const std::string & model, const int & nloc, const int & gpu_rank = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you require the nlocal to be a constant? this is a very strict restriction, as the number of atoms in a local region may change during MD simulations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the nlocal for Ascend platform is a constant, cause we pad the number of each type of atom. Considering nocal changes during the inference, we increase the value to 1.1 times the original nlocal.
source/lmp/pair_deepmd.cpp
Outdated
type_count[type[ii]-1] ++; | ||
} | ||
deep_pot.init_graph (arg[0], type_count, get_file_content(arg[0])); | ||
deep_pot.init (arg[0], nlocal, get_node_rank()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nlocal
changes if the number of subregions > 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we pad the nlocal value, so it is OK if the fluctuation range is less than the given value.
Please provide a proper title for this PR. |
Is it possible to provide unit tests for the contributed code? |
It seems that on non-data center GPU cards, the transfered model has an impressive speedup performance. I have tested the new model in a local 1080ti environment and achieved a speedup by a factor of 7.5 (water benchmark system, 12288 atoms): double precision original model |
@denghuilu Are they the same model? The output looks different. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve conflicts
…d models; 4. add lammps code for ascend models;
…ProdEnvMatAMash op and modified some details.
deepmd/entrypoints/transfer.py
Outdated
---------- | ||
new_graph : tf.Graph | ||
orginal new graph | ||
Returns : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns : | |
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we have fixed it.
---------- | ||
feed_dict : dict of tensor | ||
Session original feed_dict includes coord, type, box, mesh, natoms. | ||
t_out : list of tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type and the order are different from those in line 408. Please check which is right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we have fixed it.
No, they are not the same models, |
15541f9
to
e4e4e91
Compare