Submission for #97 #141

JACKHAHA363 · 2019-01-06T15:53:26Z

Submission for #97

JACKHAHA363 · 2019-01-30T00:07:27Z

Hi, where can I see the review?

koustuvsinha · 2019-02-01T00:58:01Z

Hi @JACKHAHA363, reviewer(s) has been just assigned to all the projects, the reviews will be posted by our bot @reproducibility-org in the respective Pull Requests, and you will have the opportunity to correct your submission and respond to the reviewers.

reproducibility-org · 2019-02-21T15:59:07Z

Hi, please find below a review submitted by one of the reviewers:

Score: 5 6 *
Reviewer 3 comment : The report aim to reproduce the policy gradient based benchmarks from the "Countering Language Drift via Grounding" paper. The report analyzes some of the claims and verify(or refute in some cases) them independently. The code has been made available publically.

General questions:

The report-authors mentioned that they discussed the sequence of length on open-review with the paper-authors. I must have missed it while reading the discussion on open review. Could the authors please provide a link to that discussion.

Some questions related to implementation:

Why use vanilla policy gradients only?
The hyper-params tweaked are: lr of agent A and \alpha_ent? Or are there more hyperparams that were optimised for?
Could you highlight some challenges in reproducing the paper and what could have been done to improvement reproducibility.
The pretraining results have not been reproduced to a large extent - do you thoughts about why that might be the case?
Confidence : 3

* Edit: Reviewer updated score based on author feedback

reproducibility-org · 2019-02-22T11:27:54Z

Hi, please find below a review submitted by one of the reviewers:

Score: 7
Reviewer 1 comment : Problem statement
The authors do a good job summarizing the problem addressed in the original paper (the problem of language drift in collaborative translation) and the approach to solving it (reducing language drift by introducing an appropriate grounding task). They identify a likely source of difficulty in reproducing the work (the policy gradient method used to train the collaborative translation) and focus their efforts on showing that this component of the work is reproducible.

In light of this, the authors restrict their analysis to the policy gradient baseline used in the original paper. This seems like a sensible scope for a reproduction study, as it focuses on one aspect of the problem that is frequently difficult to reproduce.

However, this choice also limits the potential usefulness of the reproduction: the main result of the original paper is not that policy gradients produce the phenomenon of language drift (which was anticipated based on results elsewhere in the literature on language learning using self-play), but that the proposed grounding method reliably solves this. Results of this nature would most likely strengthen the impact of the reproduction.

That being said, the authors very clearly specify the scope of the reproduction they are attempting, so I don’t feel this unduly limits the usefulness of this reproduction.

Code
The code included with this submission is reasonably well-organized and readable. The code appears to reimplement the project from scratch.

Communication with original authors
I did not see any evidence of communication with the original authors.

Hyperparameter search
The authors performed a grid search over appropriate policy gradient hyperparameters (the entropy weight and the learning rate). This resulted in improved performance on the policy gradient baseline over the original submission.

Ablation Study
As far as I could tell, no ablations were performed.

Discussion of results
The results are presented thoughtfully, with tables and plots that allow direct and easy comparison between the original and reproduced results. The results appear to be consistent with the original paper. Although somewhat better policy gradient baseline results were obtained, these results did not surpass the performance of the improved method presented in Lee et al 2019. This is consistent with the original paper’s claim (although as noted above, the claim of reproducibility is weakened as the improved method itself is not reproduced in this work).

Recommendations for reproducibility
The authors point out several implementation details that they found important or relevant to getting good results with the model or affected the results of the model, but which were not discussed in the original paper (e.g. length normalization, sharing GRU weights). These details may be very helpful for researchers building on this work in the future.

Overall organization and clarity
The report, results, and code are well-organized and interpretable. I believe this reproduction will be of use to other researchers attempting to build on and understand the work of Lee et al 2019.
Confidence : 3

reproducibility-org · 2019-02-23T11:59:58Z

Hi, please find below a review submitted by one of the reviewers:

Score: 7
Reviewer 2 comment : The authors successfully implemented the approach proposed in the original paper to solving the language drift when using policy gradient methods. With the implementation, they achieved comparable results. Furthermore, the authors found that length normalisation for reward computation is essential to the results, which is not explicitly mentioned in the original paper. Overall, this is a good report, not only reproducing the main results but giving some additional insights as well.
Confidence : 4

JACKHAHA363 · 2019-02-25T23:34:09Z

Repsonse reviewer 3

The report-authors mentioned that they discussed the sequence of length on open-review with the paper-authors. I must have missed it while reading the discussion on open review. Could the authors please provide a link to that discussion.

I rechecked on Openreview and realized that the original comment was set to private. It is public now, and the reviewer should be able to see the exchange.

Why use vanilla policy gradients only?

Using a more advanced policy gradient methods like PPO and TRPO is definitely worth trying, but we think it might be beyond the scope of this report because we want to stay close to the original paper. In addition, this policy gradient method (REINFORCE with learnt value baseline) is also widely employed in current self-play/RL in NLP community[1, 2], so we think confirming the language drift of this method should be representative. That being said, we are aware of this and we also implemented the PPO here https://github.com/JACKHAHA363/language_drift/blob/master/ld_research/training/finetune_ppo.py

The hyper-params tweaked are: lr of agent A and \alpha_ent? Or are there more hyperparams that were optimised for?

Yes. We also try to use a linearly decaying learning rate, but that turns out to converge to suboptimal. The hyperparameters on the Agent B does not make much difference, but maybe we did not perform a thorough enough search on that. The motivation for our focus on learning rate and \alpha_ent is that 1) policy gradient is known to be very sensitive to the learning rate, which is the motivation of methods like TRPO, and 2) the effect of \alpha_ent to the finetune results is discussed by one of the reviewer and the authors. We update the paper to include more details on hyperparameter optimization.

The pretraining results have not been reproduced to a large extent - do you thoughts about why that might be the case?

We think it could be caused by learning rate schedule. In the original paper, the authors does not articulate these, even if they seem to perform , and we use the default one from OpenNMT. We think it could be worthwhile to reproduce the pretrained results, but the main focus of this report would be to confirm the language drift from pre-training to policy gradient fine-tuning on a new corpus.

Response Reviewer 1

However, this choice also limits the potential usefulness of the reproduction: the main result of the original paper is not that policy gradients produce the phenomenon of language drift (which was anticipated based on results elsewhere in the literature on language learning using self-play), but that the proposed grounding method reliably solves this.

Yes we agree that we did not reproduce the main claim of the authors, but in our experiments we have some ongoing results of finetuning using language model, which is implemented here https://github.com/JACKHAHA363/language_drift/blob/master/ld_research/training/finetune.py#L531. We choose to restrict ourselves, since we would like to have a more thorough discussion, which is also kindly noticed in your review.

Reference:
[1] Gao, Jianfeng, Michel Galley, and Lihong Li. "Neural approaches to conversational AI." The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2018.

[2] Bahdanau, Dzmitry, et al. "An actor-critic algorithm for sequence prediction." arXiv preprint arXiv:1607.07086 (2016).

JACKHAHA363 · 2019-02-25T23:52:00Z

To all reviewers

Thank you for your time! we just update our paper 7d522be accordingly to include a section highlighting the potential pitfalls and challenge during our attempt to reproduce. @reproducibility-org

JACKHAHA363 · 2019-02-28T23:31:58Z

@koustuvsinha Will the bot update my response?

JACKHAHA363 · 2019-03-12T15:32:46Z

@reproducibility-org Any updates?

JACKHAHA363 · 2019-03-25T21:28:08Z

Updates?

JACKHAHA363 changed the title ~~Submission for BkMn9jAcYQ~~ Submission for #97 Jan 6, 2019

reproducibility-org added the checks-complete Submission criteria checks complete label Jan 7, 2019

JACKHAHA363 force-pushed the master branch 12 times, most recently from 90fabd9 to 5f7590a Compare January 14, 2019 08:15

submit

5b1e741

JACKHAHA363 force-pushed the master branch from e24cf05 to 5b1e741 Compare January 14, 2019 14:56

reproducibility-org added the reviewer-assigned Reviewer has been assigned label Jan 25, 2019

koustuvsinha added reviewer-assigned Reviewer has been assigned and removed reviewer-assigned Reviewer has been assigned labels Feb 1, 2019

reproducibility-org added the review-complete Review is done by all reviewers label Feb 23, 2019

Update to have more repro discussion

7d522be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission for #97 #141

Submission for #97 #141

JACKHAHA363 commented Jan 6, 2019

JACKHAHA363 commented Jan 30, 2019

koustuvsinha commented Feb 1, 2019

reproducibility-org commented Feb 21, 2019 •

edited

Loading

reproducibility-org commented Feb 22, 2019

reproducibility-org commented Feb 23, 2019

JACKHAHA363 commented Feb 25, 2019 •

edited

Loading

JACKHAHA363 commented Feb 25, 2019 •

edited

Loading

JACKHAHA363 commented Feb 28, 2019

JACKHAHA363 commented Mar 12, 2019

JACKHAHA363 commented Mar 25, 2019

Submission for #97 #141

Are you sure you want to change the base?

Submission for #97 #141

Conversation

JACKHAHA363 commented Jan 6, 2019

JACKHAHA363 commented Jan 30, 2019

koustuvsinha commented Feb 1, 2019

reproducibility-org commented Feb 21, 2019 • edited Loading

reproducibility-org commented Feb 22, 2019

reproducibility-org commented Feb 23, 2019

JACKHAHA363 commented Feb 25, 2019 • edited Loading

Repsonse reviewer 3

Response Reviewer 1

JACKHAHA363 commented Feb 25, 2019 • edited Loading

To all reviewers

JACKHAHA363 commented Feb 28, 2019

JACKHAHA363 commented Mar 12, 2019

JACKHAHA363 commented Mar 25, 2019

reproducibility-org commented Feb 21, 2019 •

edited

Loading

JACKHAHA363 commented Feb 25, 2019 •

edited

Loading

JACKHAHA363 commented Feb 25, 2019 •

edited

Loading