Random Walk Perturbation using BERT #275

sajantanand · 2021-09-01T06:52:03Z

We using masked token prediction to perform a random walk on a sentence; i.e., we choose a word and block it out, relying on a language model like BERT to determine what word should be placed in that blank. By repeating this several times, we can reach a new sentence that is perturbed from the original one.

sajantanand · 2021-09-01T18:50:48Z

Pytest currently fails, but I am not sure why this is. I can generate the json file using the provided code.

RoyRin · 2021-09-01T19:49:34Z

transformations/random_walk/transformation.py

+        old_sentences = copy.deepcopy(sentences)
+    assert len(sentences) == k**steps
+    return sentences
+


I think that this could be refactored a bit for clarity.

import torch import copy import re import numpy as np import random from typing import List def _mask_word(sentence, word_to_mask, tokenizer): """ helper function, replace word in a sentence with mask-token, as prep for BERT tokenizer""" start_index = sentence.find(word_to_mask) return sentence[0:start_index] + tokenizer.mask_token + sentence[ start_index + len(word_to_mask):] def get_k_replacement_words(tokenized_text, model, tokenizer, k=5): """return k most similar words from the model, for a tokenized mask-word in a sentence. Args: tokenized_text (str): sentence with a word masked out model ([type]): model tokenizer ([type]): tokenizer k (int, optional): how many similar words to find for a given tokenized-word. Defaults to 5. Returns: [list]: list of top k words """ inputs = tokenizer.encode_plus(tokenized_text, return_tensors='pt') index_to_mask = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id) outputs = model(**inputs) softmax = F.softmax(outputs.logits, dim=-1) mask_word = softmax[0, index_to_mask, :] return torch.topk(mask_word, k)[1][0] def single_sentence_random_step(sentence, tokenizer, model, k=5): """For a given sentence, choose a random word to mask, and replace it with a word the top-k most similar words in BERT model. Return k sentences, each with a different replacement word for the mask. Args: sentence ([type]): sentence to perform random walk on tokenizer ([type]): tokenizer model ([type]): model k (int, optional): how many replacement words to try. Defaults to 5. Returns: [list]: k-sentences with masked word replaced with top-k most similar words """ text_split = re.split('[ ?.,!;"]', sentence) # pick a random word to mask word_to_mask = random.choice(text_split) while len(word_to_mask) == 0: word_to_mask = random.choice(text_split) # mask word new_text = _mask_word(sentence, word_to_mask, tokenizer) # get k replacement words top_k = get_k_replacement_words(new_text, tokenizer, model, k=k) # replace mask-token with the word from the top-k replacements return [ new_text.replace(tokenizer.mask_token, tokenizer.decode([token])) for token in top_k ] def single_round(sentences: List[str], tokenizer, model) -> List[str]: """For a given list of sentences, do a random walk on each sentence. Args: sentences ([type]): list of sentnces to perform random walk on tokenizer ([type]): tokenizer model ([type]): model Returns: [List]: list of random-walked sentences """ new_sentences = [] for sentence in sentences: new_sentences.extend( single_sentence_random_step(sentence, tokenizer, model)) return new_sentences def random_walk(original_text: str, steps: int, k: int, tokenizer, model) -> List[str]: sentences = [original_text] # Do k steps of random walk procedure for _ in range(steps): sentences.append(single_round(sentences, tokenizer, model)) assert len(sentences) == k**steps return sentences

I refactored the code as suggested.

sajantanand · 2021-09-04T16:44:01Z

@kaustubhdhole can you explain how the pytest ran by github (pytest -s --t=light --f=light) works? I can pass the pytest locally using the command in the ReadME (pytest -s --t=random_walk), but the one run by github that includes the "light" filter does not succeed.

RoyRin · 2021-09-12T19:37:14Z

transformations/random_walk/transformation.py

+    """
+    inputs = tokenizer.encode_plus(tokenized_text, return_tensors='pt', truncation=True, max_length = 512)
+    index_to_mask = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)
+    if index_to_mask[0].numel() == 0: # Since we are truncating the input to be 512 tokens (BERT's max), we need to make sure the mask is in these first 512.


nitpick: don't let text go over 80 characters in width (break comments into several lines, and run yapf or black)

RoyRin · 2021-09-12T19:37:51Z

transformations/random_walk/transformation.py

+    Returns:
+        [list]: k-sentences with masked word replaced with top-k most similar words
+    """
+    #text_split = re.split('[ ?.,!;"]', sentence)


nit: remove comment.

RoyRin · 2021-09-12T19:45:30Z

transformations/random_walk/transformation.py

+        TaskType.TEXT_CLASSIFICATION
+    ]
+    languages = ["en"]
+


don't forget to include keywords: https://github.com/GEM-benchmark/NL-Augmenter/blob/main/docs/keywords.md
like:
keywords = [ "model-based", "lexical", "possible-meaning-alteration", "high-coverage", "high-generations" ]

AbinayaM02 · 2021-09-14T13:58:40Z

@kaustubhdhole can you explain how the pytest ran by github (pytest -s --t=light --f=light) works? I can pass the pytest locally using the command in the ReadME (pytest -s --t=random_walk), but the one run by github that includes the "light" filter does not succeed.

Hi @sajantanand: We're running only the test cases for light transformations in the Github actions. Since yours is a heavy transformation, you need to run the test locally and check if passes.

Edit: I think you haven't set the heavy flag to True in your transformation so it is considered as light.

juand-r

Hi @sajantanand,
I was assigned to review your transformation. I think this should be accepted, but have a few comments and suggestions below.

Correctness: The code passed all the tests.
Interface: Looks good.
Applicable Tasks and Keywords: Languages: good. Tasks: I'm not sure TEXT_TO_TEXT_GENERATION would really work correctly; see my comments below. Keywords: please add!
Specificity: Not specific to a particular task. This could be used for data augmentation; however, it should be noted it is low-precision, and generations are sometimes unnatural.
Novelty: This is not already implement in NL-Augmenter.
Adding new libraries: sentence-transformers and transformers specifed in requirements.txt (with versions).
Description: The README looks good. (But if there are any publications using this kind of transformation, please add them)
Data and code source: You could rename section Extras from the README to "Data and code provenance", and add license information for the models.
Paraphrasers and data augmenters: while this is low-precision, I still think it is an interesting transformation for data augmentation.
Test cases: 5 cases, good.
Evaluating robustness: not present.
Languages: English only. This could be expanded to other languages in the future.
Documentation: most functions have docstrings, and the ones that don't are short and easy to read.

One suggestion: depending on how this is to be used, it might be helpful to have the option to not change named entities. Actually, as this currently stands, wouldn't there be possible issues when using this for some text to text task such as summarization? (e.g., couldn't it modify some entity in the source but not in the reference, or modify them in different ways?)

sajantanand · 2021-09-21T07:38:45Z

I have addressed Roy's comments, added keywords, added the $heavy$ flag, and am currently running the evaluation scripts. These scripts do take quite a while using the default parameters, as 32 sentences are generated for each input. If the scripts finish running on collab, I'll update the ReadMe with the results. Unfortunately I don't have the compute resources to run this locally.

Now I'll address @juand-r's helpful comments.

Tasks: I am not sure any tasks type other than Task.TEXT_TO_TEXT_GENERATION makes sense as the transformation generates a new sentence of possibly-altered meaning. I agree that the model can generate new text that isn't consistent with the inputted text for the reason you mentioned. A single word is selected and replaced, so if there were a Named entity, only one iteration would be replaced. One approach we have to deal with this is to set a high $sim_req$ when initializing the class, requiring generated sentences to be similar in meaning to the original sentence.
Description: I don't know of any publications that do similar transformations, so I've instead linked some other instances of random walks in NLP.
Data and code source: I've made this change.

As to your suggestion, I don't know of any way to exclude named entities. That being said, I am not very experienced in NLP work, so I'm open to any implementation suggestions.

juand-r · 2021-09-21T17:19:40Z

Thanks for adding the references on random walks for sentence similarity.

I thought the list of tasks specify the kind of tasks this transformation could be used for in practice? So for my comment regarding Task.TEXT_TO_TEXT_GENERATION -- say you have a summarization task, with a dataset with text pairs (source text, reference summary). Then if the transformation changes named entities which appear in both the source and the reference summary, but in an inconsistent way, this would introduce noise if used as data augmentation (and it would not be a reliable evaluation dataset either). This would probably not be as much of an issue with TEXT_CLASSIFICATION (although I suppose it could sometimes; e.g., say the task is sentiment classification and you accidentally swap the sentiment by modifying words like "good" or "great" to "bad").

You can find a list of named entities (and their character or token offsets) with an off-the-shelf package like spacy. Then you would only need to keep track of which words to exclude when looking for replacements.

sajantanand · 2021-10-05T05:51:55Z

@juand-r Sorry for the long delay in response. I added an option to exclude named entities using spacy. I also added the functionality so that long sentences (over the 512 character limit of BERT) are split into chunks that are then randomly perturbed. Hopefully this task can get a second review and be merged soon.

sajantanand · 2021-10-13T05:53:45Z

@kaustubhdhole @vgtomahawk any chance this PR can get a second review? I added rudimentary evaluation results from Colab.

sajantanand · 2021-11-16T07:44:36Z

@kaustubhdhole @mille-s One last bump to try and get a second review. Thanks for doing so much work organizing this!

james-simon · 2021-11-16T18:29:38Z

@kaustubhdhole @mille-s This transformation has one accepting review, and as far as we can tell, the only reason it wasn't merged was that the second reviewer never reviewed it. We think this is an interesting transformation. We understand it's late in the project, but for the sake of closure, could you either give this a second review or definitively reject it?

Thanks!

mille-s · 2021-11-17T10:54:51Z

@james-simon @sajantanand Sorry, very busy times! It looks like this transformation is part of a small group that escaped our radars, really sorry about that. At this point we carried out the analysis and plan to finalise the paper soon, so it is a bit difficult to add new things I'm afraid. I think that at this point the best option is to merge it later (we're actually running a GEM hackathon these days to add a lot of transfos/filters within the next weeks); since you have other accepted perturbations and are thus co-authoring the paper I hope this is not too much of an inconvenience. @kaustubhdhole does this sound good?

sajantanand · 2021-11-18T17:28:52Z

@mille-s Sounds good to me! Let me know if there are any changes that need to be made before this is eventually merged.

sajantanand added 2 commits August 31, 2021 23:20

Create branch.

e562770

Working barebones version.

beff00f

RoyRin reviewed Sep 1, 2021

View reviewed changes

sajantanand and others added 2 commits September 1, 2021 15:31

No generated punctuation.

90b1f43

Update requirements.txt

18c4ef9

kaustubhdhole added the transformation label Sep 3, 2021

sajantanand and others added 5 commits September 4, 2021 15:51

Passing pytest.

5b8def2

Merge branch 'GEM-benchmark:main' into random_walk

805d49e

Update README.md

e06451e

Update README.md

2fd4753

Update transformation.py

3d61fde

sajantanand and others added 6 commits September 4, 2021 17:03

Truncating long sequences.

41cb2ff

Merging using colab.

4056276

Working truncation of very long messages.

38663ed

Update README.md

bc5ff32

Fix partial word masking.

0f85695

New json.

4a1450d

RoyRin reviewed Sep 12, 2021

View reviewed changes

juand-r approved these changes Sep 18, 2021

View reviewed changes

kaustubhdhole requested a review from vgtomahawk September 20, 2021 17:48

sajantanand added 2 commits September 21, 2021 00:14

Address reviewer comments and improve documentatation.

0f36104

Update README.md

296ecda

Merge branch 'GEM-benchmark:main' into random_walk

e27b2d5

sajantanand and others added 2 commits October 4, 2021 20:37

Merge branch 'GEM-benchmark:main' into random_walk

f7325fd

Adding named entities.

f5ca532

sajantanand and others added 4 commits October 10, 2021 23:22

Merge branch 'GEM-benchmark:main' into random_walk

efac865

Merge branch 'GEM-benchmark:main' into random_walk

2a5941c

Evaluation results.:

3fc94eb

Fix tabs and add option to choose least probable replacement.

6f88999

Add "descending" parameter to README.md

ee57b60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Walk Perturbation using BERT #275

Random Walk Perturbation using BERT #275

sajantanand commented Sep 1, 2021

sajantanand commented Sep 1, 2021 •

edited

Loading

RoyRin Sep 1, 2021

sajantanand Sep 2, 2021

sajantanand commented Sep 4, 2021 •

edited

Loading

RoyRin Sep 12, 2021

RoyRin Sep 12, 2021

RoyRin Sep 12, 2021

AbinayaM02 commented Sep 14, 2021 •

edited

Loading

juand-r left a comment

sajantanand commented Sep 21, 2021

juand-r commented Sep 21, 2021

sajantanand commented Oct 5, 2021 •

edited

Loading

sajantanand commented Oct 13, 2021

sajantanand commented Nov 16, 2021

james-simon commented Nov 16, 2021

mille-s commented Nov 17, 2021

sajantanand commented Nov 18, 2021

Random Walk Perturbation using BERT #275

Are you sure you want to change the base?

Random Walk Perturbation using BERT #275

Conversation

sajantanand commented Sep 1, 2021

sajantanand commented Sep 1, 2021 • edited Loading

RoyRin Sep 1, 2021

Choose a reason for hiding this comment

sajantanand Sep 2, 2021

Choose a reason for hiding this comment

sajantanand commented Sep 4, 2021 • edited Loading

RoyRin Sep 12, 2021

Choose a reason for hiding this comment

RoyRin Sep 12, 2021

Choose a reason for hiding this comment

RoyRin Sep 12, 2021

Choose a reason for hiding this comment

AbinayaM02 commented Sep 14, 2021 • edited Loading

juand-r left a comment

Choose a reason for hiding this comment

sajantanand commented Sep 21, 2021

juand-r commented Sep 21, 2021

sajantanand commented Oct 5, 2021 • edited Loading

sajantanand commented Oct 13, 2021

sajantanand commented Nov 16, 2021

james-simon commented Nov 16, 2021

mille-s commented Nov 17, 2021

sajantanand commented Nov 18, 2021

sajantanand commented Sep 1, 2021 •

edited

Loading

sajantanand commented Sep 4, 2021 •

edited

Loading

AbinayaM02 commented Sep 14, 2021 •

edited

Loading

sajantanand commented Oct 5, 2021 •

edited

Loading