-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Walk Perturbation using BERT #275
base: main
Are you sure you want to change the base?
Conversation
Pytest currently fails, but I am not sure why this is. I can generate the json file using the provided code. |
old_sentences = copy.deepcopy(sentences) | ||
assert len(sentences) == k**steps | ||
return sentences | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this could be refactored a bit for clarity.
import torch
import copy
import re
import numpy as np
import random
from typing import List
def _mask_word(sentence, word_to_mask, tokenizer):
""" helper function,
replace word in a sentence with mask-token, as prep for BERT tokenizer"""
start_index = sentence.find(word_to_mask)
return sentence[0:start_index] + tokenizer.mask_token + sentence[
start_index + len(word_to_mask):]
def get_k_replacement_words(tokenized_text, model, tokenizer, k=5):
"""return k most similar words from the model, for a tokenized mask-word in a sentence.
Args:
tokenized_text (str): sentence with a word masked out
model ([type]): model
tokenizer ([type]): tokenizer
k (int, optional): how many similar words to find for a given tokenized-word. Defaults to 5.
Returns:
[list]: list of top k words
"""
inputs = tokenizer.encode_plus(tokenized_text, return_tensors='pt')
index_to_mask = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)
outputs = model(**inputs)
softmax = F.softmax(outputs.logits, dim=-1)
mask_word = softmax[0, index_to_mask, :]
return torch.topk(mask_word, k)[1][0]
def single_sentence_random_step(sentence, tokenizer, model, k=5):
"""For a given sentence, choose a random word to mask, and
replace it with a word the top-k most similar words in BERT model.
Return k sentences, each with a different replacement word for the mask.
Args:
sentence ([type]): sentence to perform random walk on
tokenizer ([type]): tokenizer
model ([type]): model
k (int, optional): how many replacement words to try. Defaults to 5.
Returns:
[list]: k-sentences with masked word replaced with top-k most similar words
"""
text_split = re.split('[ ?.,!;"]', sentence)
# pick a random word to mask
word_to_mask = random.choice(text_split)
while len(word_to_mask) == 0:
word_to_mask = random.choice(text_split)
# mask word
new_text = _mask_word(sentence, word_to_mask, tokenizer)
# get k replacement words
top_k = get_k_replacement_words(new_text, tokenizer, model, k=k)
# replace mask-token with the word from the top-k replacements
return [
new_text.replace(tokenizer.mask_token, tokenizer.decode([token]))
for token in top_k
]
def single_round(sentences: List[str], tokenizer, model) -> List[str]:
"""For a given list of sentences, do a random walk on each sentence.
Args:
sentences ([type]): list of sentnces to perform random walk on
tokenizer ([type]): tokenizer
model ([type]): model
Returns:
[List]: list of random-walked sentences
"""
new_sentences = []
for sentence in sentences:
new_sentences.extend(
single_sentence_random_step(sentence, tokenizer, model))
return new_sentences
def random_walk(original_text: str, steps: int, k: int, tokenizer,
model) -> List[str]:
sentences = [original_text]
# Do k steps of random walk procedure
for _ in range(steps):
sentences.append(single_round(sentences, tokenizer, model))
assert len(sentences) == k**steps
return sentences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the code as suggested.
@kaustubhdhole can you explain how the pytest ran by github ( |
""" | ||
inputs = tokenizer.encode_plus(tokenized_text, return_tensors='pt', truncation=True, max_length = 512) | ||
index_to_mask = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id) | ||
if index_to_mask[0].numel() == 0: # Since we are truncating the input to be 512 tokens (BERT's max), we need to make sure the mask is in these first 512. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: don't let text go over 80 characters in width (break comments into several lines, and run yapf
or black
)
Returns: | ||
[list]: k-sentences with masked word replaced with top-k most similar words | ||
""" | ||
#text_split = re.split('[ ?.,!;"]', sentence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove comment.
TaskType.TEXT_CLASSIFICATION | ||
] | ||
languages = ["en"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to include keywords: https://github.com/GEM-benchmark/NL-Augmenter/blob/main/docs/keywords.md
like:
keywords = [ "model-based", "lexical", "possible-meaning-alteration", "high-coverage", "high-generations" ]
Hi @sajantanand: We're running only the test cases for light transformations in the Github actions. Since yours is a heavy transformation, you need to run the test locally and check if passes. Edit: I think you haven't set the heavy flag to True in your transformation so it is considered as light. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sajantanand,
I was assigned to review your transformation. I think this should be accepted, but have a few comments and suggestions below.
Correctness: The code passed all the tests.
Interface: Looks good.
Applicable Tasks and Keywords: Languages: good. Tasks: I'm not sure TEXT_TO_TEXT_GENERATION would really work correctly; see my comments below. Keywords: please add!
Specificity: Not specific to a particular task. This could be used for data augmentation; however, it should be noted it is low-precision, and generations are sometimes unnatural.
Novelty: This is not already implement in NL-Augmenter.
Adding new libraries: sentence-transformers and transformers specifed in requirements.txt (with versions).
Description: The README looks good. (But if there are any publications using this kind of transformation, please add them)
Data and code source: You could rename section Extras from the README to "Data and code provenance", and add license information for the models.
Paraphrasers and data augmenters: while this is low-precision, I still think it is an interesting transformation for data augmentation.
Test cases: 5 cases, good.
Evaluating robustness: not present.
Languages: English only. This could be expanded to other languages in the future.
Documentation: most functions have docstrings, and the ones that don't are short and easy to read.
One suggestion: depending on how this is to be used, it might be helpful to have the option to not change named entities. Actually, as this currently stands, wouldn't there be possible issues when using this for some text to text task such as summarization? (e.g., couldn't it modify some entity in the source but not in the reference, or modify them in different ways?)
I have addressed Roy's comments, added keywords, added the Now I'll address @juand-r's helpful comments.
As to your suggestion, I don't know of any way to exclude named entities. That being said, I am not very experienced in NLP work, so I'm open to any implementation suggestions. |
Thanks for adding the references on random walks for sentence similarity. I thought the list of tasks specify the kind of tasks this transformation could be used for in practice? So for my comment regarding Task.TEXT_TO_TEXT_GENERATION -- say you have a summarization task, with a dataset with text pairs (source text, reference summary). Then if the transformation changes named entities which appear in both the source and the reference summary, but in an inconsistent way, this would introduce noise if used as data augmentation (and it would not be a reliable evaluation dataset either). This would probably not be as much of an issue with TEXT_CLASSIFICATION (although I suppose it could sometimes; e.g., say the task is sentiment classification and you accidentally swap the sentiment by modifying words like "good" or "great" to "bad"). You can find a list of named entities (and their character or token offsets) with an off-the-shelf package like spacy. Then you would only need to keep track of which words to exclude when looking for replacements. |
@juand-r Sorry for the long delay in response. I added an option to exclude named entities using |
@kaustubhdhole @vgtomahawk any chance this PR can get a second review? I added rudimentary evaluation results from Colab. |
@kaustubhdhole @mille-s One last bump to try and get a second review. Thanks for doing so much work organizing this! |
@kaustubhdhole @mille-s This transformation has one accepting review, and as far as we can tell, the only reason it wasn't merged was that the second reviewer never reviewed it. We think this is an interesting transformation. We understand it's late in the project, but for the sake of closure, could you either give this a second review or definitively reject it? Thanks! |
@james-simon @sajantanand Sorry, very busy times! It looks like this transformation is part of a small group that escaped our radars, really sorry about that. At this point we carried out the analysis and plan to finalise the paper soon, so it is a bit difficult to add new things I'm afraid. I think that at this point the best option is to merge it later (we're actually running a GEM hackathon these days to add a lot of transfos/filters within the next weeks); since you have other accepted perturbations and are thus co-authoring the paper I hope this is not too much of an inconvenience. @kaustubhdhole does this sound good? |
@mille-s Sounds good to me! Let me know if there are any changes that need to be made before this is eventually merged. |
We using masked token prediction to perform a random walk on a sentence; i.e., we choose a word and block it out, relying on a language model like BERT to determine what word should be placed in that blank. By repeating this several times, we can reach a new sentence that is perturbed from the original one.