Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ByT5 notebook #13

Open
mapmeld opened this issue Jun 28, 2021 · 2 comments
Open

Adding ByT5 notebook #13

mapmeld opened this issue Jun 28, 2021 · 2 comments

Comments

@mapmeld
Copy link

mapmeld commented Jun 28, 2021

Hi ! I used your notebook as a starting point for fine-tuning a T5-based model (ByT5) with the latest versions of PyTorch Lightning, Transformers, etc. I also use the Datasets library instead of downloading from Stanford, so it's a little more adaptable. Feel free to update or let me know if this can be added as a new example notebook.

https://colab.research.google.com/drive/1syXmhEQ5s7C59zU8RtHVru0wAvMXTSQ8

@janyfe
Copy link

janyfe commented Jul 29, 2021

Hi @mapmeld. I've run your notebook. Finetuned byt5-small model always generates 'negative' target. It leads to 0.5 test accuracy. When I switch to t5-base, finetuned model's behaviour and metrics became reasonable (test accuracy is something around 0.8). Do you have any ideas what is wrong with byt5 finetuning?

By the way, I have one suggestion. Instead of slicing decoded outputs, you can use tokenizer.decode(ids, skip_special_tokens=True)

@jijo7
Copy link

jijo7 commented Apr 4, 2023

Hi @janyfe I would appreciate if you could let me know how I can use this code for my IMBD dataset, which is of the following format:

# train data
f = open("train.csv", "r")
lines = f.readlines()
lines = [line.strip().split(",") for line in lines]
lines = [[line[0], line[1], ",".join(line[2:])] for line in lines] 
train = pd.DataFrame(lines[1:])
train = train.drop(train.columns[0], axis=1) # drop first column
print("\ntrain set size:", train.shape)
print("\nNumber of positives: ", train[1].astype(int).sum())
train = train.rename(columns={1: 'sentiment', 2: 'review'})
imdb_reviews = train["review"]
sentiments = train["sentiment"]
sentiments = [int(v) for v in sentiments]
sentiments=pd.DataFrame(sentiments)
sentiments=sentiments.rename(columns={0:'sentiment'})
sentiments = sentiments["sentiment"].tolist()

# test data
f = open("test.csv", "r")
lines = f.readlines()
lines = [line.strip().split(",") for line in lines]
lines = [[line[0], ",".join(line[1:])] for line in lines] 
test = pd.DataFrame(lines[1:])
id_test = test[0]
print("\ntest set:", test.shape)
test = pd.DataFrame(test[1])
print("Number of test sentences: {:,}\n".format(test.shape[0]))
test = test.rename(columns={1:'review'}) 

The sentiments are 0 or 1. Also, my test set does not include the associated sentiment i.e., it does not include labels.

Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants