Adding ByT5 notebook #13

mapmeld · 2021-06-28T04:52:10Z

Hi ! I used your notebook as a starting point for fine-tuning a T5-based model (ByT5) with the latest versions of PyTorch Lightning, Transformers, etc. I also use the Datasets library instead of downloading from Stanford, so it's a little more adaptable. Feel free to update or let me know if this can be added as a new example notebook.

https://colab.research.google.com/drive/1syXmhEQ5s7C59zU8RtHVru0wAvMXTSQ8

janyfe · 2021-07-29T15:21:42Z

Hi @mapmeld. I've run your notebook. Finetuned byt5-small model always generates 'negative' target. It leads to 0.5 test accuracy. When I switch to t5-base, finetuned model's behaviour and metrics became reasonable (test accuracy is something around 0.8). Do you have any ideas what is wrong with byt5 finetuning?

By the way, I have one suggestion. Instead of slicing decoded outputs, you can use tokenizer.decode(ids, skip_special_tokens=True)

jijo7 · 2023-04-04T21:12:29Z

Hi @janyfe I would appreciate if you could let me know how I can use this code for my IMBD dataset, which is of the following format:

# train data
f = open("train.csv", "r")
lines = f.readlines()
lines = [line.strip().split(",") for line in lines]
lines = [[line[0], line[1], ",".join(line[2:])] for line in lines] 
train = pd.DataFrame(lines[1:])
train = train.drop(train.columns[0], axis=1) # drop first column
print("\ntrain set size:", train.shape)
print("\nNumber of positives: ", train[1].astype(int).sum())
train = train.rename(columns={1: 'sentiment', 2: 'review'})
imdb_reviews = train["review"]
sentiments = train["sentiment"]
sentiments = [int(v) for v in sentiments]
sentiments=pd.DataFrame(sentiments)
sentiments=sentiments.rename(columns={0:'sentiment'})
sentiments = sentiments["sentiment"].tolist()

# test data
f = open("test.csv", "r")
lines = f.readlines()
lines = [line.strip().split(",") for line in lines]
lines = [[line[0], ",".join(line[1:])] for line in lines] 
test = pd.DataFrame(lines[1:])
id_test = test[0]
print("\ntest set:", test.shape)
test = pd.DataFrame(test[1])
print("Number of test sentences: {:,}\n".format(test.shape[0]))
test = test.rename(columns={1:'review'})

The sentiments are 0 or 1. Also, my test set does not include the associated sentiment i.e., it does not include labels.

Best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ByT5 notebook #13

Adding ByT5 notebook #13

mapmeld commented Jun 28, 2021

janyfe commented Jul 29, 2021

jijo7 commented Apr 4, 2023

Adding ByT5 notebook #13

Adding ByT5 notebook #13

Comments

mapmeld commented Jun 28, 2021

janyfe commented Jul 29, 2021

jijo7 commented Apr 4, 2023