This project focuses on fine-tuning BERT models using TensorFlow Hub and Hugging Face Transformers on the Quora Insincere Questions dataset. The dataset contains question text and a target feature for each question, with the aim of classifying questions as either sincere or insincere. The dataset, available on Kaggle, comprises over 1 million samples, but due to limited GPU resources, a fraction of the data (approximately 10,000 samples for training and 1,000 samples for validation) was used for fine-tuning.
To handle data preparation, Google's Colaboratory was utilized, and the data was read using Pandas from an archive on Archive.org. The dataset was then split into training and validation samples using train_test_split
. The TensorFlow data.Dataset input pipeline was employed to convert the text input into features acceptable by BERT, such as input_ids
and attention_mask
. The input text length was reduced to a maximum of 128, as opposed to the original 512 for BERT, to enhance processing speed on limited GPU resources. The input pipeline included features like prefetching to overlap preprocessing and model execution for improved efficiency.
For the model architecture, the Hugging Face Transformers library was utilized to import the pre-trained BERT-based-uncased model. The classifier head was added on top of BERT, consisting of a dense layer with a ReLU activation function, followed by a dropout layer to prevent overfitting. Finally, a dense layer with a sigmoid activation function was used as the output layer to predict the binary classification of questions as sincere or insincere.
The fine-tuned BERT model was evaluated and predictions were generated on the validation set.
This project demonstrates how fine-tuning BERT models can effectively classify questions as sincere or insincere. By employing TensorFlow Hub and Hugging Face Transformers, we effectively handled data preparation, created a powerful input pipeline, and built a robust model architecture. We can utilize the limited GPU resources in Google's Colaboratory to optimize performance within constrained environments.