Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
satyaaman97 authored Oct 26, 2020
1 parent fb3be0c commit 6902b7e
Showing 1 changed file with 1 addition and 9 deletions.
10 changes: 1 addition & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,2 @@
# Data-prediction-and-classification-using-BERT-model
Resources Used in the Project:
Our Genre classifier used Google Colab with a TPU runtime and the free offered extension to the RAM to store all of the BERT embeddings
Our Authorship and NSP classifier used a Google Cloud N1 Ultramem machine with 40 CPUs, 961 GB of RAM, and no GPU running a Tensorflow 1.15 environment.
Running the Code
The processing.ipynb file contains the code used to process the fantasy datasets. We used the same code base for the realistic datasets, so further realistic datasets could be generated by following the steps outlined below. We have included a zip file that contains all of the .csv files that we used in our experiments, as well as the .txt files that we used for processing.
Creating a new dataset: Given a text file, take one of the sections to process an existing dataset and change the names to reflect the .txt file to be used and the target .csv file to create. Run the notebook to create the new .csv file.
The Full Bert Type Classifier and NSPandAuthorship files can be run directly using our datasets to produce our results provided there is enough RAM to store the BERT embeddings. The comments in these files indicate what each part of the code does, and should allow for modifiable experiments.
In the NSPandAuthorship.ipynb file, the test can be switched over to doing the individual dataset test described in the writeup by switching the df_list input in getFeatures() with a custom list that only includes that particular dataframe. After the list is switched out, the used_labels list should be updated and the number of examples can be increased to use the full dataset. We used 10000 max examples to get the full datasets when we ran the code.
*** Our figures were generated by collecting data in individual runs and placing them manually in data structures. This is due to the large amount of RAM that BERT embeddings use preventing us from storing all of the tests in a single runtime. If you would like to recreate our figures, run each of the tests described in the writeup manually and then plot the data separately.***
Analyzed how BERT embeddings interact with fantasy data, by examining these interactions alongside data drawn from a realistic context. Analysis is done through these experiments Genre classifiers and Next Sentence Prediction outputs

0 comments on commit 6902b7e

Please sign in to comment.