neurips_2023.tex

\documentclass{article}


% if you need to pass options to natbib, use, e.g.:
%     \PassOptionsToPackage{numbers, compress}{natbib}
% before loading neurips_2023


% ready for submission
% \usepackage{neurips_2023}


% to compile a preprint version, e.g., for submission to arXiv, add add the
% [preprint] option:
%     \usepackage[preprint]{neurips_2023}


% to compile a camera-ready version, add the [final] option, e.g.:
%     \usepackage[final]{neurips_2023}


% to avoid loading the natbib package, add option nonatbib:
   \usepackage[nonatbib]{neurips_2023}


\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{listings}
\usepackage{graphicx}

\usepackage{amssymb}
\usepackage{framed}
\usepackage{geometry}
\usepackage{subfig}         % Required for minipages.
\usepackage{hyperref}       % Required for hyperlinks.
\usepackage{float}
\usepackage[square,numbers]{natbib}
\bibliographystyle{abbrvnat}


\title{DD2424 Project - Group 12 \\ Text Generation}


% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
% authors names on the first line, and the last on the second line, try using
% \AND instead of \And before the third author name.


\author{%
  % David S.~Hippocampus\thanks{Use footnote for providing further information
  %   about author (webpage, alternative address)---\emph{not} for acknowledging
  %   funding agencies.} \\
  % Department of Computer Science\\
  % Cranberry-Lemon University\\
  % Pittsburgh, PA 15213 \\
  % \texttt{hippo@cs.cranberry-lemon.edu} \\
  % examples of more authors
  % \And
  Tristan PERROT \\
  % Affiliation \\
  % Address \\
  \texttt{tristanp@kth.se} \\
  \And
  Adiren JOUANNY \\
  % Affiliation \\
  % Address \\
  \texttt{jouanny@kth.se} \\
  \And
  Paul MAUDUIT \\
  % Affiliation \\
  % Address \\
  \texttt{mauduit@kth.se} \\
  \And
  Arthur DEPRET \\
  % Affiliation \\
  % Address \\
  \texttt{depret@kth.se} \\
}


\begin{document}


\maketitle


\begin{abstract}
  % The abstract paragraph should be indented \nicefrac{1}{2}~inch (3~picas) on
  % both the left- and right-hand margins. Use 10~point type, with a vertical
  % spacing (leading) of 11~points.  The word \textbf{Abstract} must be centered,
  % bold, and in point size 12. Two line spaces precede the abstract. The abstract
  % must be limited to one paragraph.
  This project concerns the generation of Lewis Carroll-like sentences using different architectures of neural networks (RNN, LSTM, Transformers). Additionally, we evaluate the results both quantitatively and qualitatively, employing various methods to improve these results. The methods used involve optimizing the hyper-parameters of the networks as well as modifying the input data (such as data augmentation and embedding techniques).
\end{abstract}

\section{Introduction}

The problem we aim to address is natural language generation, commonly used for text prediction/completion, content summarization, and chat-bots. A significant challenge in this field is evaluating the generated data, as there are numerous valid outputs and various qualitative and quantitative methods to assess a model and its outputs.

The main objective of this project is to compare different architectures and their effectiveness in generating coherent text. Additionally, we evaluate overall performance metrics, such as inference time and training time. We will optimize the results of the architectures through hyper-parameters tuning and employ data augmentation to achieve better generalization.

\section{Related Work}
Research in natural language generation has advanced with various neural network architectures \cite{nlpprogress}. Recurrent Neural Networks (RNNs) have been foundational but suffer from limitations like vanishing gradients. Long Short-Term Memory (LSTM) networks address these issues with memory cells and gating mechanisms, handling long-term dependencies better. More recently, Transformer architectures enable parallel processing of text, leading to superior performance and efficiency in generating coherent sentences \cite{grulstm}.


\section{Data}

\subsection{Original Data}

The data used in this project is the entire text of \textbf{Alice's Adventures in Wonderland} by Lewis Carroll, obtained from the Gutenberg Project. This book contains approximately 26500 words. For our experiments, the text was divided into training, validation, and test sets. The validation test is composed of the second to last chapter and the test set is composed of the last chapter. The training set is the rest of the book.

The preprocessing steps were minimal, focusing primarily on splitting the raw corpus into individual characters, as our initial experiments were conducted at the character level. We created a dictionary to map each character to a unique integer, facilitating the conversion of text into a numerical format suitable for model training. This approach allowed us to capture fine-grained details of the text generation process, though it required careful handling of character sequences to maintain coherence in the generated text.

\subsection{Data augmentation}

As suggested in the guidance, we implemented data augmentation. We used the articles \cite{kaggle} \cite{dataaug} as a guidance to implement it. There are multiple type of data augmentation, such as: random insertion, random deletion, characters replacement, back translation, noise injection. In our situation we choose synonym replacement. This is relatively basic because it consists of replacing randomly some work in a sentence with a synonym. In the appendix A we can see some example of replacement. We choose to augment the text two times, concretely this means that we increase the number of unique words in the vocabulary from 3181 to 5628, that is a 77\%  increase. We could have increase more the number of reformulation but it takes a lot more time for a result that is not better. As we will see after, if we increase the number of reformulation we loose the author's ton. Moreover we need to remember that increasing the number of words, increase the time needed for training.


\section{Libraries used}

We base our work on the \textbf{PyTorch} library to efficiently train our models on GPUs, saving time. For text processing, we used the \textbf{NLTK} library to retrieve and process the data, a crucial step in training models for text generation.

\section{Methods}

\subsection{Training the networks}

Initially, we aimed to begin with the simplest foundation to facilitate comparisons with other architectures. For nearly all the networks, we utilized the Adam optimizer \cite{kingma2017adam} and an embedding layer. The Adam optimizer aids in determining the optimal learning rate throughout the training process, while the embedding layer provides a more effective representation of the inputs in a fixed dimension.
For the training, we also used dropout and layer normalization to get a better training result with more generalization.

\subsection{Generating sequences} 

Several methods are available to generate sequences using Recurrent Neural Networks (RNNs) \cite{schmidt2019recurrent} . We experimented with three of them:

\textbf{Deterministic sampling} involves picking the word with the highest probability at each step. This approach produces the most predictable and coherent text, which is especially useful for tasks requiring high reliability. Additionally, it is straightforward to implement and understand. However, it often results in repetitive and monotonous text, as it tends to overuse common phrases and words, leading to less creative outputs.

\textbf{Temperature sampling} scales the probabilities with a temperature factor before sampling from the new distribution. A temperature value below 1 leads to more predictable generations, while a value above 1 results in less predictable, more diverse, but potentially less grammatically correct outputs. This method allows fine-tuning of the randomness in text generation, enabling a balance between coherence and diversity. Its versatility makes it suitable for different types of content, from highly structured to more creative texts. On the downside, finding the optimal temperature can be challenging and may require extensive experimentation. Additionally, high temperatures can lead to nonsensical or grammatically incorrect sentences.

\textbf{Nucleus sampling} (also known as top-p sampling) defines a threshold \(p\) and only keeps the smallest set of words whose cumulative probability is above this threshold. Words are then sampled from this truncated distribution. This method strikes a balance between determinism and randomness, often resulting in coherent yet diverse text. By eliminating unlikely and potentially irrelevant words, it improves the overall quality of the text. However, it is more complex to implement compared to deterministic sampling, and the choice of threshold \(p\) can significantly impact the output, requiring careful fine-tuning.

Each method has its own advantages and trade-offs, making them suitable for different types of text generation tasks. Deterministic sampling is ideal for scenarios requiring high reliability and consistency, temperature sampling offers flexibility in creativity, and nucleus sampling provides a good balance between the two.


\subsection{Evaluation Metrics}

To evaluate the performance of our text generation models, we used several metrics:

\textbf{Loss:} We primarily monitored cross-entropy loss during training, which measures the difference between the predicted and actual word distributions. Lower loss indicates better model predictions.

\textbf{Type-Token Ratio (TTR):} This metric measures lexical diversity by calculating the ratio of unique words (types) to total words (tokens) in the generated text. A higher TTR indicates more diverse vocabulary.

\textbf{Percentage of Correctly Spelled Words:} This metric evaluates the spelling accuracy of the generated text by comparing each word to a standard dictionary, providing a percentage of correctly spelled words.

\textbf{Perplexity:} Perplexity assesses how well the model predicts a word sequence, with lower values indicating better performance. It is the exponentiated average negative log-likelihood of the predicted word sequence.

\textbf{BLEU Score:} The BLEU score \cite{10.3115/1073083.1073135} evaluates the quality of generated text by comparing it to reference texts, measuring the precision of matching n-grams. Higher BLEU scores indicate better alignment with human-written text.

Using these metrics, we ensured a comprehensive evaluation of both the training process and the quality of the generated text.


\section{Experiments}

In our experiments, we systematically evaluated various aspects of our text generation models. The primary focus was on comparing different architectures, hyperparameters, and regularization techniques to optimize performance. The experiments were structured as follows:

\subsection{Generation Process}\label{Generation study}

As described earlier, there are several methods to generate texts and we compared them. 

This study was performed on our best model, consisting of a 2-layer LSTM with a 100-dimensional embedding, a context of 200 characters, 512 hidden nodes per layer, a dropout of 0.4 and layer normalization applied between the LSTM and fully connected layer.

Figure \ref{fig:text_generation} shows the different metrics, compared to a reference text.
We see that the deterministic method has the lowest TTR, this is due to the fact that it often falls into a repetition loop. Temperature and nucleus clearly outperforms the deterministic method, and we also see a slight improvement when adding nucleus sampling over the temperature, with perplexity going a bit down, approaching the reference text.   

This is why we use the nucleus sampling method in the following experiment to assess the performance of text generation.

\subsection{Model Architecture Comparison}

We compared the performance of RNN and LSTM networks with one and two layers to evaluate the impact of model complexity and depth on text generation.

LSTM models consistently achieve lower training and validation losses compared to RNNs as can be seen in figure \ref{fig:loss_rnn_lstm}, indicating better learning and generalization. The two-layer LSTM converges faster and reaches a lower loss.

In terms of text metrics, in figure \ref{fig:text_rnn_lstm}, RNNs show higher TTR, suggesting more diverse vocabulary usage. However, LSTMs have higher perplexity, indicating less confident predictions. Despite this, LSTMs generate text with higher spelling accuracy and achieve higher BLEU scores, reflecting closer similarity to the reference text. The two-layer LSTM performs the best overall.

In conclusion, while LSTMs exhibit higher perplexity, they outperform RNNs in terms of loss reduction, text coherence, and accuracy. The two-layer LSTM, in particular, demonstrates superior performance, making it the preferred model for text generation tasks in this comparison. 

\subsection{Hidden Nodes Analysis}

We studied the impact of the number of hidden nodes in the network to balance model complexity and computational efficiency.

Figure \ref{fig:loss_hidden} shows that increasing the number of hidden nodes generally improves validation loss but also increases the risk of overfitting, as indicated by the rise in validation loss after several epochs. To mitigate this, regularization techniques are necessary. Additionally, training larger models is computationally intensive, necessitating careful consideration of resource constraints.

In terms of text metrics, we see in figure \ref{fig:text_hidden} that the TTR score increases with more hidden nodes, indicating more diverse vocabulary usage. Perplexity decreases with more hidden nodes, showing improved model confidence and coherence. Spelling accuracy is highest for the 128-node model, suggesting better vocabulary learning. The BLEU score is relatively high across all models, with smaller models achieving the highest scores.

In conclusion, while increasing hidden nodes can enhance performance, it requires regularization to prevent overfitting and involves higher computational costs. An optimal range of hidden nodes, particularly around 256, appears to balance these factors well.

\subsection{Hyperparameter Tuning}

We explored how batch size affects the learning process. Batch size determines the number of samples processed before updating the model parameters.

As shown in Figure \ref{fig:loss_batch}, increasing the batch size speeds up the training process (in terms of update steps); however, the validation loss tends to increase with larger batch sizes. Training with a lower batch size results in slower training but generally leads to better validation performance.

In terms of text generation metrics, we observe in figure \ref{fig:text_batch}, a batch size of 256 achieved the highest TTR score, indicating more diverse vocabulary usage. Despite having the highest perplexity, this batch size also attained the highest BLEU score, suggesting that it generated text most similar to the reference. The BLEU score and spelling accuracy did not vary significantly with batch size.

In conclusion, while larger batch sizes accelerate training, they may lead to higher validation loss. A batch size of 256 strikes a balance, achieving high diversity and similarity to the reference text.

\subsection{Regularization Technique}

To address overfitting, we experimented with dropout, a regularization technique that randomly sets a fraction of input units to zero at each update during training. This helps prevent the network from becoming too specialized to the training data.

As shown in Figure \ref{fig:loss_dropout}, applying dropout increases the training loss slightly but significantly reduces the validation loss and prevents it from rising too quickly.

In terms of text generation metrics, higher dropout values lead to lower TTR scores according to \ref{fig:text_dropout}, indicating reduced diversity in the generated text. However, dropout improves the percentage of correctly spelled words and slightly increases the BLEU score.

Thus, it is important to find a balance in dropout. For our best model in Experiment 1 (\ref{Generation study}), we selected a dropout value of 0.3 to balance these factors effectively.

\subsection{Number of layers}

We investigated whether increasing network depth improves performance by training 2, 3, and 4 layer networks.

In terms of validation loss (figure \ref{fig:loss_nb_layers}), deeper networks resulted in higher validation losses, indicating increased overfitting with more layers.

However, the text generation metrics present a mixed picture as can be seen in figure \ref{fig:text_nb_layers}. Deeper networks achieved higher BLEU scores and slightly better spelling accuracy, but had lower TTR scores, indicating reduced vocabulary diversity.

While adding regularization techniques might improve the performance of deeper networks, this is beyond the scope of this study.


% \subsection{Regularization Techniques}
% Additionally, layer normalization was implemented to normalize the inputs of each layer, stabilizing the learning process and improving model generalization.

\subsection{Transformer Network}

We developed a transformer-based architecture \cite{vaswani2023attention} comprising an encoder and decoder, utilizing multihead attention mechanisms. The resulting model consists of 2.5 million trainable parameters. In this implementation, we used words as tokens for the input rather than individual characters. This approach proved to be one of our most effective architectures for text generation.

Even though the results are very good for a human, according to our metrics (figure \ref{fig:text_transformer})

\subsection{Data Augmentation}

Finally, we explored data augmentation methods to increase the diversity of our training data. This involved generating additional training samples by applying various transformations to the original data, which aimed to improve the model's robustness and its ability to generalize to unseen data.


% We tried to use data augmentation on different model from a simple RNN model, we can understand that it increase the loss from 5.66 to 5.87. But when we analyse the synthesized text qualitatively, a few things have changed:
% \begin{itemize}
%     \item the model is able to create correct sentences
%     \item the model is less repetitive 
%     \item the model is less representative of the book
% \end{itemize}
% We will also next implement it on all our best models. 
% We can understand that in every case the loss is higher, as we augment the dataset we are using words that have the same meaning (roughly) but that would not be used by the author so it makes the result more accurate in general but less accurate when we think about writing as the author of the dataset.

We trained a network with the same parameters as our best model on an augmented dataset.

While the training loss was slightly higher, we observed benefits in text generation. As shown in Figure \ref{fig:text_augmentation}, the Type-Token Ratio (TTR) score decreased, indicating reduced diversity. However, we achieved higher spelling accuracy and an improved BLEU score.

The higher loss can be attributed to the use of augmented data, where words with similar meanings are used but may not align with the author's original style. This leads to more general accuracy but less fidelity to the author's writing style.

To further improve results with data augmentation, implementing contextualized word embeddings could be beneficial.\cite{wordrep}


\section{Results}

The firsts model were really not interesting and we obtain almost random generation (however, some words were understandable even though that was a simple RNN based on char representation) like that:
\begin{lstlisting}[breaklines]
Alice you, yot e, and the n all, S rs. Houcl, int you, ife        of thilled them!ns lillillis
mbis lyou, 'Why en and ang,' and Alich yealligi, th af
mers the med Alichem I tou Hic you armnerie youglaidow yriericemealongNlm every wille, you, and the hrick, you, allown and youadveen ilillembell fong th you, ifutilleem ed solllow!' mear ther shng Mer and QWr ace,' th?lllairie
'it,it,' Ilar ter her wou, inl,' h
mem doingn it thise you, thing I rig
 er ma, as wast ulilled nchilled headved Iled Alring!' A
\end{lstlisting}

With our best LSTM model we obtained: 
\begin{lstlisting}[breaklines]
Alice was a little house, and said 'That's very curious.'
'I won't interrupt at this morning, I suppose?' said Alice.
'It doesn't matter which way you go,' said the Mock Turtle a little three of the jurymen.
'It isn't a fine doesn't matter much,' thought Alice, 'to be a serpents to go by this before, never!'
The Mock Turtle sighed deeply, and the long grass rustled at her arm, and walked off.
'They lived on saying "Come up, and make your face, and the poor little thing!' said the Queen.
'It isn't mad.'
'It's the jury-box,' said Alice.
\end{lstlisting}

And at the very end, the transformer based model output that:
\begin{lstlisting}[breaklines]
Alice was a good deal worse off than before, as the March Hare had just upset the milk-jug into his plate. Alice did not wish to offend the Dormouse again, so she began very cautiously: 'But I don't understand. Where did they draw the treacle from?' 'You can draw water out of a water-well,' said the Hatter; 'so I should think you could draw treacle out of a treacle-well--eh, stupid?' 'But they were IN the well,' Alice said to the Dormouse, not choosing to notice this last remark. 'Of course they were', said the Dormouse; '--well in.' This answer so confused poor Alice,
\end{lstlisting}

\section{Conclusion}

In this paper, we implemented and compared multiple architectures. We evaluated their performance by analyzing loss, training time, and various metrics designed to assess the quality and accuracy of the generated text. Additionally, we applied data augmentation techniques to enhance the generalizability of our results. Ultimately, our findings indicate that both a well-optimized LSTM and a transformer model yield exceptional outcomes, closely resembling human-generated text.


\newpage

\bibliography{ref}

\newpage

\section*{Appendix: Experiment 1 Text generation methods}
\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/text_generation.png}
    \caption{Comparison of text generation methods}
    \label{fig:text_generation}
\end{figure}

\section*{Appendix: Experiment 2 RNN vs LSTM}
\begin{figure}[H]
    \centering
    \includegraphics[width=\linewidth]{figures/RNN_LSTM_comparison.png}
    \caption{Losses of RNN and LSTM with 1 or 2 hidden layers}
    \label{fig:loss_rnn_lstm}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/text_rnn_vs_lstm.png}
    \caption{RNN vs LSTM with 1 or 2 hidden layers}
    \label{fig:text_rnn_lstm}
\end{figure}

\section*{Appendix: Experiment 3 Number of hidden nodes}
\begin{figure}[H]
    \centering
    \includegraphics[width=\linewidth]{figures/LSTM_hidden_sizes.png}
    \caption{Losses of LSTM when varying hidden size}
    \label{fig:loss_hidden}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/text_hidden_size.png}
    \caption{Influence of hidden size on text generation}
    \label{fig:text_hidden}
\end{figure}

\section*{Appendix: Experiment 4 Batch Size influence}
\begin{figure}[H]
    \centering
    \includegraphics[width=\linewidth]{figures/LSTM_batch_size.png}
    \caption{Losses of LSTM when varying batch size}
    \label{fig:loss_batch}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/text_batch_size.png}
    \caption{Influence of batch size on text generation}
    \label{fig:text_batch}
\end{figure}


\section*{Appendix: Experiment 5 Regularization with Dropout}
\begin{figure}[H]
    \centering
    \includegraphics[width=\linewidth]{figures/LSTM_dropout.png}
    \caption{Losses of LSTM when varying the dropout rate}
    \label{fig:loss_dropout}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/text_dropout.png}
    \caption{Influence of drop out rate on text generation}
    \label{fig:text_dropout}
\end{figure}

\section*{Appendix: Experiment 6 Number of Layers}
\begin{figure}[H]
    \centering
    \includegraphics[width=\linewidth]{figures/LSTM_nb_layers.png}
    \caption{Losses of LSTM when varying the number of hidden layers}
    \label{fig:loss_nb_layers}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/text_nb_layers.png}
    \caption{Influence of the number of hidden layers on text generation}
    \label{fig:text_nb_layers}
\end{figure}

\section*{Appendix: Experiment 7 Transformer Network}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/text_transformers.png}
    \caption{Comparison of text generation: LSTM vs Transformer}
    \label{fig:text_transformer}
\end{figure}

\section*{Appendix: Experiment 8 Data Augmentation}
% \section{Appendix A: Data augmentation example and graphs}

Here is a short example of the substitution that have been made to augment the dataset.
Example 1 of substitution:
\begin{itemize}
    \item Original: for the hot day made her feel very sleepy and stupid
    \item Reformulated once: for the hot the made her feel not sleepy years stupid
    \item Reformulated twice: for when top day made her feel very sleepy and dumb
\end{itemize}
Example 2 of substitution:
\begin{itemize}
    \item Original: and looked at it 
    \item Reformulated once: and looked at being 
    \item Reformulated twice: more looked at it 
\end{itemize}

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/text_augmented.png}
    \caption{Comparison of text generation: LSTM vs LSTM with Data Augmentation}
    \label{fig:text_augmentation}
\end{figure}

And here is what a text generation looks like:
\begin{lstlisting}[breaklines]
    alice was all three sides of the house of her and the three gardeners who was the caterpillar took the cook to the thing before she had never doesnt think to alice some called with cat in the way to the baby with a sigh his took nearer in the distance and the party were off if something came with one paw to the mock turtle in the distance but it made me to fancy what the gryphon said was shouted out that she stood well one of the end of its shoes of mouse in silence was a rabbit was over in more little shriek and stupidly until she could not think of an office of the fire and the first same side of the trees had a sort of all the caterpillar and the rabbit was so she did so she began in the air - - what time it to be tried at the confusion that she set off at the mouse began went on of what dark said the pigeon but she sat down again so in the sea so she went on looking at the baby with one eyes and the things growing and said to herself but alice tried to stay down here which he was so suddenly 

\end{lstlisting}
\end{document}