This LLM-based Document Summarizer application demonstrates how to download, process, and summarize long documents using the Hugging Face API. It is designed to handle large documents by splitting them into manageable chunks and summarizing each chunk individually. The final result is a cohesive summary of the entire document.
- Downloads and processes HTML content from a given URL (e.g., Project Gutenberg or arXiv).
- Extracts plain text from HTML format for text summarization.
- Splits long documents into smaller chunks for efficient processing by the Hugging Face API.
- Summarizes each chunk using the Hugging Face API (e.g., using the
google/pegasus-xsum
model). - Combines all the summarized chunks into a final, cohesive summary.
- MATLAB (version 2021 or later is recommended).
- Hugging Face API key (for using Hugging Face models).
- A text file containing your Hugging Face API key (
API_KEY.txt
).
- Visit Hugging Face API Token page and generate your API key.
- Save your API key in a text file named
API_KEY.txt
in the same directory as the MATLAB script.
- Clone or download the repository.
- Open MATLAB and navigate to the folder containing the script.
- Ensure that the
API_KEY.txt
file is in the same directory as the script. - Set the URL of the document you want to summarize.
In the script, define your Hugging Face API key and set the maximum chunk size for text processing.
% Set the document URL (replace with your desired document)
url = "https://www.gutenberg.org/files/11/11-h/11-h.htm";
% Set the maximum chunk size (number of characters per chunk)
chunkSize = 3000; % Adjust as needed
% Load your Hugging Face API key
fileID = fopen('API_KEY.txt', 'r');
apiKey = strtrim(fgets(fileID));
fclose(fileID);
disp(['API Key successfully loaded: ', apiKey]); % For debugging purposes
Once you have set up the script and defined the document URL and chunk size, run the MATLAB script. The application will:
- Download the document from the URL.
- Extract plain text from the HTML content.
- Split the document into manageable chunks.
- Summarize each chunk using the Hugging Face API.
- Combine the individual summaries into a final cohesive summary.
The default model used for summarization is google/pegasus-xsum, but you can modify the summarizeTextHuggingFace
function to use a different Hugging Face model if needed. Please, refer to the models folder for more details.
url = "https://api-inference.huggingface.co/models/google/pegasus-xsum"; % Default model URL
You can replace google/pegasus-xsum
with another available model of your choice.
To avoid triggering rate limits (e.g., 503 errors) when making requests to the Hugging Face API, the script pauses for 5 seconds between each request. You can adjust this pause time by modifying the pauseBetweenRequests
variable.
pauseBetweenRequests = 5; % Adjust the wait time between requests as needed
This function uses the Hugging Face API to summarize the input text. It sends the text to the specified Hugging Face model and returns the summary.
This function splits the long text into smaller chunks, each with a specified maximum character count. It tokenizes the input text into sentences and groups them into manageable chunks.
- Download and extract the text of a long document (e.g., a research paper or book) from a URL.
- Split the document into chunks of manageable size.
- Summarize each chunk using the Hugging Face API.
- Combine the summaries to generate a final, concise summary of the entire document.
After running the script, the final summary will be displayed in the MATLAB command window.
{'Alice was beginning to get very tired of sitting by her sister’s bank, and had n'}
{'o pictures or conversations in it, or of Alice’s Adventures in Wonderland. Alice'}
{' cried so hard that her eyes began to water . Alice looked at the Dodo with a pu'}
{'zzled look on her face . Alice spread out her hand and made a snatch in the air '}
{'. The Caterpillar put the hookah into its mouth and began smoking . The Duchess '}
{'of Cambridge was nursing her baby boy in the kitchen , when the cook came into t'}
{'he room , and asked the Duchess if she knew the time it took for the earth to tu'}
{'rn round . Alice looked at the Hatter with a smile on her face. It was a fine da'}
{'y , and Alice was walking by the rose-tree , when she saw Three and Two , and Tw'}
{'o , and Two , and Two , and Two , and Two , and Two , and Two , and Two , and Tw'}
{'o , Alice looked up at the Duchess of Cambridge , and asked what she would say t'}
{'o her if she had the choice . The whiting , the whiting, the whiting, the whitin'}
{'g, the whiting, the whiting, the whiting, the whiting, the whiting, the whiting,'}
{' the whiting, the whiting, the whiting, the whiting, the whiting, the “ I’m a ha'}
{'tter , your Majesty , ” said the Hatter . There was a big smile on the King’s fa'}
{'ce as he saw the tarts on the table .'}
You can view a demonstration of how the MATLAB Document Summarizer works by playing the following GIF:
This GIF shows the process of downloading, chunking, summarizing, and combining the final summary of a document using the Hugging Face API.
- "API_KEY.txt file not found" error: Ensure that the
API_KEY.txt
file is in the same directory as the MATLAB script, and that it contains a valid Hugging Face API key. - 503 error or rate limit exceeded: Increase the pause time between requests or check Hugging Face's rate limits.
- Empty summary: If a chunk cannot be summarized, the function will attempt to retry up to three times before returning an error message.
This project is licensed under the Apache-2.0 License - see the LICENSE.md file for details.
- Hugging Face for providing powerful pre-trained models for text summarization.
- MATLAB for offering a versatile environment for text processing and API integration.