WebScrape is a simple yet powerful Python-based web scraping tool that allows users to extract and store website data, including titles, anchor tags, images, headings, and paragraphs. The scraped data is stored in a JSON file and can be viewed in a tabular format using the BeautifulTable
library
- Scrape web pages to extract:
- Title
- Anchor tags (
<a>
) - Images (
<img>
) - Headings (
<h1>
,<h2>
,<h3>
) - Paragraphs (
<p>
)
- Stores scraped data in a JSON file for future use.
- Displays existing scraped websites in a user-friendly table format.
- Allows alias names for websites to manage and store scraped data.
- Handles multiple websites and retains a history of scrapes.
- Web Version as well as CLI(With Ollama Integration) Version is available.
You can use WebScrape on Google Colab for free. The project is using Llama model using hugging face for data parsing. if you don't have a powerful GPU of your own. You can borrow a powerful GPU (Tesla K80, T4, P4, or P100) on Google's server for free for a maximum of 12 hours per session. Please use the free resource fairly and do not create sessions back-to-back and run upscaling 24/7. This might result in you getting banned. You can get Colab Pro/Pro+ if you'd like to use better GPUs and get longer runtimes. Usage instructions are embedded in the Colab Notebook. Check out the wiki page.
- Python
requests
- For making HTTP requests.beautifulsoup4
- For parsing the HTML.lxml
- A fast XML and HTML parser.beautifultable
- For displaying scraped data in a table.
Install the dependencies using the following command:
git clone https://github.com/aa-sikkkk/WebScrape.git
cd WebScrape
pip install -r requirements.txt
python scrap.py
Data Storage
{
"scraped_data": {
"alias_name": {
"url": "http://example.com",
"title": "Example Website",
"all_anchor_href": [...],
"all_anchors": [...],
"all_images_data": [...],
"all_images_source_data": [...],
"all_h1_data": [...],
"all_h2_data": [...],
"all_h3_data": [...],
"all_p_data": [...],
"scraped_at": "dd/mm/yyyy hh:mm:ss",
"status": true,
"domain": "example.com"
}
}
}
The Project is powered by Streamlit for web version.
This project is licensed under the MIT License. See the LICENSE file for more details.
Feel free to fork the project and submit pull requests! If you encounter any issues, you can open an issue on the repository.