GitHub - aa-sikkkk/WebScrape: Web + Command Line Webscraper Tool!

WebScrape is a simple yet powerful Python-based web scraping tool that allows users to extract and store website data, including titles, anchor tags, images, headings, and paragraphs. The scraped data is stored in a JSON file and can be viewed in a tabular format using the BeautifulTable library

Features

Scrape web pages to extract:
- Title
- Anchor tags (<a>)
- Images (<img>)
- Headings (<h1>, <h2>, <h3>)
- Paragraphs (<p>)
Stores scraped data in a JSON file for future use.
Displays existing scraped websites in a user-friendly table format.
Allows alias names for websites to manage and store scraped data.
Handles multiple websites and retains a history of scrapes.
Web Version as well as CLI(With Ollama Integration) Version is available.

📔 Google Collab

You can use WebScrape on Google Colab for free. The project is using Llama model using hugging face for data parsing. if you don't have a powerful GPU of your own. You can borrow a powerful GPU (Tesla K80, T4, P4, or P100) on Google's server for free for a maximum of 12 hours per session. Please use the free resource fairly and do not create sessions back-to-back and run upscaling 24/7. This might result in you getting banned. You can get Colab Pro/Pro+ if you'd like to use better GPUs and get longer runtimes. Usage instructions are embedded in the Colab Notebook. Check out the wiki page.

Requirements

Python
requests - For making HTTP requests.
beautifulsoup4 - For parsing the HTML.
lxml - A fast XML and HTML parser.
beautifultable - For displaying scraped data in a table.

Install the dependencies using the following command:

git clone https://github.com/aa-sikkkk/WebScrape.git
cd WebScrape

pip install -r requirements.txt

python scrap.py

Data Storage

{
    "scraped_data": {
        "alias_name": {
            "url": "http://example.com",
            "title": "Example Website",
            "all_anchor_href": [...],
            "all_anchors": [...],
            "all_images_data": [...],
            "all_images_source_data": [...],
            "all_h1_data": [...],
            "all_h2_data": [...],
            "all_h3_data": [...],
            "all_p_data": [...],
            "scraped_at": "dd/mm/yyyy hh:mm:ss",
            "status": true,
            "domain": "example.com"
        }
    }
}

Web Version of the Project.

The Project is powered by Streamlit for web version.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contributing

Feel free to fork the project and submit pull requests! If you encounter any issues, you can open an issue on the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
web		web
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrap.py		scrap.py
scrap_with_ollama.py		scrap_with_ollama.py
scraped_data.json		scraped_data.json
web_scrapper.ipynb		web_scrapper.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

📔 Google Collab

Requirements

Web Version of the Project.

License

Contributing

About

Releases 1

Contributors 2

Languages

License

aa-sikkkk/WebScrape

Folders and files

Latest commit

History

Repository files navigation

Features

📔 Google Collab

Requirements

Web Version of the Project.

License

Contributing

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Contributors 2

Languages