Skip to content
Aas1kk edited this page Sep 21, 2024 · 2 revisions


Pull Requests Stars Downloads Issues License

Follow these Step for The CLI Tool

Requirements

  • Python
  • requests - For making HTTP requests.
  • beautifulsoup4 - For parsing the HTML.
  • lxml - A fast XML and HTML parser.
  • beautifultable - For displaying scraped data in a table.

Ollama is integrated with this tool so that data parsing can be done according to your needs!

Install the dependencies using the following command:

git clone https://github.com/aa-sikkkk/WebScrape.git
cd WebScrape
pip install -r requirements.txt
python scrap.py

Data Storage

{
    "scraped_data": {
        "alias_name": {
            "url": "http://example.com",
            "title": "Example Website",
            "all_anchor_href": [...],
            "all_anchors": [...],
            "all_images_data": [...],
            "all_images_source_data": [...],
            "all_h1_data": [...],
            "all_h2_data": [...],
            "all_h3_data": [...],
            "all_p_data": [...],
            "scraped_at": "dd/mm/yyyy hh:mm:ss",
            "status": true,
            "domain": "example.com"
        }
    }
}

Clone this wiki locally