My first attempt at creating a web crawler in Python. It traverses an online book shop and grabs book titles and URLs.
I have followed the tutorial that can be found at: https://realpython.com/web-scraping-with-scrapy-and-mongodb/.
While working on this project I have learned the following:
- Set up and configure a project using Scrapy
- Build a working web scraper using Scrapy
- Extract data from websites using CSS selectors
- Store scraped data in a MongoDB database
In addition, my code takes into account that sometimes duplicate data can be scraped so the duplicates are ignored.
Python: Get it from here: https://www.python.org/downloads/ or via Microsoft Store
venv:
pip install virtualenv
Scrapy:
python -m pip install scrapy
MongoDB: Download the relevant to you installer from https://www.mongodb.com/docs/manual/installation/#mongodb-community-edition-installation-tutorials. Additionally, you will need to run the below command in cmd:
python -m pip install pymongo
To deploy this project run the following commands in the cmd:
- venv
venv\Scripts\activate.bat
or add it to your PATH
- Scrapy
scrapy startproject books
- MongoDB
test> use books_db
switched to db books_db
books_db> db.createCollection("books")
{ ok: 1 }
books_db> show collections
books
books_db>
- Scrape dynamically generated content