Releases: kevinszuchet/data-mining
Releases · kevinszuchet/data-mining
Checkpoint #3
In this checkpoint you will enrich the data you scraped by accessing a public API of your choice, querying it, and storing the response in the database.
- Find a free public API, that has some relevance to your data. You can start here for a list of some public APIs.
- Add to your scraper the ability to query this API and store the responses in your database. You are not allowed to use an API wrapper module.
- There should be some link between the data you scrape from the website and the new data you retrieve from the API. It can be new attributes and properties of the data you already have or new data records that do not exist in the website you scrape. For example if you scrape data about hashtags in instagram you can access twitter api and retrieve data about the same hashtags there.
- Make sure you have sufficient logging (to file) in your project.
- Use this checkpoint to complete unfinished features and fix bugs. The next checkpoints will not involve writing additional code.
Checkpoint #2
Command line interface
- Wrap up your web scraper to be able to call it with different arguments from the terminal.
- Examples to arguments you can use: different data types to scrape (e.g. hashtags on instagram, product categories on amazon), timespan/dates to scrape (scrape only data that was created in a certain timespan, etc), different technical parameters (DB params, number of iterations to scrape, etc).
- Use click or argparse packages.
- Add documentation to the different CLI arguments to the README.md file, including the default values.
Database implementation
- Design an ERD for your data. Think about which fields should be primary and foreign keys, and how you distinct new entries from already existing ones.
- Take notice of primary and foreign keys.
- Write a script that creates your database structure (python or sql), it should be separate from the main scraper code (but should be part of the project and submitted as well).
- Add to your scraper the ability to store the data it scrapes to the database you designed. It should store only new data and avoid duplicates.
- Work with Mysql database.
- If you'd like you can use ORM tools such as sqlalchemy.
- Add a DB documentation section to the README.md file, including an ERD diagram and explanations about each table and its columns.
Checkpoint #1
- Choose a website that contains data (a lot) which is not publicly available via an API. The data should keep being updated frequently, and every data point (a news article / a weather prediction / a movie review) should contain as many details as possible). Be creative!
- Confirm your data source with one of the tech mentors before going to the next step.
- Read through this tutorial, or watch this video (or both) to learn about the requests package of Python that allows you to send requests and get responses from web servers.
- Read through this tutorial, or watch this video to learn about the beautifulsoup package of Python that allows you to parse the response.
- Code your own web scraper. You're required to adhere to conventions and write clean and quality well-structured code. The web scraper should be able to query your data source and print the collected data to the screen. You should be able to change the settings of the web scraper according to your needs.
- You can only use requests and bs4 packages, and selenium if needed.
- Create a public github repository and maintain the code of your project there.
- Add a README.md file that explains what website you are using, how did you go about solving the problem, how to run your code etc. It should give all the information necessary to users that just got your code and want to use it to scrape. (makeareadme.com or readme.so can help)
- Add a requirements.txt file with all the installations required with pip on top of bare Python installation to allow your code to run.