Skip to content

Latest commit

 

History

History
53 lines (33 loc) · 2.28 KB

README.md

File metadata and controls

53 lines (33 loc) · 2.28 KB

dsci524_group29_webscraping

A Python package for simplified web scraping functionality for data scientists new to web scraping.

Installation

$ pip install dsci524_group29_webscraping

Functions

  • fetch_html(url): Retrieves the raw HTML content from the specified URL, handling HTTP requests and potential errors.
  • parse_content(html, selector, selector_type): Parses the provided HTML content using CSS selectors or XPath to extract specified data.
  • save_data(data, format, destination): Saves the extracted data into the desired format (e.g., TXT, CSV, JSON) at the specified destination path.

Python Ecosystem

While libraries like BeautifulSoup and Scrapy offer comprehensive web scraping capabilities, dsci524_group29_webscraping aims to provide a more streamlined and beginner-friendly approach. By focusing on three core functions, it abstracts the complexities involved in web scraping, making it accessible for quick tasks and educational purposes.

Similar Packages:

  • webscraping: Provides web scraping functions but contains a rich set of functionality that is beyond beginner level.
  • webscraping_tools: Offers similar functionalities and many more that in our opinion, places it in the intermediate level.

dsci524_group29_webscraping differentiates itself by offering a simple set of functions that do the job for simple, beginner level needs.

Contributors

  • Lixuan Lin
  • Hui Tang
  • Sienko Ikhabi

Contributing

Interested in contributing? Check out the contributing guidelines.

Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by the specified terms.

License

Package dsci524_group29_webscraping was created by Lixuan Lin, Hui Tang and Sienko Ikhabi for the Master of Data Science, University of British Columbia. It is licensed under the terms of the MIT license.

Credits

This project was created with cookiecutter from the py-pkgs-cookiecutter template.