Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs #5

Merged
merged 35 commits into from
Aug 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
4ba954f
sort imports with ruff
lucaromagnoli Aug 6, 2024
ea2d66c
test_httpx_client.py imports
lucaromagnoli Aug 6, 2024
2173ac2
comment
lucaromagnoli Aug 6, 2024
34a6f1d
add makefiles
lucaromagnoli Aug 6, 2024
61a71cf
more test_models.py
lucaromagnoli Aug 6, 2024
78eaf78
docs initial commit
lucaromagnoli Aug 6, 2024
84986de
update README.rst
lucaromagnoli Aug 6, 2024
e5bd2b2
reference readme in index.rst
lucaromagnoli Aug 6, 2024
580bb8e
rename soup property to html and add both text and data to response
lucaromagnoli Aug 7, 2024
2b04631
add utils module
lucaromagnoli Aug 7, 2024
30579f6
add examples to docs
lucaromagnoli Aug 7, 2024
3df7ea1
change before to before_sleep
lucaromagnoli Aug 7, 2024
c55665d
remove pipeline for now. will come in future releases
lucaromagnoli Aug 7, 2024
09416b1
remove AttrDict not needed
lucaromagnoli Aug 7, 2024
29b2053
fix data
lucaromagnoli Aug 7, 2024
cede261
no longer support dataclasses only BaseDataItem
lucaromagnoli Aug 7, 2024
4bdc2f0
add F401 to ruff
lucaromagnoli Aug 7, 2024
b7a4829
remove data classes from service as well
lucaromagnoli Aug 7, 2024
0512674
remove logger
lucaromagnoli Aug 7, 2024
c94628e
unused import
lucaromagnoli Aug 7, 2024
24ffd17
test f401
lucaromagnoli Aug 7, 2024
993bf04
use fields
lucaromagnoli Aug 7, 2024
9f52e27
use pydata_sphinx_theme
lucaromagnoli Aug 7, 2024
c80e2f8
chnages to examples
lucaromagnoli Aug 7, 2024
1ad64d6
changes to docs
lucaromagnoli Aug 7, 2024
a6203ee
implement get_func_name
lucaromagnoli Aug 7, 2024
62fe78d
improvements
lucaromagnoli Aug 7, 2024
fb7ae26
use python 3.12
lucaromagnoli Aug 8, 2024
8ab077c
use typing_extensions
lucaromagnoli Aug 8, 2024
fe6fe69
import from dataservice.data
lucaromagnoli Aug 8, 2024
d1e74c9
reformat
lucaromagnoli Aug 8, 2024
5acabe8
add log remove utils
lucaromagnoli Aug 8, 2024
512e9f3
add I to ruff rules
lucaromagnoli Aug 8, 2024
8720d27
rename to logs
lucaromagnoli Aug 8, 2024
d5c683b
add read the docs setup
lucaromagnoli Aug 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
- name: Install Python
uses: actions/setup-python@v4
with:
python-version: "3.11"
python-version: "3.12"
# see details (matrix, python-version, python-version-file, etc.)
# https://github.com/actions/setup-python
- name: Install poetry
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,4 @@ dmypy.json
cython_debug/

.idea/
/temp/
16 changes: 8 additions & 8 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.5.6
hooks:
# Run the linter.
# Run the linter and sort imports.
- id: ruff
args: [ --fix ]
args: [--fix]
# Run the formatter.
- id: ruff-format
22 changes: 22 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
version: 2

build:
os: ubuntu-22.04
tools:
python: "3.12"


sphinx:
configuration: source/conf.py

# Optionally build your docs in additional formats such as PDF and ePub
# formats:
# - pdf
# - epub

# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
# python:
# install:
# - requirements: docs/requirements.txt
20 changes: 20 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
44 changes: 0 additions & 44 deletions README.md

This file was deleted.

49 changes: 49 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
DataService
===========

Lightweight - async - data gathering for Python.
____________________________________________________________________________________
DataService is a lightweight data gathering library for Python.

Designed for simplicity, it uses common web scraping and data gathering patterns.

No complex API to learn, just standard Python idioms.

Asynchronous implementation, synchronous interface.

How to use DataService
-------

To start, create a ``DataService`` instance with an ``Iterable`` of ``Request`` objects. This setup provides you with an ``Iterator`` of data objects that you can then iterate over or convert to a ``list``, ``tuple``, a ``pd.DataFrame`` or any data structure of choice.

.. code-block:: python

start_requests = [Request(url="https://books.toscrape.com/index.html", callback=parse_books_page, client=HttpXClient())]
data_service = DataService(start_requests)
data = tuple(data_service)

A ``Request`` is a ``Pydantic`` model that includes the URL to fetch, a reference to the ``client`` callable, and a ``callback`` function for parsing the ``Response`` object.

The client can be any Python callable that accepts a ``Request`` object and returns a ``Response`` object. ``DataService`` provides an ``HttpXClient`` class, which is based on the ``httpx`` library, but you are free to use your own custom async client.

The callback function processes a ``Response`` object and returns either ``data`` or additional ``Request`` objects.

In this trivial example we are requesting the `Books to Scrape <https://books.toscrape.com/index.html>`_ homepage and parsing the number of books on the page.

Example ``parse_books_page`` function:

.. code-block:: python

def parse_books_page(response: Response):
articles = response.soup.find_all("article", {"class": "product_pod"})
return {
"url": response.request.url,
"title": response.soup.title.get_text(strip=True),
"articles": len(articles),
}

This function takes a ``Response`` object, which has a ``soup`` attribute (a ``BeautifulSoup`` object of the HTML content). The function parses the HTML content and returns data.

The callback function can ``return`` or ``yield`` either ``data`` (dict or dataclass) or more ``Request`` objects.

If you have used Scrapy before, you will find this pattern familiar.
9 changes: 7 additions & 2 deletions dataservice/__init__.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
from dataservice.clients import HttpXClient
from dataservice.config import ServiceConfig
from dataservice.data import BaseDataItem, DataWrapper
from dataservice.exceptions import RequestException, RetryableRequestException
from dataservice.logs import setup_logging
from dataservice.models import Request, Response
from dataservice.pipeline import Pipeline
from dataservice.service import DataService

__all__ = [
"BaseDataItem",
"DataService",
"DataWrapper",
"HttpXClient",
"Pipeline",
"Request",
"Response",
"RequestException",
"RetryableRequestException",
"ServiceConfig",
"setup_logging",
]

__version__ = "0.0.1"
23 changes: 18 additions & 5 deletions dataservice/clients.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,17 @@ def __init__(self):
self.async_client = httpx.AsyncClient

def __call__(self, *args, **kwargs):
"""Make a request using the client."""
return self.make_request(*args, **kwargs)

async def make_request(self, request: Request) -> Response | NoReturn:
"""Make a request and handle exceptions."""
"""Make a request and handle exceptions.

:param request: The request object containing the details of the HTTP request.
:return: A Response object if the request is successful.
:raises RequestException: If a non-retryable HTTP error occurs.
:raises RetryableRequestException: If a retryable HTTP error occurs.
"""
try:
return await self._make_request(request)
except httpx.HTTPStatusError as e:
Expand All @@ -46,9 +53,15 @@ async def make_request(self, request: Request) -> Response | NoReturn:
raise RequestException(str(e))

async def _make_request(self, request: Request) -> Response:
"""Make a request using HTTPX."""
"""Make a request using HTTPX. Private method for internal use.

:param request: The request object containing the details of the HTTP request.
:return: A Response object containing the response data.
"""
logger.info(f"Requesting {request.url}")
async with self.async_client(headers=request.headers) as client:
async with self.async_client(
headers=request.headers, proxy=request.proxy
) as client:
match request.method:
case "GET":
response = await client.get(request.url, params=request.params)
Expand All @@ -62,8 +75,8 @@ async def _make_request(self, request: Request) -> Response:
response.raise_for_status()
match request.content_type:
case "text":
data = response.text
data = None
case "json":
data = response.json()
logger.info(f"Returning response for {request.url}")
return Response(request=request, data=data)
return Response(request=request, text=response.text, data=data)
26 changes: 20 additions & 6 deletions dataservice/config.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from typing import NewType, Annotated
from typing import Annotated, NewType

from annotated_types import Ge
from pydantic import BaseModel
from pydantic import BaseModel, Field

PositiveInt = Annotated[int, Ge(0)]
Milliseconds = NewType("Milliseconds", PositiveInt)
Expand All @@ -19,7 +19,21 @@ class RetryConfig(BaseModel):
class ServiceConfig(BaseModel):
"""Global configuration for the service."""

deduplication: bool = True
max_concurrency: PositiveInt = 10
random_delay: Milliseconds = Milliseconds(0)
retry: RetryConfig = RetryConfig()
retry: RetryConfig = Field(
default_factory=RetryConfig, description="The retry configuration."
)
deduplication: bool = Field(
default=True, description="Whether to deduplicate requests."
)
max_concurrency: PositiveInt = Field(
default=10, description="The maximum number of concurrent requests."
)
random_delay: Milliseconds = Field(
default=Milliseconds(0),
description="The maximum random delay between requests.",
)

cache: bool = Field(default=False, description="Whether to cache requests.")
cache_name: str = Field(
default="cache", description="A name to use for the cache. Defaults to 'cache'."
)
Loading
Loading