Skip to content

Commit

Permalink
Pw imp dev (#65)
Browse files Browse the repository at this point in the history
* Py313 (#56)

* use Run CI workflow name

* run 3.13 tests

* update poetry.lock

* add python version icons

* poetry version minor

* Async dataservice (#60)

* AsyncDataService

* minor version

* remove DataItem from __all__

* Playwright optional (#61)

* make playwright optional

* update README.rst

* minor version

* Boilerplate (#62)

* add cli tool

* add py to filename if not present

* update cli and template.py.j2

* remove check playwright in init (#63)

* refactor Playwright,add PW config and devices.py

* init page to None

* fix playwright tests

* change log.info to debug

* add debug log messages

* improve logs

* change log level

* add test_init_browser

* add intercept_content_type

* update publish.yml

* update publish.yml for dev

* update ci

* fix pyproject.toml

* update deps
  • Loading branch information
lucaromagnoli authored Nov 7, 2024
1 parent c7a28fd commit 7efce08
Show file tree
Hide file tree
Showing 21 changed files with 2,060 additions and 603 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ name: Run CI

on:
push:
branches: [ "main", "dev" ]
branches: [ "dev" ]
pull_request:
branches: [ "main"]
branches: [ "dev"]

permissions:
contents: read
Expand All @@ -16,7 +16,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.11", "3.12"]
python-version: ["3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4

Expand All @@ -42,7 +42,7 @@ jobs:
- name: Install project dependencies
run: |
poetry config virtualenvs.create false
poetry install --with dev --no-root
poetry install --with dev --no-root -E playwright
- name: Lint ruff
run: |
Expand Down
41 changes: 27 additions & 14 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,13 @@ name: Publish Python Package

on:
workflow_run:
workflows: [ "ci (3.11)", "ci (3.12)" ]
workflows: ["CI"]
types:
- completed
push:
branches:
- main

jobs:
release:
if: github.event.workflow_run.conclusion == 'success' && github.ref == 'refs/heads/dev'
runs-on: ubuntu-latest

steps:
Expand All @@ -29,38 +27,53 @@ jobs:
- name: Install dependencies
run: |
poetry install --no-root
poetry install --no-root -E playwright
- name: Check if version is already published
id: check_version
run: |
VERSION=$(poetry version -s)
if python -m pip search dataservice==$VERSION; then
echo "Version $VERSION is already published. Skipping."
exit 0
else
echo "Version $VERSION is not published yet. Proceeding."
fi
- name: Build package
if: ${{ steps.check_version.outcome == 'success' }}
run: |
poetry build
- name: Publish to PyPI
- name: Publish to Test PyPI
env:
POETRY_PYPI_TOKEN_PYPI: ${{ secrets.DS_PYPI_API_TOKEN }}
POETRY_TEST_PYPI_TOKEN_PYPI: ${{ secrets.DS_TEST_PYPI_API_TOKEN }}
run: |
poetry config http-basic.pypi "__token__" "${POETRY_PYPI_TOKEN_PYPI}"
poetry publish
poetry config repositories.test-pypi https://test.pypi.org/legacy/
poetry config pypi-token.test-pypi "${POETRY_TEST_PYPI_TOKEN_PYPI}"
poetry publish -r test-pypi
- name: Get the version from pyproject.toml
id: get_version
run: |
echo "::set-output name=version::$(poetry version -s)"
- name: Create Git tag
if: ${{ steps.check_version.outcome == 'success' }}
env:
GITHUB_TOKEN: ${{ secrets.CI_TOKEN }}
run: |
git config user.name "github-actions"
git config user.email "[email protected]"
git tag -a v${{ steps.get_version.outputs.version }} -m "Release version ${{ steps.get_version.outputs.version }}"
git push https://x-access-token:${GITHUB_TOKEN}@github.com/lucaromagnoli/dataservice.git v${{ steps.get_version.outputs.version }}
git tag -a v${{ steps.get_version.outputs.version }} -m "Release version ${{ steps.get_version.outputs.version }}-dev"
git push https://x-access-token:${GITHUB_TOKEN}@github.com/lucaromagnoli/dataservice.git v${{ steps.get_version.outputs.version }}-dev
- name: Create GitHub Release
if: ${{ steps.check_version.outcome == 'success' }}
uses: ncipollo/release-action@v1
with:
tag: v${{ steps.get_version.outputs.version }}
name: "Release ${{ steps.get_version.outputs.version }}"
body: "New release version ${{ steps.get_version.outputs.version }} is published"
tag: v${{ steps.get_version.outputs.version }}-dev
name: "Release ${{ steps.get_version.outputs.version }}-dev"
body: "New release version ${{ steps.get_version.outputs.version }}-dev is published"
draft: false
prerelease: false
67 changes: 0 additions & 67 deletions .github/workflows/publish_dev.yml

This file was deleted.

2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ COPY noxfile.py README.rst /app/

# Install project dependencies
RUN poetry config virtualenvs.create false
RUN poetry install --with dev --no-root
RUN poetry install --with dev -E playwright --no-root

# Install Nox
RUN pip install --no-cache-dir nox nox-poetry
Expand Down
22 changes: 16 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
.. image:: https://img.shields.io/pypi/pyversions/python-dataservice.svg
:alt: Python Versions

DataService
===========

Expand All @@ -9,20 +12,26 @@ Designed for simplicity, it's built upon common web scraping and data gathering

No complex API to learn, just standard Python idioms.

Async implementation, sync interface.
Dual synchronous and asynchronous support.

Installation
------------
Please note that DataService requires Python 3.11 or higher.

You can install DataService via pip:

.. code-block:: bash
pip install python-dataservice
Please note that DataService requires Python 3.11 or higher.
If you want to use `PlaywrightClient`, you will also need to install the `playwright` package:
You can also install the optional ``playwright`` dependency to use the ``PlaywrightClient``:

.. code-block:: bash
pip install python-dataservice[playwright]
To install Playwright, run:

.. code-block:: bash
Expand All @@ -47,7 +56,8 @@ To start, create a ``DataService`` instance with an ``Iterable`` of ``Request``
A ``Request`` is a ``Pydantic`` model that includes the URL to fetch, a reference to the ``client`` callable, and a ``callback`` function for parsing the ``Response`` object.

The client can be any Python callable that accepts a ``Request`` object and returns a ``Response`` object. ``DataService`` provides an ``HttpXClient`` class, which is based on the ``httpx`` library, but you are free to use your own custom async client.
The client can be any async Python callable that accepts a ``Request`` object and returns a ``Response`` object.
``DataService`` provides an ``HttpXClient`` class by default, which is based on the ``httpx`` library, but you are free to use your own custom async client.

The callback function processes a ``Response`` object and returns either ``data`` or additional ``Request`` objects.

Expand All @@ -69,8 +79,8 @@ This function takes a ``Response`` object, which has a ``html`` attribute (a ``B

The callback function can ``return`` or ``yield`` either ``data`` (``dict`` or ``pydantic.BaseModel``) or more ``Request`` objects.

If you have used Scrapy before, you will find this pattern familiar.
If you have used ``Scrapy`` before, you will find this pattern familiar.

For more examples and advanced usage, check out the `examples <https://dataservice.readthedocs.io/en/latest/examples.html>`_ section.

For a detailed API reference, check out the `modules <https://dataservice.readthedocs.io/en/latest/modules.html>`_ section.
For a detailed API reference, check out the `API <https://dataservice.readthedocs.io/en/latest/modules.html>`_ section.
11 changes: 9 additions & 2 deletions dataservice/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
from dataservice.clients import HttpXClient, PlaywrightClient, PlaywrightPage
from dataservice.clients import (
HttpXClient,
PlaywrightClient,
PlaywrightPage,
)
from dataservice.config import (
CacheConfig,
ProxyConfig,
RateLimiterConfig,
RetryConfig,
ServiceConfig,
Expand All @@ -9,9 +14,10 @@
from dataservice.exceptions import DataServiceException, RetryableException
from dataservice.logs import setup_logging
from dataservice.models import FailedRequest, Request, Response
from dataservice.service import DataService
from dataservice.service import AsyncDataService, DataService

__all__ = [
"AsyncDataService",
"BaseDataItem",
"CacheConfig",
"DataService",
Expand All @@ -21,6 +27,7 @@
"HttpXClient",
"PlaywrightClient",
"PlaywrightPage",
"ProxyConfig",
"RateLimiterConfig",
"Request",
"Response",
Expand Down
75 changes: 75 additions & 0 deletions dataservice/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import argparse
import os

from jinja2 import Environment, PackageLoader, select_autoescape


def main():
parser = argparse.ArgumentParser(
description="Generate a Python file with boilerplate code for a data service."
)
parser.add_argument(
"filename", type=str, help="The name of the Python file to create."
)
parser.add_argument(
"--data-item",
action="store_true",
help="Include BaseDataItem in the generated code.",
)
parser.add_argument(
"--service-config",
action="store_true",
help="Include ServiceConfig in the generated code.",
)
parser.add_argument(
"--proxy-config", action="store_true", help="Import ProxyConfig."
)
parser.add_argument(
"--async-service",
action="store_true",
help="Use AsyncDataService and make the main function async.",
)
parser.add_argument(
"--client",
help="The name of the client to use. Default is HttpXClient.",
choices=["httpx", "playwright"],
default="httpx",
)

args = parser.parse_args()

filename = args.filename
if not filename.endswith(".py"):
filename = f"{filename}.py"
script_name = filename.split(".")[0]
use_httpx_client = args.client == "httpx"
use_playwright_client = args.client == "playwright"
use_async_data_service = args.async_service
use_data_service = not use_async_data_service

env = Environment(
loader=PackageLoader("dataservice", "templates"),
autoescape=select_autoescape(["html", "xml"]),
)

template = env.get_template("template.py.j2")

content = template.render(
script_name=script_name,
use_base_data_item=args.data_item,
use_service_config=args.service_config,
use_httpx_client=use_httpx_client,
use_playwright_client=use_playwright_client,
use_data_service=use_data_service,
use_async_data_service=use_async_data_service,
)

filepath = os.path.join(os.getcwd(), filename)
with open(filepath, "w") as f:
f.write(content)

print(f"File '{filename}' created with boilerplate code.")


if __name__ == "__main__":
main()
Loading

0 comments on commit 7efce08

Please sign in to comment.