Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lets collaborate #1

Open
milahu opened this issue Jan 8, 2024 · 16 comments
Open

lets collaborate #1

milahu opened this issue Jan 8, 2024 · 16 comments

Comments

@milahu
Copy link
Owner

milahu commented Jan 8, 2024

@kaliiiiiiiiii this project is largely based on your Selenium-Driverless
would you be interested in collaboration?
(spoiler: i will use MIT license)

i have not-yet found an actual "headful-web-scraper"
where i can simply remote-control an actual chromium browser
to allow "semi-automatic web scraping" (solving captchas, debugging error states)
so i created my own : )

so far, my code is unreleased
im using it in my opensubtitles-scraper to bypass cloudflare

so far, my code (fetch-subs.py) is really messy
and it will need some serious refactoring
from 8000 lines in one file, to modules and classes

my goal is to make chromium usable just like any other http client in python
as a drop-in replacement for aiohttp

i have a working prototype for handling file downloads (and html error pages)
but i guess that will be too complex / out of scope for Selenium-Driverless
see also Selenium-Driverless#140

@kaliiiiiiiiii
Copy link
Contributor

@milahu

would you be interested in collaboration?
(spoiler: i will use MIT license)

Surely, why not:)

i have a working prototype for handling file downloads (and html error pages)
but i guess that will be too complex / out of scope for Selenium-Driverless
see also Selenium-Driverless#140

kaliiiiiiiiii/Selenium-Driverless#140 will be resolved in some future for sure - and file downloading, html error-pages etc. wouldn't be an issue to implement into driverless.
However, I see the point of wanting to have a lightweight, stable, fast aiohttp-like browser. And ofc being completely open-source hehe.
Already at this point, I see the an issue abt when to know I a page has loaded yet completely, as pages in edge-cases for example can have forever-loading Iframes, iframes loaded after content-load etc. Waiting for elements etc. would then already go towards more complex automation (=> driverless, not aiohttp-like)

so far, my code (fetch-subs.py) is really messy
and it will need some serious refactoring
from 8000 lines in one file, to modules and classes

I assume you're talking about opensubtitles-scraper/blob/main/fetch-subs.py.
So therefore the plan is to use Pyppeteer? I tbh really wouldn't recommend using it. Puppeteer (& Pyppeteer) just isn't made to be undetectable. I'd recommend relying on bare CDP, possibly using CDP-Socket (no worries, I plan to make it GNU//MIT anyways:) )

What I have to note here tho:

  1. Currently I'm quite maxed-out & therefore won't find a lot of time in near future. Feel free to lmk if you need anything tho.
  2. Let's keep this professional - I don't judge or discuss about any political or personnal stuff on here
  3. I havent worked with auto-generated documentations yet - might need some time getting into that//you might provide some structure on that to start on.

@kaliiiiiiiiii
Copy link
Contributor

Or did you mean that you'd like to use driverless as a base for this project?

@milahu
Copy link
Owner Author

milahu commented Jan 8, 2024

so far, my code is unreleased

if you want to see my current mess: [email protected]

file downloading, html error-pages etc. wouldn't be an issue to implement into driverless.

maybe...

I see the an issue abt when to know I a page has loaded yet completely

more complex automation (=> driverless, not aiohttp-like)

true, this would be more than a stupid http client

its a challenge to reduce this complexity into a few lines of code

the http client would need some model of the http server
to predict possible responses

the goal is to autosolve complex challenges like

  • "please solve this captcha"
  • "please log in"
  • "do you want cookies?"
  • "do you want to donate?"
  • "please turn off your ad blocker"
  • "do you want to chat with our support?"
  • "click here to load more comments"
  • "you have reached your daily rate limit, please wait until tomorrow"

also automate pagination (or infinite scroll)

im pretty sure that something like this exists somewhere...
for example apify has a similar goal, to translate html responses to json responses
or "web archive" services will have such challenge-solvers, to "click through" to the content

@kaliiiiiiiiii
Copy link
Contributor

if you want to see my current mess: [email protected]

I'll send you an E-Mail.

translate html responses to json responses
Huh how'd you wanna do that? Run some model on it? Maintain for current frameworks//antibots//patterns?

I suppose that some basic wait for content load with bare CDP its a start at some point. Considerations might be:

  1. wait for regex html match
  2. wait for regex url match (redirect-urls support)
  3. wait for iframes mentioned on any above mentioned conditions.(optionally)

@milahu
Copy link
Owner Author

milahu commented Jan 8, 2024

the http client would need some model of the http server

the user would have to provide that model of the http server
with all the if/then/else/match/retry/... logic

@milahu
Copy link
Owner Author

milahu commented Jan 14, 2024

im pretty sure that something like this exists somewhere...

yepp, i have reinvented botasaurus

@kaliiiiiiiiii
Copy link
Contributor

im pretty sure that something like this exists somewhere...

yepp, i have reinvented botasaurus

botasaurus uses selenium internally.
Also, JaveScript execution such as https://github.com/omkarcloud/botasaurus/blob/dba618c26da74263cc4af33a13faf41cb7a30ae3/botasaurus/anti_detect_driver.py#L216 for sure is detectable.

@kaliiiiiiiiii
Copy link
Contributor

looked shortly into the code you've got so far. Stuff I notice here:

  1. UBlock Extension to my knowledge is pretty invasive (js execution, network blocking, etc.) and I'm pretty sure it's detetacble. Therefore I'd not recommend adding it by default (if that is the case?)
  2. I like the arguments & preferences you've added. Might have a closer look at them for my usages as well.
  3. You might consider support for context(incognito). While doing multiple request in the same context, all cookies will be shared. Also, pasing extra headers might be a nice feature
  4. For supporting streaming, you might consider using Network.takeResponseBodyForInterceptionAsStream

@milahu
Copy link
Owner Author

milahu commented Jan 16, 2024

i have reinvented botasaurus

actually no, aiohttp_chromium is more low-level than botasaurus
aiohttp_chromium is really just a drop-in replacement for aiohttp
and selenium features are hidden under response._driver

also botasaurus fails to run on my nixos machine, see omkarcloud/botasaurus#40
meanwhile, aiohttp_chromium just works
also because selenium_driverless is a pure-python library
so no webdriver, and no node process to eval javascript

passing extra headers might be a nice feature

the goal is to autosolve complex challenges

middlewares is the term i was looking for
to intercept and modify requests and responses

also scrapy has support for middlewares
but scrapy is too high-level for my taste, similar to botasaurus

the aiohttp client has only support for passive tracing
but since its just a dumb http client
where one request gives only one response (plus redirects)
such request/response interception is not needed

the aiohttp server has support for middlewares

A middleware is a coroutine that can modify either the request or response

Every middleware should accept two parameters, a request instance and a handler, and return the response or raise an exception

in aiohttp_chromium this could look like

import asyncio

import aiohttp_chromium as aiohttp

async def main():

    async with aiohttp.ClientSession() as session:

        async def middleware_1(request, handler):
            print("middleware_1")
            request.headers["test"] = "hello"
            request.cookies["some_key"] = "some value"
            # send request, get response
            respone = await handler(request)
            response.text = response.text + ' wink'
            return response

        args = dict(
            _middlewares=[
                middleware_1,
            ]
        )

        url = "http://httpbin.org/get"

        async with session.get(url, **args) as response:
            print(response.status)
            print(await response.text())

asyncio.run(main())

this is also useful to block requests
example: dont load images / styles / scripts / ads / ...

UBlock Extension to my knowledge is pretty invasive

the ads on many websites are "pretty invasive" too
many normal browsers have ublock, so thats no sign of a bot

I like the arguments & preferences you've added

im surprised that i have to add --enable-features=WebContentsForceDark
to actually enable darkmode for websites
otherwise only the chromium UI is dark, and websites are light
i would call this a chromium bug, but probably its "default off" for better performance

my self._chromium_config has some "reasonable defaults"
_chromium_config will be exposed in the session constructor

    args = dict(
        _chromium_config = {
            "bookmark_bar": {
                # disable bookmarks bar
                "show_on_all_tabs": False,
            },
        },
    )
    async with aiohttp.ClientSession(**args) as session:

generally, all chromium options will be exposed
because different users have different needs

You might consider support for context

with aiohttp i would create multiple sessions
with different cookie_jar, different request headers, ...

i guess that creating an incognito window
is not more efficient that starting a new chromium process

currently the start is slow, because i wait 20 seconds for ublock update
but the start time can be reduced by using persistent user-data-dir for chromium
one user-data-dir for every session

Network.takeResponseBodyForInterceptionAsStream

probably i will use this by default instead of Network.getResponseBody
because ahead of time, i dont know
whether a response is a document or an infinite stream (long poll)

@kaliiiiiiiiii
Copy link
Contributor

kaliiiiiiiiii commented Jan 16, 2024

currently the start is slow, because i wait 20 seconds for ublock update

Uhh you mean fetch the extension? Each time?
pretty sure versioning should be possible to implement.

example: dont load images / styles / scripts / ads / ...
f

good point

Network.takeResponseBodyForInterceptionAsStream

probably i will use this by default instead of Network.getResponseBody

Yep, I'd propose that as well. Additionally, there's maximum size for python websockets (technically overridable)

@kaliiiiiiiiii
Copy link
Contributor

@milahu Also, you might consider using threading at:

with zipfile.ZipFile(self.cache_filepath, "r") as z:
z.extractall(self.path)
return

just to be safe for asyncio.
This applies as well for shutil and reading//writing files.
aiofiles might be considerable here. It's anyways a dependency for driverless.

@milahu
Copy link
Owner Author

milahu commented Jan 16, 2024

Uhh you mean fetch the extension? Each time?
pretty sure versioning should be possible to implement.

no. the extension zip is already cached to self._extensions_cache_path
which by default is $HOME/.cache/aiohttp_chromium/extensions/

what takes so long is the update of ublock, visible by the orange ublock icon
when i send requests too early, then ublock is not working
on update, ublock is downloading filter lists from uBlock/assets/assets.json

i have added caching of extensions state in 686b19f
now ublock starts in about 5 seconds (versus 30 seconds cold start)

ublock options

Storage used: 23.7 MB

115,558 network filters + 44,343 cosmetic filters

this data is stored in levelDB databases in
{user_data_dir}/Default/Local Extension Settings/{ext_id}/

Suspend network activity until all filter lists are loaded

aka suspendUntilListsAreLoaded with default setting in js/background.js

you might consider using threading

i dont see how the unzip code could break anything
this runs sequentially to unpack extensions to user-data-dir

to be safe for asyncio

this could be more relevant when reading downloaded files in response.content etc
currently, this is just a "quickfix" solution
which is also not compatible with aiohttp await response.content.read()
because currently response.content.read is a sync method

@property
def content(self):
# TODO make this compatible with aiohttp.streams.StreamReader
# based on io.BytesIO for in-memory data
# or based on io.BufferedReader for files
# to provide methods like
# reader.iter_chunked
# reader.iter_any
logger.debug(f"ClientResponse.content")
if self._content is None:
if self._filepath:
# dont use self._body
# open -> io.BufferedReader
# FIXME handle missing file
# FIXME handle incomplete file
self._content = open(self._filepath, "rb")

@kaliiiiiiiiii
Copy link
Contributor

i dont see how the unzip code could break anything
this runs sequentially to unpack extensions to user-data-dir

For cases where the disk is slow and multiple Chrome instances are started, I suppose this could cause to long blocking coroutines for asyncio.

@milahu
Copy link
Owner Author

milahu commented Jan 17, 2024

sounds like low priority stuff
having write-locks for saving extensions state would be more important
or having atomic writes when moving downloaded files from tmpfs to disk

meanwhile, scraper goes brrr ; )

but opensubtitles.org is easy to scrape...
currently im handling 2K requests per day

Screenshot_20240118_004624

@ZakariaMQ
Copy link

@kaliiiiiiiiii @milahu
just a little addition from me
the new headless mode of puppeteer has a nearly regular fingerprint as a real Chrome browser
I was able to pass with it so many cloudflare protected website
and "Antoine Vastel, PhD, Head of Research at DataDome" Admit it in his last interview

adding to puppeteer some custom patches + some mouse movements can be game changer

also, there is a new protocol rather than CDP which is WebDriver BiDi
introduced by Google more info here https://developer.chrome.com/blog/webdriver-bidi/

@milahu
Copy link
Owner Author

milahu commented Apr 16, 2024

WebDriver BiDi

thanks for sharing the good news

WebDriver BiDi promises bi-directional communication, making it fast by default, and it comes packed with low-level control.

i hope they will finally implement hooking into http streams
so we can use chromium as a full http client
see also kaliiiiiiiiii/Selenium-Driverless#123 (comment)

see also

the new headless mode of puppeteer

i still prefer a headful chromium browser, running on my desktop machine
which allows (in theory) semi-automatic scraping, asking the user to solve captchas

i tried to run chromium in an xvnc server, to only show it when needed
but chromium in xvnc fails to bypass cloudflare
somehow the rendering in xvnc is slower than on the main desktop
i guess cloudflare wants to block exactly this use case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants