-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lets collaborate #1
Comments
Surely, why not:)
kaliiiiiiiiii/Selenium-Driverless#140 will be resolved in some future for sure - and file downloading, html error-pages etc. wouldn't be an issue to implement into driverless.
I assume you're talking about opensubtitles-scraper/blob/main/fetch-subs.py. What I have to note here tho:
|
Or did you mean that you'd like to use driverless as a base for this project? |
if you want to see my current mess: [email protected]
maybe...
true, this would be more than a stupid http client its a challenge to reduce this complexity into a few lines of code the http client would need some model of the http server the goal is to autosolve complex challenges like
also automate pagination (or infinite scroll) im pretty sure that something like this exists somewhere... |
I'll send you an E-Mail.
I suppose that some basic wait for content load with bare CDP its a start at some point. Considerations might be:
|
the user would have to provide that model of the http server |
yepp, i have reinvented botasaurus |
botasaurus uses selenium internally. |
looked shortly into the code you've got so far. Stuff I notice here:
|
actually no, also
also scrapy has support for middlewares the the
in import asyncio
import aiohttp_chromium as aiohttp
async def main():
async with aiohttp.ClientSession() as session:
async def middleware_1(request, handler):
print("middleware_1")
request.headers["test"] = "hello"
request.cookies["some_key"] = "some value"
# send request, get response
respone = await handler(request)
response.text = response.text + ' wink'
return response
args = dict(
_middlewares=[
middleware_1,
]
)
url = "http://httpbin.org/get"
async with session.get(url, **args) as response:
print(response.status)
print(await response.text())
asyncio.run(main()) this is also useful to block requests
the ads on many websites are "pretty invasive" too
im surprised that i have to add my args = dict(
_chromium_config = {
"bookmark_bar": {
# disable bookmarks bar
"show_on_all_tabs": False,
},
},
)
async with aiohttp.ClientSession(**args) as session: generally, all chromium options will be exposed
with i guess that creating an incognito window currently the start is slow, because i wait 20 seconds for ublock update
probably i will use this by default instead of |
Uhh you mean fetch the extension? Each time?
good point
Yep, I'd propose that as well. Additionally, there's maximum size for python websockets (technically overridable) |
@milahu Also, you might consider using threading at: aiohttp_chromium/src/aiohttp_chromium/extensions.py Lines 121 to 123 in fc15ea6
just to be safe for asyncio. This applies as well for shutil and reading//writing files.aiofiles might be considerable here. It's anyways a dependency for driverless.
|
no. the extension zip is already cached to what takes so long is the update of ublock, visible by the orange ublock icon i have added caching of extensions state in 686b19f ublock options
this data is stored in levelDB databases in
aka
i dont see how the unzip code could break anything
this could be more relevant when reading downloaded files in aiohttp_chromium/src/aiohttp_chromium/client.py Lines 411 to 429 in fc15ea6
|
For cases where the disk is slow and multiple Chrome instances are started, I suppose this could cause to long blocking coroutines for asyncio. |
@kaliiiiiiiiii @milahu adding to puppeteer some custom patches + some mouse movements can be game changer also, there is a new protocol rather than CDP which is WebDriver BiDi |
thanks for sharing the good news
i hope they will finally implement hooking into http streams see also
i still prefer a headful chromium browser, running on my desktop machine i tried to run chromium in an xvnc server, to only show it when needed |
@kaliiiiiiiiii this project is largely based on your Selenium-Driverless
would you be interested in collaboration?
(spoiler: i will use MIT license)
i have not-yet found an actual "headful-web-scraper"
where i can simply remote-control an actual chromium browser
to allow "semi-automatic web scraping" (solving captchas, debugging error states)
so i created my own : )
so far, my code is unreleased
im using it in my opensubtitles-scraper to bypass cloudflare
so far, my code (fetch-subs.py) is really messy
and it will need some serious refactoring
from 8000 lines in one file, to modules and classes
my goal is to make chromium usable just like any other http client in python
as a drop-in replacement for aiohttp
i have a working prototype for handling file downloads (and html error pages)
but i guess that will be too complex / out of scope for Selenium-Driverless
see also Selenium-Driverless#140
The text was updated successfully, but these errors were encountered: