-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ResourceBusy errors with curio #118
Comments
So I smashed some prints and stuff in haphazardly, like so: from urllib.parse import urljoin, urlparse
import asks
import curio
from bs4 import BeautifulSoup
ARCHIVES_URL = 'https://mail.python.org/pipermail/python-ideas/'
def make_soup(data):
return BeautifulSoup(data, features="html.parser")
# return BeautifulSoup(data, features="html5lib")
async def get_archive_index(session):
resp = await session.post(ARCHIVES_URL)
resp.raise_for_status()
return resp.text
def parse_threads(archive_index):
soup = make_soup(archive_index)
return [
urljoin(ARCHIVES_URL, th.attrs['href'])
for th in soup.find_all('a', string='[ Thread ]')
]
def parse_emails(thread_index, month_url):
soup = make_soup(thread_index)
return [
urljoin(month_url, em.attrs['href'])
for em in soup.find_all('a', attrs={'name': True, 'href': True})
]
processed_email = False
async def process_email(email_url, session, sync):
print('starting to process email')
async with sync:
print('email acquired sema')
resp = await session.get(email_url)
print('email got response')
print("Email resp", resp.status, resp.content)
global processed_email
if not processed_email:
print("process_email -> resp.text", resp.text[:10])
processed_email = True
resp.raise_for_status()
print(resp.text)
processed_month = False
async def process_month(month_url, session, sync):
async with sync:
resp = await session.get(month_url)
resp.raise_for_status()
thread_index = resp.text
global processed_month
if not processed_month:
print("process_month -> resp.text", resp.text[:10])
processed_month = True
emails = await curio.run_in_process(parse_emails, thread_index, month_url)
if emails:
print(emails)
async with curio.TaskGroup() as tg:
for email_url in emails:
await tg.spawn(process_email, email_url, session, sync)
async def main():
s = asks.Session(connections=100, persist_cookies=True)
archive_data = await get_archive_index(s)
print("main -> archive_data", archive_data[:10])
archive_index = await curio.run_in_process(parse_threads, archive_data)
sync = curio.Semaphore(value=1)
async with curio.TaskGroup() as tg, s:
for month_url in archive_index:
await tg.spawn(process_month, month_url, s, sync)
if __name__ == '__main__':
curio.run(main) and as output we get:
No errors. Note, the email parsing doesn't seem to be working. Off the bat, try just use a regular old virtualenv and python rather than annaconda. Constantly see people with totally random errors using annaconda + random normally functioning things. |
I am getting some errors in addition to the prints, in the current environment:
I'll try to set up a reproducible environment and see if this fixes itself. |
FWIW I get the same error on an ubuntu docker image, with the system python and the pip default version of the packages. I also tried the git versions of both curio ans asks.
|
Add a script to parse the emails, and find the mentions of validphys reports and associate report id with email url and title. Because there is no way to get an email URL from the email as received, we scan the HTML of the archives, by crawling over each message in each month. The script tries to remove links that are in quoted sections but that only works if these have already been parsed as a `backquote` HTML element in the email archives. We use this information to create a link to the email, in the index page, by adding an email emoji link to each email. It could be used for other things such as displaying the email in the template. One annoying aspect is that this is an embracingly parallel task (we could be processing the emails while we are waiting for other emails to download), but I am hitting some bug I don't understand when trying to do this with curio and asks (theelous3/asks#118), so it will stay sequential for the moment. Because it is slow, we add a cache to remember already seen emails. At the moment index-emails needs to be run independently from index-reports (I run it once a day), but that may not be optimal.
Does this happen with other backends as well? Or only curio? |
As far as I can tell, the following should work (sorry it requires BeautifulSoap); I want to crawl over each email in a mailing list and do something with it:
I am getting the following error, which I do not understand:
Playing with the value of the semaphore or the number of connections doesn't seem to change the result for me.
The text was updated successfully, but these errors were encountered: