You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a rather urgent problem, for which I hope you can help me,
I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't.
If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery.
If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.
code:
from celery import Celery
from boilerpipe.extract import Extractor
from harvest.celery import app
app.config_from_object('harvest.celeryconfig')
I've tried everything but editing the java code and found the following:
the task / boilerpipe stops working at line 70 or so in the Extractor (init.py),
"self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()"
it simply doesn't give back the parsed text and then the task times out.
Please understand It works perfectly with some URL's within celery, others timeout.
If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)
if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery
however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.
So: this works, but is not good code and highly unwanted I think:
I encountered an error (see below) on the same line, the issue in my case I think is due to a race condition somewhere in java code. Try running your celery worker with --concurrency=1 and see if it works. I don't have a solution for this.
...
extractor = Extractor(extractor='ArticleExtractor', html=html)
File "/usr/local/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 62, in __init__
self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()
Exception: <jpype._jclass.java.lang.NoClassDefFoundError object at 0x126ebb250>
Hi,
I have a rather urgent problem, for which I hope you can help me,
I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't.
If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery.
If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.
code:
from celery import Celery
from boilerpipe.extract import Extractor
from harvest.celery import app
app.config_from_object('harvest.celeryconfig')
def call_txt_extr():
@app.task
def Extract_Text():
I've tried everything but editing the java code and found the following:
the task / boilerpipe stops working at line 70 or so in the Extractor (init.py),
"self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()"
it simply doesn't give back the parsed text and then the task times out.
Please understand It works perfectly with some URL's within celery, others timeout.
If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)
if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery
however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.
So: this works, but is not good code and highly unwanted I think:
class taskclass(celery.Task):
def call_txt_extr():
@app.task (base=taskclass)
def Extract_Text():
I hope you can help me,
Kindest regards,
Roland Zoet
The text was updated successfully, but these errors were encountered: