some urls will not work with celery #28

Zman67 · 2014-11-24T17:07:23Z

Hi,

I have a rather urgent problem, for which I hope you can help me,
I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't.
If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery.
If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.

code:

from celery import Celery

from boilerpipe.extract import Extractor
from harvest.celery import app
app.config_from_object('harvest.celeryconfig')

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task
def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return

I've tried everything but editing the java code and found the following:

the task / boilerpipe stops working at line 70 or so in the Extractor (init.py),
"self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()"
it simply doesn't give back the parsed text and then the task times out.
Please understand It works perfectly with some URL's within celery, others timeout.
If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)
if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery
however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.

So: this works, but is not good code and highly unwanted I think:

class taskclass(celery.Task):

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl'
extractorType="DefaultExtractor"
print Extractor(extractor=extractorType, url=URL).getText()

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task (base=taskclass)
def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return

updated JPype1
updated nekohtml
cannot find any other instance of this on the internet.

I hope you can help me,

Kindest regards,

Roland Zoet

The text was updated successfully, but these errors were encountered:

andreip · 2015-11-13T13:30:07Z

I encountered an error (see below) on the same line, the issue in my case I think is due to a race condition somewhere in java code. Try running your celery worker with --concurrency=1 and see if it works. I don't have a solution for this.

...
    extractor = Extractor(extractor='ArticleExtractor', html=html)
  File "/usr/local/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 62, in __init__
    self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()
Exception: <jpype._jclass.java.lang.NoClassDefFoundError object at 0x126ebb250>

korycins · 2017-10-26T15:07:28Z

@Zman67 Did you find any solutions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some urls will not work with celery #28

some urls will not work with celery #28

Zman67 commented Nov 24, 2014

andreip commented Nov 13, 2015

korycins commented Oct 26, 2017

some urls will not work with celery #28

some urls will not work with celery #28

Comments

Zman67 commented Nov 24, 2014

code:

andreip commented Nov 13, 2015

korycins commented Oct 26, 2017