Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some urls will not work with celery #28

Open
Zman67 opened this issue Nov 24, 2014 · 2 comments
Open

some urls will not work with celery #28

Zman67 opened this issue Nov 24, 2014 · 2 comments

Comments

@Zman67
Copy link

Zman67 commented Nov 24, 2014

Hi,

I have a rather urgent problem, for which I hope you can help me,
I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't.
If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery.
If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.

code:

from celery import Celery

from boilerpipe.extract import Extractor
from harvest.celery import app
app.config_from_object('harvest.celeryconfig')

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task
def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return 

I've tried everything but editing the java code and found the following:

  1. the task / boilerpipe stops working at line 70 or so in the Extractor (init.py),
    "self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()"
    it simply doesn't give back the parsed text and then the task times out.

  2. Please understand It works perfectly with some URL's within celery, others timeout.
    If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)

  3. if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery
    however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.

So: this works, but is not good code and highly unwanted I think:

class taskclass(celery.Task):

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl'
extractorType="DefaultExtractor"
print Extractor(extractor=extractorType, url=URL).getText()

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task (base=taskclass)
def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return 
  1. updated JPype1
  2. updated nekohtml
  3. cannot find any other instance of this on the internet.

I hope you can help me,

Kindest regards,

Roland Zoet

@andreip
Copy link

andreip commented Nov 13, 2015

I encountered an error (see below) on the same line, the issue in my case I think is due to a race condition somewhere in java code. Try running your celery worker with --concurrency=1 and see if it works. I don't have a solution for this.

...
    extractor = Extractor(extractor='ArticleExtractor', html=html)
  File "/usr/local/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 62, in __init__
    self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()
Exception: <jpype._jclass.java.lang.NoClassDefFoundError object at 0x126ebb250>

@korycins
Copy link

@Zman67 Did you find any solutions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants