Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding Issues - UnicodeDecodeError: 'utf8' codec can't decode byte #19

Open
jimishjoban opened this issue Feb 12, 2014 · 1 comment
Open

Comments

@jimishjoban
Copy link

Hey guys,

First of all thanks for python-boilerpipe

Trying to use Boilerpipe but can't extract properly some documents...

from boilerpipe.extract import Extractor
extractorType="DefaultExtractor"
sourceUrl = 'http://www.indiatimes.com/news/india/arvind-kejriwal-to-seek-political-sanyas-127620.html'
extractor = Extractor(extractor=extractorType, url=sourceUrl)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/boilerpipe/extract/init.py", line 41, in init
self.data = unicode(self.data, encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 53647: invalid start byte

The document seems to be having some non-utf8 characters... which do not seem to parse well... Any workaround for the problem?

@Caimany
Copy link

Caimany commented Jun 1, 2015

I solved UnicodeDecodeError ,you can see what I modified in init.py
https://github.com/Caimany/python-boilerpipe/blob/master/src/boilerpipe/extract/__init__.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants