Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve feed parser robustness #13

Open
sebastian-nagel opened this issue Nov 23, 2016 · 4 comments
Open

Improve feed parser robustness #13

sebastian-nagel opened this issue Nov 23, 2016 · 4 comments
Assignees

Comments

@sebastian-nagel
Copy link
Collaborator

As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which should not break a robust feed parser and do mostly not affect extraction of links:

  • (35 feeds) unknown entities ‘ or ú etc.
2016-11-22 16:21:14.949 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://rakurs.rovno.ua/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 282:
  The entity "lsquo" was referenced, but not declared.
2016-11-22 16:18:18.177 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.diariolaestrella.com/150/index.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 17:
  The entity "uacute" was referenced, but not declared.
2016-11-22 16:19:35.721 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.iltalehti.fi/rss/rss.xml: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 66:
  The entity "euro" was referenced, but not declared.
  • (20 feeds) single ampersands
2016-11-22 16:18:07.643 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.northerniowan.com/feed/atom/: com.rometools.rome.io.ParsingFeedException: Invalid XML:
  Error on line 84: The entity name must immediately follow the '&' in the entity reference.
  • RSS extensions
2016-11-22 18:20:14.535 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.amurpravda.ru/rss/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 20:
  The prefix "yandex" for element "yandex:full-text" is not bound.
  • leading newlines / white space / BOMs
2016-11-22 16:20:12.279 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.pixelmonsters.de/feed/gamenews.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 2:
  The processing instruction target matching "[xX][mM][lL]" is not allowed.
  • NPEs (!)
2016-11-22 16:18:53.325 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://chestertontribune.com/rss.xml: java.lang.NullPointerException
2016-11-22 16:18:21.004 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://atv.at/atom.xml: java.lang.NullPointerException
2016-11-22 16:27:35.163 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml: java.lang.NullPointerException
  • encoding issues
2016-11-22 16:20:33.593 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://newamericamedia.org/atom.xml: com.rometools.rome.io.ParsingFeedException:
  Invalid XML: Error on line 534: Invalid byte 2 of 3-byte UTF-8 sequence.

This issue is used as umbrella to track existing feed parser problems and address them step by step:

@jnioche jnioche self-assigned this Nov 23, 2016
@jnioche
Copy link
Contributor

jnioche commented Nov 23, 2016

Thanks @sebastian-nagel this is very useful.

@jnioche
Copy link
Contributor

jnioche commented Nov 23, 2016

NPE

http://chestertontribune.com/rss.xml contains an item without link, which causes the NPE

<item>
<title>
http://chestertontribune.com/Sports/state_park_little_league_registr.htm
</title>
<pubDate>Tue, 19 Feb 2013 20:42:24 GMT</pubDate>
</item> 

https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml is a bit different in that it uses guid instead of link. I'll modify the code so that we take the guid in the absence of a link.

Fixed in apache/incubator-stormcrawler@cafaf3a

@jnioche
Copy link
Contributor

jnioche commented Nov 28, 2016

Note : just upgraded Rome-Tools to 1.7.0 in apache/incubator-stormcrawler@4832c98

@sebastian-nagel
Copy link
Collaborator Author

Alternatively, thinking about using the sitemap parser (based on crawler-commons) to parse the feeds. The important parts (URL and publication date) are also made available by the sitemap parser. I'll try to evaluate both parsers on a larger test set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants