Improve feed parser robustness #13

sebastian-nagel · 2016-11-23T10:43:43Z

As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which should not break a robust feed parser and do mostly not affect extraction of links:

(35 feeds) unknown entities ‘ or ú etc.

2016-11-22 16:21:14.949 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://rakurs.rovno.ua/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 282:
  The entity "lsquo" was referenced, but not declared.
2016-11-22 16:18:18.177 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.diariolaestrella.com/150/index.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 17:
  The entity "uacute" was referenced, but not declared.
2016-11-22 16:19:35.721 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.iltalehti.fi/rss/rss.xml: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 66:
  The entity "euro" was referenced, but not declared.

(20 feeds) single ampersands

2016-11-22 16:18:07.643 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.northerniowan.com/feed/atom/: com.rometools.rome.io.ParsingFeedException: Invalid XML:
  Error on line 84: The entity name must immediately follow the '&' in the entity reference.

RSS extensions

2016-11-22 18:20:14.535 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.amurpravda.ru/rss/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 20:
  The prefix "yandex" for element "yandex:full-text" is not bound.

leading newlines / white space / BOMs

2016-11-22 16:20:12.279 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.pixelmonsters.de/feed/gamenews.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 2:
  The processing instruction target matching "[xX][mM][lL]" is not allowed.

NPEs (!)

2016-11-22 16:18:53.325 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://chestertontribune.com/rss.xml: java.lang.NullPointerException
2016-11-22 16:18:21.004 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://atv.at/atom.xml: java.lang.NullPointerException
2016-11-22 16:27:35.163 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml: java.lang.NullPointerException

encoding issues

2016-11-22 16:20:33.593 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://newamericamedia.org/atom.xml: com.rometools.rome.io.ParsingFeedException:
  Invalid XML: Error on line 534: Invalid byte 2 of 3-byte UTF-8 sequence.

This issue is used as umbrella to track existing feed parser problems and address them step by step:

reproduce problems in isolation, e.g., add unit tests to SC's FeedParserBoltTest
upgrade Rome library and test again
open issues for Rome or SC

The text was updated successfully, but these errors were encountered:

jnioche · 2016-11-23T12:38:14Z

Thanks @sebastian-nagel this is very useful.

jnioche · 2016-11-23T21:19:05Z

NPE

http://chestertontribune.com/rss.xml contains an item without link, which causes the NPE

<item>
<title>
http://chestertontribune.com/Sports/state_park_little_league_registr.htm
</title>
<pubDate>Tue, 19 Feb 2013 20:42:24 GMT</pubDate>
</item>

https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml is a bit different in that it uses guid instead of link. I'll modify the code so that we take the guid in the absence of a link.

Fixed in apache/incubator-stormcrawler@cafaf3a

jnioche · 2016-11-28T12:37:26Z

Note : just upgraded Rome-Tools to 1.7.0 in apache/incubator-stormcrawler@4832c98

sebastian-nagel · 2018-03-28T19:25:47Z

Alternatively, thinking about using the sitemap parser (based on crawler-commons) to parse the feeds. The important parts (URL and publication date) are also made available by the sitemap parser. I'll try to evaluate both parsers on a larger test set.

jnioche self-assigned this Nov 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve feed parser robustness #13

Improve feed parser robustness #13

sebastian-nagel commented Nov 23, 2016

jnioche commented Nov 23, 2016

jnioche commented Nov 23, 2016 •

edited

Loading

jnioche commented Nov 28, 2016

sebastian-nagel commented Mar 28, 2018

Improve feed parser robustness #13

Improve feed parser robustness #13

Comments

sebastian-nagel commented Nov 23, 2016

jnioche commented Nov 23, 2016

jnioche commented Nov 23, 2016 • edited Loading

jnioche commented Nov 28, 2016

sebastian-nagel commented Mar 28, 2018

jnioche commented Nov 23, 2016 •

edited

Loading