-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve feed parser robustness #13
Comments
Thanks @sebastian-nagel this is very useful. |
NPE http://chestertontribune.com/rss.xml contains an item without link, which causes the NPE
https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml is a bit different in that it uses guid instead of link. I'll modify the code so that we take the guid in the absence of a link. |
Note : just upgraded Rome-Tools to 1.7.0 in apache/incubator-stormcrawler@4832c98 |
Alternatively, thinking about using the sitemap parser (based on crawler-commons) to parse the feeds. The important parts (URL and publication date) are also made available by the sitemap parser. I'll try to evaluate both parsers on a larger test set. |
As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which should not break a robust feed parser and do mostly not affect extraction of links:
‘
orú
etc.This issue is used as umbrella to track existing feed parser problems and address them step by step:
The text was updated successfully, but these errors were encountered: