Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract publishing date #18

Open
fhamborg opened this issue May 29, 2017 · 2 comments
Open

Extract publishing date #18

fhamborg opened this issue May 29, 2017 · 2 comments

Comments

@fhamborg
Copy link

It would be great if you could additionally extract the date when an article was published. Currently, this requires parsing the web page and using tools such as newspaper3k to get that information. However, during the crawling process at least some webpages would offer this information, e.g. the time stamp within the RSS feed
<pubDate>Thu, 25 Dec 2014 02:10:00 +0900</pubDate>
or within the sitemap
<news:publication_date>2016-12-09T16:18:48Z</news:publication_date>

@sebastian-nagel
Copy link
Collaborator

Status update:

  • <pubDate> (feeds) and <lastmod> (sitemaps) is now used to reject news articles older than 30 days.
  • TODO:
    • add support for <news:publication_date>
    • pass this info from feed/sitemap forward and add it to the WARC record

@sebastian-nagel
Copy link
Collaborator

The project now uses crawler-commons 1.0 which brings full support for all sitemap extensions, including news sitemaps. The <news:publication_date> is now used to skip older news articles (with the current configuration older than 30 days).
Next steps to implement would be:

  • make FeedParserBolt and NewsSiteMapParserBolt store the information in the metadata of the found links. This should be configurable so that also other details
  • write the information from the metadata into a WARC metadata record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants