Skip to content

Releases: microlinkhq/metascraper

v5.5.1

20 Jun 16:43
Compare
Choose a tag to compare

5.5.1 (2019-06-20)

Note: Version bump only for package metascraper

v5.5.0

20 Jun 16:23
Compare
Choose a tag to compare

5.5.0 (2019-06-20)

Features

v5.4.7

20 Jun 09:42
Compare
Choose a tag to compare

5.4.7 (2019-06-20)

Note: Version bump only for package metascraper

v5.4.6

19 Jun 21:54
Compare
Choose a tag to compare

5.4.6 (2019-06-19)

Note: Version bump only for package metascraper

v5.0.0

17 Mar 15:46
2b453b2
Compare
Choose a tag to compare

Breaking Changes

Rules Bundles processed in parallel

Until now, the rules bundles are processed in the interface, being possible passing meta between rules:

({ htmlDom: $, meta, url: baseUrl }) => wrap($ => $('meta[property="og:logo"]').attr('content')),

Now, the bundles rules are processed in parallel, being no possible have shared information between rules, so meta will no more passed.

The only official rule bundler affected by this is metascraper-lang-detector.

Improvements

Add metascraper-readability

The metascraper-readabilityhttp://npm.im/metascraper-readability is based on https://github.com/mozilla/readability.

v4.9.0

10 Jan 18:32
ca32573
Compare
Choose a tag to compare

Remove sanitize-html

The dependency is introducing a bug related to malformed URLs: apostrophecms/sanitize-html#274

In fact, I detected it's no longer necessary since htmlparser2 is present as part of cheerio load method.

Result: Smaller bundler, less parsing time.

Setup CSS Insensitive Rules

One of the things related to sanitize-html was normalized some common things around the HTML markup.

Because this dependency is no more dependency and after discovering that CSS rules can be insensitive, I enabled it properly in where is possible.

Result: Better data detection, less initial parsing time.

Improve Date Rules

Based on the insensitive CSS rules improvement, I was re-checking the bundle set related to metascraper-date.

I detected some interesting improvement opportunities: some rules can be merged into the same, also being possible to convert some rules into more generic, improving the data accurately.

Also, I tried to prioritize update over create, so the output is more associated with the last modification date over the creation date.

Result: Better date accurate, more value detected.

Improve URL detection

The URL detection has been improved for being possible detected more kind of URLs.

An URL is a subtype of URI. The thing that I want to be sure is detecting as much data as possible.

Now the metascraper-helpers related with urls being possible detected URIs, such data image URI encoded on base64 or magnet URIs.

The challenge here is doing that while we still support original functionality. I added a lot of tests to ensure about that.

Result: Better URLs detection, supporting URIs.

v4.6.0

26 Oct 18:02
0ef7ad5
Compare
Choose a tag to compare

Features

v4.0.0

24 Aug 08:25
7901fb6
Compare
Choose a tag to compare

Breaking Changes

The autoload feature has been removed.

Now rules bundles need to be loaded explicitly:

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit-logo')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

Migration guide

If you are using metrascraper.load
Just rename it to metascraper. The .load method is now the main exported function.

If you ar using metascraper autoload
Replace it with the snippet code on top. It's loading the defaults rules bundles present in v3.

v3.2.0

19 Dec 14:13
Compare
Choose a tag to compare

2.0.0

11 Dec 13:37
3a81306
Compare
Choose a tag to compare

Breaking Changes

From now, metascraper will be the main method and you need to pass html and url for extracting metadata.

const metascraper = require('metascraper')
const got = require('got')

const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'

;(async () => {
  const {body: html, url} = await got(targetUrl)
  const metadata = await metascraper({html, url})
  console.log(metadata)
})()

We moved the HTTP layout out of the library to avoid problems related to the connections.

Also in this new interface rules are not exposed directly.

Features

logo data field

We added a new field logo for identifying the publisher brand under a link. It uses the high resolution favicon possible to get as a fallback.

Improvements

Codebase simplification

We rewrote the code to make easy support plugins in the future.

Testing environment

We updated integration tests, with at least top50 popular internet sites. Also, they are automated, so add a new test is easy.