Releases: microlinkhq/metascraper
v5.5.1
v5.5.0
v5.4.7
v5.4.6
v5.0.0
Breaking Changes
Rules Bundles processed in parallel
Until now, the rules bundles are processed in the interface, being possible passing meta
between rules:
({ htmlDom: $, meta, url: baseUrl }) => wrap($ => $('meta[property="og:logo"]').attr('content')),
Now, the bundles rules are processed in parallel, being no possible have shared information between rules, so meta
will no more passed.
The only official rule bundler affected by this is metascraper-lang-detector
.
Improvements
Add metascraper-readability
The metascraper-readability
http://npm.im/metascraper-readability is based on https://github.com/mozilla/readability.
v4.9.0
Remove sanitize-html
The dependency is introducing a bug related to malformed URLs: apostrophecms/sanitize-html#274
In fact, I detected it's no longer necessary since htmlparser2
is present as part of cheerio
load method.
Result: Smaller bundler, less parsing time.
Setup CSS Insensitive Rules
One of the things related to sanitize-html
was normalized some common things around the HTML markup.
Because this dependency is no more dependency and after discovering that CSS rules can be insensitive, I enabled it properly in where is possible.
Result: Better data detection, less initial parsing time.
Improve Date Rules
Based on the insensitive CSS rules improvement, I was re-checking the bundle set related to metascraper-date
.
I detected some interesting improvement opportunities: some rules can be merged into the same, also being possible to convert some rules into more generic, improving the data accurately.
Also, I tried to prioritize update over create, so the output is more associated with the last modification date over the creation date.
Result: Better date accurate, more value detected.
Improve URL detection
The URL detection has been improved for being possible detected more kind of URLs.
An URL is a subtype of URI. The thing that I want to be sure is detecting as much data as possible.
Now the metascraper-helpers
related with urls
being possible detected URIs, such data image URI encoded on base64 or magnet URIs.
The challenge here is doing that while we still support original functionality. I added a lot of tests to ensure about that.
Result: Better URLs detection, supporting URIs.
v4.6.0
v4.0.0
Breaking Changes
The autoload
feature has been removed.
Now rules bundles need to be loaded explicitly:
const metascraper = require('metascraper')([
require('metascraper-author')(),
require('metascraper-date')(),
require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-logo')(),
require('metascraper-clearbit-logo')(),
require('metascraper-publisher')(),
require('metascraper-title')(),
require('metascraper-url')()
])
Migration guide
If you are using metrascraper.load
Just rename it to metascraper
. The .load
method is now the main exported function.
If you ar using metascraper autoload
Replace it with the snippet code on top. It's loading the defaults rules bundles present in v3.
v3.2.0
- Add amazon metascraper.
- Simplify rules interface.
- Improve documentation.
2.0.0
Breaking Changes
From now, metascraper will be the main method and you need to pass html
and url
for extracting metadata.
const metascraper = require('metascraper')
const got = require('got')
const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'
;(async () => {
const {body: html, url} = await got(targetUrl)
const metadata = await metascraper({html, url})
console.log(metadata)
})()
We moved the HTTP layout out of the library to avoid problems related to the connections.
Also in this new interface rules are not exposed directly.
Features
logo
data field
We added a new field logo
for identifying the publisher brand under a link. It uses the high resolution favicon possible to get as a fallback.
Improvements
Codebase simplification
We rewrote the code to make easy support plugins in the future.
Testing environment
We updated integration tests, with at least top50 popular internet sites. Also, they are automated, so add a new test is easy.