Web Scraping 101

Web Scraping (aka web mining) is the process of collecting information from a network resource. This document is an (work in progress) introduction to differenct aspects and techniques involved in a web scraping system.

Most of this text will be language-agnostic, but code snippets will probably break that rule.

The Process

You can crudely break down any web scraping process into 3 steps:

acquiring or requesting a network resource
parsing the data and extracting primitives
transformation and storage

There is also a 4th step that's concerned with finding what to reterieve, but we'll come to that later.

systematic resources(a resource is anything you download from the web: html pages for example) are generally easier to reterieve compared to non-systematic resources. What are systematic resources you ask? systematic resources are network resources that have a pattern to them. One good example of a systematic resource is a infinitely scrolling webpage. You know that as long as you keep scrolling down you'll keep getting more content loaded, therefore it has a repeatable behaviour that you can rely on.

let's now walk through the web scraping process, step by step

Acquiring or requesting a network resource

This should be self-explanatory. If you want to scrape data from a news website, you begin by first loading the contents of that webpage.

Parsing the payload and extracting primitives

Once you have loaded a resource, the next logical step is to parse the resource data. Most of the time you'll be extracting data from a structured interchange langauge (XML, HTML, JSON et al). These representations(or formats) make it exceedingly easy to extract and query data from them. Scraping data from non-structured text requires some arcane spells that are beyond the scope of this text. Most of the time you'll load the interchange language into a data structure, and then extract primitive values. What are primitives you ask? They're simply values of interest. The reason they're called primitives is because they may go transformation down the line, and may end up being modified or consumed in the process.

Transformation and storage

This stage involves making changes to the extracted data so it fits your use case. It can range from simple pipelines that save some extracted data as is, to all sorts of sequential or concurrent transformations.

have some pseudo code:

resource = request('/resource/url')
data = parse(resource).extract('data')
save(data)

An Example

# file: examples/python/hn_simple.py
# ----------------------------------

# -*- coding: utf-8 -*-

'''
    fetch the title of the latest posts on page 1 of hacker news

    because who doesn't like hackers news?
'''

import requests
import lxml.etree

URL = 'http://news.ycombinator.com'

# fetch the page
response = requests.get(URL)

# parse the page
document = lxml.etree.HTML(response.text)

# on inspecting the HTML of the document, you'll see that every HN
# post has an anchor tag with the class 'story' inside a table row
# with a class 'athing'. We use the cssselect library in order to
# select all those elements that match this description. The result is a 
# list of lxml.etree.Element objects.
#
# you can also use xpath to select elements, for example:
# document.xpath("//tr[@class='athing']//a[@class='storylink']"):
# selects the same elements as the css expression 'tr.athing a.storylink'

for title in document.cssselect('tr.athing a.storylink'):
    print(title.text)

The aforementioned example depicts a simple scraper. Notice that this is just basic scraping, not crawling. What's the difference? A scraper's only job is to extract information from a payload, while the crawler finds the links to follow and then actually fetch them. Notice how our example only fetches the headings from the first page on hacker news. We're going to extend our example to introduce crawling, where our program will scrape data from more than one page.

# file: examples/python/hn_crawl.py
# ---------------------------------

# -*- coding: utf-8 -*-

import urllib.parse

import requests
import lxml.etree

next_url = 'http://news.ycombinator.com'
total_posts = 0

while next_url is not None:

    response = requests.get(next_url)

    document = lxml.etree.HTML(response.text)

    for title in document.cssselect('tr.athing a.storylink'):
        print(title.text)
        total_posts += 1


    # ascertain the base url i.e. schema + domain
    urlinfo = urllib.parse.urlparse(next_url)
    base_url = urlinfo.scheme + '://' + urlinfo.netloc

    try:
        href = document.cssselect('a.morelink')[0].get('href')
    except:
        next_url = None
        continue

    next_url = urllib.parse.urljoin(base_url, href)

print('#' * 20)
print('total post: {}'.format(total_posts))

For brevity, I've kept it simple. But a crawler has to do a lot of jobs, like keep track of pages visited, checking for(and removing) circular references, reattempting any urls that fail etc. Okay, let's clean up our code and introduce some nice abstractions.

# file: examples/python/hn_refactored.py
# --------------------------------------

# -*- coding: utf-8 -*-

import sys
import types
import urllib.parse

import requests
import lxml.etree

class Crawler:

    def __init__(self, spider, **kwargs):
        self._spider = spider()
        self._pipeline = []
        self._next_url = [spider.start_url]


    def add_pipeline_component(self, component):
        self._pipeline.append(component())

    def schedule_request(self, url):
        self._next_url.append(url)


    def _invoke_spider(self, payload):
        # tbd = to be determined
        tbd = self._spider.parse(payload, self)
        if isinstance(tbd, types.GeneratorType):
            for chunk in tbd:
                self._pipe(chunk)
        else:
            self._pipe(tbd)


    def _pipe(self, payload):
        for component in self._pipeline:
            payload = component.process(payload, self)

    def run(self):
        
        while self._next_url:

            next_url = self._next_url.pop()
            request = requests.get(next_url)

            if request.status_code >= 200 and request.status_code < 300:
                payload = self._invoke_spider(request)
            else:
                errstr = 'Crawler Error: {} {}({})'.format(
                    request.status_code,
                    request.reason,
                    request.url
                )
                print(errstr, file=sys.stderr)
        else:
            print('Crawler has finished running')

class HackerNewsSpider:

    start_url = 'http://news.ycombinator.com'

    def parse(self, response, crawler):

        document = lxml.etree.HTML(response.text)
        for title in document.cssselect('tr.athing a.storylink'):
            yield title.text

        urlinfo = urllib.parse.urlparse(response.url)
        base_url = urlinfo.scheme + '://' + urlinfo.netloc

        try:
            href = document.cssselect('a.morelink')[0].get('href')
        except:
            return

        next_url = urllib.parse.urljoin(base_url, href)

        crawler.schedule_request(next_url)

class PrintComponent:

    def process(self, payload, crawler):
        print(payload)
        return payload

if __name__ == "__main__":
    crawler = Crawler(HackerNewsSpider)
    crawler.add_pipeline_component(PrintComponent)
    crawler.run()

ahh there. all neat and tidy. Let's break it down:

The Crawler is incharge of fetching data
The Spiders extract the data you want
The Pipeline does stuff with the said data.

The items in the pipeline are called in order in which the pipeline components were registered. The output of one component is piped into the input of the next. The Spider and Pipeline Components both get a reference to the crawler to allow them to talk to the crawler (like telling the crawler to also fetch another URL etc). In this example, the only crawler functionality we're invoking is the schedule_request method that adds a new URL to the request queue, but it could've been anything. We will now further improve this example to fetch more data and print it as a JSON object.

But before we begin, lets decide on our goals for our scraper. we want to scrape:

The name of the post
It's URL
Link to comments
User who posted it
The time when it was scraped (collected)

Now that we have decided on our objectives, lets begin.

# file: examples/python/hn.py
# ---------------------------

# -*- coding: utf-8 -*-

import sys
import time
import types
import pprint
import urllib.parse

import requests
import lxml.etree

class Crawler:

    def __init__(self, spider, **kwargs):
        self._spider = spider()
        self._pipeline = []
        self._next_url = [spider.start_url]


    def add_pipeline_component(self, component):
        self._pipeline.append(component())

    def schedule_request(self, url):
        self._next_url.append(url)


    def _invoke_spider(self, payload):
        # tbd = to be determined
        tbd = self._spider.parse(payload, self)
        if isinstance(tbd, types.GeneratorType):
            for chunk in tbd:
                self._pipe(chunk)
        else:
            self._pipe(tbd)


    def _pipe(self, payload):
        for component in self._pipeline:
            payload = component.process(payload, self)

    def run(self):
        
        while self._next_url:

            next_url = self._next_url.pop()
            request = requests.get(next_url)

            if request.status_code >= 200 and request.status_code < 300:
                payload = self._invoke_spider(request)
            else:
                errstr = 'Crawler Error: {} {}({})'.format(
                    request.status_code,
                    request.reason,
                    request.url
                )
                print(errstr, file=sys.stderr)
        else:
            print('Crawler has finished running')

class HackerNewsSpider:

    start_url = 'http://news.ycombinator.com'

    def parse(self, response, crawler):

        urlinfo = urllib.parse.urlparse(response.url)
        base_url = urlinfo.scheme + '://' + urlinfo.netloc

        document = lxml.etree.HTML(response.text)

        # hacker news uses table rows for it's interface.
        # since there's no clear isolation of data elements,
        # we're going to have create our own.
        headings = document.cssselect('table.itemlist td.title a.storylink')
        subtext = document.cssselect('table.itemlist td.subtext')

        for head, opt in zip(headings, subtext):

            item = {}
            item['name'] = head.text
            item['url'] = head.get('href')

            link_to_comments = opt.cssselect('span.age a')[0].get('href')
            link_to_comments = urllib.parse.urljoin(base_url, link_to_comments)
            item['comments'] = link_to_comments

            # sometimes posts without a user creep in
            try:
                item['user'] = opt.cssselect('a.hnuser')[0].text
            except IndexError:
                item['user'] = '(N/A)'
            yield item
            

        try:
            href = document.cssselect('a.morelink')[0].get('href')
        except:
            return

        next_url = urllib.parse.urljoin(base_url, href)

        crawler.schedule_request(next_url)

class PrintComponent:

    def process(self, payload, crawler):
        pprint.pprint(payload)
        return payload

class CrawlTimeComponent:

    def process(self, payload, crawler):
        payload['collected_at'] = time.ctime()
        return payload

if __name__ == "__main__":
    crawler = Crawler(HackerNewsSpider)
    crawler.add_pipeline_component(CrawlTimeComponent)
    crawler.add_pipeline_component(PrintComponent)
    crawler.run()

As you can see that this (final) crawler builds upon the previous, simpler version(s). We're now extracting some useful information and then sending it down the pipeline to be processed in a sequential fashion. Listings and paginated pages make for excellent and easy scrape.

We've come a long way from just fetching text from the first page of hackers news to crawling all new pages along with some useful information. At this point, hopfully you should have a nice intution about how the process of webscraping works. You can further extend the previous example by writing a component that saves the returned primitives to a database, or maybe add more primitives to the scraper.

And with that, you should be well equipped to start your journey in web scrapping. Go out and scrape the hell out of that web ;)

How to be a responsible little spider

Web scraping is a powerful tool; you can extract all sorts of information and use them in your own applications, or expose them as apis for others to use. But like uncle ben says, with great power comes great responsiblity.

The general rule of thumb for responsible scraping is:

if the owners of the content won't like it being scraped, then don't scrape it.

Just because you can scrape something doesn't necessarily mean that you should. Some websites and webapps specifically cover collection/scraping under their terms of service. If they specifically ask you to not scrape their data (by any means), then be a responsible citizen of the interweb and leave them be.

Another aspect of responsible crawling is ensuring that you don't burden the origins' server too much. It's very much possible to have a webscraper that scrapes multiple concurrent sections of a server at high speeds and high intervals. This leads to increased infrastructure cost for the owner of the site, and it's something that you would like to avoid. Please be considerate. I'll recommend using a sequential scraping pipeline with a delay between adjacent scrapes.

Respect a sites' robots.txt file. They're being nice enough to let you take their data, respect the few places that they don't want scraped.

Tips

When scraping websites, meta tags can be unbelieveably helpful. For example, if you're trying to build a site map for a website, you can use it's meta tags to extract a few primitives like title, description etc. Most modern websites have social media tags like ones for twitter and open graph(facebook). The benefit of using meta tags are that most of the times they are consistent across websites and rarely change.

Where to go from here?

I'll recommend you look into a web scraping framework that fits your current toolset. There are some excellent options available. One that deserves special mention is scrapy; It's an incredibly well thought out framework. It's super simple simple to use, but is ridiculously powerful. But there's no reason to not try out something else. Each framework or library might take a different approach to web scraping; architecture wise atleast, so observe what those differences are and try to learn from them.

you can consult these lists to find something that might work for you.

awesome-crawler
awesome-web-scraping

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WS101.md

WS101.md

Web Scraping 101

The Process

Acquiring or requesting a network resource

Parsing the payload and extracting primitives

Transformation and storage

An Example

How to be a responsible little spider

Tips

Where to go from here?

Files

WS101.md

Latest commit

History

WS101.md

File metadata and controls

Web Scraping 101

The Process

Acquiring or requesting a network resource

Parsing the payload and extracting primitives

Transformation and storage

An Example

How to be a responsible little spider

Tips

Where to go from here?