Stanford Litigation DB Scraper

Scrapy project for scraping the Stanford Securities Class Action Clearinghouse.

Use

git clone https://github.com/gaulinmp/lit_scrape.git
cd lit_scrape
scrapy crawl -o lit_data.json -t json --logfile lit_log.log seclit

This will make two files, lit_data.json and a log file you can ignore. The former will be a json dump of the litigation database. Turn it into a pandas dataframe like so:

import re, json
import pandas as pd
re_date = re.compile('[01]\d/[0123]\d/[12]\d\d\d')
clean_dat = []
with open('lit_data.json') as fh:
    for row in json.load(fh):
        tmp = {}
        for k,v in row.items():
            if k == 'description':
                continue
            if k == 'url':
                tmp[k] = v
                continue
            if k == 'status':
                _ = [_v.strip() for _v in v if _v.strip()]
                tmp[k] = _[0]
                tmp[k+'_long'] = '|'.join(_)
                _ = re_date.search(tmp[k+'_long'])
                tmp[k+'_date'] = _.group(0) if _ else None
                continue
            tmp[k] = v[0].strip() if v[0].strip() else None
            if k == 'company' and 'Defendant: ' in v[0]:
                tmp[k] = tmp[k].replace('Defendant: ', '')
        clean_dat.append(tmp)
df_lit = pd.DataFrame(clean_dat)
for c in 'class_start class_end date_filed status_date'.split():
    df_lit[c] = pd.to_datetime(df_lit[c])

License

See the LICENSE file for license rights and limitations (MIT).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
lit_scrape		lit_scrape
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stanford Litigation DB Scraper

Use

License

About

Releases

Packages

Languages

License

gaulinmp/lit_scrape

Folders and files

Latest commit

History

Repository files navigation

Stanford Litigation DB Scraper

Use

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages