Skip to content

Lightweight python utility for pulling data from common crawl.

License

Notifications You must be signed in to change notification settings

georgew79/CommonCrawler

Repository files navigation

Common Crawler

'Common Crawl' (https://commoncrawl.org/) is an open repository of years of internet webpages stored for use. The data can be difficult to work with when all that is desired is plaintext.

We provide a lightweight and simple python utility for collecting and batching plaintext data from common crawl in a multiprocessing manner. The data is cleaned, if desired, before being returned.

Data from Common Crawl comes in .WARC, .WET, and .WAT formats, as described here (https://commoncrawl.org/the-data/get-started/). The .WARC files store the information of the crawl itself (responses, request information, etc.). The .WET files store the plaintext. The .WAT files store metadata about the .WARC files. Note that data before 2018 does NOT store language information. Data before 2018 will need language detected manually.

This repository specifically focuses on converting common crawl plaintext to a usable dataset format.

** PLEASE NOTE THAT THE REPOSITORY IS IN ACTIVE DEVELOPMENT AS A SIDE PROJECT, development is slow **

Planned Improvements

  • Clean up the parameters, allow easy access to all depth parameters from the top level.
  • Randomized index selection, filter by language.
  • Add automatic filtering of headers, extra code details captured by the crawler.
  • Improve readme / docs
  • Improve crawl_example notebook
  • Add grammar based filtering for 'sensical' outputs. This is useful for LLM training.
  • Add text saving

Install

Usage

About

Lightweight python utility for pulling data from common crawl.

Resources

License

Stars

Watchers

Forks

Packages

No packages published