Common Crawler

'Common Crawl' (https://commoncrawl.org/) is an open repository of years of internet webpages stored for use. The data can be difficult to work with when all that is desired is plaintext.

We provide a lightweight and simple python utility for collecting and batching plaintext data from common crawl in a multiprocessing manner. The data is cleaned, if desired, before being returned.

Data from Common Crawl comes in .WARC, .WET, and .WAT formats, as described here (https://commoncrawl.org/the-data/get-started/). The .WARC files store the information of the crawl itself (responses, request information, etc.). The .WET files store the plaintext. The .WAT files store metadata about the .WARC files. Note that data before 2018 does NOT store language information. Data before 2018 will need language detected manually.

This repository specifically focuses on converting common crawl plaintext to a usable dataset format.

** PLEASE NOTE THAT THE REPOSITORY IS IN ACTIVE DEVELOPMENT AS A SIDE PROJECT, development is slow **

Planned Improvements

Clean up the parameters, allow easy access to all depth parameters from the top level.
Randomized index selection, filter by language.
Add automatic filtering of headers, extra code details captured by the crawler.
Improve readme / docs
Improve crawl_example notebook
Add grammar based filtering for 'sensical' outputs. This is useful for LLM training.
Add text saving

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Common Crawler

Planned Improvements

Install

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Common Crawler

Planned Improvements

Install

Usage