The goal of this repo is to allow users to easily and interactively view statistics about the Common Crawl data set.
Python environment (3.x)
Jupyter Notebook
Clone the cc-webgraph repository and run the java tools with mvn package
.
You may then clone this repository in the cc-webgraph
root, or choose your own location. If you choose your own location please keep in mind to update any variables related to pathing.
One this repository is cloned, we may now start working with the web graph.
To download the graph, we run
for f in cc-main-2020-feb-mar-may-domain-t.graph cc-main-2020-feb-mar-may-domain-t.properties \
cc-main-2020-feb-mar-may-domain.graph cc-main-2020-feb-mar-may-domain.properties \
cc-main-2020-feb-mar-may-domain.stats \
cc-main-2020-feb-mar-may-domain-edges.txt.gz; do
aws --no-sign-request s3 cp s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/$f .;
done
This will download a representation of the domain level graph (cc-main-2020-feb-mar-may-domain.graph
), a file that contains useful properties of said graph (cc-main-2020-feb-mar-may-domain.properties
) such as the WebGraph class the graph is, number of nodes, etc, a file that contains more statistics about the graph (cc-main-2020-feb-mar-may-domain.stats
) such as min indegree/outdegrees, number of dangling nodes, etc, and a file that represents the edges in the graph (cc-main-2020-feb-mar-may-domain-edges.txt.gz
). Transpose of the graph is also downloaded (which is any file name that ends with -t
). More information can be found on the blog.
Now, in order to more easily run our WebGraph java commands, we define the variable:
WG="cc-webgraph/src/script/webgraph_ranking/run_webgraph.sh"
Once this is defined, we may generate offset files for the graph and the transpose graph:
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain-t
Offset files are used to determine where each successor list is stored (successor lists are what makes up the graph file, each node is represented by a successor list). More information on the offset files section.
We can then generate the files for connected components:
$WG it.unimi.dsi.webgraph.algo.ConnectedComponents -m --renumber --sizes -t cc-main-2020-feb-mar-may-domain-t cc-main-2020-feb-mar-may-domain
$WG it.unimi.dsi.webgraph.algo.StronglyConnectedComponents --renumber --sizes cc-main-2020-feb-mar-may-domain
which will produce *.wcc
, *.wccsizes
which contains information about weakly connected components, and similarly the files *.scc
and *.sccsizes
contain information about strongly connected components. More info on the weakly connected components and the strongly connected components classes, and about the concept itself.
Now finally, we may generate some statistics with the Stats class of WebGraph by running:
$WG it.unimi.dsi.webgraph.Stats --save-degrees cc-main-2020-feb-mar-may-domain
The --save-degrees flag generates *.outdegrees
and *.indegrees
files which contain number of degrees per node (so the first line of the outdegrees file contains the number of outdegrees that node of id 0). *.outdegree
and *.indegree
files will be generated which contains frequency distribution of degrees (so line 0 of .indegree contains the number of nodes with 0 incoming edges). Also, if connected component files are present, Stats will genenerate some information about the connected components. More information about the Stats class.
You are now ready to run the notebook! Simply load it and take a look. Double check any variables with pathing.
For any existing code, remember to double check any variables related to pathing. Change them to adhere to your own directory's structure, if required.
The current existing notebook code will process and plot an indegree frequency scatter plot, using the .indegree
file which for each line i, the i'th line has contains the number of Pay level Domains with i indegrees.
If you would like to analyze the metrics of another graph characteristic (such as outdegrees), it is as simple as changing the file_name
variable so that it points to the correct file.
Another interesting tool that you may look into is pyWebGraph, a program developed by Massimo Santini which interfaces Jython with WebGraph. This tool makes the exploration of WebGraphs very easy. The following sections will outline how to get started using pyWebGraph.
Python (2.x)
In order to get pyWebGraph working we need to (rebuild the string maps)[http://law.di.unimi.it/tutorial.php]. Following the above steps to get the notebook started, you should have a vertices file (cc-main-2020-feb-mar-may-domain-vertices-domain.txt.gz) which contains a mapping from a node id to a reversed domain name. We however need to attain a mapping from reversed domain name to numbers. To achieve this we will use the (minimal perfect hashing algorithm)[http://sux4j.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/package-summary.html].
Running the following command will produce a mph
(minimal perfect hash) file that contains mapping from reversed domain name to number.
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.sux4j.mph.GOV4Function cc-main-2020-feb-mar-may-domain.mph
However, we would also like to have the functionality of determining if a string is actually present as a domain in the map. We will produce a String Map
to do so.
Running the following command will produce a smph
file (string map perfect hash) that has the outlined functionality.
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.util.ShiftAddXorSignedStringMap cc-main-2020-feb-mar-may-domain.mph cc-main-2020-feb-mar-may-domain.smph
We will also need a front coded list to map node numbers to URLs:
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.util.FrontCodedStringList -u -r 32 cc-main-2020-feb-mar-may-domain.fcl
Running these three commands, one should obtain an mph
file, smph
file and a fcl
file. We may now begin the installation process for pyWebGraph!
We will clone the pyWebGraph repo:
git clone https://github.com/mapio/py-web-graph.git
We will now operate in this repo:
cd py-web-graph
We require Jython to run pyWebGraph, so we will install it: (more info at https://www.jython.org/download)
wget https://repo1.maven.org/maven2/org/python/jython-standalone/2.7.2/jython-standalone-2.7.2.jar
We will copy the console to the root so pywebgraph
may be visible as a package.
cp pywebgraph/console.py .
Now we define shell variable DIR
to be the location of where WebGraph is installed. Assuming it is in the current working directory:
DIR=$PWD
Defining the Web Graph version (which is visible by the directory name of webgraph):
WEBGRAPH_VERSION="3.6.5"
We will now define the class path for the webgraph package by running:
WG_CP=$DIR/webgraph-$WEBGRAPH_VERSION.jar:$(ls $DIR/deps/*.jar | tr '\n' ':')
Note for Windows users, the seperator for class paths is ;
rather than :
. This means that this command will become:
WG_CP=$DIR/webgraph-$WEBGRAPH_VERSION.jar\;$(ls $DIR/deps/*.jar | tr '\n' ';')
Note that in the following command, we dedicate 12 gb of heap space with java -Xmx12g
so ensure your machine is able to do so, or modify this number. Further, the python path may need to be adapted depending on where it is installed. Finally we may launch pyWebGraph:
java -Xmx12g -Dpython.console=org.python.util.JLineConsole -Dpython.executable=/bin/python2.7 -cp $WG_CP:../jython-standalone-2.7.2.jar: org.python.util.jython console.py
Again, for windows users the command would become:
java -Xmx12g -Dpython.console=org.python.util.JLineConsole -Dpython.executable=/bin/python2.7 -cp $WG_CP\;../jython-standalone-2.7.2.jar\; org.python.util.jython console.py
Now a pyWebGraph console should appear. Assuming the path to the Common Crawl domain graph is in ./output_dir:
pyWebGraph console, Copyright (C) 2009 Massimo Santini
>> graph ./output_dir/cc-main-2020-feb-mar-may-domain
>> pwn
#0
>> namemaps ./output_dir/cc-main-2020-feb-mar-may-domain
>> cn "org.commoncrawl"
>> pwn
#76320850 org.commoncrawl
>> ls
0: #1143797 au.com.dejanseo
1: #1644111 au.com.spatialsource
2: #2474905 be.youtu
...
195: #87060356 uk.co.bbc
196: #89576687 us.lumeno
>> sl
0: #69452 ai.botxo
1: #74455 ai.kritikalvision
...
678: #89622268 us.pingpong
679: #89702797 us.zillman
680: #89834247 vn.avnuc
681: #90129490 wiki.sysadmin
682: #90203492 work.yokonoji
We can see that we are now able to interact with the graph in a very seamless manner!
Common Errors:
- For Windows users the
run_webgraph.sh
file fromcc-webgraph
may not work due to Windows/Linux syntatical differences. See the Windows-friendlyrun_webgraph.sh
script attached in this repo.