This repository has been archived by the owner on Aug 20, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Alex Xue
authored and
Alex Xue
committed
Aug 25, 2020
1 parent
1a723d9
commit 6c8e2bd
Showing
7 changed files
with
171,622 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
Dmoz | ||
==== | ||
[Dmoz](http://www.dmoz.org) is an open directory which lists and groups web pages into categories (directories). Their data is publicly available, but provided as an RDF file - a huge, funny XML file. | ||
|
||
Dmoz Parser | ||
======== | ||
|
||
This is a really simple python implementation of the Dmoz RDF parser. It does not try to be smart and process the parsed XML for you, you have to provide a handler implementation where YOU decide what to do with the data (store it in file, database, print, etc.). | ||
|
||
This parser makes the assumption is the last entity in each dmoz page is _topic_: | ||
|
||
<ExternalPage about="http://www.awn.com/"> | ||
<d:Title>Animation World Network</d:Title> | ||
<d:Description>Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.</d:Description> | ||
<priority>1</priority> | ||
<topic>Top/Arts/Animation</topic> | ||
</ExternalPage> | ||
|
||
This assumption is strictly checked, and processing will abort if it is violated. | ||
|
||
The RDF file needs to be downloaded, but can stay packed. You can [download the RDF](http://rdf.dmoz.org/rdf/content.rdf.u8.gz) from Dmoz site. | ||
|
||
The RDF is pretty large, over 2G unpacked and parsing it takes some time, so there is a progress indicator. | ||
|
||
Warnings | ||
-------- | ||
|
||
This parser does not check for links between topics in the hierarchy, or any sophisticated parsing of the hierarchy. | ||
|
||
The same URL might appear in multiple locations in the hierarchy. | ||
|
||
Dependencied | ||
------------ | ||
You need to install dependencies from the requirements.txt file, for example by `pip install -r requirements.txt` | ||
|
||
Usage | ||
----- | ||
Instantiate the parser, provide the handler and run. | ||
|
||
#!/usr/bin/env python | ||
|
||
from parser import DmozParser | ||
from handlers import JSONWriter | ||
|
||
parser = DmozParser() | ||
parser.add_handler(JSONWriter('output.json')) | ||
parser.run() | ||
|
||
JSONWriter is the builtin handler which outputs the pages, one JSON object per line. | ||
(Note: This is different than saying that the entire file is a large JSON list.) | ||
|
||
Terminal Usage | ||
-------------- | ||
`python parser.py <content.rdf.u8 file path> <output file path>` | ||
example: `python parser.py ./data/content.rdf.u8 ./data/parsed.json` | ||
|
||
Built-in handlers | ||
----------------- | ||
There are two builtin handlers so far - _JSONWriter_ and _CSVWriter_. | ||
_CSVWriter_ is buggy (see "handler.py" to understand why), and we recommend the _JSONWriter_. | ||
|
||
Handlers | ||
-------- | ||
A handler must implement two methods: | ||
|
||
def page(self, page, content) | ||
|
||
this method will be called every time a new page is extracted from the RDF, argument _page_ will contain the URL of the page and _content_ will contain a dictionary of page content. | ||
|
||
def finish(self) | ||
|
||
The finish method will be called after the parsing is done. You may want to clean up here, close the files, etc. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
import copy | ||
import json | ||
import logging | ||
|
||
from smart_open import smart_open | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class JSONWriter: | ||
def __init__(self, name): | ||
self._file = smart_open(name, 'w') | ||
|
||
def page(self, page, content): | ||
if page is not None and page != "": | ||
newcontent = copy.copy(content) | ||
newcontent["url"] = page | ||
|
||
self._file.write(json.dumps(newcontent) + "\n") | ||
else: | ||
logger.info("Skipping page %s, page attribute is missing", page) | ||
|
||
def finish(self): | ||
self._file.close() | ||
|
||
|
||
class CSVWriter: | ||
# Note: The CSVWriter has several bugs and assumptions, as documented below. | ||
def __init__(self, name): | ||
self._file = smart_open(name, 'w') | ||
|
||
def page(self, page, content): | ||
if page is not None and page != "": | ||
page = page.encode("utf-8") | ||
page = page.replace('"', '') | ||
page = page.replace('"', '') | ||
|
||
self._file.write('"%(page)s"' % {'page': page}) | ||
# for type in content: | ||
# For CSV, read only these fields, in only this order. | ||
newcontent = {} | ||
for type in ['d:Title', 'd:Description', 'priority', 'topic']: | ||
newcontent[type] = content[type].encode("utf-8") | ||
newcontent[type] = newcontent[type].replace('"', '') | ||
newcontent[type] = newcontent[type].replace('"', '') | ||
# BUG: Convert comma to something else? Otherwise, it will trip up the CSV parser. | ||
self._file.write(',"%s"' % newcontent[type]) | ||
|
||
self._file.write("\n") | ||
else: | ||
logger.info("Skipping page %s, page attribute is missing", page) | ||
|
||
def finish(self): | ||
self._file.close() |
Oops, something went wrong.