add dmoz

georgian-io-archive · Aug 25, 2020 · 6c8e2bd · 6c8e2bd
1 parent 1a723d9
commit 6c8e2bd
Show file tree

Hide file tree

Showing 7 changed files with 171,622 additions and 0 deletions.
diff --git a/dmozparser/README.md b/dmozparser/README.md
@@ -0,0 +1,72 @@
+Dmoz
+====
+[Dmoz](http://www.dmoz.org) is an open directory which lists and groups web pages into categories (directories). Their data is publicly available, but provided as an RDF file - a huge, funny XML file.
+
+Dmoz Parser
+========
+
+This is a really simple python implementation of the Dmoz RDF parser. It does not try to be smart and process the parsed XML for you, you have to provide a handler implementation where YOU decide what to do with the data (store it in file, database, print, etc.).
+
+This parser makes the assumption is the last entity in each dmoz page is _topic_:
+
+     <ExternalPage about="http://www.awn.com/">
+       <d:Title>Animation World Network</d:Title>
+       <d:Description>Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.</d:Description>
+       <priority>1</priority>
+       <topic>Top/Arts/Animation</topic>
+     </ExternalPage>
+
+This assumption is strictly checked, and processing will abort if it is violated.
+
+The RDF file needs to be downloaded, but can stay packed. You can [download the RDF](http://rdf.dmoz.org/rdf/content.rdf.u8.gz) from Dmoz site.
+
+The RDF is pretty large, over 2G unpacked and parsing it takes some time, so there is a progress indicator.
+
+Warnings
+--------
+
+This parser does not check for links between topics in the hierarchy, or any sophisticated parsing of the hierarchy.
+
+The same URL might appear in multiple locations in the hierarchy.
+
+Dependencied
+------------
+You need to install dependencies from the requirements.txt file, for example by `pip install -r requirements.txt`
+
+Usage
+-----
+Instantiate the parser, provide the handler and run.
+
+    #!/usr/bin/env python
+
+    from parser import DmozParser
+    from handlers import JSONWriter
+
+    parser = DmozParser()
+    parser.add_handler(JSONWriter('output.json'))
+    parser.run()
+
+JSONWriter is the builtin handler which outputs the pages, one JSON object per line.
+(Note: This is different than saying that the entire file is a large JSON list.)
+
+Terminal Usage
+--------------
+`python parser.py <content.rdf.u8 file path> <output file path>`
+example: `python parser.py ./data/content.rdf.u8 ./data/parsed.json`
+
+Built-in handlers
+-----------------
+There are two builtin handlers so far - _JSONWriter_ and _CSVWriter_.
+_CSVWriter_ is buggy (see "handler.py" to understand why), and we recommend the _JSONWriter_.
+
+Handlers
+--------
+A handler must implement two methods:
+
+    def page(self, page, content)
+
+this method will be called every time a new page is extracted from the RDF, argument _page_ will contain the URL of the page and _content_ will contain a dictionary of page content.
+
+    def finish(self)
+
+The finish method will be called after the parsing is done. You may want to clean up here, close the files, etc.
diff --git a/dmozparser/handlers.py b/dmozparser/handlers.py
@@ -0,0 +1,54 @@
+import copy
+import json
+import logging
+
+from smart_open import smart_open
+
+logger = logging.getLogger(__name__)
+
+
+class JSONWriter:
+    def __init__(self, name):
+        self._file = smart_open(name, 'w')
+
+    def page(self, page, content):
+        if page is not None and page != "":
+            newcontent = copy.copy(content)
+            newcontent["url"] = page
+
+            self._file.write(json.dumps(newcontent) + "\n")
+        else:
+            logger.info("Skipping page %s, page attribute is missing", page)
+
+    def finish(self):
+        self._file.close()
+
+
+class CSVWriter:
+  # Note: The CSVWriter has several bugs and assumptions, as documented below.
+    def __init__(self, name):
+        self._file = smart_open(name, 'w')
+
+    def page(self, page, content):
+        if page is not None and page != "":
+            page = page.encode("utf-8")
+            page = page.replace('"', '')
+            page = page.replace('&quot;', '')
+
+            self._file.write('"%(page)s"' % {'page': page})
+            # for type in content:
+            # For CSV, read only these fields, in only this order.
+            newcontent = {}
+            for type in ['d:Title', 'd:Description', 'priority', 'topic']:
+                newcontent[type] = content[type].encode("utf-8")
+                newcontent[type] = newcontent[type].replace('"', '')
+                newcontent[type] = newcontent[type].replace('&quot;', '')
+                # BUG: Convert comma to something else? Otherwise, it will trip up the CSV parser.
+                self._file.write(',"%s"' % newcontent[type])
+
+            self._file.write("\n")
+        else:
+            logger.info("Skipping page %s, page attribute is missing", page)
+
+    def finish(self):
+        self._file.close()