Skip to content
This repository has been archived by the owner on Jun 17, 2022. It is now read-only.

Latest commit

 

History

History
74 lines (50 loc) · 1.45 KB

README.md

File metadata and controls

74 lines (50 loc) · 1.45 KB

A simple tool for NLP

The primary purpose of this tool is to get rid of stressful data managements with Mahout and Hadoop. Thus, it basically wraps Mahout and Hadoop with simple command line interfaces, but also provides some utilities.

Requirement

maven, jdk1.8 (other jdk cause failures), hadoop-2.6.0-cdh5.4.4, mahout-0.9-cdh5.4.4

Build

$ mvn package

Run

$ vi conf.json
$ vi run

Configure your environments

$ su {hadoop user}
$ ./run

Available commands are displayed if no arguments

Develop with Eclipse

$ mvn eclipse:eclipse

Note: you may encounter jdk.tools warnings on pom.xml if you convert the project to a Maven project.

License

MIT

TODO

  • DeleteJob

    • Deletes job results on HDFS
    • Hides HDFS from users more
  • Result decorator for Hive queries

    • Allows users to promptly analyze data by Mahout
    • Needs VectorWritable parser for Hive
  • Better logging

  • Stopping Maven directory layout

    • Moves target/ and eclipse settings out of tree for Git-friendly
    • CMake?
  • Spark movement

    • Potentially speeds up everything
    • But needs to consider high memory pressures
    • Parameter Server?
  • Job history and statistics collections

    • e.g., Hadoop job configuration, task counters (.xml and .jhist files)
    • May be useful for future uses
  • Add other data analytics

    • Machine learning, graph, etc.

Author

Takeshi Yoshimura (https://github.com/takeshi-yoshimura)