Comments

luigiba · Sep 8, 2019 · 3b05d4f · 3b05d4f
1 parent f9e4e91
commit 3b05d4f
Show file tree

Hide file tree

Showing 4 changed files with 80 additions and 58 deletions.
diff --git a/.idea/workspace.xml b/.idea/workspace.xml
diff --git a/README.md b/README.md
@@ -1,71 +1,45 @@
 
 # OpenKEonSpark  
 This is a distributed version of the framework OpenKE (https://github.com/thunlp/OpenKE) using the library TensorFlowOnSpark (https://github.com/yahoo/TensorFlowOnSpark).  
+Please refer to https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md to have an overview about how Distributed Tensorflow training works.
 
+
+## Index  
+
 - [OpenKEonSpark](#openkeonspark)
+  * [Index](#index)
   * [Overview](#overview)
   * [Installation](#installation)
     + [General installation](#general-installation)
-    + [Google Colab](#google-colab)
-  * [(Distributed) Train mode](#-distributed--train-mode)
-  * [How to use the learned model?](#how-to-use-the-learned-model-)
+    + [Install on Google Colab](#install-on-google-colab)
+  * [Train mode](#train-mode)
+  * [How to use the model learned?](#how-to-use-the-model-learned-)
     + [Link Prediction](#link-prediction)
     + [Triple classification](#triple-classification)
     + [Usage example](#usage-example)
     + [Predict head entity](#predict-head-entity)
     + [Predict tail entity](#predict-tail-entity)
     + [Predict relation](#predict-relation)
     + [Classify a triple](#classify-a-triple)
-  * [(Distributed) Evaluation mode](#-distributed--evaluation-mode)
+  * [Evaluation mode](#evaluation-mode)
       - [Link Prediction Evaluation](#link-prediction-evaluation)
       - [Triple Classification Evaluation](#triple-classification-evaluation)
   * [How to generate all these files?](#how-to-generate-all-these-files-)
   * [How to run the distributed application?](#how-to-run-the-distributed-application-)
 
 <small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>
 
-
-
-
 
 
 ## Overview  
 
 **OpenKE** is an Efficient implementation based on TensorFlow for knowledge representation learning.  C++ is used to implement some underlying operations such as data preprocessing and negative sampling. Knowledege Graph Embedding models (TransE, TransH, TransR, TransD) are implemented using TensorFlow with Python interfaces so that there is a convenient platform to run models on GPUs and CPUs.   
 
-**OpenKEonSpark** is the is the distributed version of OpenKE using the library **TensorflowOnSpark** (which allows to distribute existing Tensorflow application on Spark). 
-With OpenKEonSpark can be performed both distributed training and evaluation of **Translational Distance Models** (TransE, TransH, TransR, TransD).
-The motivations that have driven this project are the following: 
+**OpenKEonSpark** is the is the distributed version of OpenKE using the library **TensorflowOnSpark** (which allows to distribute existing Tensorflow application on Spark). The motivations that have driven this project are the following: 
   1. Create a tool which efficiently allows to train and evaluate knowledge graph embedding, distributing the job among a set of resources;  
   2. As in Big data scenarios, the model created in the first point should be updatable as new batch of data arrives.  
-
-Before launching the main program (**main_spark.py**), common Spark parameters have to be specified, such as the cluster dimension, the resources to allocate for each cluster (CPUs, GPUs, RAM), the number of parameter servers and the number of workers. The script below reports the essential ones:
-```bash
-#CUDA
-export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
-export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
-export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
-export LIB_CUDA=/usr/local/cuda-10.0/lib64
-
-#JAVA
-export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
-
-#PYTHON
-export PYSPARK_PYTHON=/usr/bin/python3 
-
-#SPARK
-export SPARK_HOME=/path/to/spark/spark-2.1.1-bin-hadoop2.7
-export PATH=$SPARK_HOME/bin:$PATH
-export MASTER=spark://$(hostname):7077
-export SPARK_WORKER_INSTANCES=3
-export CORES_PER_WORKER=1
-export MEMORY_PER_WORKER=2g
-export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) 
-
-``` 
-Among the set of all workers, there is a special one named **chief worker**, that has greater responsibilities: initialize the model, restore the model variables and save training checkpoints. The tool can be installed locally (Standalone cluster) or using an Hadoop cluster. Only the **Standalone mode** has been tested.  
-
-The framework can be executed in two different modalities: (1) **Train mode**; (2) **Evaluation mode**.   
+
+The tool can be installed locally (Standalone cluster) or using an Hadoop cluster. Only the **Standalone mode** has been tested.  The framework can be executed in two different modalities: (1) **Train mode**; (2) **Evaluation mode**.   
 
 ## Installation  
 ### General installation  
@@ -81,12 +55,12 @@ $ cd OpenKEonSpark
 $ bash make.sh  
 ```  
 
-### Google Colab  
+### Install on Google Colab  
 
 Please refer to colab directory to installation, running and evaluation pipelines
 
 
-## (Distributed) Train mode  
+## Train mode  
 
 The train mode aims to perform the training of knowledge graph embedding, distributing the job respect to the cluster of machines. The tool implements data parallelism using the between-graph replication. Moreover, it employs the asynchronous techniques to perform gradient updates. The embedding can be learnt using translational distance models (TransE, TransH, TransR and TransD) and specifying an optimization algorithm (which could be either SGD or Adam optimizer).  The training phase is performed respect to the following files:  
 
@@ -126,11 +100,11 @@ The output of the training phase (the model learned) will be saved into a specif
 
 
 
-## How to use the learned model?  
+## How to use the model learned?  
 There are two tasks already implemented in the repository, which can be used once the embeddings have been learned: **Link prediction** and **Triple classification**.  
 
 ### Link Prediction  
-Link prediction aims to predict the missing head (using the method *predict_head_entity*), or the tail (*predict_tail_entity*) or the relation (*predict_relation*) for a relation fact triple *(h,r,t)*. In this task, for each position of missing entity, the system is asked to rank a set of *k* (additional parameter of the methods) candidate entities from the knowledge graph, instead of only giving one best result. Given a specific test triple *(h,r,t)* from which to predict a missing component (either the head or the tail or the relation), the component is replaced by all entities or relations in the knowledge graph, and these triples are ranked in descending order respect to their scores (which depend from the specific model used). Prediction can be performed respect to the missing component using the corresponding methods to get the top predictions from the ranked list.
+Link prediction aims to predict the missing head (using the method *predict_head_entity*), or  tail (*predict_tail_entity*) or  relation (*predict_relation*) for a relation fact triple *(h,r,t)*. In this task, for each position of missing entity, the system is asked to rank a set of *k* (additional parameter of the methods) candidate entities from the knowledge graph, instead of only giving one best result. Given a specific test triple *(h,r,t)* from which to predict a missing component (either the head or the tail or the relation), the component is replaced by all entities or relations in the knowledge graph, and these triples are ranked in descending order respect to their scores (which depend from the specific model used). Prediction can be performed respect to the missing component using the corresponding methods to get the top predictions from the ranked list.
 
 ### Triple classification  
 Triple classification aims to judge whether a given triple *(h,r,t)* is correct or not (a binary classification task). For triple classification, is used a threshold δ: for a given triple if the score obtained by the triple is below δ, the triple will be classified as positive, otherwise as negative. The threshold can be passed as a parameter to the method *predict_triple*, or can be optimized by maximizing the classification accuracy on the validation set. 
@@ -179,12 +153,12 @@ con.predict_triple(0, 1928, 1)
 ```  
 
 
-## (Distributed) Evaluation mode 
+## Evaluation mode 
 
-Using the **evaluation mode**, the link prediction evaluation task is distributed among the workers, since this is a very expensive task. The additional two files are required:
+Using the **evaluation mode**, the link prediction evaluation task is distributed among the workers, since this is a very expensive task. Two additional files are required:
 
   1. **type_constrain.txt**: the first line is the number of relations; the following lines are type constraints for each relation. For example, the line “1200 4 3123 1034 58 5733” means that the relation with id 1200 has 4 types of head entities (if another line with relation id equals to 1200 is written, it will refer to tail entities constraints), which are 3123, 1034, 58 and 5733.  
-  2. **ontology_constrain.txt**: the first line is the number of classes in the ontology; the following lines are ontology constraints for each ontology class. For example, the line "100 3 10 20 30" means that the class with id 100 has three super classes (if another line with class id equals to 100 is written, it will refer to sub-classes ontology constraints), which are 10, 20 and 30.  
+  2. **ontology_constrain.txt**: the first line is the number of classes in the ontology; the following lines are ontology constraints for each ontology class. For example, the line "100 3 10 20 30" means that the class with id 100 has three **super classes** (if another line with class id equals to 100 is written, it will refer to **sub-classes** ontology constraints), which are 10, 20 and 30.  
 
 The first file is used to incorporate additional information (i.e. the entity types) during the prediction phase. The second file usage will be better explained in the next section.
 
@@ -215,16 +189,46 @@ Moreover, since many of the published linked data are based on **ontologies**, t
 The protocol used for triple classification evaluation is easier respect to the previous one. Since this task needs negative labels, each golden test triple is corrupted to get only one negative triple (by corrupting the tail). The resulting set of triples will contain twice the number of test triples. The same procedure is repeated for the validation set.  As explained before, the classification task needs a threshold, which is learned using the validation set. In this case, the validation set is used only to tune the threshold for classification purpose. Finally, the learned threshold, is used to classify the whole set of test triples.
 The metric reported from this task depend from the number of target relations present in the test set:
 * If there is only one target relation, it will be reported standard metrics (accuracy, precision, recall and f-measure)
-* If there are more than one target relations it will be reported mirco-metrics.
+* If there are more than one target relations it will be reported micro averaged metrics.
 
-This task  has not been distributed because it is already very efficient. However, since it depends from a decision threshold, a better performance estimates of this task can be get using ROC curve. The method *plot_roc* of the *Config* class, can be used to plot it and compute its area under the curve using the test triples and the decision thresholds computed from the validation set. Please refer to the file *test.py* to perform this task.
+This task  has not been distributed because it is already very efficient. However, since it depends from a decision threshold, a better performance estimates of this task can be get using ROC curves. The method *plot_roc* of the *Config* class, can be used to plot it respect to a specific relation. Moreover it is possible to compute its area under the curve using the test triples and the decision thresholds computed from the validation set. Please refer to the file *test.py* to perform this task.
 
 ## How to generate all these files?
 In order to generate all the files mentioned here from a set of triples, the script *generate.py* can be used: it assigns a numerical identifier to each resource, split the data into training, test and validation set and one or more batches. It accepts triples serialized using N-Triples format. The test and validation set are created by selecting from the available data a specified percentage of triples. These triples regard a set of one or more target relations that we are seeking to learn.
 
 
 ## How to run the distributed application?
-The distributed application can be launched using the *main_spark.py* script; it accepts the following parameters:
+Before launching the main program (**main_spark.py**), common Spark parameters have to be specified, such as the cluster dimension, the resources to allocate for each cluster (CPUs, GPUs, RAM), the number of parameter servers and the number of workers. The script below reports the essential ones:
+```
+#CUDA
+export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
+export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
+export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
+export LIB_CUDA=/usr/local/cuda-10.0/lib64
+
+#JAVA
+export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
+
+#PYTHON
+export PYSPARK_PYTHON=/usr/bin/python3 
+
+#SPARK
+export SPARK_HOME=/path/to/spark/spark-2.1.1-bin-hadoop2.7
+export PATH=$SPARK_HOME/bin:$PATH
+export MASTER=spark://$(hostname):7077
+export SPARK_WORKER_INSTANCES=3
+export CORES_PER_WORKER=1
+export MEMORY_PER_WORKER=2g
+export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) 
+
+``` 
+Subsequently, master and slaves have to be started using the following command:
+```
+${SPARK_HOME}/sbin/start-master.sh
+${SPARK_HOME}/sbin/start-slave.sh -c ${CORES_PER_WORKER} -m ${MEMORY_PER_WORKER} ${MASTER}
+
+```
+The distributed application can be finally launched using the *main_spark.py* script; it accepts the following parameters:
 * --cluster_size: number of nodes in the cluster
 * --num_ps: number of ps (parameter server) nodes
 * --num_gpus: number of gpus to use
@@ -250,3 +254,19 @@ The distributed application can be launched using the *main_spark.py* script; it
 * --mode: whether to perform train or evaluation mode
 * --test_head: perform link prediction evaluation on missing head, too (only if mode != 'train')
 
+The following script reports an example to launch the distributed application using the train mode.
+```
+${SPARK_HOME}/bin/spark-submit --master ${MASTER} \
+--py-files /path/to/OpenKEonSpark/distribute_training.py,/path/to/OpenKEonSpark/Config.py,/path/to/OpenKEonSpark/Model.py,/path/to/OpenKEonSpark/TransE.py,/path/to/OpenKEonSpark/Model.py,/path/to/OpenKEonSpark/TransH.py,/path/to/OpenKEonSpark/Model.py,/path/to/OpenKEonSpark/TransR.py,/path/to/OpenKEonSpark/Model.py,/path/to/OpenKEonSpark/TransD.py \
+--conf spark.dynamicAllocation.enabled=false \
+--conf spark.cores.max=${TOTAL_CORES} --conf spark.task.cpus=${CORES_PER_WORKER} --executor-memory ${MEMORY_PER_WORKER} --num-executors ${SPARK_WORKER_INSTANCES} \
+/path/to/OpenKEonSpark/main_spark.py \
+--cluster_size ${SPARK_WORKER_INSTANCES} --num_ps 1 --num_gpus 0 \
+--input_path /path/to/dataset/ --output_path /path/where/to/store/model/ --cpp_lib_path /path/to/OpenKEonSpark/release/Base.so \
+--alpha 0.00001 --optimizer SGD --train_times 50 --ent_neg_rate 1 --embedding_dimension 64 --model TransE --mode train 
+```
+When the program finished, master and slaves can be stopped using the following command:
+```
+${SPARK_HOME}/sbin/stop-slave.sh; ${SPARK_HOME}/sbin/stop-master.sh
+```
+