-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Jinho Choi
committed
May 9, 2018
1 parent
2dda92c
commit 9126ece
Showing
6 changed files
with
217 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
# DDR Conversion | ||
|
||
DDR conversion generates the [deep dependency graphs](https://github.com/emorynlp/ddr) from the Penn Treebank style constituency trees. | ||
The conversion tool is written in Java and developed by [Emory NLP](http://nlp.mathcs.emory.edu) as a part of the [ELIT](https://elit.cloud) project. | ||
|
||
## Installation | ||
|
||
Add the following dependency to your maven project: | ||
|
||
``` | ||
<dependency> | ||
<groupId>cloud.elit</groupId> | ||
<artifactId>elit-ddr</artifactId> | ||
<version>0.0.4</version> | ||
</dependency> | ||
``` | ||
|
||
* Download the conversion script: [nlp4j-ddr.jar](http://nlp.mathcs.emory.edu/nlp4j/nlp4j-ddr.jar). | ||
* Make sure [Java 8 or above](http://www.oracle.com/technetwork/java/javase/downloads) is installed on your machine: | ||
|
||
``` | ||
$ java -version | ||
java version "1.8.x" | ||
Java(TM) SE Runtime Environment (build 1.8.x) | ||
... | ||
``` | ||
|
||
* Run the following command: | ||
|
||
``` | ||
java edu.emory.mathcs.nlp.bin.DDGConvert -i <filepath> [ -r -n -pe <string> -oe <string>] | ||
``` | ||
|
||
* `-i`: the path to the parse file or a directory containing the parse files to convert. | ||
* `-r`: if set, process all files with the extension in the subdirectories of the input directory recursively. | ||
* `-n`: if set, normalize the parse trees before the conversion. | ||
* `-pe`: the extension of the parse files; required if the input path indicates a directory (default: `parse`). | ||
* `-oe`: the extension of the output files (default: `ddg`). | ||
|
||
## Corpora | ||
|
||
DDG conversion has been tested on the following corpora. Some of these corpora require you to be a member of the [Linguistic Data Consortium](https://www.ldc.upenn.edu) (LDC). Retrieve the corpora from LDC and run the following command for each corpus to generate DDG. | ||
|
||
* [OntoNotes Release 5.0](https://catalog.ldc.upenn.edu/LDC2013T19): | ||
|
||
``` | ||
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGConvert -r -i ontonotes-release-5.0/data/files/data/english/annotations | ||
``` | ||
|
||
* [English Web Treebank](https://catalog.ldc.upenn.edu/LDC2012T13): | ||
|
||
``` | ||
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGConvert -r -i eng_web_tbk/data -pe tree | ||
``` | ||
|
||
* [QuestionBank with Manually Revised Treebank Annotation 1.0](https://catalog.ldc.upenn.edu/LDC2012R121): | ||
|
||
``` | ||
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGConvert -i QB-revised.tree | ||
``` | ||
|
||
## Merge | ||
|
||
We have internally updated these corpora to reduce annotation errors and produce a richer representation. If you want to take advantage of our latest updates, merge the original annotation with our annotation. You still need to retrieve the original corpora from LDC. | ||
|
||
* Clone this repository: | ||
|
||
``` | ||
git clone https://github.com/emorynlp/ddr.git | ||
``` | ||
|
||
* Run the following command: | ||
|
||
``` | ||
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGMerge <source path> <target path> <parse ext> | ||
``` | ||
|
||
* `<source path>`: the path to the original corpus. | ||
* `<target path>`: the path to our annotation. | ||
* `<parse ext`>: the extension of the parse files. | ||
|
||
|
||
* [OntoNotes Release 5.0](https://catalog.ldc.upenn.edu/LDC2013T19): | ||
|
||
``` | ||
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGMerge ontonotes-release-5.0/data/files/data/english/annotations ddr/english/ontonotes parse | ||
``` | ||
|
||
* [English Web Treebank](https://catalog.ldc.upenn.edu/LDC2012T13): | ||
|
||
``` | ||
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGMerge eng_web_tbk/data ddr/english/google/ewt tree | ||
``` | ||
|
||
* [QuestionBank with Manually Revised Treebank Annotation 1.0](https://catalog.ldc.upenn.edu/LDC2012R121): | ||
|
||
``` | ||
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGMerge QB-revised.tree ddr/english/google/qb/QB-revised.tree.skel tree | ||
``` | ||
|
||
|
||
## Format | ||
|
||
DDG is represented in the tab separated values format (TSV), where each column represents a different field. The semantic roles are indicated in the `feats` column with the key, `sem`. | ||
|
||
``` | ||
1 You you PRP _ 3 nsbj 7:nsbj O | ||
2 can can MD _ 3 modal _ O | ||
3 ascend ascend VB _ 0 root _ O | ||
4 Victoria victoria NNP _ 5 com _ B-LOC | ||
5 Peak peak NNP _ 3 obj _ L-LOC | ||
6 to to TO _ 7 aux _ O | ||
7 get get VB sem=prp 3 advcl _ O | ||
8 a a DT _ 10 det _ O | ||
9 panoramic panoramic JJ _ 10 attr _ O | ||
10 view view NN _ 7 obj _ O | ||
11 of of IN _ 16 case _ O | ||
12 Victoria victoria NNP _ 13 com _ B-LOC | ||
13 Harbor harbor NNP _ 16 poss _ I-LOC | ||
14 's 's POS _ 13 case _ L-LOC | ||
15 beautiful beautiful JJ _ 16 attr _ O | ||
16 scenery scenery NN _ 10 ppmod _ O | ||
17 . . . _ 3 p _ O | ||
``` | ||
|
||
* `id`: current token ID (starting at 1). | ||
* `form`: word form. | ||
* `lemma`: lemma. | ||
* `pos`: part-of-speech tag. | ||
* `feats`: extra features; different features are delimited by `|`, keys and values are delimited by `=` (`_` indicates no feature). | ||
* `headId`: head token ID. | ||
* `deprel`: dependency label. | ||
* `sheads`: secondary heads (`_` indicates no secondary head). | ||
* `nament`: named entity tags in the `BILOU` notation if the annotation is available. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
34 changes: 34 additions & 0 deletions
34
elit-ddr/src/main/java/cloud/elit/ddr/bin/DDRConvertDemo.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
/* | ||
* Copyright 2018 Emory University | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package cloud.elit.ddr.bin; | ||
|
||
import cloud.elit.ddr.constituency.CTTree; | ||
import cloud.elit.ddr.conversion.C2DConverter; | ||
import cloud.elit.ddr.conversion.EnglishC2DConverter; | ||
import cloud.elit.ddr.util.Language; | ||
import cloud.elit.sdk.structure.Document; | ||
import cloud.elit.sdk.structure.Sentence; | ||
|
||
public class DDRConvertDemo { | ||
public static void main(String[] args) { | ||
final String parseFile = "/Users/jdchoi/workspace/elit-java/relcl.parse"; | ||
final String tsvFile = "/Users/jdchoi/workspace/elit-java/relcl.tsv"; | ||
C2DConverter converter = new EnglishC2DConverter(); | ||
DDRConvert ddr = new DDRConvert(); | ||
ddr.convert(converter, Language.ENGLISH, parseFile, tsvFile, false); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters