Author: Alexander Panchenko Last update: 20 Feb 2015 JoBimText version: jobimtext_pipeline_0.1.1
This describes different stages of the JoBimText pipeline of the bash script generated by generateHadoopScript.py.
The output on the HDFS may look like this for the input file wikipedia_sample_1K:
1 164.9 K wikipedia_sample_1K
2 699.7 K wikipedia_sample_1K_bigram
3 110.5 K wikipedia_sample_1K_bigram__FeatureCount
4 609.3 K wikipedia_sample_1K_bigram__FreqSigLMI
5 367.7 K wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000
6 120.2 K wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt
7 20.9 M wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False
8 1.2 M wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2
9 356.4 K wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000
10 110.0 K wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt
11 10.7 M wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False
12 1.1 M wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2
13 53.4 K wikipedia_sample_1K_bigram__WordCount
14 381.4 K wikipedia_sample_1K_bigram__WordFeatureCount
Warning: note that here input is 0.17M and the result output is 35.13M. Thus output temporary files are 200 times bigger than the input corpus.
- Cleanup of the HDFS filesystem
Output directory:
wikipedia_sample_1K
Output format:
The following week Hardy saved his brother when CM Punk and The Hart Dynasty attacked both Jeff and John Morrison, turning into a fan favorite again.
Particularly in western societies, modern legal conventions which stipulate points in late adolescence or early adulthood (most commonly 16-21 when adolescents are generally no longer considered minors a
Major railways began running trains at 10–20 minute intervals, rather than the usual 3–5 minute intervals, operating some lines only at rush hour and completely shutting down others; notably, the Tōkaidō
- The Holing operation: generate a word-feature co-occurence file
Output directory:
wikipedia_sample_1K_bigram
Output format:
In Bigram(@_2009) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 0:2 3:7;
2009 -Bigram(In_@) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 3:7 0:2;
2009 Bigram(@_,) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 3:7 7:8;
, -Bigram(2009_@) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 7:8 3:7;
, Bigram(@_a) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 7:8 9:10;
a -Bigram(,_@) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 9:10 7:8;
a Bigram(@_man) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 9:10 11:14;
man -Bigram(a_@) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 11:14 9:10;
man Bigram(@_from) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 11:14 15:19;
from -Bigram(man_@) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 15:19 11:14;
from Bigram(@_Billingham) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 15:19 20:30;
Billingham -Bigram(from_@) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 20:30 15:19;
Billingham Bigram(@_,) wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0 20:30 30:31;
Steps:
- copy uima holing operation xml description
- run uima xml with the de.tudarmstadt.ukp.dkpro.bigdata.hadoop.XMLDescriptorRunner
- cleanup of the results
- Calculate word (Jo) frequency dictionary
Output directory:
wikipedia_sample_1K_bigram__WordCount
Output format:
adventures 1
along 12
average 1
be 76
bed 1
bedrooms 1
began 15
bill 3
blunt 1
boxing 1
co-ordinate 1
Steps:
- org.jobimtext.hadoop.mapreducer.UniqMapper of words
- Calculate feature (Bim) frequency dictionary
Output directory:
wikipedia_sample_1K_bigram__FeatureCount
Output format:
-Bigram(spinning_@) 1
-Bigram(states_@) 6
-Bigram(stored-program_@) 1
-Bigram(subsequently_@) 6
-Bigram(switchers_@) 1
-Bigram(think_@) 1
-Bigram(thus_@) 2
-Bigram(timing_@) 1
-Bigram(tracked_@) 1
-Bigram(tree_@) 2
Steps:
- org.jobimtext.hadoop.mapreducer.UniqMapper of words
- Calculate word-feature (JoBim) frequency dictionary
Output directory:
wikipedia_sample_1K_bigram__WordFeatureCount
Output format:
Washington -Bigram(to_@) 1
Would -Bigram(:_@) 1
a -Bigram("_@) 2
a -Bigram(from_@) 8
a Bigram(@_5-match) 1
a Bigram(@_96) 1
a Bigram(@_multi-million-dollar) 1
a Bigram(@_path) 1
a Bigram(@_recording) 1
a Bigram(@_result) 4
a Bigram(@_stakes) 1
above Bigram(@_walls) 1
accepted Bigram(@_a) 1
admitted Bigram(@_under) 1
after -Bigram(disappointment_@) 1
against Bigram(@_an) 1
Steps:
- org.jobimtext.hadoop.mapreducer.UniqMapper
- Calculate LMI word-feature matrix
Output directory:
wikipedia_sample_1K_bigram__FreqSigLMI
Output format (word feature lmi #word-feature #features:word #words:feature
):
said -Bigram(Chen_@) 12.719721574502557 1 9.0 1.0
Company -Bigram(Fire_@) 13.889646552738117 1 2.0 2.0
" -Bigram(Fire_@) 6.242188185025285 1 401.0 2.0
owned -Bigram(Navy_@) 12.304684005603457 1 4.0 3.0
' -Bigram(Navy_@) 9.397793427400003 1 30.0 3.0
was -Bigram(Navy_@) 6.254835440923465 1 265.0 3.0
also -Bigram(Tech_@) 9.982755928121158 1 60.0 1.0
of -Bigram(feed_@) 10.786584240845354 2 963.0 3.0
on -Bigram(feed_@) 6.853472900298028 1 175.0 3.0
for -Bigram(love_@) 8.031665534403793 1 232.0 1.0
cheaper -Bigram(thus_@) 14.889646552738117 1 1.0 2.0
backwards -Bigram(thus_@) 14.889646552738117 1 1.0 2.0
of -Bigram(tree_@) 11.956509154990396 2 963.0 2.0
( Bigram(@_"haft) 8.101743936780228 1 221.0 1.0
with Bigram(@_2,168) 8.082291677094014 1 224.0 1.0
of Bigram(@_Anita) 4.978254577495198 1 963.0 2.0
Santa Bigram(@_Anita) 12.567718457850754 1 5.0 2.0
" Bigram(@_Iliad) 7.242188185025285 1 401.0 1.0
in Bigram(@_Miami) 5.486634475134901 1 677.0 2.0
the Bigram(@_Miami) 4.120635296910089 1 1745.0 2.0
near Bigram(@_Ongar) 13.082291677094014 1 7.0 1.0
in Bigram(@_Santa) 4.164706410933029 1 677.0 5.0
of Bigram(@_Santa) 3.6563264389591996 1 963.0 5.0
de Bigram(@_Santa) 9.319790900894509 1 19.0 5.0
real Bigram(@_Santa) 11.98275592812116 1 3.0 5.0
at Bigram(@_Santa) 6.338899700272855 1 150.0 5.0
Steps:
- pig/FreqSigLMI.pig:
- Prune LMI word-feature matrix: keep the features that occurs with at most 1000 words, etc
Output directory:
wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000
Output format:
software -Bigram(commercial_@) 12.304684005603457 1 4 3 3
software Bigram(@_development) 11.304684005603457 1 4 6 5
Americana Bigram(@_)) 7.050442716052036 1 2 229 92
Americana Bigram(@_() 7.1017439367802275 1 2 221 103
Initially Bigram(@_,) 4.1754010713325425 1 2 1680 661
Initially Bigram(@_sold) 13.304684005603457 1 2 3 3
Palestine -Bigram(in_@) 4.901672066470214 1 3 677 212
Palestine -Bigram(to_@) 5.080682389603534 1 3 598 220
Palestine Bigram(@_established) 10.604244296164897 1 3 13 6
Palestine Bigram(@_League) 10.604244296164897 1 3 13 9
Palestine Bigram(@_to) 5.080682389603534 1 3 598 283
formation Bigram(@_.) 3.8658921954500394 1 4 1041 467
formation Bigram(@_in) 4.486634475134901 1 4 677 338
formation Bigram(@_of) 3.978254577495198 1 4 963 385
formation -Bigram(the_@) 8.241270593820177 2 4 1745 646
formation Bigram(@_;) 8.304684005603457 1 4 48 35
Steps:
- pig/PruneContext.pig
- Create an inverted index of features
Output directory:
wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt
Output format:
Bigram(@_rise) The students
Bigram(@_throughout) stations tour ceremonies homeroom ,
Bigram(@_tries) 's )
Bigram(@_visa) a the
Bigram(@_were) which roads these kids Troops Act women vibrations artists studios York team records venues population deserts people % there Line supports Titans two They occupied attempts sets men bridge Bowl parts league 2005 rear measurements pieces Obama buildings Most rights they ) There who probably
-Bigram(-_@) Dr. The to June
-Bigram(12_@) m months and pieces , located to . cm May
-Bigram(1989_@) the edition
-Bigram(Newlyn_@) School .
-Bigram(bird_@) species 12
-Bigram(co-located_@) at with
-Bigram(direct_@) support it
-Bigram(especially_@) common with when during in since under around those " roads
-Bigram(except_@) in for three
Steps:
- org.jobimtext.hadoop.mapreducer.AggrPerFt
- Calculate word similarites
Output directory:
wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False
Output format:
woman year 3.0
woman visit 3.0
woman video 3.0
Convention years 3.0
Convention Year 3.0
Convention work 3.0
a a 378.0
for for 208.0
was was 204.0
and , 204.0
, and 204.0
Steps: - org.jobimtext.hadoop.mapreducer.SimCount
- Prune similarity graph by KNN thresholding (e.g. 200 most similar words), and by a global threshold (e.g. sim >= 3)
Output directory:
wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2
Output format:
prison 1967 3.0
prison 1947 3.0
prison 1939 3.0
printed printed 3.0
principle town 3.0
principle station 3.0
principle principle 3.0
principle music 3.0
Prince work 3.0
Prince wake 3.0
Steps:
- pig/SimSort.pig
- for instance, 2.038.768 before pruning, 69.192 after pruning
- Prune LMI word-feature matrix: keep the words that occurs with at most 1000 features, etc.
Output directory:
wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000
Output format:
Bigram(Peace_@) " 5.2421881850252845 1 401 4 229
-Bigram(Since_@) its 9.680193169704102 1 37 2 41
-Bigram(Since_@) 1997 12.567718457850754 1 5 2 9
-Bigram(basis_@) . 4.280929741912922 1 1041 3 472
-Bigram(basis_@) for 14.89340615148975 2 232 3 221
-Bigram(birds_@) and 4.539812448589677 1 870 3 693
-Bigram(birds_@) could 10.845252473991478 1 11 3 14
-Bigram(plans_@) to 13.331289672230339 2 598 2 503
Bigram(@_Living) the 4.120635296910089 1 1745 2 898
Bigram(@_Living) Country 13.889646552738117 1 2 2 3
Bigram(@_faster) run 12.304684005603457 1 4 3 7
Bigram(@_things) other 20.67011544853446 2 47 2 54
-Bigram(Ragged_@) Ass 42.914052016810366 3 3 3 2
-Bigram(states_@) . 3.2809297419129217 1 1041 6 472
-Bigram(states_@) ; 7.719721574502556 1 48 6 60
-Bigram(states_@) obtained 11.719721574502556 1 3 6 4
-Bigram(states_@) that 13.859289250767814 2 166 6 160
Bigram(@_Florida) driving 12.719721574502557 1 3 3 5
Bigram(@_Israeli) an 6.625203942539863 1 123 5 103
Bigram(@_Israeli) real 11.98275592812116 1 3 5 5
Steps:
- pig/PruneContext.pig
- Build feature inverted index
Output directory:
wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt
Output format:
1943 -Bigram((_@) Bigram(@_–) Bigram(@_on) -Bigram(September_@)
Americana Bigram(@_() -Bigram(Universidad_@) Bigram(@_))
But Bigram(@_he) Bigram(@_she) Bigram(@_after) Bigram(@_,) Bigram(@_with)
CR -Bigram(,_@) Bigram(@_21) Bigram(@_7) -Bigram(with_@) -Bigram(from_@)
Cardiff Bigram(@_,) -Bigram(in_@)
Ministry -Bigram(,_@) Bigram(@_nominated) Bigram(@_of)
Natural -Bigram(the_@) Bigram(@_Gas) -Bigram(&_@) -Bigram(and_@)
September -Bigram(11_@) Bigram(@_24) Bigram(@_11) Bigram(@_2012) -Bigram(the_@) -Bigram(On_@) Bigram(@_2) Bigram(@_1943) -Bigram(in_@) Bigram(@_1) -Bigram(–_@) -Bigram(on_@) Bigram(@_29) Bigram(@_14) -Bigram(to_@)
West -Bigram(to_@) Bigram(@_Berlin) Bigram(@_Virginia) -Bigram(,_@) -Bigram(Mountain_@) Bigram(@_&) -Bigram(violent_@) Bigram(@_with) -Bigram(by_@) Bigram(@_Point) -Bigram(on_@) -Bigram(Old_@) -Bigram(include_@) Bigram(@_,)
accepted Bigram(@_a) -Bigram("_@) Bigram(@_on)
began Bigram(@_on) -Bigram(and_@) Bigram(@_running) Bigram(@_their) Bigram(@_as) Bigram(@_,) -Bigram(company_@) -Bigram(girls_@) -Bigram("_@) -Bigram(store_@) Bigram(@_in) Bigram(@_to) -Bigram(U2_@) -Bigram(population_@) Bigram(@_setting)
bill -Bigram(The_@) Bigram(@_would) -Bigram(a_@) Bigram(@_passed)
character -Bigram(main_@) Bigram(@_with) Bigram(@_is) -Bigram(the_@) -Bigram(central_@) -Bigram(allow_@)
Steps:
- org.jobimtext.hadoop.mapreducer.AggrPerFt
- Calculate feature similarities, pruning feature similarities by knn thresholding e.g. top 200 features) and global thresholding (e.g. sim of at least 2)
Output directory:
wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2
Output format:
-Bigram(of_@) Bigram(@_as) 15.0
-Bigram(of_@) -Bigram(as_@) 15.0
Bigram(@_of) -Bigram("_@) 15.0
-Bigram(of_@) Bigram(@_)) 15.0
-Bigram(new_@) -Bigram(the_@) 15.0
-Bigram(new_@) Bigram(@_.) 15.0
Bigram(@_most) Bigram(@_most) 15.0
Bigram(@_many) Bigram(@_many) 15.0
Bigram(@_it) Bigram(@_he) 15.0
-Bigram(it_@) -Bigram(and_@) 15.0
Bigram(@_it) Bigram(@_a) 15.0
-Bigram(is_@) -Bigram(were_@) 15.0
Bigram(@_is) -Bigram(and_@) 15.0
Bigram(@_is) -Bigram(,_@) 15.0
Bigram(@_in) Bigram(@_that) 15.0
Steps:
- org.jobimtext.hadoop.mapreducer.SimCount
- pig/SimSort.pig