Skip to content

Latest commit



336 lines (301 loc) · 19.6 KB

File metadata and controls

336 lines (301 loc) · 19.6 KB

JoBimText formats and stages

Author: Alexander Panchenko Last update: 20 Feb 2015 JoBimText version: jobimtext_pipeline_0.1.1

This describes different stages of the JoBimText pipeline of the bash script generated by

The output on the HDFS may look like this for the input file wikipedia_sample_1K:

  1 164.9 K  wikipedia_sample_1K
  2 699.7 K  wikipedia_sample_1K_bigram
  3 110.5 K  wikipedia_sample_1K_bigram__FeatureCount
  4 609.3 K  wikipedia_sample_1K_bigram__FreqSigLMI
  5 367.7 K  wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000
  6 120.2 K  wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt
  7 20.9 M   wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False
  8 1.2 M    wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2
  9 356.4 K  wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000
 10 110.0 K  wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt
 11 10.7 M   wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False
 12 1.1 M    wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2
 13 53.4 K   wikipedia_sample_1K_bigram__WordCount
 14 381.4 K  wikipedia_sample_1K_bigram__WordFeatureCount

Warning: note that here input is 0.17M and the result output is 35.13M. Thus output temporary files are 200 times bigger than the input corpus.

Calculate word-feature matrix (JoBim text)

  1. Cleanup of the HDFS filesystem Output directory: wikipedia_sample_1K Output format:
The following week Hardy saved his brother when CM Punk and The Hart Dynasty attacked both Jeff and John Morrison, turning into a fan favorite again.
Particularly in western societies, modern legal conventions which stipulate points in late adolescence or early adulthood (most commonly 16-21 when adolescents are generally no longer considered minors a
Major railways began running trains at 10–20 minute intervals, rather than the usual 3–5 minute intervals, operating some lines only at rush hour and completely shutting down others; notably, the Tōkaidō
  1. The Holing operation: generate a word-feature co-occurence file Output directory: wikipedia_sample_1K_bigram Output format:
In                     Bigram(@_2009)                    wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  0:2        3:7;
2009                   -Bigram(In_@)                     wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  3:7        0:2;
2009                   Bigram(@_,)                       wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  3:7        7:8;
,                      -Bigram(2009_@)                   wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  7:8        3:7;
,                      Bigram(@_a)                       wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  7:8        9:10;
a                      -Bigram(,_@)                      wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  9:10       7:8;
a                      Bigram(@_man)                     wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  9:10       11:14;
man                    -Bigram(a_@)                      wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  11:14      9:10;
man                    Bigram(@_from)                    wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  11:14      15:19;
from                   -Bigram(man_@)                    wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  15:19      11:14;
from                   Bigram(@_Billingham)              wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  15:19      20:30;
Billingham             -Bigram(from_@)                   wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  20:30      15:19;
Billingham             Bigram(@_,)                       wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  20:30      30:31;


  • copy uima holing operation xml description
  • run uima xml with the de.tudarmstadt.ukp.dkpro.bigdata.hadoop.XMLDescriptorRunner
  • cleanup of the results
  1. Calculate word (Jo) frequency dictionary Output directory: wikipedia_sample_1K_bigram__WordCount Output format:
adventures             1
along                  12
average                1
be                     76
bed                    1
bedrooms               1
began                  15
bill                   3
blunt                  1
boxing                 1
co-ordinate            1


  • org.jobimtext.hadoop.mapreducer.UniqMapper of words
  1. Calculate feature (Bim) frequency dictionary Output directory: wikipedia_sample_1K_bigram__FeatureCount Output format:
-Bigram(spinning_@)               1
-Bigram(states_@)                 6
-Bigram(stored-program_@)         1
-Bigram(subsequently_@)           6
-Bigram(switchers_@)              1
-Bigram(think_@)                  1
-Bigram(thus_@)                   2
-Bigram(timing_@)                 1
-Bigram(tracked_@)                1
-Bigram(tree_@)                   2


  • org.jobimtext.hadoop.mapreducer.UniqMapper of words
  1. Calculate word-feature (JoBim) frequency dictionary Output directory: wikipedia_sample_1K_bigram__WordFeatureCount Output format:
Washington             -Bigram(to_@)                     1
Would                  -Bigram(:_@)                      1
a                      -Bigram("_@)                      2
a                      -Bigram(from_@)                   8
a                      Bigram(@_5-match)                 1
a                      Bigram(@_96)                      1
a                      Bigram(@_multi-million-dollar)    1
a                      Bigram(@_path)                    1
a                      Bigram(@_recording)               1
a                      Bigram(@_result)                  4
a                      Bigram(@_stakes)                  1
above                  Bigram(@_walls)                   1
accepted               Bigram(@_a)                       1
admitted               Bigram(@_under)                   1
after                  -Bigram(disappointment_@)         1
against                Bigram(@_an)                      1


  • org.jobimtext.hadoop.mapreducer.UniqMapper
  1. Calculate LMI word-feature matrix Output directory: wikipedia_sample_1K_bigram__FreqSigLMI Output format (word feature lmi #word-feature #features:word #words:feature):
said                   -Bigram(Chen_@)                   12.719721574502557      1    9.0     1.0
Company                -Bigram(Fire_@)                   13.889646552738117      1    2.0     2.0
"                      -Bigram(Fire_@)                   6.242188185025285       1    401.0   2.0
owned                  -Bigram(Navy_@)                   12.304684005603457      1    4.0     3.0
'                      -Bigram(Navy_@)                   9.397793427400003       1    30.0    3.0
was                    -Bigram(Navy_@)                   6.254835440923465       1    265.0   3.0
also                   -Bigram(Tech_@)                   9.982755928121158       1    60.0    1.0
of                     -Bigram(feed_@)                   10.786584240845354      2    963.0   3.0
on                     -Bigram(feed_@)                   6.853472900298028       1    175.0   3.0
for                    -Bigram(love_@)                   8.031665534403793       1    232.0   1.0
cheaper                -Bigram(thus_@)                   14.889646552738117      1    1.0     2.0
backwards              -Bigram(thus_@)                   14.889646552738117      1    1.0     2.0
of                     -Bigram(tree_@)                   11.956509154990396      2    963.0   2.0
(                      Bigram(@_"haft)                   8.101743936780228       1    221.0   1.0
with                   Bigram(@_2,168)                   8.082291677094014       1    224.0   1.0
of                     Bigram(@_Anita)                   4.978254577495198       1    963.0   2.0
Santa                  Bigram(@_Anita)                   12.567718457850754      1    5.0     2.0
"                      Bigram(@_Iliad)                   7.242188185025285       1    401.0   1.0
in                     Bigram(@_Miami)                   5.486634475134901       1    677.0   2.0
the                    Bigram(@_Miami)                   4.120635296910089       1    1745.0  2.0
near                   Bigram(@_Ongar)                   13.082291677094014      1    7.0     1.0
in                     Bigram(@_Santa)                   4.164706410933029       1    677.0   5.0
of                     Bigram(@_Santa)                   3.6563264389591996      1    963.0   5.0
de                     Bigram(@_Santa)                   9.319790900894509       1    19.0    5.0
real                   Bigram(@_Santa)                   11.98275592812116       1    3.0     5.0
at                     Bigram(@_Santa)                   6.338899700272855       1    150.0   5.0


  • pig/FreqSigLMI.pig:

Calculate word similarity matrix (Jo DT)

  1. Prune LMI word-feature matrix: keep the features that occurs with at most 1000 words, etc Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000 Output format:
software         -Bigram(commercial_@)      12.304684005603457    1    4     3     3
software         Bigram(@_development)      11.304684005603457    1    4     6     5
Americana        Bigram(@_))                7.050442716052036     1    2     229   92
Americana        Bigram(@_()                7.1017439367802275    1    2     221   103
Initially        Bigram(@_,)                4.1754010713325425    1    2     1680  661
Initially        Bigram(@_sold)             13.304684005603457    1    2     3     3
Palestine        -Bigram(in_@)              4.901672066470214     1    3     677   212
Palestine        -Bigram(to_@)              5.080682389603534     1    3     598   220
Palestine        Bigram(@_established)      10.604244296164897    1    3     13    6
Palestine        Bigram(@_League)           10.604244296164897    1    3     13    9
Palestine        Bigram(@_to)               5.080682389603534     1    3     598   283
formation        Bigram(@_.)                3.8658921954500394    1    4     1041  467
formation        Bigram(@_in)               4.486634475134901     1    4     677   338
formation        Bigram(@_of)               3.978254577495198     1    4     963   385
formation        -Bigram(the_@)             8.241270593820177     2    4     1745  646
formation        Bigram(@_;)                8.304684005603457     1    4     48    35


  • pig/PruneContext.pig
  1. Create an inverted index of features Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt Output format:
Bigram(@_rise)             The             students
Bigram(@_throughout)       stations        tour            ceremonies       homeroom        ,
Bigram(@_tries)            's              )
Bigram(@_visa)             a               the
Bigram(@_were)             which           roads           these            kids            Troops         Act              women           vibrations      artists         studios      York           team           records        venues         population    deserts         people         %               there       Line        supports      Titans       two          They            occupied        attempts        sets         men         bridge       Bowl         parts        league         2005          rear         measurements  pieces         Obama       buildings    Most             rights         they           )            There           who            probably
-Bigram(-_@)               Dr.             The             to               June
-Bigram(12_@)              m               months          and              pieces          ,              located          to              .               cm              May
-Bigram(1989_@)            the             edition
-Bigram(Newlyn_@)          School          .
-Bigram(bird_@)            species         12
-Bigram(co-located_@)      at              with
-Bigram(direct_@)          support         it
-Bigram(especially_@)      common          with            when             during          in             since            under           around          those           "            roads
-Bigram(except_@)          in              for             three


  • org.jobimtext.hadoop.mapreducer.AggrPerFt
  1. Calculate word similarites Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False Output format:
woman   year    3.0
woman   visit   3.0
woman   video   3.0
Convention  years   3.0
Convention  Year    3.0
Convention  work    3.0
a   a   378.0
for for 208.0
was was 204.0
and ,   204.0
,   and 204.0

Steps: - org.jobimtext.hadoop.mapreducer.SimCount

  1. Prune similarity graph by KNN thresholding (e.g. 200 most similar words), and by a global threshold (e.g. sim >= 3) Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2 Output format:
prison  1967    3.0
prison  1947    3.0
prison  1939    3.0
printed printed 3.0
principle   town    3.0
principle   station 3.0
principle   principle   3.0
principle   music   3.0
Prince  work    3.0
Prince  wake    3.0


  • pig/SimSort.pig
  • for instance, 2.038.768 before pruning, 69.192 after pruning

Calculate feature similarity matrix (Bim DT)

  1. Prune LMI word-feature matrix: keep the words that occurs with at most 1000 features, etc. Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000 Output format:
Bigram(Peace_@)            "                5.2421881850252845    1    401   4     229
-Bigram(Since_@)            its              9.680193169704102     1    37    2     41
-Bigram(Since_@)            1997             12.567718457850754    1    5     2     9
-Bigram(basis_@)            .                4.280929741912922     1    1041  3     472
-Bigram(basis_@)            for              14.89340615148975     2    232   3     221
-Bigram(birds_@)            and              4.539812448589677     1    870   3     693
-Bigram(birds_@)            could            10.845252473991478    1    11    3     14
-Bigram(plans_@)            to               13.331289672230339    2    598   2     503
Bigram(@_Living)            the              4.120635296910089     1    1745  2     898
Bigram(@_Living)            Country          13.889646552738117    1    2     2     3
Bigram(@_faster)            run              12.304684005603457    1    4     3     7
Bigram(@_things)            other            20.67011544853446     2    47    2     54
-Bigram(Ragged_@)           Ass              42.914052016810366    3    3     3     2
-Bigram(states_@)           .                3.2809297419129217    1    1041  6     472
-Bigram(states_@)           ;                7.719721574502556     1    48    6     60
-Bigram(states_@)           obtained         11.719721574502556    1    3     6     4
-Bigram(states_@)           that             13.859289250767814    2    166   6     160
Bigram(@_Florida)           driving          12.719721574502557    1    3     3     5
Bigram(@_Israeli)           an               6.625203942539863     1    123   5     103
Bigram(@_Israeli)           real             11.98275592812116     1    3     5     5


  • pig/PruneContext.pig
  1. Build feature inverted index Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt Output format:
1943                -Bigram((_@)               Bigram(@_–)               Bigram(@_on)               -Bigram(September_@)
Americana           Bigram(@_()                -Bigram(Universidad_@)    Bigram(@_))
But                 Bigram(@_he)               Bigram(@_she)             Bigram(@_after)            Bigram(@_,)                Bigram(@_with)
CR                  -Bigram(,_@)               Bigram(@_21)              Bigram(@_7)                -Bigram(with_@)            -Bigram(from_@)
Cardiff             Bigram(@_,)                -Bigram(in_@)
Ministry            -Bigram(,_@)               Bigram(@_nominated)       Bigram(@_of)
Natural             -Bigram(the_@)             Bigram(@_Gas)             -Bigram(&_@)               -Bigram(and_@)
September           -Bigram(11_@)              Bigram(@_24)              Bigram(@_11)               Bigram(@_2012)             -Bigram(the_@)            -Bigram(On_@)              Bigram(@_2)               Bigram(@_1943)            -Bigram(in_@)              Bigram(@_1)                -Bigram(–_@)              -Bigram(on_@)              Bigram(@_29)             Bigram(@_14)              -Bigram(to_@)
West                -Bigram(to_@)              Bigram(@_Berlin)          Bigram(@_Virginia)         -Bigram(,_@)               -Bigram(Mountain_@)       Bigram(@_&)                -Bigram(violent_@)        Bigram(@_with)            -Bigram(by_@)              Bigram(@_Point)            -Bigram(on_@)             -Bigram(Old_@)             -Bigram(include_@)       Bigram(@_,)
accepted            Bigram(@_a)                -Bigram("_@)              Bigram(@_on)
began               Bigram(@_on)               -Bigram(and_@)            Bigram(@_running)          Bigram(@_their)            Bigram(@_as)              Bigram(@_,)                -Bigram(company_@)        -Bigram(girls_@)          -Bigram("_@)               -Bigram(store_@)           Bigram(@_in)              Bigram(@_to)               -Bigram(U2_@)            -Bigram(population_@)     Bigram(@_setting)
bill                -Bigram(The_@)             Bigram(@_would)           -Bigram(a_@)               Bigram(@_passed)
character           -Bigram(main_@)            Bigram(@_with)            Bigram(@_is)               -Bigram(the_@)             -Bigram(central_@)        -Bigram(allow_@)


  • org.jobimtext.hadoop.mapreducer.AggrPerFt
  1. Calculate feature similarities, pruning feature similarities by knn thresholding e.g. top 200 features) and global thresholding (e.g. sim of at least 2) Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2 Output format:
-Bigram(of_@)   Bigram(@_as)    15.0
-Bigram(of_@)   -Bigram(as_@)   15.0
Bigram(@_of)    -Bigram("_@)    15.0
-Bigram(of_@)   Bigram(@_)) 15.0
-Bigram(new_@)  -Bigram(the_@)  15.0
-Bigram(new_@)  Bigram(@_.) 15.0
Bigram(@_most)  Bigram(@_most)  15.0
Bigram(@_many)  Bigram(@_many)  15.0
Bigram(@_it)    Bigram(@_he)    15.0
-Bigram(it_@)   -Bigram(and_@)  15.0
Bigram(@_it)    Bigram(@_a) 15.0
-Bigram(is_@)   -Bigram(were_@) 15.0
Bigram(@_is)    -Bigram(and_@)  15.0
Bigram(@_is)    -Bigram(,_@)    15.0
Bigram(@_in)    Bigram(@_that)  15.0


  • org.jobimtext.hadoop.mapreducer.SimCount
  • pig/SimSort.pig