Skip to content

Latest commit

 

History

History
336 lines (301 loc) · 19.6 KB

index.md

File metadata and controls

336 lines (301 loc) · 19.6 KB

JoBimText formats and stages

Author: Alexander Panchenko Last update: 20 Feb 2015 JoBimText version: jobimtext_pipeline_0.1.1

This describes different stages of the JoBimText pipeline of the bash script generated by generateHadoopScript.py.

The output on the HDFS may look like this for the input file wikipedia_sample_1K:

  1 164.9 K  wikipedia_sample_1K
  2 699.7 K  wikipedia_sample_1K_bigram
  3 110.5 K  wikipedia_sample_1K_bigram__FeatureCount
  4 609.3 K  wikipedia_sample_1K_bigram__FreqSigLMI
  5 367.7 K  wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000
  6 120.2 K  wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt
  7 20.9 M   wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False
  8 1.2 M    wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2
  9 356.4 K  wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000
 10 110.0 K  wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt
 11 10.7 M   wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False
 12 1.1 M    wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2
 13 53.4 K   wikipedia_sample_1K_bigram__WordCount
 14 381.4 K  wikipedia_sample_1K_bigram__WordFeatureCount

Warning: note that here input is 0.17M and the result output is 35.13M. Thus output temporary files are 200 times bigger than the input corpus.

Calculate word-feature matrix (JoBim text)

  1. Cleanup of the HDFS filesystem Output directory: wikipedia_sample_1K Output format:
The following week Hardy saved his brother when CM Punk and The Hart Dynasty attacked both Jeff and John Morrison, turning into a fan favorite again.
Particularly in western societies, modern legal conventions which stipulate points in late adolescence or early adulthood (most commonly 16-21 when adolescents are generally no longer considered minors a
Major railways began running trains at 10–20 minute intervals, rather than the usual 3–5 minute intervals, operating some lines only at rush hour and completely shutting down others; notably, the Tōkaidō
  1. The Holing operation: generate a word-feature co-occurence file Output directory: wikipedia_sample_1K_bigram Output format:
In                     Bigram(@_2009)                    wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  0:2        3:7;
2009                   -Bigram(In_@)                     wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  3:7        0:2;
2009                   Bigram(@_,)                       wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  3:7        7:8;
,                      -Bigram(2009_@)                   wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  7:8        3:7;
,                      Bigram(@_a)                       wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  7:8        9:10;
a                      -Bigram(,_@)                      wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  9:10       7:8;
a                      Bigram(@_man)                     wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  9:10       11:14;
man                    -Bigram(a_@)                      wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  11:14      9:10;
man                    Bigram(@_from)                    wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  11:14      15:19;
from                   -Bigram(man_@)                    wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  15:19      11:14;
from                   Bigram(@_Billingham)              wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  15:19      20:30;
Billingham             -Bigram(from_@)                   wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  20:30      15:19;
Billingham             Bigram(@_,)                       wikipedia_sample_1K-attempt_1423840143031_0005_m_000000_0  20:30      30:31;

Steps:

  • copy uima holing operation xml description
  • run uima xml with the de.tudarmstadt.ukp.dkpro.bigdata.hadoop.XMLDescriptorRunner
  • cleanup of the results
  1. Calculate word (Jo) frequency dictionary Output directory: wikipedia_sample_1K_bigram__WordCount Output format:
adventures             1
along                  12
average                1
be                     76
bed                    1
bedrooms               1
began                  15
bill                   3
blunt                  1
boxing                 1
co-ordinate            1

Steps:

  • org.jobimtext.hadoop.mapreducer.UniqMapper of words
  1. Calculate feature (Bim) frequency dictionary Output directory: wikipedia_sample_1K_bigram__FeatureCount Output format:
-Bigram(spinning_@)               1
-Bigram(states_@)                 6
-Bigram(stored-program_@)         1
-Bigram(subsequently_@)           6
-Bigram(switchers_@)              1
-Bigram(think_@)                  1
-Bigram(thus_@)                   2
-Bigram(timing_@)                 1
-Bigram(tracked_@)                1
-Bigram(tree_@)                   2

Steps:

  • org.jobimtext.hadoop.mapreducer.UniqMapper of words
  1. Calculate word-feature (JoBim) frequency dictionary Output directory: wikipedia_sample_1K_bigram__WordFeatureCount Output format:
Washington             -Bigram(to_@)                     1
Would                  -Bigram(:_@)                      1
a                      -Bigram("_@)                      2
a                      -Bigram(from_@)                   8
a                      Bigram(@_5-match)                 1
a                      Bigram(@_96)                      1
a                      Bigram(@_multi-million-dollar)    1
a                      Bigram(@_path)                    1
a                      Bigram(@_recording)               1
a                      Bigram(@_result)                  4
a                      Bigram(@_stakes)                  1
above                  Bigram(@_walls)                   1
accepted               Bigram(@_a)                       1
admitted               Bigram(@_under)                   1
after                  -Bigram(disappointment_@)         1
against                Bigram(@_an)                      1

Steps:

  • org.jobimtext.hadoop.mapreducer.UniqMapper
  1. Calculate LMI word-feature matrix Output directory: wikipedia_sample_1K_bigram__FreqSigLMI Output format (word feature lmi #word-feature #features:word #words:feature):
said                   -Bigram(Chen_@)                   12.719721574502557      1    9.0     1.0
Company                -Bigram(Fire_@)                   13.889646552738117      1    2.0     2.0
"                      -Bigram(Fire_@)                   6.242188185025285       1    401.0   2.0
owned                  -Bigram(Navy_@)                   12.304684005603457      1    4.0     3.0
'                      -Bigram(Navy_@)                   9.397793427400003       1    30.0    3.0
was                    -Bigram(Navy_@)                   6.254835440923465       1    265.0   3.0
also                   -Bigram(Tech_@)                   9.982755928121158       1    60.0    1.0
of                     -Bigram(feed_@)                   10.786584240845354      2    963.0   3.0
on                     -Bigram(feed_@)                   6.853472900298028       1    175.0   3.0
for                    -Bigram(love_@)                   8.031665534403793       1    232.0   1.0
cheaper                -Bigram(thus_@)                   14.889646552738117      1    1.0     2.0
backwards              -Bigram(thus_@)                   14.889646552738117      1    1.0     2.0
of                     -Bigram(tree_@)                   11.956509154990396      2    963.0   2.0
(                      Bigram(@_"haft)                   8.101743936780228       1    221.0   1.0
with                   Bigram(@_2,168)                   8.082291677094014       1    224.0   1.0
of                     Bigram(@_Anita)                   4.978254577495198       1    963.0   2.0
Santa                  Bigram(@_Anita)                   12.567718457850754      1    5.0     2.0
"                      Bigram(@_Iliad)                   7.242188185025285       1    401.0   1.0
in                     Bigram(@_Miami)                   5.486634475134901       1    677.0   2.0
the                    Bigram(@_Miami)                   4.120635296910089       1    1745.0  2.0
near                   Bigram(@_Ongar)                   13.082291677094014      1    7.0     1.0
in                     Bigram(@_Santa)                   4.164706410933029       1    677.0   5.0
of                     Bigram(@_Santa)                   3.6563264389591996      1    963.0   5.0
de                     Bigram(@_Santa)                   9.319790900894509       1    19.0    5.0
real                   Bigram(@_Santa)                   11.98275592812116       1    3.0     5.0
at                     Bigram(@_Santa)                   6.338899700272855       1    150.0   5.0

Steps:

  • pig/FreqSigLMI.pig:

Calculate word similarity matrix (Jo DT)

  1. Prune LMI word-feature matrix: keep the features that occurs with at most 1000 words, etc Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000 Output format:
software         -Bigram(commercial_@)      12.304684005603457    1    4     3     3
software         Bigram(@_development)      11.304684005603457    1    4     6     5
Americana        Bigram(@_))                7.050442716052036     1    2     229   92
Americana        Bigram(@_()                7.1017439367802275    1    2     221   103
Initially        Bigram(@_,)                4.1754010713325425    1    2     1680  661
Initially        Bigram(@_sold)             13.304684005603457    1    2     3     3
Palestine        -Bigram(in_@)              4.901672066470214     1    3     677   212
Palestine        -Bigram(to_@)              5.080682389603534     1    3     598   220
Palestine        Bigram(@_established)      10.604244296164897    1    3     13    6
Palestine        Bigram(@_League)           10.604244296164897    1    3     13    9
Palestine        Bigram(@_to)               5.080682389603534     1    3     598   283
formation        Bigram(@_.)                3.8658921954500394    1    4     1041  467
formation        Bigram(@_in)               4.486634475134901     1    4     677   338
formation        Bigram(@_of)               3.978254577495198     1    4     963   385
formation        -Bigram(the_@)             8.241270593820177     2    4     1745  646
formation        Bigram(@_;)                8.304684005603457     1    4     48    35

Steps:

  • pig/PruneContext.pig
  1. Create an inverted index of features Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt Output format:
Bigram(@_rise)             The             students
Bigram(@_throughout)       stations        tour            ceremonies       homeroom        ,
Bigram(@_tries)            's              )
Bigram(@_visa)             a               the
Bigram(@_were)             which           roads           these            kids            Troops         Act              women           vibrations      artists         studios      York           team           records        venues         population    deserts         people         %               there       Line        supports      Titans       two          They            occupied        attempts        sets         men         bridge       Bowl         parts        league         2005          rear         measurements  pieces         Obama       buildings    Most             rights         they           )            There           who            probably
-Bigram(-_@)               Dr.             The             to               June
-Bigram(12_@)              m               months          and              pieces          ,              located          to              .               cm              May
-Bigram(1989_@)            the             edition
-Bigram(Newlyn_@)          School          .
-Bigram(bird_@)            species         12
-Bigram(co-located_@)      at              with
-Bigram(direct_@)          support         it
-Bigram(especially_@)      common          with            when             during          in             since            under           around          those           "            roads
-Bigram(except_@)          in              for             three

Steps:

  • org.jobimtext.hadoop.mapreducer.AggrPerFt
  1. Calculate word similarites Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False Output format:
woman   year    3.0
woman   visit   3.0
woman   video   3.0
Convention  years   3.0
Convention  Year    3.0
Convention  work    3.0
a   a   378.0
for for 208.0
was was 204.0
and ,   204.0
,   and 204.0

Steps: - org.jobimtext.hadoop.mapreducer.SimCount

  1. Prune similarity graph by KNN thresholding (e.g. 200 most similar words), and by a global threshold (e.g. sim >= 3) Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2 Output format:
prison  1967    3.0
prison  1947    3.0
prison  1939    3.0
printed printed 3.0
principle   town    3.0
principle   station 3.0
principle   principle   3.0
principle   music   3.0
Prince  work    3.0
Prince  wake    3.0

Steps:

  • pig/SimSort.pig
  • for instance, 2.038.768 before pruning, 69.192 after pruning

Calculate feature similarity matrix (Bim DT)

  1. Prune LMI word-feature matrix: keep the words that occurs with at most 1000 features, etc. Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000 Output format:
Bigram(Peace_@)            "                5.2421881850252845    1    401   4     229
-Bigram(Since_@)            its              9.680193169704102     1    37    2     41
-Bigram(Since_@)            1997             12.567718457850754    1    5     2     9
-Bigram(basis_@)            .                4.280929741912922     1    1041  3     472
-Bigram(basis_@)            for              14.89340615148975     2    232   3     221
-Bigram(birds_@)            and              4.539812448589677     1    870   3     693
-Bigram(birds_@)            could            10.845252473991478    1    11    3     14
-Bigram(plans_@)            to               13.331289672230339    2    598   2     503
Bigram(@_Living)            the              4.120635296910089     1    1745  2     898
Bigram(@_Living)            Country          13.889646552738117    1    2     2     3
Bigram(@_faster)            run              12.304684005603457    1    4     3     7
Bigram(@_things)            other            20.67011544853446     2    47    2     54
-Bigram(Ragged_@)           Ass              42.914052016810366    3    3     3     2
-Bigram(states_@)           .                3.2809297419129217    1    1041  6     472
-Bigram(states_@)           ;                7.719721574502556     1    48    6     60
-Bigram(states_@)           obtained         11.719721574502556    1    3     6     4
-Bigram(states_@)           that             13.859289250767814    2    166   6     160
Bigram(@_Florida)           driving          12.719721574502557    1    3     3     5
Bigram(@_Israeli)           an               6.625203942539863     1    123   5     103
Bigram(@_Israeli)           real             11.98275592812116     1    3     5     5

Steps:

  • pig/PruneContext.pig
  1. Build feature inverted index Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt Output format:
1943                -Bigram((_@)               Bigram(@_–)               Bigram(@_on)               -Bigram(September_@)
Americana           Bigram(@_()                -Bigram(Universidad_@)    Bigram(@_))
But                 Bigram(@_he)               Bigram(@_she)             Bigram(@_after)            Bigram(@_,)                Bigram(@_with)
CR                  -Bigram(,_@)               Bigram(@_21)              Bigram(@_7)                -Bigram(with_@)            -Bigram(from_@)
Cardiff             Bigram(@_,)                -Bigram(in_@)
Ministry            -Bigram(,_@)               Bigram(@_nominated)       Bigram(@_of)
Natural             -Bigram(the_@)             Bigram(@_Gas)             -Bigram(&_@)               -Bigram(and_@)
September           -Bigram(11_@)              Bigram(@_24)              Bigram(@_11)               Bigram(@_2012)             -Bigram(the_@)            -Bigram(On_@)              Bigram(@_2)               Bigram(@_1943)            -Bigram(in_@)              Bigram(@_1)                -Bigram(–_@)              -Bigram(on_@)              Bigram(@_29)             Bigram(@_14)              -Bigram(to_@)
West                -Bigram(to_@)              Bigram(@_Berlin)          Bigram(@_Virginia)         -Bigram(,_@)               -Bigram(Mountain_@)       Bigram(@_&)                -Bigram(violent_@)        Bigram(@_with)            -Bigram(by_@)              Bigram(@_Point)            -Bigram(on_@)             -Bigram(Old_@)             -Bigram(include_@)       Bigram(@_,)
accepted            Bigram(@_a)                -Bigram("_@)              Bigram(@_on)
began               Bigram(@_on)               -Bigram(and_@)            Bigram(@_running)          Bigram(@_their)            Bigram(@_as)              Bigram(@_,)                -Bigram(company_@)        -Bigram(girls_@)          -Bigram("_@)               -Bigram(store_@)           Bigram(@_in)              Bigram(@_to)               -Bigram(U2_@)            -Bigram(population_@)     Bigram(@_setting)
bill                -Bigram(The_@)             Bigram(@_would)           -Bigram(a_@)               Bigram(@_passed)
character           -Bigram(main_@)            Bigram(@_with)            Bigram(@_is)               -Bigram(the_@)             -Bigram(central_@)        -Bigram(allow_@)

Steps:

  • org.jobimtext.hadoop.mapreducer.AggrPerFt
  1. Calculate feature similarities, pruning feature similarities by knn thresholding e.g. top 200 features) and global thresholding (e.g. sim of at least 2) Output directory: wikipedia_sample_1K_bigram__FreqSigLMI__PruneContext_BIM_s_0.0_w_2_f_2_wf_0_wpfmax_1000_wpfmin_2_p_1000__AggrPerFt__SimCount_sc_one_ac_False__SimSortlimit_200_minsim_2 Output format:
-Bigram(of_@)   Bigram(@_as)    15.0
-Bigram(of_@)   -Bigram(as_@)   15.0
Bigram(@_of)    -Bigram("_@)    15.0
-Bigram(of_@)   Bigram(@_)) 15.0
-Bigram(new_@)  -Bigram(the_@)  15.0
-Bigram(new_@)  Bigram(@_.) 15.0
Bigram(@_most)  Bigram(@_most)  15.0
Bigram(@_many)  Bigram(@_many)  15.0
Bigram(@_it)    Bigram(@_he)    15.0
-Bigram(it_@)   -Bigram(and_@)  15.0
Bigram(@_it)    Bigram(@_a) 15.0
-Bigram(is_@)   -Bigram(were_@) 15.0
Bigram(@_is)    -Bigram(and_@)  15.0
Bigram(@_is)    -Bigram(,_@)    15.0
Bigram(@_in)    Bigram(@_that)  15.0

Steps:

  • org.jobimtext.hadoop.mapreducer.SimCount
  • pig/SimSort.pig