Skip to content

Commit

Permalink
tokenizer from mosesdecoder
Browse files Browse the repository at this point in the history
  • Loading branch information
Kyunghyun Cho committed Jul 3, 2015
1 parent 7cdb933 commit 6b5e3bb
Show file tree
Hide file tree
Showing 24 changed files with 5,913 additions and 0 deletions.
8 changes: 8 additions & 0 deletions data/nonbreaking_prefixes/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
The language suffix can be found here:

http://www.loc.gov/standards/iso639-2/php/code_list.php

This code includes data from Daniel Naber's Language Tools (czech abbreviations).
This code includes data from czech wiktionary (also czech abbreviations).


75 changes: 75 additions & 0 deletions data/nonbreaking_prefixes/nonbreaking_prefix.ca
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Dr
Dra
pàg
p
c
av
Sr
Sra
adm
esq
Prof
S.A
S.L
p.e
ptes
Sta
St
pl
màx
cast
dir
nre
fra
admdora
Emm
Excma
espf
dc
admdor
tel
angl
aprox
ca
dept
dj
dl
dt
ds
dg
dv
ed
entl
al
i.e
maj
smin
n
núm
pta
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Loading

0 comments on commit 6b5e3bb

Please sign in to comment.