-
Notifications
You must be signed in to change notification settings - Fork 7
Implement optional expanding of compound words into separate tokens #5
base: master
Are you sure you want to change the base?
Implement optional expanding of compound words into separate tokens #5
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
I studied the code and tried to understand it and came up with some questions and comments. Some of the comments are mostly about my personal preference towards the style of the codebase, but then there are some real questions about the functionality as well.
If you can go through my comments and figure out answers as how the code really should work I can take care of any post-merge cleanups, but before merging I'd like to be sure to have an understanding. So especially the parsing questions towards the end of the review interest me.
| libraryPath | system dependent | path to directory containing libvoikko | | ||
| poolMaxSize | 10 | maximum amount of Voikko-instances to pool | | ||
| analysisCacheSize | 1024 | number of analysis results to cache | | ||
| expandCompounds | false | whether to produce separate tokens for parts of compound words | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimumSubwordSize
and maximumSubwordSize
are not documented here. Is that intentional or accidental? If you you consider them to be so rarely needed that they don't need to be documented, that's fine by me. I just want to make sure that they aren't forgotten by accident.
for (String compound : expandCompounds(results)) { | ||
baseForms.add(compound); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be just:
if (cfg.expandCompounds)
baseForms.addAll(expandCompounds(results));
} | ||
} | ||
if (!cfg.analyzeAll) { | ||
return new ArrayList<String>(new LinkedHashSet<String>(baseForms)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of notes:
First of all, this is Java 8, so you can leave out the explicit types and say:
return new ArrayList<>(new LinkedHashSet<>(baseForms));
Second of all, I'd probably add a method like CollectionUtils.withoutDuplicates
and write just return withoutDuplicates(baseForms)
here. There are two reasons for this: first one is communicating the intent more clearly, saying what the code is supposed to do instead of how it's implemented. And second, having a separate method would in fact allow nice little optimizations without complicating the calling code. For example:
public static <T> List<T> withoutDuplicates(List<T> xs) {
if (xs.size() < 2) return xs;
return ArrayList<>(new LinkedHashSet<>(xs));
}
Or even something like:
public static <T> List<T> withoutDuplicates(List<T> xs) {
if (xs.size() < 2) return xs;
// for small collections (our usual case) the naive algorithm should be a win
if (xs.size() < 16) {
ArrayList<T> result = new ArrayList<>(xs.size());
for (T x : xs)
if (!result.contains(x))
result.add(x);
return result;
}
return ArrayList<>(new LinkedHashSet<>(xs));
}
Not that I'd probably bother to optimize it, but having it's nice to write higher-level code in the calling side and know that I have a good place to perform optimization if I have to get my hands dirty. This code is called a lot and reducing extra allocations helps.
if (!analysis.containsKey("WORDBASES")) { | ||
continue; | ||
} | ||
String wordbases = analysis.get("WORDBASES"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This performs lookups to the map: first with containsKey
and then with get
. It's better to just say:
String wordbases = analysis.get("WORDBASES");
if (wordbases == null) continue;
Almost the only reason to call containsKey
is in the obscure case where null
is a valid value in the map and you need to distinguish between the cases where null
was returned because value was found and the case where null
was returned because null
was found. Even in that case it's best to call containsKey
after the call to get
only if necessary. So something like:
String value = map.get(key);
if (value != null || map.containsKey(key)) {
// map contained entry for 'key', value could be null
} else {
// map did not contain 'key'
}
That said, it's super-rare to contain maps where null
is a valid value.
} | ||
String wordbases = analysis.get("WORDBASES"); | ||
// Split by plus sign (unless right after an open parenthesis) | ||
String matches[] = wordbases.split("(?<!\\()\\+"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ends up compiling the same regex over and over again. It's better to perform the expensive compilation of the regex just once:
private static final Pattern WORDBASE_SPLIT = Pattern.compile(""(?<!\\()\\+"");
and then say:
String[] matches = WORDBASE_SPLIT.split(wordbases);
int minSubwordSize = cfg.minimumSubwordSize; | ||
if (wordLen > minSubwordSize) { | ||
if (wordLen > maxSubwordSize) { | ||
word = word.substring(0, maxSubwordSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is prudent to just cut words to maxSubwordSize
or should we just ignore the words that exceed the length? I could be persuaded that either behavior is correct, but it is somewhat confusing that the semantics of minSubwordSize
and maxSubwordSize
are so different even though the names are symmetrical.
// by the National Library of Finland. | ||
// | ||
// https://github.com/NatLibFi/SolrPlugins | ||
Set<String> compoundForms = new LinkedHashSet<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now duplicate removal is performed at two different levels: this method performs and the caller performs it again. It's probably best to just leave it to the caller and return a List
from here.
} | ||
} | ||
} | ||
return compoundForms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is quite complex and it's hard simply look at it and be assured that it's correct. Furthermore, it's hard to write tests for this because it's tied to running Voikko.
Perhaps the code could be structured along these lines:
private List<String> expandCompounds(List<Analysis> analysisList) {
List<String> compoundForms = new ArrayList<String>();
for (Analysis analysis: analysisList) {
String wordBases = analysis.get("WORDBASES");
if (wordBases != null)
compoundForms.addAll(expandWordBases(wordBases));
}
return compoundForms:
}
private List<String> expandWordBases(String wordBases) {
return CompoundWordParser.expandWordBases(wordBases, charTermAttribute.length(), cfg.maximumSubwordSize, cfg.minimumSubwordSize);
}
This way the complex logic would be in a static method for which it would be trivial to write test cases:
assertWordBases("+köyde(köysi)+n+veto(veto)",
"köysi", "veto");
assertWordBases("+alkio(Alkio)+-+opisto(opisto)",
"Alkio", "opisto");
assertWordBases("+kansa(kansa)+llis(+llinen)+eepos(eepos)",
"kansa", "kansallis", "eepos");
I copy-pasted your code to a scratch file, hacked it a bit to make it callable from test, asked what it gives back to the examples inputs above and then wrote the expected results based on the results the code gave me. Do those look expected? I'm surprised about "kansallis" is the results. Wouldn't the correct value be "kansallinen"?
Furthermore I'm a bit confused if the logic can be right because the only reason why the input +köyde(köysi)+n+veto(veto)
does not produce nonsensical output köysi
, n
, veto
is that n
is rejected for being too short. I guess that "n" should be rejected because it does not have a specified root, not because it's too short. If this is fixed perhaps, the minimum and maximum lengths are not needed at all?
All that said, here's my attempt at this function:
private static final Pattern WORDBASE = Pattern.compile("\\+([^+(]+)(\\(([^)]*)\\))?");
static List<String> expandWordBases(String wordbases) {
Matcher m = WORDBASE.matcher(wordbases.replace("=", ""));
String lastBody = "";
List<String> result = new ArrayList<>();
while (m.find()) {
String body = m.group(1);
String root = m.group(3);
if (root != null) {
if (root.startsWith("+")) {
result.add(lastBody + root.substring(1));
} else {
result.add(root);
lastBody = body;
}
}
}
return result;
}
And some tests to go with it:
@Test
void voikkoExamples() {
assertWordBases("+köyde(köysi)+n+veto(veto)", "köysi", "veto");
assertWordBases("+alkio(Alkio)+-+opisto(opisto)", "Alkio", "opisto");
assertWordBases("+kansa(kansa)+llis(+llinen)+eepos(eepos)", "kansa", "kansallinen", "eepos");
}
private static void assertWordBases(String input, String... expected) {
assertEquals(asList(expected), Foo.expandWordBases(input));
}
lastWordBody = wordBody; | ||
lastPos = currentPos; | ||
currentPos += baseForm.length(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say that I understand the meaning of juggling all these variables, but just by looking at the code above, I can say that at least assignment to wordLen
can be moved outside of the branch since it's the same on both cases:
int wordLen = word.length();
|
||
int maxSubwordSize = cfg.maximumSubwordSize; | ||
int minSubwordSize = cfg.minimumSubwordSize; | ||
if (wordLen > minSubwordSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I specify minimum size to be 2, I'd expect that words of length 2 will be included, but this excludes them. So perhaps this should be >=
instead of >
?
Thank you! This is an excellent review. I realize that some work needs to be done to make the implementation mature and easier to reason about. I will try to get back with a new proposal as soon as possible. |
Please note that this implementation contains code from
https://github.com/NatLibFi/SolrPlugins/tree/master/Voikko
which is a National Library of Finland project which was kindly relicensed by my request
to be compatible with this project.
See issue #4