This repository has been archived by the owner on Jan 22, 2021. It is now read-only.

Implement optional expanding of compound words into separate tokens #5

Open

tituomin wants to merge 1 commit into EvidentSolutions:master from City-of-Helsinki:feature/compound-words

tituomin commented Apr 18, 2019

Please note that this implementation contains code from
https://github.com/NatLibFi/SolrPlugins/tree/master/Voikko
which is a National Library of Finland project which was kindly relicensed by my request
to be compatible with this project.

See issue #4


          Implement optional expanding of compound words into separate tokens

9416e00

komu reviewed

View reviewed changes

Member

komu left a comment

Thanks a lot!

I studied the code and tried to understand it and came up with some questions and comments. Some of the comments are mostly about my personal preference towards the style of the codebase, but then there are some real questions about the functionality as well.

If you can go through my comments and figure out answers as how the code really should work I can take care of any post-merge cleanups, but before merging I'd like to be sure to have an understanding. So especially the parsing questions towards the end of the review interest me.

README.md

+              | libraryPath       | system dependent | path to directory containing libvoikko                         |
+              | poolMaxSize       | 10               | maximum amount of Voikko-instances to pool                     |
+              | analysisCacheSize | 1024             | number of analysis results to cache                            |
+              | expandCompounds   | false            | whether to produce separate tokens for parts of compound words |

Member

komu Apr 18, 2019

The minimumSubwordSize and maximumSubwordSize are not documented here. Is that intentional or accidental? If you you consider them to be so rarely needed that they don't need to be documented, that's fine by me. I just want to make sure that they aren't forgotten by accident.

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                          for (String compound : expandCompounds(results)) {
+                              baseForms.add(compound);
+                          }
+                      }

Member

komu Apr 18, 2019

This could be just:

if (cfg.expandCompounds)
    baseForms.addAll(expandCompounds(results));

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                          }
+                      }
+                      if (!cfg.analyzeAll) {
+                          return new ArrayList<String>(new LinkedHashSet<String>(baseForms));

Member

komu Apr 18, 2019

Couple of notes:

First of all, this is Java 8, so you can leave out the explicit types and say:

return new ArrayList<>(new LinkedHashSet<>(baseForms));

Second of all, I'd probably add a method like CollectionUtils.withoutDuplicates and write just return withoutDuplicates(baseForms) here. There are two reasons for this: first one is communicating the intent more clearly, saying what the code is supposed to do instead of how it's implemented. And second, having a separate method would in fact allow nice little optimizations without complicating the calling code. For example:

public static <T> List<T> withoutDuplicates(List<T> xs) {
    if (xs.size() < 2) return xs;
    return ArrayList<>(new LinkedHashSet<>(xs));
}

Or even something like:

public static <T> List<T> withoutDuplicates(List<T> xs) {
    if (xs.size() < 2) return xs;

    // for small collections (our usual case) the naive algorithm should be a win
    if (xs.size() < 16) {
        ArrayList<T> result = new ArrayList<>(xs.size());
        for (T x : xs)
            if (!result.contains(x))
                result.add(x);
        return result;
    }

    return ArrayList<>(new LinkedHashSet<>(xs));
}

Not that I'd probably bother to optimize it, but having it's nice to write higher-level code in the calling side and know that I have a good place to perform optimization if I have to get my hands dirty. This code is called a lot and reducing extra allocations helps.

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                          if (!analysis.containsKey("WORDBASES")) {
+                              continue;
+                          }
+                          String wordbases = analysis.get("WORDBASES");

Member

komu Apr 18, 2019

This performs lookups to the map: first with containsKey and then with get. It's better to just say:

String wordbases = analysis.get("WORDBASES");
if (wordbases == null) continue;

Almost the only reason to call containsKey is in the obscure case where null is a valid value in the map and you need to distinguish between the cases where null was returned because value was found and the case where null was returned because null was found. Even in that case it's best to call containsKey after the call to get only if necessary. So something like:

String value = map.get(key);
if (value != null || map.containsKey(key)) {
    // map contained entry for 'key', value could be null
} else {
    // map did not contain 'key'
}

That said, it's super-rare to contain maps where null is a valid value.

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                          }
+                          String wordbases = analysis.get("WORDBASES");
+                          // Split by plus sign (unless right after an open parenthesis)
+                          String matches[] = wordbases.split("(?<!\\()\\+");

Member

komu Apr 18, 2019

This ends up compiling the same regex over and over again. It's better to perform the expensive compilation of the regex just once:

private static final Pattern WORDBASE_SPLIT = Pattern.compile(""(?<!\\()\\+"");

and then say:

String[] matches = WORDBASE_SPLIT.split(wordbases);

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                              int minSubwordSize = cfg.minimumSubwordSize;
+                              if (wordLen > minSubwordSize) {
+                                  if (wordLen > maxSubwordSize) {
+                                      word = word.substring(0, maxSubwordSize);

Member

komu Apr 18, 2019

It is prudent to just cut words to maxSubwordSize or should we just ignore the words that exceed the length? I could be persuaded that either behavior is correct, but it is somewhat confusing that the semantics of minSubwordSize and maxSubwordSize are so different even though the names are symmetrical.

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                      // by the National Library of Finland.
+                      //
+                      // https://github.com/NatLibFi/SolrPlugins
+                      Set<String> compoundForms = new LinkedHashSet<String>();

Member

komu Apr 18, 2019

Now duplicate removal is performed at two different levels: this method performs and the caller performs it again. It's probably best to just leave it to the caller and return a List from here.

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                              }
+                          }
+                      }
+                      return compoundForms;

Member

komu Apr 18, 2019

This method is quite complex and it's hard simply look at it and be assured that it's correct. Furthermore, it's hard to write tests for this because it's tied to running Voikko.

Perhaps the code could be structured along these lines:

private List<String> expandCompounds(List<Analysis> analysisList) {
    List<String> compoundForms = new ArrayList<String>();

    for (Analysis analysis: analysisList) {
        String wordBases = analysis.get("WORDBASES");
            if (wordBases != null)
                compoundForms.addAll(expandWordBases(wordBases));
    }

    return compoundForms:
}

private List<String> expandWordBases(String wordBases) {
    return CompoundWordParser.expandWordBases(wordBases, charTermAttribute.length(), cfg.maximumSubwordSize, cfg.minimumSubwordSize);
}

This way the complex logic would be in a static method for which it would be trivial to write test cases:

assertWordBases("+köyde(köysi)+n+veto(veto)", 
    "köysi", "veto");
assertWordBases("+alkio(Alkio)+-+opisto(opisto)",
    "Alkio", "opisto");
assertWordBases("+kansa(kansa)+llis(+llinen)+eepos(eepos)",
    "kansa", "kansallis", "eepos");

I copy-pasted your code to a scratch file, hacked it a bit to make it callable from test, asked what it gives back to the examples inputs above and then wrote the expected results based on the results the code gave me. Do those look expected? I'm surprised about "kansallis" is the results. Wouldn't the correct value be "kansallinen"?

Furthermore I'm a bit confused if the logic can be right because the only reason why the input +köyde(köysi)+n+veto(veto) does not produce nonsensical output köysi, n, veto is that n is rejected for being too short. I guess that "n" should be rejected because it does not have a specified root, not because it's too short. If this is fixed perhaps, the minimum and maximum lengths are not needed at all?

All that said, here's my attempt at this function:

private static final Pattern WORDBASE = Pattern.compile("\\+([^+(]+)(\\(([^)]*)\\))?");

static List<String> expandWordBases(String wordbases) {
    Matcher m = WORDBASE.matcher(wordbases.replace("=", ""));

    String lastBody = "";
    List<String> result = new ArrayList<>();
    while (m.find()) {
        String body = m.group(1);
        String root = m.group(3);

        if (root != null) {
            if (root.startsWith("+")) {
                result.add(lastBody + root.substring(1));

            } else {
                result.add(root);
                lastBody = body;
            }
        }
    }
    return result;
}

And some tests to go with it:

@Test
void voikkoExamples() {
    assertWordBases("+köyde(köysi)+n+veto(veto)", "köysi", "veto");
    assertWordBases("+alkio(Alkio)+-+opisto(opisto)", "Alkio", "opisto");
    assertWordBases("+kansa(kansa)+llis(+llinen)+eepos(eepos)", "kansa", "kansallinen", "eepos");
}

private static void assertWordBases(String input, String... expected) {
    assertEquals(asList(expected), Foo.expandWordBases(input));
}

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                                  lastWordBody = wordBody;
+                                  lastPos = currentPos;
+                                  currentPos += baseForm.length();
+                              }

Member

komu Apr 18, 2019

I can't say that I understand the meaning of juggling all these variables, but just by looking at the code above, I can say that at least assignment to wordLen can be moved outside of the branch since it's the same on both cases:

int wordLen = word.length();

src/main/java/fi/evident/elasticsearch/voikko/analysis/VoikkoTokenFilter.java

+                              int maxSubwordSize = cfg.maximumSubwordSize;
+                              int minSubwordSize = cfg.minimumSubwordSize;
+                              if (wordLen > minSubwordSize) {

Member

komu Apr 18, 2019

If I specify minimum size to be 2, I'd expect that words of length 2 will be included, but this excludes them. So perhaps this should be >= instead of >?

Author

tituomin commented Apr 26, 2019

Thanks a lot!

I studied the code and tried to understand it and came up with some questions and comments. Some of the comments are mostly about my personal preference towards the style of the codebase, but then there are some real questions about the functionality as well.

If you can go through my comments and figure out answers as how the code really should work I can take care of any post-merge cleanups, but before merging I'd like to be sure to have an understanding. So especially the parsing questions towards the end of the review interest me.

Thank you! This is an excellent review. I realize that some work needs to be done to make the implementation mature and easier to reason about. I will try to get back with a new proposal as soon as possible.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet