Skip to content

Build Process

adang1345 edited this page Jul 14, 2020 · 24 revisions

This page provides additional detail on the tasks and artefacts involved with building the iKnow engine.

The Language Models

Context

The iKnow engine relies on Language Models (also known as Knowledge Bases or simply "KB") for its language-specific parsing of sentences. A KB's source is expressed as a set of CSV files, which are not containing outright code but capture linguistic tokens, rules and other metadata specific to a language, plus some comments and sample sentences. These files are maintained under /kb in a human-readable (and editable) source format, usually through simple text editors like Notepad++.

When accumulated language model edits to the files in /kb represent a comprehensive update, it's time to compile them ahead of a full iKnow engine build. Compiling the language models means transforming them from CSV format into a collection of artefacts the iKnow engine can use at runtime:

  • Data in lexrep.csv is compiled into a C++ state machine that ends up in .inl files in /modules/aho/inl/<language>/lexrep/, enabling high-performance matching of input text to the linguistic tokens on which iKnow bases its parsing
  • Data in the other csv files gets loaded as shared memory dumps to enable efficient runtime loading

Compilation steps

In case you made any changes to the source .csv files, the iKnowLanguageCompiler project takes care of transforming them into those runtime formats. If you haven't, you can skip this step. Since the language compiler relies on common parts, you'll need to build the iKnowEngineTest program as described below before building iKnowLanguageCompiler. This should result in a new executable:

<repo_root>\kit\x64\(Debug|Release)\bin\iKnowLanguageCompiler(.exe)

Open a command window, change directory to <repo_root>\kit\x64\(Debug|Release)\bin\, and run the program with the requested language code (eg: IKnowLanguageCompiler.exe en for building the English language model). If no language parameter is supplied, all language models will be rebuilt. After the build process, you must rebuild the test program to pick up the new language models.

Inputs and outputs

It is important to understand the in- and output of this process. The input consists of a collection of csv-files, representing the language model as assembled by a qualified linguist:

  • <repo_root>\language_models\(cs|de|en|es|fr|ja|nl|pt|ru|sv|uk)\

    Each language directory contains 8 (or less) csv-files : "acro", "filter", "labels", "lexreps", "metadata", "prepro", "regex" and "rules". See here for a detailed description. These files are the input for the language model builder.

  • <repo_root>\modules\engine\language_data\

    This directory contains, per language, the binary representation of the linguistic data, in the form of a header file (kb_<language>_data.h), this is output, generated by the language compiler, do not edit!

  • <repo_root>\modules\aho\inl\(cs|de|en|es|fr|ja|nl|pt|ru|sv|uk)\

    This is the place where, per language, AHO state machine data is written, this is output, also the result of the language compilation process, do not edit!

The language compiler must be run from its /bin directory, and knows the input and output directories, so there is no need for any configuration. If you would like to change these, you'll have to edit the source code. After rebuilding a language model data, a new build of the language module itself is needed, since this binary data is hard coded for maximum speed.

The C++ Source Code

Once the Language Models are compiled into .inl files in /modules/aho/inl/, a regular C++ build (through Visual Studio .sln or a Makefile) turns the full codebase into the required .dll or .so files. Note that this code repository contains pre-compiled versions of the language models, so the above step is only required if you made changes to the .csv files.

See here for an introduction on the important sections of the iKnow source code. ICU (header files and libraries) is the only dependency for this project.

Building on Windows

Step 1: Setting up project dependencies

  1. Download the Win64 binaries for a recent release of the ICU library (e.g. version 65.1) and unzip to <repo_root>/thirdparty/icu (or a local folder of your choice).

  2. If you chose a different folder for your ICU libraries, update <repo_root>\modules\Dependencies.props to represent your local configuration. This is how it looks after download, which should be OK if you used the suggested directory paths:

  <PropertyGroup Label="UserMacros">
    <ICUDIR>$(SolutionDir)..\thirdparty\icu\</ICUDIR>
    <ICU_INCLUDE>$(ICUDIR)\include</ICU_INCLUDE>
    <ICU_LIB>$(ICUDIR)\lib64</ICU_LIB>
  </PropertyGroup>

Step 2: Building the modules

  1. Open the Solution file <repo_root>\modules\iKnowEngine.sln in Visual Studio. We used Visual Studio Community 2019

  2. In the Solution Explorer, choose "iKnowEngineTest" as "Set up as startup project"

  3. In Solution Configurations, choose either "Debug|x86", or "Release|x64", depending on the kind of executable you prefer.

  4. Build the solution, it will build all 29 projects.

Step 3: Testing the indexer

Once building has succeeded, you can run the test program, depending on which build config you chose:

  • <repo_root>\kit\x64\Debug\bin\iKnowEngineTest.exe
  • <repo_root>\kit\x64\Release\bin\iKnowEngineTest.exe

⚠️ Note that you'll have to add the $(ICUDIR)/bin64 directory to your PATH or copy its .dll files to this test folder in order to run the test executable.

Alternatively, you can also start a debugging session in Visual Studio and walk through the code to inspect it.

The iKnow indexing demo program will index one sentence for each of the 11 languages, and write out the sentence boundaries. That's of course not very spectacular by itself, but future iterations of this demo program will expose more of the entity and context information iKnow detects.

On Linux / Unix

Step 1: Setting up project dependencies

  1. Download the proper binaries for a recent release of the ICU library (e.g. version 65.1) and untar to <repo_root>/thirdparty/icu (or a local folder of your choice).

  2. Save the path you untarred the archive to a ICUDIR environment variable. Note that your ICU download may have a relative path inside the tar archive, so you may need to use --strip-components=4 or manually reorganise to make sure the ${ICUDIR}/include leads where you'd expect it to lead.

Step 2: Building the modules

  1. Set the IKNOWPLAT environment variable to the target platform of your choice: e.g. "lnxubuntux64", "lnxrhx64" or "macx64"

  2. In the <repo_root> folder, run

    make all

Building in Docker

While primarily useful for build-testing convenience, we're also providing a Dockerfile that stuffs the code in a clean container with the required ICU libraries. If your Linux / Unix build doesn't seem to work, perhaps a quick look at this Dockerfile will help nail down where trouble starts.

Step 1: Building the container

  1. Optionally open the Dockerfile to change the ICU library version to use

  2. Use the docker build command to package things up:

docker build --tag iknow .

This will automatically download the ICU library of your choice and register its path for onward building.

Step 2: Building the iKnow modules

  1. Start and step into the container using docker run:
docker run --rm -it iknow

The --rm flag will make sure the container gets dropped after you're done exploring.

  1. Inside the container, use make all to kick off the build.
cd /usr/src/iknow
make all

Step 3: Testing iKnow

  1. make test will build and run the testprogram ("iknowenginetest"). You will find the testprogram in /usr/src/iknow/kit/lnxubuntux64/release/bin
cd /usr/src/iknow
make test

The Python Interface

The iknowpy Python module brings the iKnow engine capabilities to Python ≥3.5 on Windows, Mac OS, and Linux. We use Travis CI to build iknowpy automatically and deploy it to PyPI (the Python Package Index). To trigger a new build and deployment, simply increment the version in <repo_root>/modules/iknowpy/iknowpy/version.py in the master branch and push the associated commit.

To create a development release of iknowpy for testing purposes, you can instead edit the number in version.py so that it ends in .devN, where N is the development build number. Examples are 0.0.12.dev0 and 1.2.3.dev456. When you push a change to version.py that contains a development build number, iknowpy is automatically built and deployed to TestPyPI, an index separate from PyPI. Once the deployment of the development build is complete, you can install it with pip.

pip install --index-url https://test.pypi.org/simple/ -U iknowpy

Note: If you are pushing multiple commits at once, the edit to version.py must occur in the final commit to trigger automatic deployment.

If you want to build the Python interface locally, read on. The following directions refer to the commands pip and python. On some platforms, these commands use Python 2 by default, in which case you should execute pip3 and python3 instead to ensure that you are using Python 3.

Step 1: Build the iKnow engine

Build the iKnow engine following the above directions. If you are on Windows, choose the "Release|x64" configuration.

Step 2: Setting up Python dependencies

  1. Install Python ≥3.5. Ensure that the installation includes Python header files.

  2. Install Cython, setuptools, and wheel. You can do this by having a Python distribution that already includes these modules or by running

    pip install -U cython setuptools wheel
  3. If you are on Mac OS, ensure that the otool and install_name_path command-line tools are present. They should be available if you have XCode installed with command-line developer tools. If you are on Linux, ensure that the patchelf tool is present and has version ≥0.9. You can install it using the package manager on your machine, or you can build it from source at https://github.com/NixOS/patchelf/releases.

Step 3: Building and installing the iknowpy module

Open a command shell in the directory <repo_root>/modules/iknowpy and execute the setup script. This builds iknowpy, creates a package containing iknowpy and its dependencies, and installs the package.

python setup.py install

Step 4: Testing the iknowpy module

The scripts at <repo_root>/modules/iknowpy/tests/ provide example of how to use iknowpy. Run the scripts to call a few iKnow functions from Python and print their results. See Getting Started for more on the various entry points.

⚠️ If you are testing iknowpy via the Python interactive console, do not do so in the <repo_root>/modules/iknowpy working directory. Because of how Python resolves module names, importing iknowpy will cause Python to try importing the source package <repo_root>/modules/iknowpy/iknowpy instead of the installed package, resulting in an import error.

Step 5: (Optional) Building a Wheel

A wheel is a pre-built package that can can be distributed to others and can be installed using pip. A single wheel is specific to the build platform and the minor version of Python (e.g. 3.7 or 3.8) used to build the wheel. Thus, a wheel must be built for every platform and minor Python version for which a simple installation using pip is desired.

On Windows

  1. Open a command shell in the directory <repo_root>/modules/iknowpy.

  2. Build the wheel.

    python setup.py bdist_wheel

On Mac OS

Decide the minimum version of Mac OS that the wheel will support. Ensure that the iKnow engine, ICU, and Python were built with support for this version. Python distributions from https://www.python.org/downloads/mac-osx are the best for this situation, as they tend to be the distributions that are maximally compatible with different Mac OS versions. The following directions assume a minimum target version of Mac OS X 10.9, but you can adapt them to suit your preferences.

  1. Open a command shell in the directory <repo_root>/modules/iknowpy.

  2. Build the wheel. If you do not specify MACOSX_DEPLOYMENT_TARGET or --plat-name, then the minimum supported Mac OS version defaults to that of the Python distribution used to build the wheel.

    export MACOSX_DEPLOYMENT_TARGET=10.9
    python setup.py bdist_wheel --plat-name=macosx-10.9-x86_64

On Linux

In general, binaries built on one Linux distribution are not directly runnable on another Linux distribution. If all you want is a wheel that is compatible with the build platform and its binary compatible platforms, simply execute the following in the directory <repo_root>/modules/iknowpy.

python setup.py bdist_wheel

To build a wheel that is compatible with the vast majority of modern Linux distributions, you can use the manylinux project, which is the standard way to distribute pre-built Python extension modules on Linux. If you want to build a manylinux wheel locally, see <repo_root>/travis/build_manylinux.sh to see how our continuous deployment pipeline does it. In short, start up a manylinux Docker container for the desired CPU architecture. Inside the container, build ICU, the iKnow engine, and the Python interface.