-
Notifications
You must be signed in to change notification settings - Fork 20
Build Process
This page provides additional detail on the tasks and artefacts involved with building the iKnow engine.
The iKnow engine relies on Language Models (also known as Knowledge Bases or simply "KB") for its language-specific parsing of sentences. A KB's source is expressed as a set of CSV files, which are not containing outright code but capture linguistic tokens, rules and other metadata specific to a language, plus some comments and sample sentences. These files are maintained under /kb in a human-readable (and editable) source format, usually through simple text editors like Notepad++.
When accumulated language model edits to the files in /kb
represent a comprehensive update, it's time to compile them ahead of a full iKnow engine build. Compiling the language models means transforming them from CSV format into a collection of artefacts the iKnow engine can use at runtime:
- Data in
lexrep.csv
is compiled into a C++ state machine that ends up in .inl files in/modules/aho/inl/<language>/lexrep/
, enabling high-performance matching of input text to the linguistic tokens on which iKnow bases its parsing - Data in the other csv files gets loaded as shared memory dumps to enable efficient runtime loading
In case you made any changes to the source .csv files, the iKnowLanguageCompiler
project takes care of transforming them into those runtime formats. If you haven't, you can skip this step. Since the language compiler relies on common parts, you'll need to build the iKnowEngineTest
program as described below before building iKnowLanguageCompiler
.
This should result in a new executable:
<repo_root>\kit\x64\(Debug|Release)\bin\iKnowLanguageCompiler(.exe)
Open a command window, change directory to <repo_root>\kit\x64\(Debug|Release)\bin\
, and run the program with the requested language code (eg: IKnowLanguageCompiler.exe en
for building the English language model). If no language parameter is supplied, all language models will be rebuilt. After the build process, you must rebuild the test program to pick up the new language models.
It is important to understand the in- and output of this process. The input consists of a collection of csv-files, representing the language model as assembled by a qualified linguist:
-
<repo_root>\language_models\(cs|de|en|es|fr|ja|nl|pt|ru|sv|uk)\
Each language directory contains 8 (or less) csv-files : "acro", "filter", "labels", "lexreps", "metadata", "prepro", "regex" and "rules". See here for a detailed description. These files are the input for the language model builder.
-
<repo_root>\modules\engine\language_data\
This directory contains, per language, the binary representation of the linguistic data, in the form of a header file (
kb_<language>_data.h
), this is output, generated by the language compiler, do not edit! -
<repo_root>\modules\aho\inl\(cs|de|en|es|fr|ja|nl|pt|ru|sv|uk)\
This is the place where, per language, AHO state machine data is written, this is output, also the result of the language compilation process, do not edit!
The language compiler must be run from its /bin
directory, and knows the input and output directories, so there is no need for any configuration. If you would like to change these, you'll have to edit the source code. After rebuilding a language model data, a new build of the language module itself is needed, since this binary data is hard coded for maximum speed.
Once the Language Models are compiled into .inl files in /modules/aho/inl/
, a regular C++ build (through Visual Studio .sln or a Makefile) turns the full codebase into the required .dll or .so files. Note that this code repository contains pre-compiled versions of the language models, so the above step is only required if you made changes to the .csv files.
See here for an introduction on the important sections of the iKnow source code. ICU (header files and libraries) is the only dependency for this project.
-
Download the Win64 binaries for a recent release of the ICU library (e.g. version 65.1) and unzip to
<repo_root>/thirdparty/icu
(or a local folder of your choice). -
If you chose a different folder for your ICU libraries, update
<repo_root>\modules\Dependencies.props
to represent your local configuration. This is how it looks after download, which should be OK if you used the suggested directory paths:
<PropertyGroup Label="UserMacros">
<ICUDIR>$(SolutionDir)..\thirdparty\icu\</ICUDIR>
<ICU_INCLUDE>$(ICUDIR)\include</ICU_INCLUDE>
<ICU_LIB>$(ICUDIR)\lib64</ICU_LIB>
</PropertyGroup>
-
Open the Solution file
<repo_root>\modules\iKnowEngine.sln
in Visual Studio. We used Visual Studio Community 2019 -
In the Solution Explorer, choose "iKnowEngineTest" as "Set up as startup project"
-
In Solution Configurations, choose either "Debug|x86", or "Release|x64", depending on the kind of executable you prefer.
-
Build the solution, it will build all 29 projects.
Once building has succeeded, you can run the test program, depending on which build config you chose:
<repo_root>\kit\x64\Debug\bin\iKnowEngineTest.exe
<repo_root>\kit\x64\Release\bin\iKnowEngineTest.exe
$(ICUDIR)/bin64
directory to your PATH or copy its .dll files to this test folder in order to run the test executable.
Alternatively, you can also start a debugging session in Visual Studio and walk through the code to inspect it.
The iKnow indexing demo program will index one sentence for each of the 11 languages, and write out the sentence boundaries. That's of course not very spectacular by itself, but future iterations of this demo program will expose more of the entity and context information iKnow detects.
-
Download the proper binaries for a recent release of the ICU library (e.g. version 65.1) and untar to
<repo_root>/thirdparty/icu
(or a local folder of your choice). -
Save the path you untarred the archive to a
ICUDIR
environment variable. Note that your ICU download may have a relative path inside the tar archive, so you may need to use--strip-components=4
or manually reorganise to make sure the${ICUDIR}/include
leads where you'd expect it to lead.
-
Set the
IKNOWPLAT
environment variable to the target platform of your choice: e.g. "lnxubuntux64", "lnxrhx64" or "macx64" -
In the
<repo_root>
folder, runmake all
While primarily useful for build-testing convenience, we're also providing a Dockerfile
that stuffs the code in a clean container with the required ICU libraries. If your Linux / Unix build doesn't seem to work, perhaps a quick look at this Dockerfile will help nail down where trouble starts.
-
Optionally open the
Dockerfile
to change the ICU library version to use -
Use the
docker build
command to package things up:
docker build --tag iknow .
This will automatically download the ICU library of your choice and register its path for onward building.
- Start and step into the container using
docker run
:
docker run --rm -it iknow
The --rm
flag will make sure the container gets dropped after you're done exploring.
- Inside the container, use
make all
to kick off the build.
cd /usr/src/iknow
make all
-
make test
will build and run the testprogram ("iknowenginetest"). You will find the testprogram in /usr/src/iknow/kit/lnxubuntux64/release/bin
cd /usr/src/iknow
make test
The iknowpy
Python module brings the iKnow engine capabilities to Python ≥3.5 on Windows, Mac OS, and Linux. We use Travis CI to build iknowpy
automatically and deploy it to PyPI (the Python Package Index). To trigger a new build and deployment, simply increment the version in <repo_root>/modules/iknowpy/iknowpy/version.py
in the master branch and push the associated commit.
To create a development release of iknowpy
for testing purposes, you can instead edit the number in version.py
so that it ends in .devN
, where N
is the development build number. Examples are 0.0.12.dev0
and 1.2.3.dev456
. When you push a change to version.py
that contains a development build number, iknowpy
is automatically built and deployed to TestPyPI, an index separate from PyPI. Once the deployment of the development build is complete, you can install it with pip.
pip install --index-url https://test.pypi.org/simple/ -U iknowpy
Note: If you are pushing multiple commits at once, the edit to version.py
must occur in the final commit to trigger automatic deployment.
If you want to build the Python interface locally, read on. The following directions refer to the commands pip
and python
. On some platforms, these commands use Python 2 by default, in which case you should execute pip3
and python3
instead to ensure that you are using Python 3.
Build the iKnow engine following the above directions. If you are on Windows, choose the "Release|x64" configuration.
-
Install Python ≥3.5. Ensure that the installation includes Python header files.
-
Install
Cython
,setuptools
, andwheel
. You can do this by having a Python distribution that already includes these modules or by runningpip install -U cython setuptools wheel
-
If you are on Mac OS, ensure that the
otool
andinstall_name_path
command-line tools are present. They should be available if you have XCode installed with command-line developer tools. If you are on Linux, ensure that thepatchelf
tool is present and has version ≥0.9. You can install it using the package manager on your machine, or you can build it from source at https://github.com/NixOS/patchelf/releases.
Open a command shell in the directory <repo_root>/modules/iknowpy
and execute the setup script. This builds iknowpy
, creates a package containing iknowpy
and its dependencies, and installs the package.
python setup.py install
The scripts at <repo_root>/modules/iknowpy/tests/
provide example of how to use iknowpy
. Run the scripts to call a few iKnow functions from Python and print their results. See Getting Started for more on the various entry points.
iknowpy
via the Python interactive console, do not do so in the <repo_root>/modules/iknowpy
working directory. Because of how Python resolves module names, importing iknowpy
will cause Python to try importing the source package <repo_root>/modules/iknowpy/iknowpy
instead of the installed package, resulting in an import error.
A wheel is a pre-built package that can can be distributed to others and can be installed using pip
. A single wheel is specific to the build platform and the minor version of Python (e.g. 3.7 or 3.8) used to build the wheel. Thus, a wheel must be built for every platform and minor Python version for which a simple installation using pip
is desired.
-
Open a command shell in the directory
<repo_root>/modules/iknowpy
. -
Build the wheel.
python setup.py bdist_wheel
Decide the minimum version of Mac OS that the wheel will support. Ensure that the iKnow engine, ICU, and Python were built with support for this version. Python distributions from https://www.python.org/downloads/mac-osx are the best for this situation, as they tend to be the distributions that are maximally compatible with different Mac OS versions. The following directions assume a minimum target version of Mac OS X 10.9, but you can adapt them to suit your preferences.
-
Open a command shell in the directory
<repo_root>/modules/iknowpy
. -
Build the wheel. If you do not specify
MACOSX_DEPLOYMENT_TARGET
or--plat-name
, then the minimum supported Mac OS version defaults to that of the Python distribution used to build the wheel.export MACOSX_DEPLOYMENT_TARGET=10.9 python setup.py bdist_wheel --plat-name=macosx-10.9-x86_64
In general, binaries built on one Linux distribution are not directly runnable on another Linux distribution. If all you want is a wheel that is compatible with the build platform and its binary compatible platforms, simply execute the following in the directory <repo_root>/modules/iknowpy
.
python setup.py bdist_wheel
To build a wheel that is compatible with the vast majority of modern Linux distributions, you can use the manylinux project, which is the standard way to distribute pre-built Python extension modules on Linux. If you want to build a manylinux wheel locally, see <repo_root>/travis/build_manylinux.sh
to see how our continuous deployment pipeline does it. In short, start up a manylinux Docker container for the desired CPU architecture. Inside the container, build ICU, the iKnow engine, and the Python interface.