description |
---|
This section outlines technical questions about the working of the Aleph system. |
With Aleph, we try to support all file formats commonly found in leaked evidence used in investigative reporting. Unlike other systems, Aleph does not use the Apache Tika format for content extraction. This allows us to more specifically extract structured informations and generate detailed online previews for a variety of formats.
- Basic data formats like plain text, HTML and XML.
- Office formats including Word, Powerpoint, LibreOffice Text, LibreOffice Impress, WordPerfect, RTF, PDF, ClarisWorks, EPub, DejaVu, Lotus WordPro, StarOffice, Abiword, PageMaker, MacWrite, etc.
- Tabular formats like Excel, Excel 2007, OpenDocument Spreadsheet, DBF, Comma-Separated Values, SQLite, Access.
- E-Mail formats including plain MIME email (RFC822), Outlook MSG, Outlook PST, Outlook Mac Backups (OLM), MBOX, VCard.
- Archive/package formats like ZIP, RAR, Tar, 7Zip, Gzip, BZip2.
- Media formats including JPEG, PNG, GIF, TIFF, SVG, and metadata from common video and audio files.
Aleph attempts to extract written text from any image submitted to the engine. This includes images included in PDF or other office format files, such as in scanned documents. When performing OCR, Aleph supports two backends: Tesseract 4, and the Google Vision API.
The output generated by Google Vision API is much higher quality than that generated by Tesseract, but requires submitting the source images to a remote service, while also incurring potentially significant costs.
Tesseract, on the other hand, benefits heavily from knowing the language of the documents from which it is attempting to extract content. If you are seeing extremely weak recognition results, make sure that the collection containing the documents has a collection language set.
Aleph performs named entity recognition (NER) immediately before indexing data to ElasticSearch. The terminology here can be confusing: although called "entity extraction", the process actually extracts names from entities (e.g. a PDF or an E-Mail).
Currently text processing begins with language classification using fasttext, and then feeds into spaCy for NER. While names of people and companies are tagged directly, locations are checked against to the GeoNames database. This is used to tag countries to individual documents. Additionally, a number of regular expressions are used to perform rule-based extraction of phone numbers, email addresses, IBANs and IPs.
Once extracted, these tags are added as properties to the Follow the Money entity of the Document
that they have been extracted from. They can be found in the following fields: detectedLanguage
, namesMentioned
, country
, ipMentioned
, emailMentioned
, phoneMentioned
and ibanMentioned
.
We're extremely happy to consider pull requests that add further types of linguistic and pattern-based extraction.
There are two aspects to adding support for a new language to Aleph: translating the user interface, and adapting the processing pipeline.
To add a new language for the Aleph user interface ("localisation"), register a user account on transifex.com
and apply to become a member of the Aleph organisation. Start a new translation and translate all strings in the various Aleph components (followthemoney
, aleph-ui
, aleph-api
and react-ftm
).
If you wish to try out the translation on a local developer install of Aleph, please make sure you have the Transifex command-line client installed and configured. Then run the following sequence of commands:
make translate
cd ui/
npm run translate
cd ..
make translate
In terms of adapting the processing pipeline, go through the following items:
- Check that the hard-coded language list in FollowTheMoney includes the three-letter code for your language (module
followthemoney.types.language
). If the language you are adding has multiple language codes, you may want to add a synonym mapping to thelanguagecodes
Python library. - Check that the
ingest-file
service in itsDockerfile
installs a Tesseract model for your language, if one is available in Ubuntu. - Check if a spaCy model is available for named entity extraction, and add it to the
Dockerfile
iningest-file
. Also make sure to adapt theINGESTORS_NER_MODELS
environment variable in that file.
Can Britain leave the European Union? Yes, it's possible; but complicated and will probably not make your life better in the way that you're expecting.
Aleph's document ingest services requires a large number of command-line utilities and libraries to be installed within a certain version range in order to operate correctly. While we'd love to be able to ship e.g. a Debian package in the long term, the work required for this is significant.
Here's a guide for running Aleph sans docker on Debian w/ systemd.
Aleph does not perform updates and database migrations automatically. Once you have the latest version, you can run the command bellow to upgrade the existing installation (i.e. apply changes to the database model or the search index format).
Before you upgrade, check the release notes to make sure you understand the latest release and know about new options and features that have been added.
The procedures for upgrading are different between production and development mode:
In development mode, make sure you've pulled the latest version from GitHub. We recommend you check out develop
if you want to contribute code. Then, run:
make build
make upgrade
In production mode, make sure you perform a backup of the main database and the ElasticSearch index before running an upgrade.
Then, make sure you are using the latest docker-compose.yml
file. You can do this by checking out the source repo, but really you just need that one file (and your config in aleph.env
). Then, run:
docker-compose pull --parallel
# Terminate the existing install (enter downtime!):
docker-compose down
docker-compose up -d redis postgres elasticsearch
# Wait a minute or so while services boot up...
# Run upgrade:
docker-compose run --rm shell aleph upgrade
# Restart prod system:
docker-compose up -d
If running aleph
commands gives you warnings about missing tables, you probably need to migrate your database to the lastest schema. Try:
make upgrade
This often means you're not running an Aleph worker
process - the component responsible for indexing documents, generating caches, cross-reference and email alerts. When you operate in development mode (using the make
commands), this is the case by default.
To fix this issue in development mode, just run a worker:
make worker
If you're encountering this issue in production mode, try to check the worker log files to understand the issue.
The included docker-compose
configuration for production mode has no understanding of how powerful your server is. It will run just a single instance of the services involved in data imports, worker
, ingest-file
and convert-document
.
The easiest way to speed up processing is to scale up those services. Make a shell script to start docker-compose with a set of arguments like this:
docker-compose up --scale ingest-file=8 --scale convert-document=4 --scale worker=2
The number of ingest-file
processes could be the number of CPUs in your machine, and convert-document
needs to be scaled up for imports with many office documents, but never higher than ingest-file
.
Most problems arise when the ElasticSearch container doesn't startup properly, or in time. If upgrade
fails with errors like NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb11b6ab0d0>: Failed to establish a new connection: [Errno 111] Connection refused
this is what happened.
You can find out specifically what went wrong with ES by consulting the logs for that container:
docker-compose -f docker-compose.dev.yml logs elasticsearch
You will almost certainly need to run the following before you build:
sysctl -w vm.max_map_count=262144
Or to set this permanently, in /etc/sysctl.conf
add:
vm.max_map_count=262144
If the error in your ES container contains:
elasticsearch_1 | [1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
Please see the relevant ElasticSearch documentation for this issue.
When the host machine disk is over 90% full, ElasticSearch can decide to stop writes to the index as an emergency measure. You would see errors like this:
AuthorizationException(403, 'cluster_block_exception', 'index [aleph-collection-v1] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')
To fix this, try the following:
- Run
docker system prune
on the host machine - Inside a
make shell
, run this CURL command:curl -XPUT -H "Content-Type: application/json" http://elasticsearch:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
When you're running in development mode, run:
make stop
In production mode, the equivalent command is:
docker-compose down --remove-orphans
If all else fails, you may just need to wait a little longer for the ES service to initialize before you run upgrade.
- Shut down all Aleph components:
make stop
- Re-build the development docker containers:
make build
- Apply the latest data migrations:
make upgrade
- If that succeeds, in a new terminal run
make web
to launch the UI and API, andmake worker
to start a worker service.
Talk to the community
If that does not help, come visit the Aleph slack and talk to the community to get support.
redis-cli --scan --pattern aleph:authz:* | xargs redis-cli del
redis-cli --scan --pattern aleph:metadata:* | xargs redis-cli del
The "About" section in the Navbar is based on a micro-CMS that is a bit like Jekyll. You can see the templates in the Aleph GitHub repository at aleph/pages
. Pages can be customised by setting an environment variable, ALEPH_PAGES_PATH
to point to a directory with content pages.
All pages with the menu: true
header set will be added to the Navbar, others will just be shown in the sidebar menu inside the "About" section.
How to add these pages to the running Aleph container is more of a Docker problem, so you might want to look into how to build a derived image for the api
service, or just mount a path from the server as a volume inside the api
.
The options for managing users and groups in Aleph are very limited. This is because many installations delegate those tasks to a separate OAuth single sign-on service, such as Keycloak (an example configuration exists in contrib/keycloak
).
That's why adding features like password resets, a admin UI for user creation or groups management is not on the roadmap of the OCCRP developer team. However, other developers are encouraged to implement them and contribute the code.
This depends on how you create users more generally: when you're using Aleph's login system, you can do this on the command-line via the aleph createuser
command, by adding the --admin
option.
If you are using OAuth or have already created a user, then you can make an admin user directly via SQL in the database:
make shell
psql $ALEPH_DATABASE_URI
UPDATE role SET is_admin = true WHERE email = '[email protected]';
You may also need to run aleph update
afterwards to refresh some cached information.
That's where it's most at home! We recommend you use the helm chart to deploy Aleph. It will allow you to override the key settings for your site, while providing a coherent deployment. We use auto-scaling both on the cluster and pod level, which helps to combine fast imports with limited operational cost.
If you want to manipulate the SQL database directly (e.g. to edit a user, create or delete a group), you can connect to the PostgreSQL database.
In development mode, the database is exposed on the host at 127.0.0.1:15432
. (User, password and database name are all aleph
). You can also connect from the shell container:
make shell
psql $ALEPH_DATABASE_URI
The same can be done if you run an instance of the shell
container in production mode.
When looking at an Aleph URL, you may notice that every entity ID has two parts, separated by a dot (.
), for example:deadbeef.3cd336a9859bdf2be917f561430f2a83e5da292b
. The first part in this is the actual entity ID, while the second part is a signature (HMAC) assigned by the server when indexing the data.
The background for this is a security mitigation. There are various places in Aleph where a user can actually assign arbitrary IDs to new entities, including the collection _bulk
API. In these cases, an attacher could attempt to inject an ID already used by another collection and thus overwrite its data.
To avoid this, each entity ID is assigned a namespace ID suffix for the collection it is submitted to. This way, multiple collections can have entities with the same ID without overwriting each other's data.
When using the Aleph API, you can submit either form: a version of the entity with its signature, or without, via the _bulk
API. The signature will be fixed up automatically.
The benefit of storing Aleph as a graph would be running path queries and quick pattern matching ("Show me all the companies owned by people who have the same name as a politician").
The downsides are:
- User-controlled access to Aleph must always go through security checks, and we haven't really found a graph DB that would handle the security model of Aleph without generating incredibly complex (and slow) queries.
- While some graph databases have Lucene built in, that doesn't replace ElasticSearch. Simple search is a killer use case and needs to be really good and offer advanced features like facets, text normalisation and index sharding.
- There's a lot of data in some Aleph instances. It's not clear how many graph databases can respond to queries against billions of entities within an HTTP response cycle.
- Our data is usually not well integrated, so the graph is less dense than you might think, unless we fully pre-compute all possible entity duplicates as graph links.
All of this said, we'd really love to hear about any experiments regarding this. In OCCRP we sometimes materialise partial Aleph graphs into Neo4J and let analysts browse them via Linkurious. We're hoping to look into dgraph as a possible backend at some point.
{% hint style="info" %} We have well-defined graph semantics for FollowTheMoney data and you can export any data (including documents like emails) in Aleph into various graph formats (RDF, Neo4J, and GEXF for Gephi). {% endhint %}
You can find out specifically what went wrong with the document converter service by consulting the logs for that container:
docker-compose -f docker-compose.dev.yml logs convert-document
If LibreOffice
keeps crashing on startup with Fatal exception: Signal 11
, AppArmor can be one possible cause. AppArmor
running on the host machine could be blocking LibreOffice
from starting up. Try disabling the AppArmor
profiles related to LibreOffice
by following these instructions: https://askubuntu.com/a/1214363