Skip to content

Configuring Entity Checking

Lixi edited this page Nov 10, 2020 · 2 revisions

Gerbil will check if an entity represented by an URI does exist. Further on gerbil will use a cache system to assure that an entity which already were tested, does not need to be tested again.

There are two types of Entity Checker

and two types of Cache Systems

Adding, removing and modifying an entity checker as well as configuring the cache can be done in the entity_checking.properties which is located in the src/main/properties/ folder

Configuring the Cache

To adjust the cache to your system you can edit entity_checking.properties.

If you want to use a persistent cache you can add the following propertie

org.aksw.gerbil.dataset.check.EntityCheckerManagerImpl.usePersistentCache=true

In-Memory Cache

The In-Memory Cache will store the entity checking results in memory. To restrict the memory used by the cache you can set

org.aksw.gerbil.dataset.check.InMemoryCachingEntityCheckerManager.cacheSize=1000000

This will set the maximum amount of entities stored in the cache

Furthermore as URIs change, are removed, added and modified you can specify how long an entity should be cached by specifying:

org.aksw.gerbil.dataset.check.InMemoryCachingEntityCheckerManager.cacheDuration=2592000000

whereas the duration is in ms.

File-based Cache

The file-based cache will store the entity checking results into a file. Per default this will be located at gerbil_data/cache/entityCheck.cache however this can be changed using the following property

org.aksw.gerbil.dataset.check.FileBasedCachingEntityCheckerManager.cacheFile=${org.aksw.gerbil.CachePath}/entityCheck.cache

As URIs change, are removed, added and modified you can specify how long an entity should be cached by specifying:

org.aksw.gerbil.dataset.check.FileBasedCachingEntityCheckerManager.cacheDuration=2592000000

whereas the duration is in ms.

Index-based Entity Checker

The Index-based Entity Checker will use a pregenerated Lucene index the start.sh script will automatically download and use if specified. The Index is created for DBpedia. You can create an index for a different domain using DBpediaEntityCheckIndexTool, be aware that you need to change the code for now.

To use the Index-based Entity Checker simply add the following to the entity_checking.properties

org.aksw.gerbil.dataset.check.IndexBasedEntityChecker.yourEntityCheckerName=${org.aksw.gerbil.DataPath}/indexes/YOUR_INDEX,http://example.org, http://fr.example.org

The first argument will set the directory in which the lucene Index is located, the following arguments are the domains the index contains.

Using the pregenerated index for example

org.aksw.gerbil.dataset.check.IndexBasedEntityChecker.dbpedia=${org.aksw.gerbil.DataPath}/indexes/dbpedia_check,http://dbpedia.org,http://de.dbpedia.org,http://fr.dbpedia.org

The Index was created using the english, french and german DBpedia and is located under gerbil_data/indexes/dbpedia_check

HTTP-based Entity Checker

Another solution which takes more time than the Index-based Entity Checker is the HTTP-based Entity Checker. This one will check if an Entity exists using HTTP and can simply be set by adding the following:

org.aksw.gerbil.dataset.check.HttpBasedEntityChecker.namespace=http://example.org/res/
org.aksw.gerbil.dataset.check.HttpBasedEntityChecker.namespace=http://de.example.org/res/

This entity checker will test if an URI exists and starts with one of those namespaces (e.g. http://example.org/res/) against an HTTP endpoint.