Skip to content

Commit

Permalink
Merge pull request #5 from singnet/senna-docs-1
Browse files Browse the repository at this point in the history
Migrate docs from das-query-engine
  • Loading branch information
andre-senna authored Mar 26, 2024
2 parents 1854a21 + 96a72ff commit ae25838
Show file tree
Hide file tree
Showing 50 changed files with 498 additions and 224 deletions.
268 changes: 268 additions & 0 deletions docs/DAS-Overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
# Distributed Atomspace - Overview

Atomspace is the hypergraph OpenCog Hyperon uses to represent and store
knowledge, being the source of knowledge for AI agents and the container of any
computational result that might be created or achieved during their execution.

The __Distributed Atomspace (DAS)__ is an extension of OpenCog Hyperon's
Atomspace into a more independent component designed to support multiple
simultaneous connections with different AI algorithms, providing a flexible
query interface to distributed knowledge bases. It can be used as a component
(e.g. a Python library) or as a stand-alone server to store essentially
arbitrarily large knowledge bases and provide means for the agents to traverse
regions of the hypergraphs and perform global queries involving properties,
connectivity, subgraph topology, etc.

DAS can be understood as a persistence layer for knowledge bases used in
OpenCog Hyperon.

<p align="center">
<img src="assets/persistence_layer.png" width="400"/>
</p>


The data manipulation API provides a defined set of operations without exposing
database details such as data modeling and the DBMS (Database Management
System) being used. This is important because it allows us to evolve the data
model inside DAS and even change the DBMS without affecting the integration
with the AI agents.

But being an abstraction for the data model is not the only purpose of DAS.
While performing this connection between AI agents and the knowledge bases, DAS
provides a lot of other functionalities:

* Higher level indexes stored in the DBMS
* Query engine with pattern matching capabilities
* Traverse engine to keep track of hypergraph traversal
* Cache for query results
* Scalable connection manager to connect the DAS with multiple other DASs

This is why DAS is not just a Data Access Object or a database interface layer
but rather a more complex OpenCog Hyperon's component that abstracts not only
data modeling/access itself but also several other algorithms that are closely
related to the way AI agents manipulate information.

## Table of contents

- [DAS Components](#das-components)
- [Higher Level Indexing](#higher-level-indexing)
- [Pattern Matcher](#pattern-matcher)
- [Mapping knowledge bases to nodes and links](#mapping-knowledge-bases-to-nodes-and-links)
- [DAS Server Deployment and Architecture](#das-server-deployment-and-architecture)

## DAS Components

DAS is delivered as a Python library
[hyperon-das](https://pypi.org/project/hyperon-das/) which can be used in two
different ways:

1. To create a DAS server which is supposed to contain a knowledge base and
provide it to many remote clients (somehow like a DBMS).
2. To instantiate a DAS in a Python program which can store a smaller local
knowledge base and can, optionally, connect to one or more remote DAS servers,
exposing their contents to the local program. In this case, the local knowledge
base can store its contents in RAM or can use a DB backend to persist it.

<p align="center">
<img src="assets/components.png" width="900"/>
</p>

Components in the DAS architecture are designed to provide the same
[data manipulation API](https://singnet.github.io/das-query-engine/api/das/)
regardless of whether it's being used locally or remotely or, in the case of
a local DAS, whether DB persistence is being used or not.

Part of this API is delegated to __Traverse Engine__, which interacts with the
Query Engine and the Cache to provide means to the user to traverse the
Atomspace hypergraph. Operations like finding the links pointing from/to a
given atom or finding atoms in the surrounding neighborhood are performed by
this engine, which controls the pre-fetching of the surrounding atoms when a
remote DAS is being used, in such a way that following links can be done
quickly.

The __Query Engine__ is where global queries are processed. These are queries
for specific atoms or sets of atoms that satisfies some criteria, including
pattern matching. When making a query, the user can specify whether only local
atoms should be considered or whether atoms in remote DASs should be searched
as well. If that's the case, the Query Engine connects to the remote OpenFaaS
servers to make the queries in the remote DASs and return a answer which is a
proper combination of local and remote information. For instance, if there're
different versions of the same atom in local and one of the remote DASs, the
local version is returned.

Both engines use the __Cache__ in order to make queries involving a remote DAS
faster. The DAS' cache is not exactly like a traditional cache, where data is
stored basically in the same way in both, the cache and the primary data
repository, and queries are answered by searching the data in the former and
then in the latter. The DAS's cache implements this functionality but it also
sorts and partitions queries' results in such a way that the caller sees the
most relevant results first.

All the queries that return more than one atom, return an iterator to the
results instead of the results themselves. This way only a subset of the
results are returned in a remote query. When the caller iterates through this
iterator, other chunks of results are fetched on demand from the remote DAS
until all the results have been visited. Before splitting the results in
chunks, the resulting atoms are sorted by "relevance", which can be a measure
based in atoms' Short and Long Term Importance (STI and LTI), in a way that the
most relevant results are iterated first. This is important because most AI
agents make several queries and visit the results in a combinatorial fashion so
visiting every single possible combination of results are not practical. Having
results sorted by relevance allow the agents to constraint the search and
eventually avoid fetching too many chunks of results from the remote server.

The __AtomDB__ is somehow like a Data Access Object or a database interface
layer to abstract the calls to the database where atoms are actually stored.
Having this abstraction is important because it allows us to change or to
extend the actual data storage without affecting the query algorithms (such as
pattern matching) implemented in traverse and query engines. AtomDB can be
backended by in-RAM data structures or one or more DBMSs.

## Higher Level Indexing

DAS uses a DBMS to store atoms. By doing so it uses the indexing capabilities
of this DBMS to retrieve atoms faster. But in addition to this, DAS also
creates other custom indexes and stores these indexes in another DBMS. The most
relevant of such indexes is the Pattern Inverted Index.

An inverted index is a data structure which stores a map from contents (words,
sentences, numbers, etc) to where they can be found in a given data container
(database, file system etc).

This type of data structure is largely used in document retrieval systems to
implement efficient search engines. The idea is spending computational time
when documents are inserted in the document base to index and record the words
that appear in each document (and possibly the position they happen inside the
documents). Afterwards this index can be used by the search engine to
efficiently locate documents that contain a given set of keywords.

The entities in the Opencog Hyperon's context are different from the ones in
typical document retrieval systems but their roles and the general idea of the
algorithms are very similar. In OpenCog Hyperon's context, a knowledge base is
a set of toplevel links (which may point to nodes or to other links). When
the knowledge base is loaded, we can create an inverted index of patterns
present in each toplevel link and use such index later to perform pattern
matching.

For instance, given as toplevel link like this one:

```
Inherits
<Concept A>
<Concept B>
```

We could add entries like these ones in the Pattern Inverted Index (where `H1`
is the handle of the toplevel link above):

```
Inherits * <Concept B> ==> H1
Inherits <Concept A> * ==> H1
Inherits * * ==> H1
```

## Pattern Matcher

DAS' query engine can answer pattern matching queries. These are queries where
the caller specifies a _pattern_ i.e. a boolean expression of subgraphs with
nodes, links and wildcards and the engine finds every subgraph in the knowledge
base that satisfies the passed expression.

For instance, suppose we have the following knowledge base in DAS.

<p align="center">
<img src="assets/pmquery_1.png" width="500"/>
</p>

We could search for a pattern like:

```
AND
Similar(V1, V2)
NOT
AND
IS_A(V1, V3)
IS_A(V2, V3)
```

`V1`, `V2` and `V3` are wildcards or variables. In any candidate subgraph
answer, the atom replacing `V1`, for instance, should be the same in all the
links where `V1` appears. In other words, with this pattern we are searching
for two nodes `V1` and `V2` such that there exist a similarity link between
them but there's no pair of inheritance links pointing `V1` and `V2` to the
same node `V3`, no matter the value of `V3`.

In this example, `Chimp` and `Human` are not a suitable answer to replace `V1`
and `V2` because there's a possible value for `V3` that satisfies the `AND`
clause in the pattern, as shown below.

<p align="center">
<img src="assets/pmquery_2.png" width="500"/>
</p>

On the other hand, there are other pair of nodes which could be used to match
`V1` and `V2` whitout matching the `AND` clause, as shown below.

<p align="center">
<img src="assets/pmquery_3.png" width="500"/>
</p>

The answer for the query is all the subgraphs that satisfy the pattern. In our
example, the answer would be as follows.

<p align="center">
<img src="assets/pmquery_4.png" width="500"/>
</p>

## Mapping knowledge bases to nodes and links

Before loading a knowledge base into DAS, you need to define a proper mapping
to Atomspace nodes and links. DAS doesn't make any assumptions regarding nodes
or link types, arity etc. When adding nodes and links using DAS' API, one may
specify atom types freely and the semantic meaning of such atom types are
totally concerned with the application. DAS don't make any kind of processing
based in pre-defined types (actually, there are no internally pre-defined atom
types).

DAS also doesn't provide a way to read a text or SQL or whatever type of file in
order to load a knowledge base. There's no DAS-defined file syntax for this.
If one needs to import a knowledge base, it needs to provide a proper loader
application to parse the input file(s) and make the proper calls to DAS' API in
order to add nodes and links.

Surely one of the interesting topics for future/on-going work on DAS is to
provide loaders (and respective nodes/links mapping) for different types of
knowledge base formats like SQL, Atomese, etc. We already have such a
[loader for MeTTa files](https://github.com/singnet/das-metta-parser).

## DAS Server Deployment and Architecture

DAS server is deployed in a [Lambda Architecture](https://en.wikipedia.org/wiki/Lambda_architecture)
based either in [OpenFaaS](https://www.openfaas.com/) or [AWS Lambda](https://aws.amazon.com/pm/lambda/).
We made a comparative study of these two architectures (results are presented in
this [report](https://docs.google.com/document/d/1kQhM62T3TIb3ECoqBxmqPulMciJq7Vev3-f_Gnhes_s/))
and decided to prioritize OpenFaaS.
Although deployment in AWS Lambda is still possible, currently only OpenFaaS is
supported by [our automated deployment tool](https://github.com/singnet/das-toolbox).
This architecture is presented in the diagram below.


<p align="center">
<img src="assets/architecture.jpg" width="800"/>
</p>

When deploying in AWS Lambda, [Redis](https://redis.io/) and
[MongoDB](https://www.mongodb.com/) can be replaced by AWS'
[DocumentDB](https://aws.amazon.com/pt/documentdb/) and
[ElastiCache](https://aws.amazon.com/elasticache/redis/) but the overall
structure is basically the same.

Functions are deployed in servers in the cloud as
[Docker](https://www.docker.com/) containers, built in [our CI/CD pipeline](Deployment-Process.md)
by automated [GitHub Actions](https://docs.github.com/en/actions) scripts and
stored in a private Docker hub registry.

Clients can connect using HTTP, gRPC or an external lambda functions (OpenFaaS
functions can only connect to OpenFaaS and the same is true for AWS functions).

DAS is versioned and released as a [library in PyPI](https://pypi.org/project/hyperon-das/).
Loading

0 comments on commit ae25838

Please sign in to comment.