Skip to content

Commit

Permalink
Added docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
lasyard committed Oct 19, 2021
1 parent 87e3156 commit d8cdd63
Show file tree
Hide file tree
Showing 19 changed files with 928 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Sphinx files
_*
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
102 changes: 102 additions & 0 deletions docs/architecture/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Cluster

DingoDB cluster consists of two distributed components :Coordinator and Executor.

The cluster is managed using [Apache Helix](http://helix.apache.org/). Helix is a cluster management framework for managing replication and partitioning resources in distributed systems. Helix uses Zookeeper to store cluster status and metadata.



## Coordinator

The coordinator is responsible for these things:

- Controllers maintain the global metadata (e.g. configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store.
- Controller modeled as Helix Controller and is responsible for managing executors.
- They maintain the mapping of which executors are responsible for which table partitions.

There can be multiple instances of DingoDB coordinator for redundancy.



## Executor

Executors are responsible for maintaining table partitions and provide the ability to read and wirte the table partitions they responsible.

The DingoDB executor is modeled as a Helix Participant, responsible tables, and the table are modeled as Helix Resources. The partitions of the table are modeled as partitions of the Helix resource. Thus, a DingoDB executor responsible one or more helix partitions of one or more helix resources.



## Table

Tables are collections of rows and columns representing related data that DingoDB divides into partitions that are stored on local disks or in cloud storage.

In the DingoDB, a table is modeled as a Helix resource and each partition of a table is modeled as a Helix Partition.

The resource state model uses the Leader/Follower model.

- Leader: Supports both data write and data query operations
- Follower: synchronizes data from the Leader and supports query rather than direct writing
- Offline: the standby node is Offline and does not process data
- Dropped: Deletes the mapping between a data partition and Executor

Each data partition in a cluster has only one Leader. The Leader is selected by a Coordinator and is preferentially selected from the followers. The selected followers are promoted to the Leader.There can be multiple followers, depending on the number of replicas.

Whether a Table can be responsible for an Executor is determined by the tag. When a Table is created, a tag is set for the Table. During partition allocation, the Executor that contains the tag is selected and the Executor that does not contain the tag is excluded.

```{mermaid}
stateDiagram
[*] --> OFFLINE
OFFLINE --> LEADER
OFFLINE --> FOLLOWER
OFFLINE --> DROPPED
LEADER --> FOLLOWER
LEADER --> OFFLINE
FOLLOWER --> LEADER
FOLLOWER --> OFFLINE
DROPPED --> [*]
```



## cli-tools

### Rebalance

Reset the mapping of which executors are responsible for which table partitions.

Whne running this command, it resets the mapping that executor is responsible for for a given table partition,the coordinator then detects the change, it reassigns table partitions to the executor based on policies defined by the Leader/Follower state model.

```shell
tools.sh rebalance --name TableName --replicas 3 # if replicas is not set or replicas less than 1, replicas use default replicas
```



### Sqlline

Command-line shell for issuing SQL to DingoDB.

SQLLine is a pure-Java console based utility for connecting to DingoDB and executing SQL commands,you can use it like `sqlplus `for Oracle,`mysql` for MySQL. It is implemented with the open source project [sqlline](https://github.com/julianhyde/sqlline).

```shell
tools.sh sqlline
```



### Tag

Set table tag, or add table tag to executor.

Because whether a table can be responsible for an Executor is determined by the tag,so you can use this command to add a tag to executor.

In general, no special tag Settings are required, and a newly created table will use the default tag.Therefore, any executors can responsible the table partitions.

If you want to set the specified executors to responsible the table partitions, use this command to set a tag for the table and then add the tag to the desired executors.

```shell
tools.sh tag --name TableName --tag TagName # set table tag
tools.sh tag --name ExecutorName --tag TagName # add tag to executor
tools.sh tag --name ExecutorName --tag TagName --delete # delete tag for executor
```

101 changes: 101 additions & 0 deletions docs/architecture/expression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Dingo Expression

Dingo Expression is the expression engine used by DingoDB. It is special for its runtime codebase is separated from the
parsing and compiling codebase. The classes in runtime are serializable so that they are suitable for runtime of
distributed computing system, like [Apache Flink](https://flink.apache.org/).

## Operators

| Category | Operator | Associativity |
| :------------- | :------------------------------------------- | :------------ |
| Parenthesis | `( )` | |
| Function Call | `( )` | Left to right |
| Name Index | `.` | Left to right |
| Array Index | `[ ]` | Left to right |
| Unary | `+` `-` | Right to left |
| Multiplicative | `*` `/` | Left to right |
| Additive | `+` `-` | Left to right |
| Relational | `<` `<=` `>` `>=` `==` `=` `!=` `<>` | Left to right |
| String | `startsWith` `endsWith` `contains` `matches` | Left to right |
| Logical NOT | `!` `not` | Left to right |
| Logical AND | `&&` `and` | Left to right |
| Logical OR | <code>&#x7c;&#x7c;</code> `or` | Left to right |

## Data Types

| Type Name | SQL type | JSON Schema Type | Hosting Java Type | Literal in Expression |
| :----------- | :-------- | :--------------- | :--------------------- | :-------------------- |
| Integer | INTEGER | | java.lang.Integer | |
| Long | LONG | integer | java.lang.Long | `0` `20` `-375` |
| Double | DOUBLE | number | java.lang.Double | `2.0` `-6.28` `3e-4` |
| Boolean | BOOLEAN | boolean | java.lang.Boolean | `true` `false` |
| String | CHAR | string | java.lang.String | `"hello"` `'world'` |
| String | VARCHAR | string | java.lang.String | `"hello"` `'world'` |
| Decimal | | | java.math.BigDecimal |
| Time | TIMESTAMP | | java.util.Date |
| IntegerArray | | | java.lang.Integer[] |
| LongArray | | array | java.lang.Long[] |
| DoubleArray | | array | java.lang.Double[] |
| BooleanArray | | array | java.lang.Boolean[] |
| StringArray | | array | java.lang.String[] |
| DecimalArray | | | java.math.BigDecimal[] |
| ObjectArray | | array | java.lang.Object[] |
| List | | array | java.util.List |
| Map | | object | java.util.Map |
| Object | | object | java.lang.Object |

## Constants

| Name | Value |
| :--- | ----------------------: |
| TAU | 6.283185307179586476925 |
| E | 2.7182818284590452354 |

There is not "3.14159265" but `TAU`. See [The Tau Manifesto](https://tauday.com/tau-manifesto).

## Functions

### Mathematical

See [Math (Java Platform SE 8)](https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html).

| Function | Java function based on | Description |
| :-------- | :--------------------- | :---------- |
| `abs(x)` | `java.lang.Math.abs` | |
| `sin(x)` | `java.lang.Math.sin` | |
| `cos(x)` | `java.lang.Math.cos` | |
| `tan(x)` | `java.lang.Math.tan` | |
| `asin(x)` | `java.lang.Math.asin` | |
| `acos(x)` | `java.lang.Math.acos` | |
| `atan(x)` | `java.lang.Math.atan` | |
| `cosh(x)` | `java.lang.Math.cosh` | |
| `sinh(x)` | `java.lang.Math.sinh` | |
| `tanh(x)` | `java.lang.Math.tanh` | |
| `log(x)` | `java.lang.Math.log` | |
| `exp(x)` | `java.lang.Math.exp` | |

### Type conversion

| Function | Java function based on | Description |
| :--------------- | :--------------------- | :----------------------------------- |
| `int(x)` | | Convert `x` to Integer |
| `long(x)` | | Convert `x` to Long |
| `double(x)` | | Convert `x` to Double |
| `decimal(x)` | | Convert `x` to Decimal |
| `string(x)` | | Convert `x` to String |
| `string(x, fmt)` | | Convert `x` to String, `x` is a Time |
| `time(x)` | | Convert `x` to Time |
| `time(x, fmt)` | | Convert `x` to Time, `x` is a String |

### String

See [String (Java Platform SE 8)](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html).

| Function | Java function based on | Description |
| :------------------- | :--------------------- | :---------- |
| `toLowerCase(x)` | `String::toLowerCase` | |
| `toUpperCase(x)` | `String::toUpperCase` | |
| `trim(x)` | `String::trim` | |
| `replace(x, a, b)` | `String::replace` | |
| `substring(x, s)` | `String::substring` | |
| `substring(x, s, e)` | `String::substring` | |
Binary file added docs/architecture/images/dingo-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions docs/architecture/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Architecture
============

.. toctree::
:maxdepth: 1
:glob:

overview
cluster
sql_execution
network
expression
30 changes: 30 additions & 0 deletions docs/architecture/network.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Network protocols

## Data messages and control messages

There are two types of messages transferred among the cluster nodes: data messages and control messages. Data messages
contains encoded tuples delivered by the "send operators", including "FIN". Control messages are used to coordinate the
job running. The following is the control message types currently implemented

1. Task delivery message: tasks are serialized and delivered from the coordinator to executors.
2. Receiving ready message: the "receive operators" send messages to the corresponding "send operators" to notify that
they are properly started and ready to receive data messages.

The interaction between two tasks which exchanging data with each other is illustrated in the following figure.

```{mermaid}
sequenceDiagram
participant R as Task with receive opertaor
participant S as Task with send operator
R ->> S: RCV_READY
loop Data transfer
S ->> R: Send tuples
end
S ->> R: Send FIN
```

## Serialize and deserialize

The tuples are encoded in Avro format into data messages.

The tasks are encoded in JSON format.
54 changes: 54 additions & 0 deletions docs/architecture/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Architecture Overview

## Introduction

The main architecture of DingoDB is illustrated in the following figure.

![](images/dingo-architecture.png)

The metadata of the database are kept in [Zookeeper](http://zookeeper.apache.org/). When the client request for DDL
operation (e.g. creating or dropping tables), the coordinator will accept the request and modify the metadata stored in
Zookeeper, so the whole cluster will know the modifications. When the client initiate DDL request (e.g. inserting,
updating, deleting data or query the database), the coordinator will accept the request, parse and build job graph for
the request, then distribute the tasks of the job to many executors. Each of the executors will run the tasks dispatched
to it, accessing data in the corresponding local storage, or processing data received from other tasks. Finally, the
result will be collected and returned to the client.

## Terminology

### Coordinator

Coordinators are the computing nodes where the SQL query is initiated, started and the results are collected. There may
be several coordinators in a cluster, but for executing of an SQL statement, only one coordinator is involved.

### Executor

Executors are the computing nodes where the tasks of a job are running.

### Part

Datum in a table are partitioned into multiple parts stored in different locations.

### Location

Generally, a location is defined as a computing node and a path into where parts are stored. A computing node may
contain several locations.

### Job

A job is an execution plan corresponding an SQL DML statement, which contains several tasks.

### Task

A task is defined as directed acyclic graph (DAG) of operators and a specified location.

### Operator

An operator describes a specified operation, which belongs to a specified task. Operators can be "**source operator**"
producing datum for its task, or "**sink operator**" consuming input datum with no datum output, otherwise operators
consume datum coming from the upstream operators, do calculation and transforming, and push the result to its outputs
which connected to downstream operators.

### Tuple

Datum transferred between operators and tasks are generally define as tuples. In Java, a tuple is defined as `Object[]`.
37 changes: 37 additions & 0 deletions docs/architecture/sql_execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# SQL execution

1. Parsing an SQL string into a job graph.

```{mermaid}
flowchart TD
S[SQL string] -->|parsing| L[Logical Plan]
L -->|transforming & optimizing| P[Physical Plan]
P -->|building| J[Job Graph]
```

This is done on the coordinator.

2. Distribute the job.

After the job is generated and verified, the tasks of it are serialized and transferred according to their binding
locations. Executors receive and deserialize these tasks.

3. Running the job.

Each task is initiated and executed on its executor. A "**PUSH**" model is used in task executing, that is, the source
operators produce tuples actively and all operators except sink operators push tuples to the outputs actively.

4. Data exchange.

The data exchange is done by a pair of special operators: "**send operator**" and "**receive operator**". They belong to
two tasks at two different locations. The "send operator" is a sink operator in a task, which send the tuples being
pushed to the paired "receive operator" via net links; the "receive operator" is a source operator in a task, which
receive tuples from net link and push them to output.

5. Finish job.

If a source operator exhausted the source, (e.g., a part scan operator iterated over all the data in a table part), a
special tuple "**FIN**" is pushed to its outputs, then the operator quit executing and destroyed. If all the inputs of
an operator received "FIN", it also pushes "FIN" to its outputs, quit executing and destroyed.

"FIN" can be transferred over net link, so it is can be propagated over all operators of all tasks.
Loading

0 comments on commit d8cdd63

Please sign in to comment.