Added docs.

dingodb · Oct 19, 2021 · d8cdd63 · d8cdd63
1 parent 87e3156
commit d8cdd63
Show file tree

Hide file tree

Showing 19 changed files with 928 additions and 0 deletions.
diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1,2 @@
+# Sphinx files
+_*
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/architecture/cluster.md b/docs/architecture/cluster.md
@@ -0,0 +1,102 @@
+# Cluster
+
+DingoDB cluster consists of two distributed components :Coordinator and Executor.
+
+The cluster is managed using [Apache Helix](http://helix.apache.org/). Helix is a cluster management framework for managing replication and partitioning resources in distributed systems. Helix uses Zookeeper to store cluster status and metadata.
+
+
+
+## Coordinator
+
+The coordinator is responsible for these things:
+
+- Controllers maintain the global metadata (e.g. configs and schemas) of the system with the help of Zookeeper which is used as the  persistent metadata store.
+- Controller modeled as Helix Controller and is responsible for managing executors.
+- They maintain the mapping of which executors are responsible for which table partitions. 
+
+There can be multiple instances of DingoDB coordinator for redundancy. 
+
+
+
+## Executor
+
+Executors are responsible for maintaining table partitions and provide the ability to read and wirte the table partitions they responsible.
+
+The DingoDB executor is modeled as a Helix Participant, responsible tables, and the table are modeled as Helix Resources. The partitions of the table are modeled as partitions of the Helix resource. Thus, a DingoDB executor responsible one or more helix partitions of one or more helix resources.
+
+
+
+## Table
+
+Tables are collections of rows and columns representing related data that DingoDB divides into partitions that are stored on local disks or in cloud storage.
+
+In the DingoDB, a table is modeled as a Helix resource and each partition of a table is modeled as a Helix Partition.
+
+The resource state model uses the Leader/Follower model.
+
+- Leader: Supports both data write and data query operations
+- Follower: synchronizes data from the Leader and supports query rather than direct writing
+- Offline: the standby node is Offline and does not process data
+- Dropped: Deletes the mapping between a data partition and Executor
+
+Each data partition in a cluster has only one Leader. The Leader is selected by a Coordinator and is preferentially selected from the followers. The selected followers are promoted to the Leader.There can be multiple followers, depending on the number of replicas.
+
+Whether a Table can be responsible for an Executor is determined by the tag. When a Table is created, a tag is set for the Table. During partition allocation, the Executor that contains the tag is selected and the Executor that does not contain the tag is excluded.
+
+```{mermaid}
+stateDiagram
+    [*] --> OFFLINE
+    OFFLINE  --> LEADER
+    OFFLINE  --> FOLLOWER
+    OFFLINE  --> DROPPED
+    LEADER   --> FOLLOWER
+    LEADER   --> OFFLINE
+    FOLLOWER --> LEADER
+    FOLLOWER --> OFFLINE
+    DROPPED  --> [*]
+```
+
+
+
+## cli-tools
+
+### Rebalance
+
+Reset the mapping of which executors are responsible for which table partitions.
+
+Whne running this command, it resets the mapping that executor is responsible for for a given table partition,the coordinator then detects the change, it reassigns table partitions to the executor based on policies defined by the Leader/Follower state model.
+
+```shell
+tools.sh rebalance --name TableName --replicas 3 # if replicas is not set or replicas less than 1, replicas use default replicas
+```
+
+
+
+### Sqlline
+
+Command-line shell for issuing SQL to DingoDB.
+
+SQLLine is a pure-Java console based utility for connecting to DingoDB and executing SQL commands,you can use it like `sqlplus `for Oracle,`mysql` for MySQL. It is implemented with the open source project [sqlline](https://github.com/julianhyde/sqlline).
+
+```shell
+tools.sh sqlline
+```
+
+
+
+### Tag
+
+Set table tag, or add table tag to executor.
+
+Because whether a table can be responsible for an Executor is determined by the tag,so you can use this command to add a tag to executor.
+
+In general, no special tag Settings are required, and a newly created table will use the default tag.Therefore, any executors can responsible the table partitions.
+
+If you want to set the specified executors to responsible the table partitions, use this command to set a tag for the table and then add the tag to the desired executors.
+
+```shell
+tools.sh tag --name TableName --tag TagName  # set table tag
+tools.sh tag --name ExecutorName --tag TagName # add tag to executor
+tools.sh tag --name ExecutorName --tag TagName --delete # delete tag for executor
+```
+
diff --git a/docs/architecture/expression.md b/docs/architecture/expression.md
@@ -0,0 +1,101 @@
+# Dingo Expression
+
+Dingo Expression is the expression engine used by DingoDB. It is special for its runtime codebase is separated from the
+parsing and compiling codebase. The classes in runtime are serializable so that they are suitable for runtime of
+distributed computing system, like [Apache Flink](https://flink.apache.org/).
+
+## Operators
+
+| Category       | Operator                                     | Associativity |
+| :------------- | :------------------------------------------- | :------------ |
+| Parenthesis    | `( )`                                        |               |
+| Function Call  | `( )`                                        | Left to right |
+| Name Index     | `.`                                          | Left to right |
+| Array Index    | `[ ]`                                        | Left to right |
+| Unary          | `+` `-`                                      | Right to left |
+| Multiplicative | `*` `/`                                      | Left to right |
+| Additive       | `+` `-`                                      | Left to right |
+| Relational     | `<` `<=` `>` `>=` `==` `=` `!=` `<>`         | Left to right |
+| String         | `startsWith` `endsWith` `contains` `matches` | Left to right |
+| Logical NOT    | `!` `not`                                    | Left to right |
+| Logical AND    | `&&` `and`                                   | Left to right |
+| Logical OR     | <code>&#x7c;&#x7c;</code> `or`               | Left to right |
+
+## Data Types
+
+| Type Name    | SQL type  | JSON Schema Type | Hosting Java Type      | Literal in Expression |
+| :----------- | :-------- | :--------------- | :--------------------- | :-------------------- |
+| Integer      | INTEGER   |                  | java.lang.Integer      |                       |
+| Long         | LONG      | integer          | java.lang.Long         | `0` `20` `-375`       |
+| Double       | DOUBLE    | number           | java.lang.Double       | `2.0` `-6.28` `3e-4`  |
+| Boolean      | BOOLEAN   | boolean          | java.lang.Boolean      | `true` `false`        |
+| String       | CHAR      | string           | java.lang.String       | `"hello"` `'world'`   |
+| String       | VARCHAR   | string           | java.lang.String       | `"hello"` `'world'`   |
+| Decimal      |           |                  | java.math.BigDecimal   |
+| Time         | TIMESTAMP |                  | java.util.Date         |
+| IntegerArray |           |                  | java.lang.Integer[]    |
+| LongArray    |           | array            | java.lang.Long[]       |
+| DoubleArray  |           | array            | java.lang.Double[]     |
+| BooleanArray |           | array            | java.lang.Boolean[]    |
+| StringArray  |           | array            | java.lang.String[]     |
+| DecimalArray |           |                  | java.math.BigDecimal[] |
+| ObjectArray  |           | array            | java.lang.Object[]     |
+| List         |           | array            | java.util.List         |
+| Map          |           | object           | java.util.Map          |
+| Object       |           | object           | java.lang.Object       |
+
+## Constants
+
+| Name | Value                   |
+| :--- | ----------------------: |
+| TAU  | 6.283185307179586476925 |
+| E    | 2.7182818284590452354   |
+
+There is not "3.14159265" but `TAU`. See [The Tau Manifesto](https://tauday.com/tau-manifesto).
+
+## Functions
+
+### Mathematical
+
+See [Math (Java Platform SE 8)](https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html).
+
+| Function  | Java function based on | Description |
+| :-------- | :--------------------- | :---------- |
+| `abs(x)`  | `java.lang.Math.abs`   |             |
+| `sin(x)`  | `java.lang.Math.sin`   |             |
+| `cos(x)`  | `java.lang.Math.cos`   |             |
+| `tan(x)`  | `java.lang.Math.tan`   |             |
+| `asin(x)` | `java.lang.Math.asin`  |             |
+| `acos(x)` | `java.lang.Math.acos`  |             |
+| `atan(x)` | `java.lang.Math.atan`  |             |
+| `cosh(x)` | `java.lang.Math.cosh`  |             |
+| `sinh(x)` | `java.lang.Math.sinh`  |             |
+| `tanh(x)` | `java.lang.Math.tanh`  |             |
+| `log(x)`  | `java.lang.Math.log`   |             |
+| `exp(x)`  | `java.lang.Math.exp`   |             |
+
+### Type conversion
+
+| Function         | Java function based on | Description                          |
+| :--------------- | :--------------------- | :----------------------------------- |
+| `int(x)`         |                        | Convert `x` to Integer               |
+| `long(x)`        |                        | Convert `x` to Long                  |
+| `double(x)`      |                        | Convert `x` to Double                |
+| `decimal(x)`     |                        | Convert `x` to Decimal               |
+| `string(x)`      |                        | Convert `x` to String                |
+| `string(x, fmt)` |                        | Convert `x` to String, `x` is a Time |
+| `time(x)`        |                        | Convert `x` to Time                  |
+| `time(x, fmt)`   |                        | Convert `x` to Time, `x` is a String |
+
+### String
+
+See [String (Java Platform SE 8)](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html).
+
+| Function             | Java function based on | Description |
+| :------------------- | :--------------------- | :---------- |
+| `toLowerCase(x)`     | `String::toLowerCase`  |             |
+| `toUpperCase(x)`     | `String::toUpperCase`  |             |
+| `trim(x)`            | `String::trim`         |             |
+| `replace(x, a, b)`   | `String::replace`      |             |
+| `substring(x, s)`    | `String::substring`    |             |
+| `substring(x, s, e)` | `String::substring`    |             |
diff --git a/docs/architecture/images/dingo-architecture.png b/docs/architecture/images/dingo-architecture.png
diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst
@@ -0,0 +1,12 @@
+Architecture
+============
+
+.. toctree::
+   :maxdepth: 1
+   :glob:
+
+   overview
+   cluster
+   sql_execution
+   network
+   expression
diff --git a/docs/architecture/network.md b/docs/architecture/network.md
@@ -0,0 +1,30 @@
+# Network protocols
+
+## Data messages and control messages
+
+There are two types of messages transferred among the cluster nodes: data messages and control messages. Data messages
+contains encoded tuples delivered by the "send operators", including "FIN". Control messages are used to coordinate the
+job running. The following is the control message types currently implemented
+
+1. Task delivery message: tasks are serialized and delivered from the coordinator to executors.
+2. Receiving ready message: the "receive operators" send messages to the corresponding "send operators" to notify that
+   they are properly started and ready to receive data messages.
+
+The interaction between two tasks which exchanging data with each other is illustrated in the following figure.
+
+```{mermaid}
+sequenceDiagram
+    participant R as Task with receive opertaor
+    participant S as Task with send operator
+    R ->> S: RCV_READY
+    loop Data transfer
+        S ->> R: Send tuples
+    end
+    S ->> R: Send FIN
+```
+
+## Serialize and deserialize
+
+The tuples are encoded in Avro format into data messages.
+
+The tasks are encoded in JSON format.
diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md
@@ -0,0 +1,54 @@
+# Architecture Overview
+
+## Introduction
+
+The main architecture of DingoDB is illustrated in the following figure.
+
+![](images/dingo-architecture.png)
+
+The metadata of the database are kept in [Zookeeper](http://zookeeper.apache.org/). When the client request for DDL
+operation (e.g. creating or dropping tables), the coordinator will accept the request and modify the metadata stored in
+Zookeeper, so the whole cluster will know the modifications. When the client initiate DDL request (e.g. inserting,
+updating, deleting data or query the database), the coordinator will accept the request, parse and build job graph for
+the request, then distribute the tasks of the job to many executors. Each of the executors will run the tasks dispatched
+to it, accessing data in the corresponding local storage, or processing data received from other tasks. Finally, the
+result will be collected and returned to the client.
+
+## Terminology
+
+### Coordinator
+
+Coordinators are the computing nodes where the SQL query is initiated, started and the results are collected. There may
+be several coordinators in a cluster, but for executing of an SQL statement, only one coordinator is involved.
+
+### Executor
+
+Executors are the computing nodes where the tasks of a job are running.
+
+### Part
+
+Datum in a table are partitioned into multiple parts stored in different locations.
+
+### Location
+
+Generally, a location is defined as a computing node and a path into where parts are stored. A computing node may
+contain several locations.
+
+### Job
+
+A job is an execution plan corresponding an SQL DML statement, which contains several tasks.
+
+### Task
+
+A task is defined as directed acyclic graph (DAG) of operators and a specified location.
+
+### Operator
+
+An operator describes a specified operation, which belongs to a specified task. Operators can be "**source operator**"
+producing datum for its task, or "**sink operator**" consuming input datum with no datum output, otherwise operators
+consume datum coming from the upstream operators, do calculation and transforming, and push the result to its outputs
+which connected to downstream operators.
+
+### Tuple
+
+Datum transferred between operators and tasks are generally define as tuples. In Java, a tuple is defined as `Object[]`.
diff --git a/docs/architecture/sql_execution.md b/docs/architecture/sql_execution.md
@@ -0,0 +1,37 @@
+# SQL execution
+
+1. Parsing an SQL string into a job graph.
+
+```{mermaid}
+flowchart TD
+    S[SQL string] -->|parsing| L[Logical Plan]
+    L -->|transforming & optimizing| P[Physical Plan]
+    P -->|building| J[Job Graph]
+```
+
+This is done on the coordinator.
+
+2. Distribute the job.
+
+After the job is generated and verified, the tasks of it are serialized and transferred according to their binding
+locations. Executors receive and deserialize these tasks.
+
+3. Running the job.
+
+Each task is initiated and executed on its executor. A "**PUSH**" model is used in task executing, that is, the source
+operators produce tuples actively and all operators except sink operators push tuples to the outputs actively.
+
+4. Data exchange.
+
+The data exchange is done by a pair of special operators: "**send operator**" and "**receive operator**". They belong to
+two tasks at two different locations. The "send operator" is a sink operator in a task, which send the tuples being
+pushed to the paired "receive operator" via net links; the "receive operator" is a source operator in a task, which
+receive tuples from net link and push them to output.
+
+5. Finish job.
+
+If a source operator exhausted the source, (e.g., a part scan operator iterated over all the data in a table part), a
+special tuple "**FIN**" is pushed to its outputs, then the operator quit executing and destroyed. If all the inputs of
+an operator received "FIN", it also pushes "FIN" to its outputs, quit executing and destroyed.
+
+"FIN" can be transferred over net link, so it is can be propagated over all operators of all tasks.