-
Notifications
You must be signed in to change notification settings - Fork 229
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
19 changed files
with
928 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Sphinx files | ||
_* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Minimal makefile for Sphinx documentation | ||
# | ||
|
||
# You can set these variables from the command line, and also | ||
# from the environment for the first two. | ||
SPHINXOPTS ?= | ||
SPHINXBUILD ?= sphinx-build | ||
SOURCEDIR = . | ||
BUILDDIR = _build | ||
|
||
# Put it first so that "make" without argument is like "make help". | ||
help: | ||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) | ||
|
||
.PHONY: help Makefile | ||
|
||
# Catch-all target: route all unknown targets to Sphinx using the new | ||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). | ||
%: Makefile | ||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# Cluster | ||
|
||
DingoDB cluster consists of two distributed components :Coordinator and Executor. | ||
|
||
The cluster is managed using [Apache Helix](http://helix.apache.org/). Helix is a cluster management framework for managing replication and partitioning resources in distributed systems. Helix uses Zookeeper to store cluster status and metadata. | ||
|
||
|
||
|
||
## Coordinator | ||
|
||
The coordinator is responsible for these things: | ||
|
||
- Controllers maintain the global metadata (e.g. configs and schemas) of the system with the help of Zookeeper which is used as the persistent metadata store. | ||
- Controller modeled as Helix Controller and is responsible for managing executors. | ||
- They maintain the mapping of which executors are responsible for which table partitions. | ||
|
||
There can be multiple instances of DingoDB coordinator for redundancy. | ||
|
||
|
||
|
||
## Executor | ||
|
||
Executors are responsible for maintaining table partitions and provide the ability to read and wirte the table partitions they responsible. | ||
|
||
The DingoDB executor is modeled as a Helix Participant, responsible tables, and the table are modeled as Helix Resources. The partitions of the table are modeled as partitions of the Helix resource. Thus, a DingoDB executor responsible one or more helix partitions of one or more helix resources. | ||
|
||
|
||
|
||
## Table | ||
|
||
Tables are collections of rows and columns representing related data that DingoDB divides into partitions that are stored on local disks or in cloud storage. | ||
|
||
In the DingoDB, a table is modeled as a Helix resource and each partition of a table is modeled as a Helix Partition. | ||
|
||
The resource state model uses the Leader/Follower model. | ||
|
||
- Leader: Supports both data write and data query operations | ||
- Follower: synchronizes data from the Leader and supports query rather than direct writing | ||
- Offline: the standby node is Offline and does not process data | ||
- Dropped: Deletes the mapping between a data partition and Executor | ||
|
||
Each data partition in a cluster has only one Leader. The Leader is selected by a Coordinator and is preferentially selected from the followers. The selected followers are promoted to the Leader.There can be multiple followers, depending on the number of replicas. | ||
|
||
Whether a Table can be responsible for an Executor is determined by the tag. When a Table is created, a tag is set for the Table. During partition allocation, the Executor that contains the tag is selected and the Executor that does not contain the tag is excluded. | ||
|
||
```{mermaid} | ||
stateDiagram | ||
[*] --> OFFLINE | ||
OFFLINE --> LEADER | ||
OFFLINE --> FOLLOWER | ||
OFFLINE --> DROPPED | ||
LEADER --> FOLLOWER | ||
LEADER --> OFFLINE | ||
FOLLOWER --> LEADER | ||
FOLLOWER --> OFFLINE | ||
DROPPED --> [*] | ||
``` | ||
|
||
|
||
|
||
## cli-tools | ||
|
||
### Rebalance | ||
|
||
Reset the mapping of which executors are responsible for which table partitions. | ||
|
||
Whne running this command, it resets the mapping that executor is responsible for for a given table partition,the coordinator then detects the change, it reassigns table partitions to the executor based on policies defined by the Leader/Follower state model. | ||
|
||
```shell | ||
tools.sh rebalance --name TableName --replicas 3 # if replicas is not set or replicas less than 1, replicas use default replicas | ||
``` | ||
|
||
|
||
|
||
### Sqlline | ||
|
||
Command-line shell for issuing SQL to DingoDB. | ||
|
||
SQLLine is a pure-Java console based utility for connecting to DingoDB and executing SQL commands,you can use it like `sqlplus `for Oracle,`mysql` for MySQL. It is implemented with the open source project [sqlline](https://github.com/julianhyde/sqlline). | ||
|
||
```shell | ||
tools.sh sqlline | ||
``` | ||
|
||
|
||
|
||
### Tag | ||
|
||
Set table tag, or add table tag to executor. | ||
|
||
Because whether a table can be responsible for an Executor is determined by the tag,so you can use this command to add a tag to executor. | ||
|
||
In general, no special tag Settings are required, and a newly created table will use the default tag.Therefore, any executors can responsible the table partitions. | ||
|
||
If you want to set the specified executors to responsible the table partitions, use this command to set a tag for the table and then add the tag to the desired executors. | ||
|
||
```shell | ||
tools.sh tag --name TableName --tag TagName # set table tag | ||
tools.sh tag --name ExecutorName --tag TagName # add tag to executor | ||
tools.sh tag --name ExecutorName --tag TagName --delete # delete tag for executor | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# Dingo Expression | ||
|
||
Dingo Expression is the expression engine used by DingoDB. It is special for its runtime codebase is separated from the | ||
parsing and compiling codebase. The classes in runtime are serializable so that they are suitable for runtime of | ||
distributed computing system, like [Apache Flink](https://flink.apache.org/). | ||
|
||
## Operators | ||
|
||
| Category | Operator | Associativity | | ||
| :------------- | :------------------------------------------- | :------------ | | ||
| Parenthesis | `( )` | | | ||
| Function Call | `( )` | Left to right | | ||
| Name Index | `.` | Left to right | | ||
| Array Index | `[ ]` | Left to right | | ||
| Unary | `+` `-` | Right to left | | ||
| Multiplicative | `*` `/` | Left to right | | ||
| Additive | `+` `-` | Left to right | | ||
| Relational | `<` `<=` `>` `>=` `==` `=` `!=` `<>` | Left to right | | ||
| String | `startsWith` `endsWith` `contains` `matches` | Left to right | | ||
| Logical NOT | `!` `not` | Left to right | | ||
| Logical AND | `&&` `and` | Left to right | | ||
| Logical OR | <code>||</code> `or` | Left to right | | ||
|
||
## Data Types | ||
|
||
| Type Name | SQL type | JSON Schema Type | Hosting Java Type | Literal in Expression | | ||
| :----------- | :-------- | :--------------- | :--------------------- | :-------------------- | | ||
| Integer | INTEGER | | java.lang.Integer | | | ||
| Long | LONG | integer | java.lang.Long | `0` `20` `-375` | | ||
| Double | DOUBLE | number | java.lang.Double | `2.0` `-6.28` `3e-4` | | ||
| Boolean | BOOLEAN | boolean | java.lang.Boolean | `true` `false` | | ||
| String | CHAR | string | java.lang.String | `"hello"` `'world'` | | ||
| String | VARCHAR | string | java.lang.String | `"hello"` `'world'` | | ||
| Decimal | | | java.math.BigDecimal | | ||
| Time | TIMESTAMP | | java.util.Date | | ||
| IntegerArray | | | java.lang.Integer[] | | ||
| LongArray | | array | java.lang.Long[] | | ||
| DoubleArray | | array | java.lang.Double[] | | ||
| BooleanArray | | array | java.lang.Boolean[] | | ||
| StringArray | | array | java.lang.String[] | | ||
| DecimalArray | | | java.math.BigDecimal[] | | ||
| ObjectArray | | array | java.lang.Object[] | | ||
| List | | array | java.util.List | | ||
| Map | | object | java.util.Map | | ||
| Object | | object | java.lang.Object | | ||
|
||
## Constants | ||
|
||
| Name | Value | | ||
| :--- | ----------------------: | | ||
| TAU | 6.283185307179586476925 | | ||
| E | 2.7182818284590452354 | | ||
|
||
There is not "3.14159265" but `TAU`. See [The Tau Manifesto](https://tauday.com/tau-manifesto). | ||
|
||
## Functions | ||
|
||
### Mathematical | ||
|
||
See [Math (Java Platform SE 8)](https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html). | ||
|
||
| Function | Java function based on | Description | | ||
| :-------- | :--------------------- | :---------- | | ||
| `abs(x)` | `java.lang.Math.abs` | | | ||
| `sin(x)` | `java.lang.Math.sin` | | | ||
| `cos(x)` | `java.lang.Math.cos` | | | ||
| `tan(x)` | `java.lang.Math.tan` | | | ||
| `asin(x)` | `java.lang.Math.asin` | | | ||
| `acos(x)` | `java.lang.Math.acos` | | | ||
| `atan(x)` | `java.lang.Math.atan` | | | ||
| `cosh(x)` | `java.lang.Math.cosh` | | | ||
| `sinh(x)` | `java.lang.Math.sinh` | | | ||
| `tanh(x)` | `java.lang.Math.tanh` | | | ||
| `log(x)` | `java.lang.Math.log` | | | ||
| `exp(x)` | `java.lang.Math.exp` | | | ||
|
||
### Type conversion | ||
|
||
| Function | Java function based on | Description | | ||
| :--------------- | :--------------------- | :----------------------------------- | | ||
| `int(x)` | | Convert `x` to Integer | | ||
| `long(x)` | | Convert `x` to Long | | ||
| `double(x)` | | Convert `x` to Double | | ||
| `decimal(x)` | | Convert `x` to Decimal | | ||
| `string(x)` | | Convert `x` to String | | ||
| `string(x, fmt)` | | Convert `x` to String, `x` is a Time | | ||
| `time(x)` | | Convert `x` to Time | | ||
| `time(x, fmt)` | | Convert `x` to Time, `x` is a String | | ||
|
||
### String | ||
|
||
See [String (Java Platform SE 8)](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html). | ||
|
||
| Function | Java function based on | Description | | ||
| :------------------- | :--------------------- | :---------- | | ||
| `toLowerCase(x)` | `String::toLowerCase` | | | ||
| `toUpperCase(x)` | `String::toUpperCase` | | | ||
| `trim(x)` | `String::trim` | | | ||
| `replace(x, a, b)` | `String::replace` | | | ||
| `substring(x, s)` | `String::substring` | | | ||
| `substring(x, s, e)` | `String::substring` | | |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
Architecture | ||
============ | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:glob: | ||
|
||
overview | ||
cluster | ||
sql_execution | ||
network | ||
expression |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Network protocols | ||
|
||
## Data messages and control messages | ||
|
||
There are two types of messages transferred among the cluster nodes: data messages and control messages. Data messages | ||
contains encoded tuples delivered by the "send operators", including "FIN". Control messages are used to coordinate the | ||
job running. The following is the control message types currently implemented | ||
|
||
1. Task delivery message: tasks are serialized and delivered from the coordinator to executors. | ||
2. Receiving ready message: the "receive operators" send messages to the corresponding "send operators" to notify that | ||
they are properly started and ready to receive data messages. | ||
|
||
The interaction between two tasks which exchanging data with each other is illustrated in the following figure. | ||
|
||
```{mermaid} | ||
sequenceDiagram | ||
participant R as Task with receive opertaor | ||
participant S as Task with send operator | ||
R ->> S: RCV_READY | ||
loop Data transfer | ||
S ->> R: Send tuples | ||
end | ||
S ->> R: Send FIN | ||
``` | ||
|
||
## Serialize and deserialize | ||
|
||
The tuples are encoded in Avro format into data messages. | ||
|
||
The tasks are encoded in JSON format. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Architecture Overview | ||
|
||
## Introduction | ||
|
||
The main architecture of DingoDB is illustrated in the following figure. | ||
|
||
 | ||
|
||
The metadata of the database are kept in [Zookeeper](http://zookeeper.apache.org/). When the client request for DDL | ||
operation (e.g. creating or dropping tables), the coordinator will accept the request and modify the metadata stored in | ||
Zookeeper, so the whole cluster will know the modifications. When the client initiate DDL request (e.g. inserting, | ||
updating, deleting data or query the database), the coordinator will accept the request, parse and build job graph for | ||
the request, then distribute the tasks of the job to many executors. Each of the executors will run the tasks dispatched | ||
to it, accessing data in the corresponding local storage, or processing data received from other tasks. Finally, the | ||
result will be collected and returned to the client. | ||
|
||
## Terminology | ||
|
||
### Coordinator | ||
|
||
Coordinators are the computing nodes where the SQL query is initiated, started and the results are collected. There may | ||
be several coordinators in a cluster, but for executing of an SQL statement, only one coordinator is involved. | ||
|
||
### Executor | ||
|
||
Executors are the computing nodes where the tasks of a job are running. | ||
|
||
### Part | ||
|
||
Datum in a table are partitioned into multiple parts stored in different locations. | ||
|
||
### Location | ||
|
||
Generally, a location is defined as a computing node and a path into where parts are stored. A computing node may | ||
contain several locations. | ||
|
||
### Job | ||
|
||
A job is an execution plan corresponding an SQL DML statement, which contains several tasks. | ||
|
||
### Task | ||
|
||
A task is defined as directed acyclic graph (DAG) of operators and a specified location. | ||
|
||
### Operator | ||
|
||
An operator describes a specified operation, which belongs to a specified task. Operators can be "**source operator**" | ||
producing datum for its task, or "**sink operator**" consuming input datum with no datum output, otherwise operators | ||
consume datum coming from the upstream operators, do calculation and transforming, and push the result to its outputs | ||
which connected to downstream operators. | ||
|
||
### Tuple | ||
|
||
Datum transferred between operators and tasks are generally define as tuples. In Java, a tuple is defined as `Object[]`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# SQL execution | ||
|
||
1. Parsing an SQL string into a job graph. | ||
|
||
```{mermaid} | ||
flowchart TD | ||
S[SQL string] -->|parsing| L[Logical Plan] | ||
L -->|transforming & optimizing| P[Physical Plan] | ||
P -->|building| J[Job Graph] | ||
``` | ||
|
||
This is done on the coordinator. | ||
|
||
2. Distribute the job. | ||
|
||
After the job is generated and verified, the tasks of it are serialized and transferred according to their binding | ||
locations. Executors receive and deserialize these tasks. | ||
|
||
3. Running the job. | ||
|
||
Each task is initiated and executed on its executor. A "**PUSH**" model is used in task executing, that is, the source | ||
operators produce tuples actively and all operators except sink operators push tuples to the outputs actively. | ||
|
||
4. Data exchange. | ||
|
||
The data exchange is done by a pair of special operators: "**send operator**" and "**receive operator**". They belong to | ||
two tasks at two different locations. The "send operator" is a sink operator in a task, which send the tuples being | ||
pushed to the paired "receive operator" via net links; the "receive operator" is a source operator in a task, which | ||
receive tuples from net link and push them to output. | ||
|
||
5. Finish job. | ||
|
||
If a source operator exhausted the source, (e.g., a part scan operator iterated over all the data in a table part), a | ||
special tuple "**FIN**" is pushed to its outputs, then the operator quit executing and destroyed. If all the inputs of | ||
an operator received "FIN", it also pushes "FIN" to its outputs, quit executing and destroyed. | ||
|
||
"FIN" can be transferred over net link, so it is can be propagated over all operators of all tasks. |
Oops, something went wrong.