docs: hello world readme

Docs only change.
triton-inference-server · Jan 18, 2025 · 2b3ec74 · 2b3ec74
1 parent b4ee99f
commit 2b3ec74
Show file tree

Hide file tree

Showing 2 changed files with 279 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -47,9 +47,9 @@ We provide 3 types of builds:
 2. `TENSORRTLLM` which includes our TRT-LLM backend
 3. `VLLM` which includes our VLLM backend
 
-For example, if you want to build a container for the `VLLM` backend you can run
+For example, if you want to build a container for the `STANDARD` backends you can run
 
-`./container/build.sh --framework VLLM`
+`./container/build.sh`
 
 Please see the instructions in the corresponding example for specific build instructions.
 
@@ -83,3 +83,23 @@ HF_TOKEN```) and mounts common directories such as ```/tmp:/tmp```,
 
 Please see the instructions in the corresponding example for specific
 deployment instructions.
+
+## Hello World
+
+[Hello World](./examples/hello_world)
+
+A basic example demonstrating the new interfaces and concepts of
+triton distributed. In the hello world example, you can deploy a set
+of simple workers to load balance requests from a local work queue.
+
+# Disclaimers
+
+> [!NOTE]
+> This project is currently in the alpha / experimental /
+> rapid-prototyping stage and we will be adding new features incrementally.
+
+1. The `TENSORRTLLM` and `VLLM` containers are WIP and not expected to
+   work out of the box.
+
+2. Testing has primarily been on single node systems with processes
+   launched within a single container.
diff --git a/examples/hello_world/README.md b/examples/hello_world/README.md
@@ -15,3 +15,260 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
+# Hello World
+
+A basic example demonstrating the new interfaces and concepts of
+triton distributed. In the hello world example, you can deploy a set
+of simple workers to load balance requests from a local work queue.
+
+The example demonstrates:
+
+1. How to incorporate an existing Triton Core Model into a triton distributed worker.
+2. How to incorporate a standalone python class into a triton distributed worker.
+3. How deploy a set of workers
+4. How to send requests to the triton distributed deployment
+5. Requests over the Request Plane and Data movement over the Data
+   Plane.
+
+## Building the Hello World Environment
+
+The hello world example is designed to be deployed in a containerized
+environment and to work with and without GPU support.
+
+To get started build the "STANDARD" triton distributed development
+environment.
+
+Note: "STANDARD" is the default framework
+
+```
+./container/build.sh
+```
+
+
+## Starting the Deployment
+
+```
+./container/run.sh -it -- python3 -m hello_world.deploy --initialize-request-plane
+```
+
+#### Expected Output
+
+
+```
+Starting Workers
+17:17:09 deployment.py:115[triton_distributed.worker.deployment] INFO:
+
+Starting Worker:
+
+	Config:
+	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
+             data_plane=<function UcpDataPlane at 0x7f477eb5d580>,
+             request_plane_args=([], {}),
+             data_plane_args=([], {}),
+             log_level=1,
+             operators=[OperatorConfig(name='encoder',
+                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
+                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
+                                       version=1,
+                                       max_inflight_requests=1,
+                                       parameters={'config': {'instance_group': [{'count': 1,
+                                                                                  'kind': 'KIND_CPU'}],
+                                                              'parameters': {'delay': {'string_value': '0'},
+                                                                             'input_copies': {'string_value': '1'}}}},
+                                       log_level=None)],
+             triton_log_path=None,
+             name='encoder.0',
+             log_dir='/workspace/examples/hello_world/logs',
+             metrics_port=50000)
+	<SpawnProcess name='encoder.0' parent=1 initial>
+
+17:17:09 deployment.py:115[triton_distributed.worker.deployment] INFO:
+
+Starting Worker:
+
+	Config:
+	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
+             data_plane=<function UcpDataPlane at 0x7f477eb5d580>,
+             request_plane_args=([], {}),
+             data_plane_args=([], {}),
+             log_level=1,
+             operators=[OperatorConfig(name='decoder',
+                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
+                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
+                                       version=1,
+                                       max_inflight_requests=1,
+                                       parameters={'config': {'instance_group': [{'count': 1,
+                                                                                  'kind': 'KIND_CPU'}],
+                                                              'parameters': {'delay': {'string_value': '0'},
+                                                                             'input_copies': {'string_value': '1'}}}},
+                                       log_level=None)],
+             triton_log_path=None,
+             name='decoder.0',
+             log_dir='/workspace/examples/hello_world/logs',
+             metrics_port=50001)
+	<SpawnProcess name='decoder.0' parent=1 initial>
+
+17:17:09 deployment.py:115[triton_distributed.worker.deployment] INFO:
+
+Starting Worker:
+
+	Config:
+	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
+             data_plane=<function UcpDataPlane at 0x7f477eb5d580>,
+             request_plane_args=([], {}),
+             data_plane_args=([], {}),
+             log_level=1,
+             operators=[OperatorConfig(name='encoder_decoder',
+                                       implementation='EncodeDecodeOperator',
+                                       repository='/workspace/examples/hello_world/operators',
+                                       version=1,
+                                       max_inflight_requests=1,
+                                       parameters={},
+                                       log_level=None)],
+             triton_log_path=None,
+             name='encoder_decoder.0',
+             log_dir='/workspace/examples/hello_world/logs',
+             metrics_port=50002)
+	<SpawnProcess name='encoder_decoder.0' parent=1 initial>
+
+Workers started ... press Ctrl-C to Exit
+```
+
+## Sending Requests
+
+From a separate terminal run the sample client.
+
+```
+./container/run.sh -it -- python3 -m hello_world.client
+```
+
+#### Expected Output
+
+```
+
+Client: 0 Received Response: 42 From: 39491f06-d4f7-11ef-be96-047bcba9020e Error: None:  43%|███████▋          | 43/100 [00:04<00:05,  9.83request/s]
+
+Throughput: 9.10294484748811 Total Time: 10.985455989837646
+Clients Stopped Exit Code 0
+
+
+```
+
+## Behind the Scenes
+
+The hello world example is designed to demonstrate and allow
+experimenting with different mixtures of compute and memory loads and
+different numbers of workers for different parts of the hello world
+workflow.
+
+### Hello World Workflow
+
+The hello world workflow is a simple two stage pipeline with an
+encoding stage and a decoding stage plus an encoder-decoder stage to
+orchestrate the overall workflow.
+
+```
+client <-> encoder_decoder <-> encoder
+                      |
+                      -----<-> decoder
+```
+
+
+#### Encoder
+
+The encoder follows the simple procedure:
+
+1. copy the input x times (x is configurable via parameter)
+2. invert the input
+3. delay * size of output
+
+#### Decoder
+
+The decoder follows the simple procedure:
+
+1. remove the extra copies
+2. invert the input
+3. delay * size of output
+
+#### Encoder - Decoder
+
+The encoder-decoder operator controls the overall workflow.
+
+It first sends a request for an encoder. Once it receives the response
+it sends the output from the encoder as an input to the decoder. Note
+in this step memory is transferred directly between the encoder and
+decoder workers - and does not pass through the encoder-decoder.
+
+### Operators
+
+Operators are responsible for actually doing work and responding to
+requests. Operators are supported in two main flavors and are hosted
+by a common Worker class.
+
+#### Triton Core Operator
+
+The triton core operator makes a triton model (following the [standard
+model
+repo](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md)
+and backend structure of the tritonserver) available on the request
+plane. Both the encoder and decoder are implemented as triton python
+backend models.
+
+#### Custom Operator
+
+The encoder-decoder operator is a python class that implements the
+Operator interface. Internally it makes remote requests to other
+workers. Generally an operator can make use of other operators for its
+work but isn't required to.
+
+### Workers
+
+Workers host one or more operators and pull requests from the request
+plane and forward them to a local operator.
+
+### Request Plane
+
+The current triton distributed framework leverages a distributed work
+queue for its request plane implementation. The request plane ensures
+that requests for operators are forwarded and serviced by a single
+worker.
+
+### Data Plane
+
+The triton distributed framework leverages point to point data
+transfers using the UCX library to provide optimized primitives for
+device to device transfers.
+
+Data sent over the data plane is only pulled by the worker that needs
+to perform work on it. Requests themselves contain data descriptors
+and can be referenced and shared with other workers.
+
+Note: there is also a provision for sending data in the request
+contents when the message size is small enough that UCX transfer is
+not needed.
+
+### Components
+
+Any process which communicates with one or more of the request or data
+planes is considered a "component". While this example only uses
+"Workers" future examples will also include api servers, routers, and
+other types of components.
+
+### Deployment
+
+The final piece is a deployment. A deployment is a set of components
+deployed across a cluster. Components may be added and removed from
+deployments.
+
+
+## Limitations and Caveats
+
+The example is a rapidly evolving prototype and shouldn't be used in
+production. Limited testing has been done and it is meant to help
+flesh out the triton distributed concepts, architecture, and
+interfaces.
+
+1. No multi-node testing / support has been done
+
+2. No performance tuning / measurement has been done
+