Feature/cpp baby llama rework (#2903)

* Baby Llama - Porting run.c for integration and fixed clang type conversion errors. Signed-off-by: Shrinath Suresh <[email protected]> Custom preprocess implementation Signed-off-by: Shrinath Suresh <[email protected]> Free memory only after the inference is done Signed-off-by: Shrinath Suresh <[email protected]> Implement Postprocess Signed-off-by: Shrinath Suresh <[email protected]> Setting Fast compiler option Signed-off-by: Shrinath Suresh <[email protected]> Reading checkpoint path and tokenizer path from config file using folly Signed-off-by: Shrinath Suresh <[email protected]> Removing run.c from cmake Signed-off-by: Shrinath Suresh <[email protected]> Replace auto with appropriate data type Signed-off-by: Shrinath Suresh <[email protected]> Using smartpointers and initializing the vector with appropriate size upfront Signed-off-by: Shrinath Suresh <[email protected]> Using smartpointers Signed-off-by: Shrinath Suresh <[email protected]> Directly converting the tensor values to prompt token ids Signed-off-by: Shrinath Suresh <[email protected]> Moving run.c and common variables to .cc file Signed-off-by: Shrinath Suresh <[email protected]> Moving run.c to a separate folder Signed-off-by: Shrinath Suresh <[email protected]> Uncommenting the original run.c main method Signed-off-by: Shrinath Suresh <[email protected]> Implemented destructor to free up resources Signed-off-by: Shrinath Suresh <[email protected]> Supporting files for unit test Signed-off-by: Shrinath Suresh <[email protected]> Processing all the batch inputs Signed-off-by: Shrinath Suresh <[email protected]> Setting InferenceMode guard Signed-off-by: Shrinath Suresh <[email protected]> Updating InferenceMode to use torch::InferenceMode Signed-off-by: Shrinath Suresh <[email protected]> Updating class name to BabyLlamaHandler Signed-off-by: Shrinath Suresh <[email protected]> Renaming llm_handler target to babyllama_handler Signed-off-by: Shrinath Suresh <[email protected]> Adding dummy pt file Signed-off-by: Shrinath Suresh <[email protected]> Typo Fix Signed-off-by: Shrinath Suresh <[email protected]> Calculate tokens/per second for batch input Signed-off-by: Shrinath Suresh <[email protected]> Adding README.md for babyllama example Signed-off-by: Shrinath Suresh <[email protected]> Fixing out-of-bound mem access in babyllama example Move model instance out of ts_backend Use shared_ptr<void> for model to detangle from torchscript Move BaseHAndler to backends/handler Move model instance into core Remove Torchscript as a backend and implement it as a handler Move torchscript test out of backend folder Remove dummy.pt in babyllama + update README + mvoe babyllama test to new examples/examples_test.cc file * fix spell check * Move cpp babyllama example to main example folder * Add last successful location to error message in handle function * Fix babyllama batching by changing input/output from tensor to IValue * rename prompt file * Fix spellcheck --------- Co-authored-by: Shrinath Suresh <[email protected]>
pytorch · Jan 26, 2024 · 3ecaf0b · 3ecaf0b
1 parent 9e6f1c2
commit 3ecaf0b
Show file tree

Hide file tree

Showing 41 changed files with 1,812 additions and 470 deletions.
diff --git a/cpp/README.md b/cpp/README.md
@@ -12,7 +12,7 @@ python ts_scripts/install_dependencies.py --cpp [--cuda=cu121|cu118]
 ### Building the backend
 ```
 ## Dev Build
-cd serve/cpp 
+cd serve/cpp
 ./build.sh [-g cu121|cu118]
 
 ## Install TorchServe from source
@@ -34,32 +34,60 @@ cd serve
 torchserve torchserve --ncs --start --model-store model_store
 ```
 ## Backend
-TorchServe cpp backend can run as a process, which is similar to [TorchServe Python backend](https://github.com/pytorch/serve/tree/master/ts). By default, TorchServe supports torch scripted model in cpp backend. [src/backends/core/backend.hh](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/core/backend.hh) defines the APIs of backend to support multiple different platforms such as MxNet, ONNX and so on. 
-* [Backend](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/core/backend.hh#L60) defines function `LoadModelInternal` to support model loading on different platforms.
-* [ModelInstance](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/core/backend.hh#L25) represents a model copy. The function `Predict` is to support prediction on different platforms.
-### TorchScripted Backend
-By default, TorchServe cpp provides [TorchScripted backend](https://github.com/pytorch/serve/tree/cpp_backend/cpp/src/backends/torch_scripted). Its [base handler](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/torch_scripted/handler/base_handler.hh) defines APIs to customize handler.
-* [Initialize](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/torch_scripted/handler/base_handler.hh#L29)
-* [LoadModel](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/torch_scripted/handler/base_handler.hh#L37)
-* [Preprocess](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/torch_scripted/handler/base_handler.hh#L40)
-* [Inference](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/torch_scripted/handler/base_handler.hh#L46)
-* [Postprocess](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/backends/torch_scripted/handler/base_handler.hh#L53)
+TorchServe cpp backend can run as a process, which is similar to [TorchServe Python backend](https://github.com/pytorch/serve/tree/master/ts). By default, TorchServe supports torch scripted model in cpp backend. Other platforms such as MxNet, ONNX can be supported through custom handlers following the TorchScript example [src/backends/handler/torch_scripted_handler.hh](https://github.com/pytorch/serve/blob/master/src/backends/handler/torch_scripted_handler.hh).
+### Custom Handler
+By default, TorchServe cpp provides a handler for TorchScript [src/backends/handler/torch_scripted_handler.hh](https://github.com/pytorch/serve/blob/master/src/backends/handler/torch_scripted_handler.hh). Its uses the [BaseHandler](https://github.com/pytorch/serve/blob/master/src/backends/handler/base_handler.hh) which defines the APIs to customize handler.
+* [Initialize](serve/blob/cpp_backend/cpp/src/backends/handler/base_handler.hh#L29)
+* [LoadModel](serve/blob/cpp_backend/cpp/src/backends/handler/base_handler.hh#L37)
+* [Preprocess](serve/blob/cpp_backend/cpp/src/backends/handler/base_handler.hh#L40)
+* [Inference](serve/blob/cpp_backend/cpp/src/backends/handler/base_handler.hh#L46)
+* [Postprocess](serve/blob/cpp_backend/cpp/src/backends/handler/base_handler.hh#L53)
 #### Example
-##### Using BaseHandler
-* set runtime as "LSP" in model archiver option [--runtime](https://github.com/pytorch/serve/tree/master/model-archiver#arguments) 
-* set handler as "BaseHandler" in model archiver option [--handler](https://github.com/pytorch/serve/tree/master/model-archiver#arguments)
+##### Using TorchScriptHandler
+* set runtime as "LSP" in model archiver option [--runtime](https://github.com/pytorch/serve/tree/master/model-archiver#arguments)
+* set handler as "TorchScriptHandler" in model archiver option [--handler](https://github.com/pytorch/serve/tree/master/model-archiver#arguments)
 ```
- torch-model-archiver --model-name mnist_base --version 1.0 --serialized-file mnist_script.pt --handler BaseHandler --runtime LSP
+ torch-model-archiver --model-name mnist_base --version 1.0 --serialized-file mnist_script.pt --handler TorchScriptHandler --runtime LSP
 ```
 Here is an [example](https://github.com/pytorch/serve/tree/cpp_backend/cpp/test/resources/torchscript_model/mnist/base_handler) of unzipped model mar file.
-##### Using customized handler
+##### Using Custom Handler
 * build customized handler shared lib. For example [Mnist handler](https://github.com/pytorch/serve/blob/cpp_backend/cpp/src/examples/image_classifier/mnist).
-* set runtime as "LSP" in model archiver option [--runtime](https://github.com/pytorch/serve/tree/master/model-archiver#arguments) 
+* set runtime as "LSP" in model archiver option [--runtime](https://github.com/pytorch/serve/tree/master/model-archiver#arguments)
 * set handler as "libmnist_handler:MnistHandler" in model archiver option [--handler](https://github.com/pytorch/serve/tree/master/model-archiver#arguments)
 ```
 torch-model-archiver --model-name mnist_handler --version 1.0 --serialized-file mnist_script.pt --handler libmnist_handler:MnistHandler --runtime LSP
 ```
 Here is an [example](https://github.com/pytorch/serve/tree/cpp_backend/cpp/test/resources/torchscript_model/mnist/mnist_handler) of unzipped model mar file.
+##### BabyLLama Example
+The babyllama example can be found [here](https://github.com/pytorch/serve/blob/master/cpp/src/examples/babyllama/).
+To run the example we need to download the weights as well as tokenizer files:
+```bash
+wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
+wget https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
+```
+Subsequently, we need to adjust the paths according to our local file structure in [config.json](https://github.com/pytorch/serve/blob/master/serve/cpp/test/resources/torchscript_model/babyllama/babyllama_handler/config.json).
+```bash
+{
+"checkpoint_path" : "/home/ubuntu/serve/cpp/stories15M.bin",
+"tokenizer_path" : "/home/ubuntu/serve/cpp/src/examples/babyllama/tokenizer.bin"
+}
+```
+Then we can create the mar file and deploy it with:
+```bash
+cd serve/cpp/test/resources/torchscript_model/babyllama/babyllama_handler
+torch-model-archiver --model-name llm --version 1.0 --handler libbabyllama_handler:BabyLlamaHandler --runtime LSP --extra-files config.json
+mkdir model_store && mv llm.mar model_store/
+torchserve --ncs --start --model-store model_store
+
+curl -v -X POST "http://localhost:8081/models?initial_workers=1&url=llm.mar"
+```
+The handler name `libbabyllama_handler:BabyLlamaHandler` consists of our shared library name (as defined in our [CMakeLists.txt](https://github.com/pytorch/serve/blob/master/serve/cpp/src/examples/CMakeLists.txt)) as well as the class name we chose for our [custom handler class](https://github.com/pytorch/serve/blob/master/serve/cpp/src/examples/babyllama/baby_llama_handler.cc) which derives its properties from BaseHandler.
+
+To test the model we can run:
+```bash
+cd serve/cpp/test/resources/torchscript_model/babyllama/
+curl http://localhost:8080/predictions/llm -T prompt.txt
+```
 ##### Mnist example
 * Transform data on client side. For example:
 ```
@@ -75,9 +103,4 @@ image = Image.open("examples/image_classifier/mnist/test_data/0.png")
 image = image_processing(image)
 torch.save(image, "0_png.pt")
 ```
-* Run model registration and prediction: [Using BaseHandler](https://github.com/pytorch/serve/blob/cpp_backend/cpp/test/backends/torch_scripted/torch_scripted_backend_test.cc#L54) or [Using customized handler](https://github.com/pytorch/serve/blob/cpp_backend/cpp/test/backends/torch_scripted/torch_scripted_backend_test.cc#L72).
-
-
-
-
-
+* Run model registration and prediction: [Using BaseHandler](serve/cpp/test/backends/torch_scripted/torch_scripted_backend_test.cc#L54) or [Using customized handler](serve/cpp/test/backends/torch_scripted/torch_scripted_backend_test.cc#L72).
diff --git a/cpp/build.sh b/cpp/build.sh
@@ -212,6 +212,10 @@ function build() {
     mv $DEPS_DIR/../src/examples/libmnist_handler.so $DEPS_DIR/../../test/resources/torchscript_model/mnist/mnist_handler/libmnist_handler.so
   fi
 
+  if [ -f "$DEPS_DIR/../src/examples/libbabyllama_handler.so" ]; then
+    mv $DEPS_DIR/../src/examples/libbabyllama_handler.so $DEPS_DIR/../../test/resources/torchscript_model/babyllama/babyllama_handler/libbabyllama_handler.so
+  fi
+
   cd $DEPS_DIR/../..
   if [ -f "$DEPS_DIR/../test/torchserve_cpp_test" ]; then
     $DEPS_DIR/../test/torchserve_cpp_test

diff --git a/cpp/src/backends/CMakeLists.txt b/cpp/src/backends/CMakeLists.txt
@@ -15,40 +15,27 @@ target_link_libraries(ts_backends_protocol PRIVATE ts_utils ${FOLLY_LIBRARIES})
 install(TARGETS ts_backends_protocol DESTINATION ${torchserve_cpp_SOURCE_DIR}/_build/libs)
 
 # build library ts_backend_core
-set(TS_BACKENDS_CORE_SOURCE_FILES "")
-list(APPEND TS_BACKENDS_CORE_SOURCE_FILES ${TS_BACKENDS_CORE_SRC_DIR}/backend.cc)
-add_library(ts_backends_core SHARED ${TS_BACKENDS_CORE_SOURCE_FILES})
+set(BACKEND_SOURCE_FILES "")
+list(APPEND BACKEND_SOURCE_FILES ${TS_BACKENDS_SRC_DIR}/core/backend.cc)
+list(APPEND BACKEND_SOURCE_FILES ${TS_BACKENDS_SRC_DIR}/core/model_instance.cc)
+list(APPEND BACKEND_SOURCE_FILES ${TS_BACKENDS_SRC_DIR}/handler/base_handler.cc)
+list(APPEND BACKEND_SOURCE_FILES ${TS_BACKENDS_SRC_DIR}/handler/torch_scripted_handler.cc)
+add_library(ts_backends_core SHARED ${BACKEND_SOURCE_FILES})
 target_include_directories(ts_backends_core PUBLIC ${TS_BACKENDS_CORE_SRC_DIR})
-target_link_libraries(ts_backends_core PRIVATE ts_utils ts_backends_protocol ${FOLLY_LIBRARIES})
+target_link_libraries(ts_backends_core PUBLIC ts_utils ts_backends_protocol ${FOLLY_LIBRARIES})
 install(TARGETS ts_backends_core DESTINATION ${torchserve_cpp_SOURCE_DIR}/_build/libs)
 
-# build library ts_backend_torch_scripted
-set(TS_BACKENDS_TORCH_SCRIPTED_SOURCE_FILES "")
-list(APPEND TS_BACKENDS_TORCH_SCRIPTED_SOURCE_FILES ${TS_BACKENDS_TORCH_SCRIPTED_SRC_DIR}/torch_scripted_backend.cc)
-list(APPEND TS_BACKENDS_TORCH_SCRIPTED_SOURCE_FILES ${TS_BACKENDS_TORCH_SCRIPTED_SRC_DIR}/handler/base_handler.cc)
-add_library(ts_backends_torch_scripted SHARED ${TS_BACKENDS_TORCH_SCRIPTED_SOURCE_FILES})
-target_include_directories(ts_backends_torch_scripted PUBLIC 
-  ${TS_BACKENDS_TORCH_SCRIPTED_SRC_DIR} ${TS_BACKENDS_TORCH_SCRIPTED_SRC_DIR}/handler ${TORCH_INCLUDE_DIRS})
-target_link_libraries(ts_backends_torch_scripted PUBLIC ts_utils ts_backends_core ${TORCH_LIBRARIES}) 
-install(TARGETS ts_backends_torch_scripted DESTINATION ${torchserve_cpp_SOURCE_DIR}/_build/libs)
-
-# build library ts_backend_torch_deploy
-#set(TS_BACKENDS_TORCH_DEPLOY_SOURCE_FILES "")
-#add_library(ts_backends_torch_deploy SHARED ${TS_BACKENDS_TORCH_DEPLOY_SOURCE_FILES})
-#target_include_directories(ts_backends_torch_deploy PUBLIC ${TS_BACKENDS_TORCH_DEPLOY_SRC_DIR})
-#target_link_libraries(ts_backends_torch_deploy PRIVATE ts_utils ts_backends_core ${TORCH_LIBRARIES}) 
-
 # build exe model_worker_socket
-add_executable(model_worker_socket 
+add_executable(model_worker_socket
   "${TS_BACKENDS_PROCESS_SRC_DIR}/model_worker_socket.cc"
   "${TS_BACKENDS_PROCESS_SRC_DIR}/model_worker.cc"
 )
-target_include_directories(model_worker_socket PRIVATE 
+target_include_directories(model_worker_socket PRIVATE
   ${TS_BACKENDS_CORE_SRC_DIR}
-  ${TS_BACKENDS_PROTOCOL_SRC_DIR} 
-  ${TS_BACKENDS_PROCESS_SRC_DIR} 
-  ${TS_BACKENDS_TORCH_SCRIPTED_SRC_DIR} 
+  ${TS_BACKENDS_PROTOCOL_SRC_DIR}
+  ${TS_BACKENDS_PROCESS_SRC_DIR}
+  ${TS_BACKENDS_TORCH_SCRIPTED_SRC_DIR}
 )
-target_link_libraries(model_worker_socket 
-  PRIVATE ts_backends_core ts_backends_protocol ts_backends_torch_scripted ${FOLLY_LIBRARIES})
+target_link_libraries(model_worker_socket
+  PRIVATE ts_backends_core ts_backends_protocol ${FOLLY_LIBRARIES} ${TORCH_LIBRARIES})
 install(TARGETS model_worker_socket DESTINATION ${torchserve_cpp_SOURCE_DIR}/_build/bin)
diff --git a/cpp/src/backends/core/backend.cc b/cpp/src/backends/core/backend.cc
@@ -1,6 +1,63 @@
 #include "src/backends/core/backend.hh"
 
+#include <memory>
+
+#include "src/backends/handler/handler_factory.hh"
+
 namespace torchserve {
+Backend::Backend() {}
+
+Backend::~Backend() {
+  handler_.reset();
+  model_instance_table_.clear();
+  // Todo: do proper cleanup
+  // dl_loader_->CloseDL();
+}
+
+bool Backend::Initialize(const std::string &model_dir) {
+  random_generator_.seed(time(0));
+  manifest_ = std::make_shared<torchserve::Manifest>();
+  // TODO: windows
+  if (!manifest_->Initialize(
+          fmt::format("{}/MAR-INF/MANIFEST.json", model_dir))) {
+    return false;
+  }
+
+  LoadHandler(model_dir);
+
+  if (!handler_) {
+    return false;
+  }
+
+  handler_->Initialize(model_dir, manifest_);
+
+  return true;
+}
+
+void Backend::LoadHandler(const std::string &model_dir) {
+  const std::string &handler_str = manifest_->GetModel().handler;
+  std::size_t delimiter_pos = handler_str.find(manifest_->kHandler_Delimiter);
+  if (delimiter_pos != std::string::npos) {
+#ifdef __APPLE__
+    std::string lib_path = fmt::format("{}/{}.dylib", model_dir,
+                                       handler_str.substr(0, delimiter_pos));
+#else
+    std::string lib_path = fmt::format("{}/{}.so", model_dir,
+                                       handler_str.substr(0, delimiter_pos));
+#endif
+    std::string handler_class_name = handler_str.substr(delimiter_pos + 1);
+    std::string allocator_func = fmt::format("allocator{}", handler_class_name);
+    std::string deleter_func = fmt::format("deleter{}", handler_class_name);
+
+    dl_loader_ = std::make_unique<DLLoader<BaseHandler>>(
+        lib_path, allocator_func, deleter_func);
+    dl_loader_->OpenDL();
+    handler_ = dl_loader_->GetInstance();
+  } else {
+    handler_ = HandlerFactory::GetInstance().createHandler(handler_str);
+  }
+}
+
 std::unique_ptr<torchserve::LoadModelResponse> Backend::LoadModel(
     std::shared_ptr<torchserve::LoadModelRequest> load_model_request) {
   /**
@@ -13,12 +70,43 @@ std::unique_ptr<torchserve::LoadModelResponse> Backend::LoadModel(
    * - status_READY: return the model instance if it is already.
    *
    * Common steps:
-   * https://github.com/pytorch/serve/blob/master/ts/model_loader.py#L62
+   * serve/blob/master/ts/model_loader.py#L62
    */
 
+  // TODO: support request envelope:
+  // serve/tree/master/ts/torch_handler/request_envelope
+
   return LoadModelInternal(std::move(load_model_request));
 }
 
+std::unique_ptr<LoadModelResponse> Backend::LoadModelInternal(
+    std::shared_ptr<LoadModelRequest> load_model_request) {
+  std::string model_instance_id = BuildModelInstanceId(load_model_request);
+  try {
+    model_instance_table_[model_instance_id] = {
+        ModelInstanceStatus::INIT, std::shared_ptr<ModelInstance>(nullptr)};
+
+    auto result = handler_->LoadModel(load_model_request);
+    SetModelInstanceInfo(model_instance_id, ModelInstanceStatus::READY,
+                         std::make_shared<ModelInstance>(
+                             model_instance_id, std::move(result.first),
+                             handler_, std::move(result.second)));
+
+    ready_model_instance_ids_.emplace_back(model_instance_id);
+    std::string message =
+        fmt::format("loaded model {}", load_model_request->model_name);
+    return std::make_unique<LoadModelResponse>(
+        // TODO: check current response msg content
+        200, message);
+  } catch (const c10::Error &e) {
+    SetModelInstanceInfo(model_instance_id, ModelInstanceStatus::FAILED,
+                         std::shared_ptr<ModelInstance>(nullptr));
+    return std::make_unique<LoadModelResponse>(
+        // TODO: check existing
+        500, e.msg());
+  }
+}
+
 std::string Backend::BuildModelInstanceId(
     std::shared_ptr<torchserve::LoadModelRequest> load_model_request) {
   std::string device_type("cpu");
@@ -30,15 +118,15 @@ std::string Backend::BuildModelInstanceId(
 }
 
 void Backend::SetModelInstanceInfo(
-    const std::string& model_instance_id, ModelInstanceStatus new_status,
+    const std::string &model_instance_id, ModelInstanceStatus new_status,
     std::shared_ptr<torchserve::ModelInstance> new_model_instance) {
   model_instance_table_[model_instance_id].status = new_status;
   model_instance_table_[model_instance_id].model_instance =
       std::move(new_model_instance);
 }
 
 torchserve::Backend::ModelInstanceStatus Backend::GetModelInstanceStatus(
-    const std::string& model_instance_id) {
+    const std::string &model_instance_id) {
   auto model_instance_info = model_instance_table_.find(model_instance_id);
   if (model_instance_info == model_instance_table_.end()) {
     return torchserve::Backend::ModelInstanceStatus::NOT_INIT;
@@ -47,7 +135,7 @@ torchserve::Backend::ModelInstanceStatus Backend::GetModelInstanceStatus(
 }
 
 std::shared_ptr<torchserve::ModelInstance> Backend::GetModelInstance(
-    const std::string& model_instance_id) {
+    const std::string &model_instance_id) {
   auto model_instance_info = model_instance_table_.find(model_instance_id);
   if (model_instance_info == model_instance_table_.end()) {
     return std::shared_ptr<torchserve::ModelInstance>(nullptr);