[CI/Docs/Examples] - Replace llama with llama2 model (#1219)

* replace llama with llama2 * shellcheck * rename test * fix * rename test * fix * use text completion prompt, turn off hf sampling by default * fix output name * formatting * avoid python 3.12 for now * fix * fixes for falcon * fix
flexflow · Nov 5, 2023 · bd305f7 · bd305f7
1 parent 1105f4e
commit bd305f7
Show file tree

Hide file tree

Showing 21 changed files with 193 additions and 156 deletions.
diff --git a/.github/README.md b/.github/README.md
@@ -72,7 +72,7 @@ ff.init(
 Second, we specify the LLM to serve and the SSM(s) used to accelerate LLM serving. The list of supported LLMs and SSMs is available at [supported models](#supported-llms-and-ssms).
 ```python
 # Specify the LLM
-llm = ff.LLM("decapoda-research/llama-7b-hf")
+llm = ff.LLM("meta-llama/Llama-2-7b-hf")
 
 # Specify a list of SSMs (just one in this case)
 ssms=[]
@@ -116,7 +116,7 @@ ff.init(
     )
 
 # Create the FlexFlow LLM
-llm = ff.LLM("decapoda-research/llama-7b-hf")
+llm = ff.LLM("meta-llama/Llama-2-7b-hf")
 
 # Create the sampling configs
 generation_config = ff.GenerationConfig(
@@ -152,8 +152,8 @@ A C++ example is available at [this folder](../inference/spec_infer/). After bui
 * `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
 * `-ll:fsize`: size of device memory on each GPU in MB
 * `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. FlexFlow Serve keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
-* `-llm-model`: the LLM model ID from HuggingFace (e.g. "decapoda-research/llama-7b-hf")
-* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
+* `-llm-model`: the LLM model ID from HuggingFace (e.g. "meta-llama/Llama-2-7b-hf")
+* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m-base"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
 * `-cache-folder`: the folder
 * `-data-parallelism-degree`, `-tensor-parallelism-degree` and `-pipeline-parallelism-degree`: parallelization degrees in the data, tensor, and pipeline dimensions. Their product must equal the number of GPUs available on the machine. When any of the three parallelism degree arguments is omitted, a default value of 1 will be used. 
 * `-prompt`: (optional) path to the prompt file. FlexFlow Serve expects a json format file for prompts. In addition, users can also use the following API for registering requests:
@@ -162,7 +162,7 @@ A C++ example is available at [this folder](../inference/spec_infer/). After bui
 For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-68M models for speculative inference.
 
 ```bash
-./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
+./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model meta-llama/Llama-2-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
 ```
 </details>
 
@@ -193,13 +193,13 @@ Below is a list of models that we have explicitly tested and for which a SSM may
 
 | Model | Model id on HuggingFace | Boost-tuned SSMs |
 | :---- | :---- | :---- |
-| LLaMA-7B | decapoda-research/llama-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-2-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-2-13B | meta-llama/Llama-2-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-2-70B | meta-llama/Llama-2-70b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-2-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-2-13B | meta-llama/Llama-2-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-2-70B | meta-llama/Llama-2-70b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
 | OPT-6.7B | facebook/opt-6.7b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
 | OPT-13B | facebook/opt-13b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
 | OPT-30B | facebook/opt-30b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |

diff --git a/.github/workflows/gpu-ci-skip.yml b/.github/workflows/gpu-ci-skip.yml
@@ -15,7 +15,7 @@ on:
       - ".github/workflows/gpu-ci.yml"
       - "tests/cpp_gpu_tests.sh"
       - "tests/inference_tests.sh"
-      - "tests/multi_gpu_tests.sh"
+      - "tests/training_tests.sh"
       - "tests/python_interface_test.sh"
   workflow_dispatch:
 
@@ -44,8 +44,8 @@ jobs:
     steps:
       - run: 'echo "No gpu-ci required"'
 
-  gpu-ci-flexflow:
-    name: Single Machine, Multiple GPUs Tests
+  training-tests:
+    name: Training Tests
     runs-on: ubuntu-20.04
     # if: ${{ github.event_name != 'pull_request' || github.base_ref != 'inference' }}
     needs: inference-tests

diff --git a/.github/workflows/gpu-ci.yml b/.github/workflows/gpu-ci.yml
@@ -15,7 +15,7 @@ on:
       - ".github/workflows/gpu-ci.yml"
       - "tests/cpp_gpu_tests.sh"
       - "tests/inference_tests.sh"
-      - "tests/multi_gpu_tests.sh"
+      - "tests/training_tests.sh"
       - "tests/python_interface_test.sh"
   push:
     branches:
@@ -34,7 +34,7 @@ on:
       - ".github/workflows/gpu-ci.yml"
       - "tests/cpp_gpu_tests.sh"
       - "tests/inference_tests.sh"
-      - "tests/multi_gpu_tests.sh"
+      - "tests/training_tests.sh"
       - "tests/python_interface_test.sh"
   workflow_dispatch:
 
@@ -141,7 +141,8 @@ jobs:
       run:
         shell: bash -l {0} # required to use an activated conda environment
     env: 
-      CONDA: "3"    
+      CONDA: "3"
+      HUGGINGFACE_TOKEN: ${{ secrets.HUGGINGFACE_TOKEN }}
     needs: gpu-ci-concierge
     container:
       image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
@@ -185,7 +186,7 @@ jobs:
           export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
           
           # GPT tokenizer test
-          ./tests/gpt_tokenizer_test.sh
+          # ./tests/gpt_tokenizer_test.sh
 
           # Inference tests
           source ./build/set_python_envs.sh
@@ -209,8 +210,8 @@ jobs:
         if: always()
         run: sudo rm -rf ~/.cache 
 
-  gpu-ci-flexflow:
-    name: Single Machine, Multiple GPUs Tests
+  training-tests:
+    name: Training Tests
     runs-on: [self-hosted, gpu]
     # skip this time-consuming test for PRs to the inference branch
     # if: ${{ github.event_name != 'pull_request' || github.base_ref != 'inference' }}
@@ -266,5 +267,5 @@ jobs:
           # C++ tests
           ./tests/cpp_gpu_tests.sh 4
           # Python tests
-          ./tests/multi_gpu_tests.sh 4
+          ./tests/training_tests.sh 4
 
diff --git a/.github/workflows/multinode-test.yml b/.github/workflows/multinode-test.yml
@@ -78,7 +78,7 @@ jobs:
           export OMPI_ALLOW_RUN_AS_ROOT=1
           export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
           export OMPI_MCA_btl_vader_single_copy_mechanism=none
-          ./tests/multi_gpu_tests.sh 2 2
+          ./tests/training_tests.sh 2 2
   
   multinode-gpu-test-ucx:
     name: Multinode GPU Test with UCX
@@ -129,7 +129,7 @@ jobs:
           export OMPI_ALLOW_RUN_AS_ROOT=1
           export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
           export OMPI_MCA_btl_vader_single_copy_mechanism=none
-          ./tests/multi_gpu_tests.sh 2 2
+          ./tests/training_tests.sh 2 2
   
   multinode-gpu-test-native-ucx:
     name: Multinode GPU Test with native UCX
@@ -177,7 +177,7 @@ jobs:
           export OMPI_ALLOW_RUN_AS_ROOT=1
           export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
           export OMPI_MCA_btl_vader_single_copy_mechanism=none
-          ./tests/multi_gpu_tests.sh 2 2
+          ./tests/training_tests.sh 2 2
 
   notify-slack:
     name: Notify Slack in case of failure

diff --git a/INSTALL.md b/INSTALL.md
@@ -97,7 +97,7 @@ source ./build/set_python_envs.sh
 cd "$FF_HOME"
 ./python/flexflow_python examples/python/native/mnist_mlp.py -ll:py 1 -ll:gpu 1 -ll:fsize <size of gpu buffer> -ll:zsize <size of zero buffer>
 ```
-A script to run all the Python examples is available at `tests/multi_gpu_tests.sh`
+A script to run all the Python examples is available at `tests/training_tests.sh`
 
 ### Run FlexFlow C++ examples
 

diff --git a/SERVE.md b/SERVE.md
@@ -32,7 +32,7 @@ ff.init(
 Second, we specify the LLM to serve and the SSM(s) used to accelerate LLM serving. The list of supported LLMs and SSMs is available at [supported models](#supported-llms-and-ssms).
 ```python
 # Specify the LLM
-llm = ff.LLM("decapoda-research/llama-7b-hf")
+llm = ff.LLM("meta-llama/Llama-2-7b-hf")
 
 # Specify a list of SSMs (just one in this case)
 ssms=[]
@@ -78,7 +78,7 @@ ff.init(
     )
 
 # Create the FlexFlow LLM
-llm = ff.LLM("decapoda-research/llama-7b-hf")
+llm = ff.LLM("meta-llama/Llama-2-7b-hf")
 
 # Create the sampling configs
 generation_config = ff.GenerationConfig(
@@ -116,8 +116,8 @@ A C++ example is available at [this folder](../inference/spec_infer/). After bui
 * `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
 * `-ll:fsize`: size of device memory on each GPU in MB
 * `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. FlexFlow Serve keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
-* `-llm-model`: the LLM model ID from HuggingFace (e.g. "decapoda-research/llama-7b-hf")
-* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
+* `-llm-model`: the LLM model ID from HuggingFace (e.g. "meta-llama/Llama-2-7b-hf")
+* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m-base"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
 * `-cache-folder`: the folder
 * `-data-parallelism-degree`, `-tensor-parallelism-degree` and `-pipeline-parallelism-degree`: parallelization degrees in the data, tensor, and pipeline dimensions. Their product must equal the number of GPUs available on the machine. When any of the three parallelism degree arguments is omitted, a default value of 1 will be used. 
 * `-prompt`: (optional) path to the prompt file. FlexFlow Serve expects a json format file for prompts. In addition, users can also use the following API for registering requests:
@@ -126,7 +126,7 @@ A C++ example is available at [this folder](../inference/spec_infer/). After bui
 For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-68M models for speculative inference.
 
 ```bash
-./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
+./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model meta-llama/Llama-2-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
 ```
 </details>
 
@@ -157,13 +157,13 @@ Below is a list of models that we have explicitly tested and for which a SSM may
 
 | Model | Model id on HuggingFace | Boost-tuned SSMs |
 | :---- | :---- | :---- |
-| LLaMA-7B | decapoda-research/llama-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-2-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-2-13B | meta-llama/Llama-2-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
-| LLaMA-2-70B | meta-llama/Llama-2-70b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-2-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-2-13B | meta-llama/Llama-2-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
+| LLaMA-2-70B | meta-llama/Llama-2-70b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m-base) |
 | OPT-6.7B | facebook/opt-6.7b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
 | OPT-13B | facebook/opt-13b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
 | OPT-30B | facebook/opt-30b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |

diff --git a/conda/environment.yml b/conda/environment.yml
@@ -3,7 +3,7 @@ channels:
   - defaults
   - conda-forge
 dependencies:
-  - python>=3.6
+  - python>=3.6,<3.12
   - cffi>=1.11.0
   - Pillow
   - pybind11

diff --git a/conda/flexflow.yml b/conda/flexflow.yml
@@ -3,7 +3,7 @@ channels:
   - defaults
   - conda-forge
 dependencies:
-  - python>=3.6
+  - python>=3.6,<3.12
   - cffi>=1.11.0
   - Pillow
   - pybind11

diff --git a/inference/MODEL_WEIGHTS.md b/inference/MODEL_WEIGHTS.md
@@ -2,7 +2,7 @@ To convert the weights of a HuggingFace LLM to SpecInfer's weight format, we fir
 
 ```python
 from transformers import AutoModelForCausalLM
-model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
 
 for name, params in model.named_parameters():
     for name, params in model.named_parameters():

diff --git a/inference/python/incr_decoding.py b/inference/python/incr_decoding.py
@@ -43,7 +43,7 @@ def get_configs():
             # required parameters
             "num_gpus": 4,
             "memory_per_gpu": 14000,
-            "zero_copy_memory_per_node": 30000,
+            "zero_copy_memory_per_node": 40000,
             # optional parameters
             "num_cpus": 4,
             "legion_utility_processors": 4,
@@ -108,7 +108,7 @@ def main():
         prompts = [s for s in json.load(open(configs.prompt))]
         results = llm.generate(prompts)
     else:
-        result = llm.generate("Here are some travel tips for Tokyo:\n")
+        result = llm.generate("Three tips for staying healthy are: ")
 
 
 if __name__ == "__main__":

diff --git a/inference/python/spec_infer.py b/inference/python/spec_infer.py
@@ -43,7 +43,7 @@ def get_configs():
             # required parameters
             "num_gpus": 4,
             "memory_per_gpu": 14000,
-            "zero_copy_memory_per_node": 30000,
+            "zero_copy_memory_per_node": 40000,
             # optional parameters
             "num_cpus": 4,
             "legion_utility_processors": 4,
@@ -60,7 +60,7 @@ def get_configs():
         }
         llm_configs = {
             # required llm arguments
-            "llm_model": "decapoda-research/llama-7b-hf",
+            "llm_model": "meta-llama/Llama-2-7b-hf",
             # optional llm parameters
             "cache_path": "",
             "refresh_cache": False,
@@ -154,7 +154,7 @@ def main():
         prompts = [s for s in json.load(open(configs.prompt))]
         results = llm.generate(prompts)
     else:
-        result = llm.generate("Here are some travel tips for Tokyo:\n")
+        result = llm.generate("Three tips for staying healthy are: ")
 
 
 if __name__ == "__main__":

diff --git a/inference/utils/compress_llama_weights.py b/inference/utils/compress_llama_weights.py
@@ -91,7 +91,7 @@ def decompress(packed_data, config):
 if __name__ == "__main__":
     # torch.set_default_tensor_type(torch.HalfTensor)
     # torch.set_default_tensor_type(torch.cuda.HalfTensor)
-    model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
+    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
     config = CompressionConfig(
         num_bits=8, group_size=32, group_dim=0, symmetric=False)
     for name, params in model.named_parameters():

diff --git a/python/flexflow/serve/serve.py b/python/flexflow/serve/serve.py
@@ -81,7 +81,7 @@ def __init__(
     ):
         """Create the LLM object
 
-        :param model_name: The name of the HuggingFace model to use. E.g. 'decapoda-research/llama-7b-hf'
+        :param model_name: The name of the HuggingFace model to use. E.g. 'meta-llama/Llama-2-7b-hf'
         :type model_name: str
         :param data_type: The data type to use for the tensors (e.g. DataType.DT_FLOAT for full precision, or DataType.DT_HALF for half precision), defaults to DataType.DT_HALF
         :type data_type: DataType, optional
@@ -439,7 +439,7 @@ def __init__(
     ):
         """Create the SSM object
 
-        :param model_name: The name of the HuggingFace model to use. E.g. 'decapoda-research/llama-7b-hf'
+        :param model_name: The name of the HuggingFace model to use. E.g. 'meta-llama/Llama-2-7b-hf'
         :type model_name: str
         :param data_type: The data type to use for the tensors (e.g. DataType.DT_FLOAT for full precision, or DataType.DT_HALF for half precision), defaults to DataType.DT_HALF
         :type data_type: DataType, optional