Queryloop

Latest from Queryloop

Stay updated with our latest research findings, product developments, and insights into AI optimization

Research

Best LLM Inference Engine? TensorRT vs vLLM vs LMDeploy vs MLC-LLM

Zain ul Abideen

July 7, 2024

15 min read

Benchmarking various LLM Inference Engines.

LLMs excel in text generation applications, such as chat and code completion models capable of high understanding and fluency. However, their large size also creates challenges for inference. Basic inference is slow because LLMs generate text tokens by token, requiring repeated calls for each next token. As the input sequence grows, the processing time increases. Additionally, LLMs have billions of parameters, making it difficult to store and manage all those weights in memory. In the effort to optimize LLM inference and serving, there are multiple frameworks and packages and in this blog, I'll use and compare the following inference engines TensorRT-LLM vLLM LMDeploy MLC-LLM

1. TensorRT-LLM

Introduction TensorRT-LLM is another inference engine that accelerates and optimizes inference performance for the latest LLMs on NVIDIA GPUs. LLMs are compiled into TensorRT Engine and then deployed with a triton server to leverage inference optimizations such as In-Flight Batching (reduces wait time and allows higher GPU utilization), paged KV caching, MultiGPU-MultiNode Inference, and FP8 Support.

Usage

We will compare the execution time, ROUGE scores, latency, and throughput across the HF model, TensorRT-model, and TensorRT-INT8 model (quantized).

You need to install Nvidia-container-toolkit for your Linux system, initialize Git LFS (to download HF Models), and download the necessary packages as follows:

1!curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
2&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
3  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
4  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
5!apt-get update
6!git clone https://github.com/NVIDIA/TensorRT-LLM/
7!apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
8!pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
9!pip install -r TensorRT-LLM/examples/phi/requirements.txt
10!pip install flash_attn pytest
11!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
12!apt-get install git-lfs

Now retrieve the model weights

PHI_PATH="TensorRT-LLM/examples/phi" !rm -rf $PHI_PATH/7B !mkdir -p $PHI_PATH/7B && git clone https://huggingface.co/microsoft/Phi-3-small-128k-instruct $PHI_PATH/7B Convert the model into TensorRT-LLM checkpoint format and and build the TensorRT-LLM from the checkpoint.

1!python3 $PHI_PATH/convert_checkpoint.py --model_dir $PHI_PATH/7B/ \
2              --dtype bfloat16 \
3              --output_dir $PHI_PATH/7B/trt_ckpt/bf16/1-gpu/
4# Build TensorRT-LLM model from checkpoint
5!trtllm-build --checkpoint_dir $PHI_PATH/7B/trt_ckpt/bf16/1-gpu/ \
6              --gemm_plugin bfloat16 \
7              --output_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/

Similarly, now apply INT8 weight-only quantization to the HF model and convert the checkpoint into TensorRT-LLM.

1!python3 $PHI_PATH/convert_checkpoint.py --model_dir $PHI_PATH/7B \
2              --dtype bfloat16 \
3              --use_weight_only \
4              --output_dir $PHI_PATH/7B/trt_ckpt/int8_weight_only/1-gpu/
5!trtllm-build --checkpoint_dir $PHI_PATH/7B/trt_ckpt/int8_weight_only/1-gpu/ \
6              --gemm_plugin bfloat16 \
7              --output_dir $PHI_PATH/7B/trt_engines/int8_weight_only/1-gpu/

Now test the base phi3 and two TensorRT models on the summarization task

1%%capture phi_hf_results
2# Huggingface
3!time python3 $PHI_PATH/../summarize.py --test_hf \
4                     --hf_model_dir $PHI_PATH/7B/ \
5                     --data_type bf16 \
6                     --engine_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/
7%%capture phi_trt_results
8# TensorRT-LLM
9!time python3 $PHI_PATH/../summarize.py --test_trt_llm \
10                     --hf_model_dir $PHI_PATH/7B/ \
11                     --data_type bf16 \
12                     --engine_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/
13%%capture phi_int8_results
14# TensorRT-LLM (INT8)
15!time python3 $PHI_PATH/../summarize.py --test_trt_llm \
16                     --hf_model_dir $PHI_PATH/7B/ \
17                     --data_type bf16 \
18                     --engine_dir $PHI_PATH/7B/trt_engines/int8_weight_only/1-gpu/

Now after capturing the results, you can parse the output and plot it to compare execution time, ROUGE scores, latency, and throughput across all models.

Comparison of Latency and Throughput

2. vLLM

Introduction vLLM offers LLM inferencing and serving with SOTA throughput, Paged Attention, Continuous batching, Quantization (GPTQ, AWQ, FP8), and optimized CUDA kernels.

Usage

Let's evaluate the throughput and latency of microsoft/Phi3-mini-4k-instruct . Start by setting up dependencies and importing libraries.

1!pip install -q vllm
2!git clone https://github.com/vllm-project/vllm.git
3!pip install -q datasets
4!pip install transformers scipy
5from vllm import LLM, SamplingParams
6from datasets import load_dataset
7import time
8from tqdm import tqdm
9from transformers import AutoTokenizer

Now let's load the model and generate its outputs on a small slice of the dataset.

1dataset = load_dataset("akemiH/MedQA-Reason", split="train").select(range(10))
2prompts = []
3for sample in dataset:
4  prompts.append(sample)
5sampling_params = SamplingParams(max_tokens=524)
6llm = LLM(model="microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
7def generate_with_time(prompt):
8  start = time.time()
9  outputs = llm.generate(prompt, sampling_params)
10  taken = time.time() - start
11  generated_text = outputs[0].outputs[0].text
12  return generated_text, taken
13generated_text = []
14time_taken = 0
15for sample in tqdm(prompts):
16  text, taken = generate_with_time(sample)
17  time_taken += taken
18  generated_text.append(text)
19# Tokenize the outputs and calculate the throughput
20tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
21token = 1
22for sample in generated_text:
23  tokens = tokenizer(sample)
24  tok = len(tokens.input_ids)
25  token += tok
26print(token)
27print("tok/s", token // time_taken)

Let's also benchmark the model's performance through vLLM on the ShareGPT dataset

1!wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
2%cd vllm
3!python benchmarks/benchmark_throughput.py --backend vllm --dataset ../ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-3-mini-4k-instruct --tokenizer microsoft/Phi-3-mini-4k-instruct --num-prompts=1000

3. LMDeploy

Introduction This package also allows compressing, deploying, and serving LLMs while offering efficient inference (persistent batching, blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels), effective quantization (4-bit inference performance is 2.4x higher than FP16), effortless distribution server (deployment of multi-model services across multiple machines and cards) and interactive inference mode (remembers dialogue history and avoids repetitive processing of historical sessions). Furthermore, it also allows for profiling token latency and throughput, request throughput, API server, and triton inference server performance.

Usage

Install dependencies and import packages.

1!pip install -q lmdeploy
2!pip install nest_asyncio
3import nest_asyncio
4nest_asyncio.apply()
5!git clone --depth=1 https://github.com/InternLM/lmdeploy
6%cd lmdeploy/benchmark
7LMdeploy has developed two inference engines TurboMind and PyTorch.

Let's profile the PyTorch engine on microsoft/Phi3-mini-128k-instruct .

1!python3 profile_generation.py microsoft/Phi-3-mini-128k-instruct --backend pytorch
2It profiles the engine over multiple rounds and reports the token latency & throughput for each round.

Pytorch engine profile for token latency and throughput

4. MLC-LLM

Introduction MLC-LLM offers a high performance deployment and inference engine, called MLCEngine.

Usage

Let's install dependencies which includes setting up dependencies with conda and creating a conda environment. Then clone the git repository and configure.

1conda activate your-environment
2python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
3conda env remove -n mlc-chat-venv
4conda create -n mlc-chat-venv -c conda-forge \
5  "cmake>=3.24" \
6  rust \
7  git \
8  python=3.11
9conda activate mlc-chat-venv
10git clone --recursive https://github.com/mlc-ai/mlc-llm.git && cd mlc-llm/
11mkdir -p build && cd build
12python ../cmake/gen_cmake_config.py
13cmake .. && cmake --build . --parallel $(nproc) && cd ..

To run a model with MLC LLM, we need to convert model weights into MLC format.

1mlc_llm convert_weight ./dist/models/Phi-3-small-128k-instruct/ \
2  --quantization q0f16 \
3  --model-type "phi3" \
4  -o ./dist/Phi-3-small-128k-instruct-q0f16-MLC

Now load you MLC format model into the MLC engine

1from mlc_llm import MLCEngine
2# Create engine
3model = "HF://mlc-ai/Phi-3-mini-128k-instruct-q0f16-MLC"
4engine = MLCEngine(model)
5# Now let's calculate throughput
6import time
7from transformers import AutoTokenizer
8start = time.time()
9response = engine.chat.completions.create(
10  messages=[{"role": "user", "content": "What is the Machine Learning?"}],
11  model=model,
12  stream=False,
13)
14taken = time.time() - start
15tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
16print("tok/s", 82 // taken)

Summary

TensorRT INT8 models outperform HF models and regular TensorRT with respect to inference speed while the regular TensorRT model performed better on the summarization task with the highest ROUGE score among the three models. LMDeploy delivers up to 1.8x higher request throughput than vLLM on an A100.

AIMachine LearningDeep LearningLLMInference EngineTensorRTvLLMLMDeployMLC-LLM

Research

Schedule-Free Learning — A New Way to Train Models

Training 3 Llama models for comparison of Cosine Scheduled and Schedule-Free optimizer.

Research

Llama-Bitnet | Training a 1.58 bit LLM

What is 1 bit LLM and How to train 70M Llama-Bitnet?

Research

Best LLM Inference Engine? TensorRT vs vLLM vs LMDeploy vs MLC-LLM

Zain ul Abideen

July 7, 2024

15 min read

Benchmarking various LLM Inference Engines.

1. TensorRT-LLM

Usage

We will compare the execution time, ROUGE scores, latency, and throughput across the HF model, TensorRT-model, and TensorRT-INT8 model (quantized).

You need to install Nvidia-container-toolkit for your Linux system, initialize Git LFS (to download HF Models), and download the necessary packages as follows:

1!curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
2&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
3  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
4  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
5!apt-get update
6!git clone https://github.com/NVIDIA/TensorRT-LLM/
7!apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
8!pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
9!pip install -r TensorRT-LLM/examples/phi/requirements.txt
10!pip install flash_attn pytest
11!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
12!apt-get install git-lfs

Now retrieve the model weights

1!python3 $PHI_PATH/convert_checkpoint.py --model_dir $PHI_PATH/7B/ \
2              --dtype bfloat16 \
3              --output_dir $PHI_PATH/7B/trt_ckpt/bf16/1-gpu/
4# Build TensorRT-LLM model from checkpoint
5!trtllm-build --checkpoint_dir $PHI_PATH/7B/trt_ckpt/bf16/1-gpu/ \
6              --gemm_plugin bfloat16 \
7              --output_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/

Similarly, now apply INT8 weight-only quantization to the HF model and convert the checkpoint into TensorRT-LLM.

1!python3 $PHI_PATH/convert_checkpoint.py --model_dir $PHI_PATH/7B \
2              --dtype bfloat16 \
3              --use_weight_only \
4              --output_dir $PHI_PATH/7B/trt_ckpt/int8_weight_only/1-gpu/
5!trtllm-build --checkpoint_dir $PHI_PATH/7B/trt_ckpt/int8_weight_only/1-gpu/ \
6              --gemm_plugin bfloat16 \
7              --output_dir $PHI_PATH/7B/trt_engines/int8_weight_only/1-gpu/

Now test the base phi3 and two TensorRT models on the summarization task

1%%capture phi_hf_results
2# Huggingface
3!time python3 $PHI_PATH/../summarize.py --test_hf \
4                     --hf_model_dir $PHI_PATH/7B/ \
5                     --data_type bf16 \
6                     --engine_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/
7%%capture phi_trt_results
8# TensorRT-LLM
9!time python3 $PHI_PATH/../summarize.py --test_trt_llm \
10                     --hf_model_dir $PHI_PATH/7B/ \
11                     --data_type bf16 \
12                     --engine_dir $PHI_PATH/7B/trt_engines/bf16/1-gpu/
13%%capture phi_int8_results
14# TensorRT-LLM (INT8)
15!time python3 $PHI_PATH/../summarize.py --test_trt_llm \
16                     --hf_model_dir $PHI_PATH/7B/ \
17                     --data_type bf16 \
18                     --engine_dir $PHI_PATH/7B/trt_engines/int8_weight_only/1-gpu/

Now after capturing the results, you can parse the output and plot it to compare execution time, ROUGE scores, latency, and throughput across all models.

Comparison of Latency and Throughput

2. vLLM

Introduction vLLM offers LLM inferencing and serving with SOTA throughput, Paged Attention, Continuous batching, Quantization (GPTQ, AWQ, FP8), and optimized CUDA kernels.

Usage

Let's evaluate the throughput and latency of microsoft/Phi3-mini-4k-instruct . Start by setting up dependencies and importing libraries.

1!pip install -q vllm
2!git clone https://github.com/vllm-project/vllm.git
3!pip install -q datasets
4!pip install transformers scipy
5from vllm import LLM, SamplingParams
6from datasets import load_dataset
7import time
8from tqdm import tqdm
9from transformers import AutoTokenizer

Now let's load the model and generate its outputs on a small slice of the dataset.

1dataset = load_dataset("akemiH/MedQA-Reason", split="train").select(range(10))
2prompts = []
3for sample in dataset:
4  prompts.append(sample)
5sampling_params = SamplingParams(max_tokens=524)
6llm = LLM(model="microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
7def generate_with_time(prompt):
8  start = time.time()
9  outputs = llm.generate(prompt, sampling_params)
10  taken = time.time() - start
11  generated_text = outputs[0].outputs[0].text
12  return generated_text, taken
13generated_text = []
14time_taken = 0
15for sample in tqdm(prompts):
16  text, taken = generate_with_time(sample)
17  time_taken += taken
18  generated_text.append(text)
19# Tokenize the outputs and calculate the throughput
20tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
21token = 1
22for sample in generated_text:
23  tokens = tokenizer(sample)
24  tok = len(tokens.input_ids)
25  token += tok
26print(token)
27print("tok/s", token // time_taken)

Let's also benchmark the model's performance through vLLM on the ShareGPT dataset

1!wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
2%cd vllm
3!python benchmarks/benchmark_throughput.py --backend vllm --dataset ../ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-3-mini-4k-instruct --tokenizer microsoft/Phi-3-mini-4k-instruct --num-prompts=1000

3. LMDeploy

Usage

Install dependencies and import packages.

1!pip install -q lmdeploy
2!pip install nest_asyncio
3import nest_asyncio
4nest_asyncio.apply()
5!git clone --depth=1 https://github.com/InternLM/lmdeploy
6%cd lmdeploy/benchmark
7LMdeploy has developed two inference engines TurboMind and PyTorch.

Let's profile the PyTorch engine on microsoft/Phi3-mini-128k-instruct .

1!python3 profile_generation.py microsoft/Phi-3-mini-128k-instruct --backend pytorch
2It profiles the engine over multiple rounds and reports the token latency & throughput for each round.

Pytorch engine profile for token latency and throughput

4. MLC-LLM

Introduction MLC-LLM offers a high performance deployment and inference engine, called MLCEngine.

Usage

Let's install dependencies which includes setting up dependencies with conda and creating a conda environment. Then clone the git repository and configure.

1conda activate your-environment
2python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121
3conda env remove -n mlc-chat-venv
4conda create -n mlc-chat-venv -c conda-forge \
5  "cmake>=3.24" \
6  rust \
7  git \
8  python=3.11
9conda activate mlc-chat-venv
10git clone --recursive https://github.com/mlc-ai/mlc-llm.git && cd mlc-llm/
11mkdir -p build && cd build
12python ../cmake/gen_cmake_config.py
13cmake .. && cmake --build . --parallel $(nproc) && cd ..

To run a model with MLC LLM, we need to convert model weights into MLC format.

1mlc_llm convert_weight ./dist/models/Phi-3-small-128k-instruct/ \
2  --quantization q0f16 \
3  --model-type "phi3" \
4  -o ./dist/Phi-3-small-128k-instruct-q0f16-MLC

Now load you MLC format model into the MLC engine

1from mlc_llm import MLCEngine
2# Create engine
3model = "HF://mlc-ai/Phi-3-mini-128k-instruct-q0f16-MLC"
4engine = MLCEngine(model)
5# Now let's calculate throughput
6import time
7from transformers import AutoTokenizer
8start = time.time()
9response = engine.chat.completions.create(
10  messages=[{"role": "user", "content": "What is the Machine Learning?"}],
11  model=model,
12  stream=False,
13)
14taken = time.time() - start
15tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
16print("tok/s", 82 // taken)

Summary

AIMachine LearningDeep LearningLLMInference EngineTensorRTvLLMLMDeployMLC-LLM

Research

Schedule-Free Learning — A New Way to Train Models

Training 3 Llama models for comparison of Cosine Scheduled and Schedule-Free optimizer.

Research

Llama-Bitnet | Training a 1.58 bit LLM

What is 1 bit LLM and How to train 70M Llama-Bitnet?

Latest from Queryloop

1. TensorRT-LLM

Usage

Now retrieve the model weights

Similarly, now apply INT8 weight-only quantization to the HF model and convert the checkpoint into TensorRT-LLM.

Now test the base phi3 and two TensorRT models on the summarization task

Now after capturing the results, you can parse the output and plot it to compare execution time, ROUGE scores, latency, and throughput across all models.

Comparison of Latency and Throughput

2. vLLM

Usage

Now let's load the model and generate its outputs on a small slice of the dataset.

Let's also benchmark the model's performance through vLLM on the ShareGPT dataset

3. LMDeploy

Usage

Let's profile the PyTorch engine on microsoft/Phi3-mini-128k-instruct .

Pytorch engine profile for token latency and throughput

4. MLC-LLM

Usage

To run a model with MLC LLM, we need to convert model weights into MLC format.

Now load you MLC format model into the MLC engine

Summary

Related Posts

Schedule-Free Learning — A New Way to Train Models

Llama-Bitnet | Training a 1.58 bit LLM

1. TensorRT-LLM

Usage

Now retrieve the model weights

Similarly, now apply INT8 weight-only quantization to the HF model and convert the checkpoint into TensorRT-LLM.

Now test the base phi3 and two TensorRT models on the summarization task

Now after capturing the results, you can parse the output and plot it to compare execution time, ROUGE scores, latency, and throughput across all models.

Comparison of Latency and Throughput

2. vLLM

Usage

Now let's load the model and generate its outputs on a small slice of the dataset.

Let's also benchmark the model's performance through vLLM on the ShareGPT dataset

3. LMDeploy

Usage

Let's profile the PyTorch engine on microsoft/Phi3-mini-128k-instruct .

Pytorch engine profile for token latency and throughput

4. MLC-LLM

Usage

To run a model with MLC LLM, we need to convert model weights into MLC format.

Now load you MLC format model into the MLC engine

Summary

Related Posts

Schedule-Free Learning — A New Way to Train Models

Llama-Bitnet | Training a 1.58 bit LLM