Adding a New Model#
This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to vllm official doc: Adding a New Model first.
Step 1: Implementing Models with torch
and torch_npu
#
This section provides instructions for implementing new models compatible with vllm and vllm-ascend.
Before starting:
Verify whether your model already exists in vllm’s models directory.
Use existing models’ implementation as templates to accelerate your development.
Method 1: Implementing New Models from Scratch#
Follow vllm’s OPT model adaptation example for guidance.
Key implementation requirements:
Place model files in
vllm_ascend/models/
directory.Standard module structure for decoder-only LLMs (please checkout vllm’s implementations for other kinds of model):
*ModelForCausalLM
(top-level wrapper)*Model
(main architecture)*DecoderLayer
(transformer block)*Attention
and*MLP
(specific computation unit)
Note
*
denotes your model’s unique identifier.
Critical Implementation Details:
All modules must include a prefix
argument in __init__()
.
Required interfaces:
Module Type |
Required Methods |
---|---|
|
|
|
|
Attention Backend Integration:
Importing attention via from vllm.attention import Attention
can automatically leverage the attention backend routing of vllm-ascend (see: get_attn_backend_cls()
in vllm_ascend/platform.py
).
Tensor Parallelism:
Use vllm’s parallel layers (ColumnParallelLinear
, VocabParallelEmbedding
, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in vllm_ascend/ops/
directory (RMSNorm, VocabParallelEmbedding, etc.).
Reference Implementation Template (assumed path: vllm_ascend/models/custom_model.py
):
from collections.abc import Iterable
from typing import Optional, Union
import torch
from torch import nn
from vllm.attention import Attention
from vllm.config import VllmConfig
from vllm.sequence import IntermediateTensors
from vllm.model_executor.sampling_metadata import SamplingMetadata
class CustomAttention(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.attn = Attention(prefix=f"{prefix}.attn")
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# Implement attention logic
...
class CustomDecoderLayer(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn")
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# Implement decoder layer
...
class CustomModel(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.layers = nn.ModuleList([
CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}")
for i in range(vllm_config.model_config.hf_config.num_hidden_layers)
])
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
...
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
...
def load_weights(self,
weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
...
class CustomModelForCausalLM(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
self.model = CustomModel(vllm_config, prefix=f"{prefix}.model")
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
...
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
...
def compute_logits(self,
hidden_states: torch.Tensor,
sampling_metadata: SamplingMetadata) -> torch.Tensor:
...
def load_weights(self,
weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
...
Method 2: Customizing Existing vLLM Models#
For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: vllm_ascend/models/deepseek_v2.py
).
from typing import List, Optional
import torch
from vllm.attention import AttentionMetadata
from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM
from vllm.sequence import IntermediateTensors
class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
# Define merged weights for quantization/efficiency
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
"experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
}
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: Optional[List[torch.Tensor]] = None,
attn_metadata: Optional[AttentionMetadata] = None,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
# Custom forward logic
hidden_states = self.model(
input_ids,
positions,
kv_caches,
attn_metadata,
intermediate_tensors,
inputs_embeds
)
return hidden_states
Note
For a complete implementation reference, see: vllm_ascend/models/deepseek_v2.py
.
Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM#
vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.
To integrate your implemented model from vllm_ascend/models/
directory:
Import your model implementation in
vllm_ascend/models/__init__.py
using relative imports.Register the model wrapper class via
vllm.ModelRegistry.register_model()
function.
Reference Registration Template (an example of registering new models in vllm_ascend/models/__init__.py
):
from vllm import ModelRegistry
def register_model():
from .custom_model import CustomModelForCausalLM # New custom model
from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM # Customized Deepseek
# For NEW architectures: Register with unique name
ModelRegistry.register_model(
"CustomModelForCausalLM", # Must match config.json's 'architectures'
"vllm_ascend.models.custom_model:CustomModelForCausalLM"
)
# For MODIFIED architectures: Use original name
ModelRegistry.register_model(
"DeepseekV2ForCausalLM", # Original architecture identifier in vLLM
"vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM "
)
Note
The first argument of vllm.ModelRegistry.register_model()
indicates the unique architecture identifier which must match architectures
in config.json
of the model.
{
"architectures": [
"CustomModelForCausalLM"
],
}
Step 3: Verification#
Case 1: Overriding Existing vLLM Model Architecture#
If you’re registering a customized model architecture based on vllm’s existing implementation (overriding vllm’s original class), when executing vllm offline/online inference (using any model), you’ll observe warning logs similar to the following output from vllm/models_executor/models/registry.py
.
Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
Case 2: Registering New Model Architecture#
If you’re registering a novel model architecture not present in vllm (creating a completely new class), current logs won’t provide explicit confirmation by default. It’s recommended to add the following logging statement at the end of the register_model
method in vllm/models_executor/models/registry.py
.
logger.info(f"model_arch: {model_arch} has been registered here!")
After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).
model_arch: CustomModelForCausalLM has been registered here!
This log output confirms your novel model architecture has been successfully registered in vllm.
Step 4: Testing#
After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.
Find more details at:
Step 5: Updating Supported Models Doc#
At last, if all the steps above are completed, you should add the new model into our Supported Models doc.