Adding a New Model

Adding a New Model#

This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to vllm official doc: Adding a New Model first.

Step 1: Implementing Models with `torch` and `torch_npu`#

This section provides instructions for implementing new models compatible with vllm and vllm-ascend.

Before starting:

Verify whether your model already exists in vllm’s models directory.
Use existing models’ implementation as templates to accelerate your development.

Method 1: Implementing New Models from Scratch#

Follow vllm’s OPT model adaptation example for guidance.

Key implementation requirements:

Place model files in vllm_ascend/models/ directory.
Standard module structure for decoder-only LLMs (please checkout vllm’s implementations for other kinds of model):

*ModelForCausalLM (top-level wrapper)
*Model (main architecture)
*DecoderLayer (transformer block)
*Attention and *MLP (specific computation unit)

Note

* denotes your model’s unique identifier.

Critical Implementation Details:

All modules must include a prefix argument in __init__().

Required interfaces:

Module Type	Required Methods
`*ModelForCausalLM`	`get_input_embeddings`, `compute_logits`, `load_weights`
`*Model`	`get_input_embeddings`, `load_weights`

Attention Backend Integration:

Importing attention via from vllm.attention import Attention can automatically leverage the attention backend routing of vllm-ascend (see: get_attn_backend_cls() in vllm_ascend/platform.py).

Tensor Parallelism:

Use vllm’s parallel layers (ColumnParallelLinear, VocabParallelEmbedding, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in vllm_ascend/ops/ directory (RMSNorm, VocabParallelEmbedding, etc.).

Reference Implementation Template (assumed path: vllm_ascend/models/custom_model.py):

from collections.abc import Iterable
from typing import Optional, Union

import torch
from torch import nn
from vllm.attention import Attention
from vllm.config import VllmConfig
from vllm.sequence import IntermediateTensors
from vllm.model_executor.sampling_metadata import SamplingMetadata

class CustomAttention(nn.Module):
    def __init__(self, vllm_config: VllmConfig, prefix: str):
        super().__init__()
        self.attn = Attention(prefix=f"{prefix}.attn")

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        # Implement attention logic
        ...

class CustomDecoderLayer(nn.Module):
    def __init__(self, vllm_config: VllmConfig, prefix: str):
        super().__init__()
        self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn")

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        # Implement decoder layer
        ...

class CustomModel(nn.Module):
    def __init__(self, vllm_config: VllmConfig, prefix: str):
        super().__init__()
        self.layers = nn.ModuleList([
            CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") 
            for i in range(vllm_config.model_config.hf_config.num_hidden_layers)
        ])

    def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
        ...

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        intermediate_tensors: Optional[IntermediateTensors] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
    ) -> Union[torch.Tensor, IntermediateTensors]:
        ...

    def load_weights(self, 
                    weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        ...

class CustomModelForCausalLM(nn.Module):
    def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
        super().__init__()
        self.model = CustomModel(vllm_config, prefix=f"{prefix}.model")

    def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
        ...

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        intermediate_tensors: Optional[IntermediateTensors] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
    ) -> Union[torch.Tensor, IntermediateTensors]:
        ...

    def compute_logits(self,
                      hidden_states: torch.Tensor,
                      sampling_metadata: SamplingMetadata) -> torch.Tensor:
        ...

    def load_weights(self, 
                    weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        ...

Method 2: Customizing Existing vLLM Models#

For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: vllm_ascend/models/deepseek_v2.py).

from typing import List, Optional
import torch
from vllm.attention import AttentionMetadata
from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM
from vllm.sequence import IntermediateTensors

class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
    # Define merged weights for quantization/efficiency
    packed_modules_mapping = {
        "gate_up_proj": ["gate_proj", "up_proj"],
        "experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
    }

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        kv_caches: Optional[List[torch.Tensor]] = None,
        attn_metadata: Optional[AttentionMetadata] = None,
        intermediate_tensors: Optional[IntermediateTensors] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
    ) -> Union[torch.Tensor, IntermediateTensors]:
        # Custom forward logic
        hidden_states = self.model(
            input_ids, 
            positions, 
            kv_caches,
            attn_metadata, 
            intermediate_tensors,
            inputs_embeds
        )
        return hidden_states

Note

For a complete implementation reference, see: vllm_ascend/models/deepseek_v2.py.

Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM#

vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase.

To integrate your implemented model from vllm_ascend/models/ directory:

Import your model implementation in vllm_ascend/models/__init__.py using relative imports.
Register the model wrapper class via vllm.ModelRegistry.register_model() function.

Reference Registration Template (an example of registering new models in vllm_ascend/models/__init__.py):

from vllm import ModelRegistry

def register_model():
    from .custom_model import CustomModelForCausalLM        # New custom model
    from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM  # Customized Deepseek

    # For NEW architectures: Register with unique name
    ModelRegistry.register_model(
        "CustomModelForCausalLM",  # Must match config.json's 'architectures'
        "vllm_ascend.models.custom_model:CustomModelForCausalLM"
    )

    # For MODIFIED architectures: Use original name
    ModelRegistry.register_model(
        "DeepseekV2ForCausalLM",   # Original architecture identifier in vLLM
        "vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM  "
    )

Note

The first argument of vllm.ModelRegistry.register_model() indicates the unique architecture identifier which must match architectures in config.json of the model.

{
  "architectures": [
    "CustomModelForCausalLM"
  ],
}

Step 3: Verification#

Case 1: Overriding Existing vLLM Model Architecture#

If you’re registering a customized model architecture based on vllm’s existing implementation (overriding vllm’s original class), when executing vllm offline/online inference (using any model), you’ll observe warning logs similar to the following output from vllm/models_executor/models/registry.py.

Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.

Case 2: Registering New Model Architecture#

If you’re registering a novel model architecture not present in vllm (creating a completely new class), current logs won’t provide explicit confirmation by default. It’s recommended to add the following logging statement at the end of the register_model method in vllm/models_executor/models/registry.py.

logger.info(f"model_arch: {model_arch} has been registered here!")

After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model).

model_arch: CustomModelForCausalLM has been registered here!

This log output confirms your novel model architecture has been successfully registered in vllm.

Step 4: Testing#

After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model.

Find more details at:

Step 5: Updating Supported Models Doc#

At last, if all the steps above are completed, you should add the new model into our Supported Models doc.

Adding a New Model

Contents

Adding a New Model#

Step 1: Implementing Models with torch and torch_npu#

Method 1: Implementing New Models from Scratch#

Method 2: Customizing Existing vLLM Models#

Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM#

Step 3: Verification#

Case 1: Overriding Existing vLLM Model Architecture#

Case 2: Registering New Model Architecture#

Step 4: Testing#

Step 5: Updating Supported Models Doc#

Step 1: Implementing Models with `torch` and `torch_npu`#