Skip to content

vllm.model_executor.layers.fused_moe.experts.xpu_moe

XPUExpertsWNA16

Bases: XPUExperts

W4A16 INT4-symmetric MoE backed by xpu_fused_moe(is_int4=True).

Weight layout when is_int4=True (per xpu_fused_moe docstring): w13: [num_experts, 2inter_size, hidden_size] contiguous int4-packed w13_scales: [num_experts, 2inter_size, hidden_size // group_size] w2: [num_experts, hidden_size, inter_size] contiguous int4-packed w2_scales: [num_experts, hidden_size, inter_size // group_size]

Pairs with INCXPULinearMethod for the linear layers; together they cover full-attn + MoE on Intel XPU end-to-end without IPEX.

Source code in vllm/model_executor/layers/fused_moe/experts/xpu_moe.py
class XPUExpertsWNA16(XPUExperts):
    """W4A16 INT4-symmetric MoE backed by `xpu_fused_moe(is_int4=True)`.

    Weight layout when `is_int4=True` (per `xpu_fused_moe` docstring):
        w13: [num_experts, 2*inter_size, hidden_size]   contiguous int4-packed
        w13_scales: [num_experts, 2*inter_size, hidden_size // group_size]
        w2:  [num_experts, hidden_size, inter_size]     contiguous int4-packed
        w2_scales:  [num_experts, hidden_size, inter_size // group_size]

    Pairs with `INCXPULinearMethod` for the linear layers; together they
    cover full-attn + MoE on Intel XPU end-to-end without IPEX.
    """

    def __init__(
        self,
        moe_config: FusedMoEConfig,
        quant_config: FusedMoEQuantConfig,
        max_num_tokens: int | None = None,
        num_dispatchers: int | None = None,
    ):
        super().__init__(
            moe_config,
            quant_config,
            max_num_tokens,
            num_dispatchers,
        )
        self.is_int4 = True

    @staticmethod
    def _supports_quant_scheme(
        weight_key: QuantKey | None,
        activation_key: QuantKey | None,
    ) -> bool:
        return (weight_key, activation_key) == (kInt4Static, None)