Qwen3-VL

This model was released on 2025-09-23 and added to Hugging Face Transformers on 2025-09-15.

Qwen3-VL

Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.

Model usage

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL",
    dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL")
messages = [
    {
        "role":"user",
        "content":[
            {
                "type":"image",
                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
            },
            {
                "type":"text",
                "text":"Describe this image."
            }
        ]
    }

]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs.pop("token_type_ids", None)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Qwen3VLConfig

[[autodoc]] Qwen3VLConfig

Qwen3VLTextConfig

[[autodoc]] Qwen3VLTextConfig

Qwen3VLProcessor

[[autodoc]] Qwen3VLProcessor

Qwen3VLVideoProcessor

[[autodoc]] Qwen3VLVideoProcessor

Qwen3VLVisionModel

[[autodoc]] Qwen3VLVisionModel - forward

Qwen3VLTextModel

[[autodoc]] Qwen3VLTextModel - forward

Qwen3VLModel

[[autodoc]] Qwen3VLModel - forward

Qwen3VLForConditionalGeneration

[[autodoc]] Qwen3VLForConditionalGeneration - forward