Skip to content

Qwen3-VL

This model was released on 2025-09-23 and added to Hugging Face Transformers on 2025-09-15.

PyTorch FlashAttention SDPA

Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.

Model usage

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL",
dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL")
messages = [
{
"role":"user",
"content":[
{
"type":"image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
},
{
"type":"text",
"text":"Describe this image."
}
]
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
inputs.pop("token_type_ids", None)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

[[autodoc]] Qwen3VLConfig

[[autodoc]] Qwen3VLTextConfig

[[autodoc]] Qwen3VLProcessor

[[autodoc]] Qwen3VLVideoProcessor

[[autodoc]] Qwen3VLVisionModel - forward

[[autodoc]] Qwen3VLTextModel - forward

[[autodoc]] Qwen3VLModel - forward

[[autodoc]] Qwen3VLForConditionalGeneration - forward