Qwen3-VL
This model was released on 2025-09-23 and added to Hugging Face Transformers on 2025-09-15.
Qwen3-VL
Section titled “Qwen3-VL”Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
Model usage
import torchfrom transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model = Qwen3VLForConditionalGeneration.from_pretrained( "Qwen/Qwen3-VL", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL")messages = [ { "role":"user", "content":[ { "type":"image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" }, { "type":"text", "text":"Describe this image." } ] }
]
inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt",)inputs.pop("token_type_ids", None)
generated_ids = model.generate(**inputs, max_new_tokens=128)generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)print(output_text)Qwen3VLConfig
Section titled “Qwen3VLConfig”[[autodoc]] Qwen3VLConfig
Qwen3VLTextConfig
Section titled “Qwen3VLTextConfig”[[autodoc]] Qwen3VLTextConfig
Qwen3VLProcessor
Section titled “Qwen3VLProcessor”[[autodoc]] Qwen3VLProcessor
Qwen3VLVideoProcessor
Section titled “Qwen3VLVideoProcessor”[[autodoc]] Qwen3VLVideoProcessor
Qwen3VLVisionModel
Section titled “Qwen3VLVisionModel”[[autodoc]] Qwen3VLVisionModel - forward
Qwen3VLTextModel
Section titled “Qwen3VLTextModel”[[autodoc]] Qwen3VLTextModel - forward
Qwen3VLModel
Section titled “Qwen3VLModel”[[autodoc]] Qwen3VLModel - forward
Qwen3VLForConditionalGeneration
Section titled “Qwen3VLForConditionalGeneration”[[autodoc]] Qwen3VLForConditionalGeneration - forward