GLM-4.1V
This model was released on 2025-07-01 and added to Hugging Face Transformers on 2025-06-25.
GLM-4.1V
Section titled “GLM-4.1V”Overview
Section titled “Overview”GLM-4.1V-9B-Thinking is a bilingual vision-language model optimized for reasoning, built on GLM-4-9B. It introduces a “thinking paradigm” with reinforcement learning, achieving state-of-the-art results among 10B-class models and rivaling 72B-scale models. It supports 64k context, 4K resolution, and arbitrary aspect ratios, with an open-source base model for further research. You can check our paper here. and below is a abstract.
We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.
The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.
import torchfrom transformers import pipelinepipe = pipeline( task="image-text-to-text", model="THUDM/GLM-4.1V-9B-Thinking", device=0, dtype=torch.bfloat16)messages = [ { "role": "user", "content": [ { "type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", }, { "type": "text", "text": "Describe this image."}, ] }]pipe(text=messages,max_new_tokens=20, return_full_text=False)import torchfrom transformers import Glm4vForConditionalGeneration, AutoProcessor
model = Glm4vForConditionalGeneration.from_pretrained( "THUDM/GLM-4.1V-9B-Thinking", dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa")processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")messages = [ { "role":"user", "content":[ { "type":"image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" }, { "type":"text", "text":"Describe this image." } ] }
]
inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128)generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)print(output_text)Using GLM-4.1V with video input is similar to using it with image input. The model can process video data and generate text based on the content of the video.
from transformers import AutoProcessor, Glm4vForConditionalGenerationfrom accelerate import Acceleratorimport torch
device = Accelerator().device
processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")model = Glm4vForConditionalGeneration.from_pretrained( pretrained_model_name_or_path="THUDM/GLM-4.1V-9B-Thinking", dtype=torch.bfloat16, device_map=device)
messages = [ { "role": "user", "content": [ { "type": "video", "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4", }, { "type": "text", "text": "discribe this video", }, ], }]inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", padding=True).to(model.device)generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=1.0)output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1] :], skip_special_tokens=True)print(output_text)Glm4vConfig
Section titled “Glm4vConfig”[[autodoc]] Glm4vConfig
Glm4vVisionConfig
Section titled “Glm4vVisionConfig”[[autodoc]] Glm4vVisionConfig
Glm4vTextConfig
Section titled “Glm4vTextConfig”[[autodoc]] Glm4vTextConfig
Glm4vImageProcessor
Section titled “Glm4vImageProcessor”[[autodoc]] Glm4vImageProcessor - preprocess
Glm4vVideoProcessor
Section titled “Glm4vVideoProcessor”[[autodoc]] Glm4vVideoProcessor - preprocess
Glm4vImageProcessorFast
Section titled “Glm4vImageProcessorFast”[[autodoc]] Glm4vImageProcessorFast - preprocess
Glm4vProcessor
Section titled “Glm4vProcessor”[[autodoc]] Glm4vProcessor
Glm4vVisionModel
Section titled “Glm4vVisionModel”[[autodoc]] Glm4vVisionModel - forward
Glm4vTextModel
Section titled “Glm4vTextModel”[[autodoc]] Glm4vTextModel - forward
Glm4vModel
Section titled “Glm4vModel”[[autodoc]] Glm4vModel - forward
Glm4vForConditionalGeneration
Section titled “Glm4vForConditionalGeneration”[[autodoc]] Glm4vForConditionalGeneration - forward