Skip to content

MLCD

This model was released on 2024-07-24 and added to Hugging Face Transformers on 2025-04-15.

PyTorch SDPA

The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.

🔥MLCD-ViT-bigG🔥 series is the state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.

Tips:

Result:

Vision TowerRoPE2DChartQADocVQAInfoVQAOCRBenchMMMU
CLIP (ViT-L-14-336px)×66.5275.2138.88525.0044.20
SigLIP (ViT-SO400M-384px)×69.2876.7141.38554.0046.78
DFN5B (ViT-H-14-378px)×64.3670.8738.59473.0048.00
MLCD (ViT-L-14-336px)×67.8476.4643.48531.0044.30
MLCD (ViT-bigG-14-336px)71.0779.6344.38572.0046.78
MLCD (ViT-bigG-14-448px)73.8083.3446.59582.0046.00
import requests
from PIL import Image
from transformers import AutoProcessor, MLCDVisionModel
# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
# Process single image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Generate outputs
with torch.no_grad():
outputs = model(**inputs)
# Get visual features
features = outputs.last_hidden_state
print(f"Extracted features shape: {features.shape}")

[[autodoc]] MLCDVisionConfig

[[autodoc]] MLCDVisionModel - forward