MetaCLIP 2

This model was released on {release_date} and added to Hugging Face Transformers on 2025-08-20.

MetaCLIP 2

Overview

MetaCLIP 2 is a replication of the original CLIP model trained on 300+ languages. It achieves state-of-the-art (SOTA) results on multilingual benchmarks (e.g., XM3600, CVQA, Babel‑ImageNet), surpassing previous SOTA such as mSigLIP and SigLIP‑2. The authors show that English and non-English worlds can mutually benefit and elevate each other.

This model was contributed by nielsr. The original code can be found here.

You can find all the MetaCLIP 2 checkpoints under the Meta organization.

The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with Pipeline or the AutoModel class. Usage of the MetaCLIP 2 models is identical to the CLIP models, you just need the MetaClip2Model class instead of CLIPModel.

import torch
from transformers import pipeline

clip = pipeline(
   task="zero-shot-image-classification",
   model="facebook/metaclip-2-worldwide-huge-quickgelu",
   dtype=torch.bfloat16,
   device=0
)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu", dtype=torch.bfloat16, attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
most_likely_idx = probs.argmax(dim=1).item()
most_likely_label = labels[most_likely_idx]
print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")

MetaClip2Config

[[autodoc]] MetaClip2Config

MetaClip2TextConfig

[[autodoc]] MetaClip2TextConfig

MetaClip2VisionConfig

[[autodoc]] MetaClip2VisionConfig

MetaClip2Model

[[autodoc]] MetaClip2Model - forward - get_text_features - get_image_features

MetaClip2TextModel

[[autodoc]] MetaClip2TextModel - forward

MetaClip2TextModelWithProjection

[[autodoc]] MetaClip2TextModelWithProjection - forward

MetaClip2VisionModelWithProjection

[[autodoc]] MetaClip2VisionModelWithProjection - forward

MetaClip2VisionModel

[[autodoc]] MetaClip2VisionModel - forward

MetaClip2ForImageClassification

[[autodoc]] MetaClip2ForImageClassification - forward