SmolLM3
This model was released on 2025-07-08 and added to Hugging Face Transformers on 2025-06-25.
SmolLM3
Section titled “SmolLM3”SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.
The example below demonstrates how to generate text with Pipeline, AutoModel, and from the command line using the instruction-tuned models.
import torchfrom transformers import pipeline
pipe = pipeline( task="text-generation", model="HuggingFaceTB/SmolLM3-3B", dtype=torch.bfloat16, device_map=0)
messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me about yourself."},]outputs = pipe(messages, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)print(outputs[0]["generated_text"][-1]['content'])import torchfrom transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained( "HuggingFaceTB/SmolLM3-3B", dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa")tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
prompt = "Give me a short introduction to large language models."messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}]text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True)model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate( model_inputs.input_ids, cache_implementation="static", max_new_tokens=512, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]print(response)# pip install -U flash-attn --no-build-isolationtransformers chat HuggingFaceTB/SmolLM3-3B --dtype auto --attn_implementation flash_attention_2 --device 0Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to quantize the weights to 4-bits.
# pip install -U flash-attn --no-build-isolationimport torchfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True,)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")model = AutoModelForCausalLM.from_pretrained( "HuggingFaceTB/SmolLM3-3B", dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config, attn_implementation="flash_attention_2")
inputs = tokenizer("Gravity is the force", return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=100)print(tokenizer.decode(outputs[0], skip_special_tokens=True))- Ensure your Transformers library version is up-to-date. SmolLM3 requires Transformers>=4.53.0 for full support.
SmolLM3Config
Section titled “SmolLM3Config”[[autodoc]] SmolLM3Config
SmolLM3Model
Section titled “SmolLM3Model”[[autodoc]] SmolLM3Model - forward
SmolLM3ForCausalLM
Section titled “SmolLM3ForCausalLM”[[autodoc]] SmolLM3ForCausalLM - forward
SmolLM3ForSequenceClassification
Section titled “SmolLM3ForSequenceClassification”[[autodoc]] SmolLM3ForSequenceClassification - forward
SmolLM3ForTokenClassification
Section titled “SmolLM3ForTokenClassification”[[autodoc]] SmolLM3ForTokenClassification - forward
SmolLM3ForQuestionAnswering
Section titled “SmolLM3ForQuestionAnswering”[[autodoc]] SmolLM3ForQuestionAnswering - forward