Cohere 2
This model was released on 2024-12-13 and added to Hugging Face Transformers on 2024-12-13.
Cohere 2
Section titled “Cohere 2”Cohere Command R7B is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
This model is optimized for speed, cost-performance, and compute resources.
You can find all the original Command-R checkpoints under the Command Models collection.
The example below demonstrates how to generate text with Pipeline or the AutoModel class, and from the command line.
import torchfrom transformers import pipeline
pipeline = pipeline( task="text-generation", model="CohereLabs/c4ai-command-r7b-12-2024", dtype=torch.float16, device_map=0)
messages = [ {"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"},]pipeline(messages)import torchfrom transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")model = AutoModelForCausalLM.from_pretrained( "CohereLabs/c4ai-command-r7b-12-2024", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
# format message with the Command-R chat templatemessages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)output = model.generate( input_ids, max_new_tokens=100, do_sample=True, temperature=0.3, cache_implementation="static",)print(tokenizer.decode(output[0], skip_special_tokens=True))# pip install -U flash-attn --no-build-isolationtransformers chat CohereLabs/c4ai-command-r7b-12-2024 --dtype auto --attn_implementation flash_attention_2Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to quantize the weights to 4-bits.
import torchfrom transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
bnb_config = BitsAndBytesConfig(load_in_4bit=True)tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")model = AutoModelForCausalLM.from_pretrained( "CohereLabs/c4ai-command-r7b-12-2024", dtype=torch.float16, device_map="auto", quantization_config=bnb_config, attn_implementation="sdpa")
# format message with the Command-R chat templatemessages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)output = model.generate( input_ids, max_new_tokens=100, do_sample=True, temperature=0.3, cache_implementation="static",)print(tokenizer.decode(output[0], skip_special_tokens=True))Cohere2Config
Section titled “Cohere2Config”[[autodoc]] Cohere2Config
Cohere2Model
Section titled “Cohere2Model”[[autodoc]] Cohere2Model - forward
Cohere2ForCausalLM
Section titled “Cohere2ForCausalLM”[[autodoc]] Cohere2ForCausalLM - forward