GPT-2
This model was released on 2019-02-14 and added to Hugging Face Transformers on 2020-11-16.
GPT-2 is a scaled up version of GPT, a causal transformer language model, with 10x more parameters and training data. The model was pretrained on a 40GB dataset to predict the next word in a sequence based on all the previous words. This approach enabled the model to perform many downstream tasks in a zero-shot setting. The blog post released by OpenAI can be found here.
The model architecture uses a unidirectional (causal) attention mechanism where each token can only attend to previous tokens, making it particularly effective for text generation tasks.
You can find all the original GPT-2 checkpoints under the OpenAI community organization.
The example below demonstrates how to generate text with Pipeline or the AutoModel, and from the command line.
import torchfrom transformers import pipeline
pipeline = pipeline(task="text-generation", model="openai-community/gpt2", dtype=torch.float16, device=0)pipeline("Hello, I'm a language model")import torchfrom transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
input_ids = tokenizer("Hello, I'm a language model", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")print(tokenizer.decode(output[0], skip_special_tokens=True))echo -e "Hello, I'm a language model" | transformers run --task text-generation --model openai-community/gpt2 --device 0One can also serve the model using vLLM with the transformers backend.
vllm serve openai-community/gpt2 --model-imp transformersQuantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to only quantize the weights to 4-bits.
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained( "openai-community/gpt2-xl", quantization_config=quantization_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")inputs = tokenizer("Once upon a time, there was a magical forest", return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=100)print(tokenizer.decode(outputs[0], skip_special_tokens=True))- Pad inputs on the right because GPT-2 uses absolute position embeddings.
- GPT-2 can reuse previously computed key-value attention pairs. Access this feature with the past_key_values parameter in
forward. - Enable the scale_attn_by_inverse_layer_idx and reorder_and_upcast_attn parameters to apply the training stability improvements from Mistral.
GPT2Config
Section titled “GPT2Config”[[autodoc]] GPT2Config
GPT2Tokenizer
Section titled “GPT2Tokenizer”[[autodoc]] GPT2Tokenizer - save_vocabulary
GPT2TokenizerFast
Section titled “GPT2TokenizerFast”[[autodoc]] GPT2TokenizerFast
GPT2 specific outputs
Section titled “GPT2 specific outputs”[[autodoc]] models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
GPT2Model
Section titled “GPT2Model”[[autodoc]] GPT2Model - forward
GPT2LMHeadModel
Section titled “GPT2LMHeadModel”[[autodoc]] GPT2LMHeadModel - forward
GPT2DoubleHeadsModel
Section titled “GPT2DoubleHeadsModel”[[autodoc]] GPT2DoubleHeadsModel - forward
GPT2ForQuestionAnswering
Section titled “GPT2ForQuestionAnswering”[[autodoc]] GPT2ForQuestionAnswering - forward
GPT2ForSequenceClassification
Section titled “GPT2ForSequenceClassification”[[autodoc]] GPT2ForSequenceClassification - forward
GPT2ForTokenClassification
Section titled “GPT2ForTokenClassification”[[autodoc]] GPT2ForTokenClassification - forward