CLIP was trained on 400 million image-text pairs

OpenAI’s 2021 CLIP paper, “Learning Transferable Visual Models From Natural Language Supervision,” reports that the model was trained on “a dataset of 400 million (image, text) pairs collected from the internet.” Learning from naturally occurring captions at that scale, rather than from a fixed set of hand-labeled categories, is what let CLIP recognize visual concepts it was never explicitly trained to classify.

Sources

Last verified June 6, 2026