CLIP was trained on 400 million image-text pairs

fact

OpenAI’s 2021 CLIP paper, “Learning Transferable Visual Models From Natural Language Supervision,” reports that the model was trained on “a dataset of 400 million (image, text) pairs collected from the internet.” Learning from naturally occurring captions at that scale, rather than from a fixed set of hand-labeled categories, is what let CLIP recognize visual concepts it was never explicitly trained to classify.

Sources

PRIMARY https://arxiv.org/abs/2103.00020

Last verified June 6, 2026

<- Back to the AI Library

CLIP was trained on 400 million image-text pairs

Sources

Related