Self-Supervised Learning

Self-supervised learning is a way of training models on data that has no human-applied labels. Instead, the learning task is constructed from the data itself: the model hides part of an input and learns to predict the hidden part from the rest. Predicting the next word in a sentence, filling in a masked word, or reconstructing a corrupted patch of an image are all self-supervised tasks. Because the “answer” comes from the data, these methods can use the vast amounts of unlabeled text and images on the internet rather than expensive hand-labeled datasets.

The 2023 survey “A Cookbook of Self-Supervised Learning,” whose authors include Yann LeCun, calls the approach “the dark matter of intelligence” and lays out the recipes practitioners use, from choosing the pretext task to setting training hyperparameters. Self-supervised pretraining is the engine behind modern large language models, which learn by predicting the next token, and behind many vision systems that learn useful features before any labeled fine-tuning.

Why business readers should care: Self-supervised learning is why today’s AI models can be trained on enormous unlabeled datasets instead of costly human-annotated ones. It is the reason a single pretrained model can be adapted to many downstream tasks, lowering the cost of building useful AI.

Sources

Related