Foundation model for science

A foundation model for science applies the recipe behind large language models, pretrain a big network on enormous amounts of data, then adapt it to many downstream tasks, to scientific domains instead of text. The data might be protein sequences, genomes, single-cell measurements, weather states, chemical structures, or astronomical observations, but the pattern is the same: one large pretrained model that can be fine-tuned or prompted for a wide range of specific problems.

The appeal is that scientific data is often abundant in the aggregate but scarce for any one question. A foundation model soaks up general structure during pretraining, so that even a problem with only a handful of labeled examples can ride on top of everything the model already learned. This is why these models tend to shine in transfer-learning settings where training from scratch would be hopeless.

The genomics model Evo, the single-cell models Geneformer and scGPT, and protein models such as the ESM family are leading examples, and machine-learned weather models play a similar role in the earth sciences. Each is general within its domain and reusable across tasks.

For a general reader, the rise of scientific foundation models is one of the clearest signs that the methods powering consumer AI are being absorbed into research itself. The bet is that, just as one model now handles many text tasks, one model per scientific domain could become the default substrate on which discovery is built.

Sources

Related