Data-Centric AI

Data-centric AI is an approach to building machine-learning systems that puts deliberate, systematic work on the data at the center of the effort, rather than treating the dataset as fixed and focusing only on improving the model. The term was popularized by Andrew Ng, who, through DeepLearning.AI and Landing AI, ran a Data-Centric AI Competition in 2021 that inverted the usual contest format: instead of building the best model for a fixed dataset, participants had to improve a dataset given a fixed model.

The motivating observation is that strong model architectures are now widely available and easy to reuse, so for many real-world problems the remaining gains come from the data. Data-centric techniques include correcting mislabeled examples, making labeling guidelines consistent, adding examples that cover edge cases, removing bad or duplicate data, and applying targeted augmentation. The argument is that careful, repeatable data work often improves a deployed system more than another round of architecture tweaks, and that data quality deserves the same engineering rigor usually reserved for code.

Why a business reader should care: data-centric AI reframes where the effort and budget should go, often into the unglamorous work of cleaning and curating data, which is frequently the highest-return investment in a practical AI project.

Sources

Related