Common Crawl’s overview page describes its corpus as containing “petabytes of data, regularly collected since 2008.” The nonprofit hosts this web archive on Amazon Web Services and makes it freely available to download or analyze in the cloud. This corpus is one of the most common starting points for the text used to train large language models.