CSV

CSV, short for Comma-Separated Values, is one of the oldest and most widely used formats for exchanging tabular data. Each line represents a record, and within a record the fields are separated by commas. Because it is plain text, CSV can be produced and consumed by virtually any tool, from spreadsheets and databases to one-line shell scripts, which is why it remains a lingua franca for moving data between systems.

For decades CSV existed only as a loose convention with countless incompatible variations. RFC 4180, “Common Format and MIME Type for Comma-Separated Values (CSV) Files,” authored by Yakov Shafranovich and published by the IETF in October 2005, was the first attempt to write down a common interpretation and to register the “text/csv” MIME type. The RFC is informational rather than a binding standard, and it openly acknowledges that “implementations vary widely,” which is exactly why a reference description was needed.

The RFC’s core grammar is simple. Records are separated by a CRLF line break, fields within a record are separated by commas, and an optional header line may appear first to name the columns. The document also defines an optional “header” parameter for the MIME type so that a recipient can know whether the first line is data or column names.

The subtleties of CSV all concern quoting and escaping, and this is where naive parsers fail. RFC 4180 specifies that fields containing line breaks, double quotes, or commas should be enclosed in double quotes. When a field is quoted, any literal double quote inside it must be escaped by doubling it, so a field reading She said “hi” becomes “She said ""hi""". Fields may be optionally quoted even when they contain none of those special characters, which means a correct parser cannot simply split on commas.

These rules explain the format’s most notorious pitfalls. Splitting on commas without honoring quotes corrupts any field that legitimately contains a comma or an embedded newline; mishandling doubled quotes mangles values; and real-world files freely diverge from the RFC by using semicolons (common in locales where the comma is a decimal separator), tabs, varying line endings, or byte-order marks. The RFC’s guidance to be conservative in what you produce and liberal in what you accept captures the practical reality of working with CSV.

Despite these hazards, CSV endures because of its simplicity, human readability, and universal tooling support. Formats like JSON offer richer structure and unambiguous typing, but for flat, row-and-column data exported from spreadsheets and databases, CSV remains the default exchange format, and RFC 4180 remains the closest thing to a canonical specification.

Sources

Last verified June 8, 2026