Syntax Highlighting

Syntax highlighting is the editor feature that displays source code in different colors and font styles according to the lexical or syntactic role of each token. Keywords, strings, comments, numbers, types, and function names are each given a distinct appearance, so the structure of a program is visible at a glance. The technique reduces reading errors, makes mismatched brackets or unterminated strings obvious, and is one of the most universally recognized features of a modern code editor.

The traditional approach is regex-based tokenizing. Vim’s syntax documentation describes a system where a syntax file defines keywords, matches, and regions using pattern rules, then maps the resulting groups to highlight groups that the color scheme renders. A match item highlights text matching a pattern, a region highlights everything between a start and end pattern, and contained groups allow nesting, such as recognizing an escape sequence only inside a string. TextMate-style grammars, later adopted by many editors, work similarly, layering regular expressions to assign scope names to spans of text.

Regex tokenizers are fast and simple but fundamentally limited, because programming languages are not regular and many constructs require real parsing to disambiguate. A regex highlighter cannot reliably tell a type name from a variable, handle deeply nested or context-dependent constructs, or recover cleanly from an edit in the middle of a file. As editors demanded richer and more accurate coloring, the limits of pattern matching pushed the field toward grammar-based approaches.

Tree-sitter represents that shift. Its highlighting documentation explains that coloring is driven by query files run against a concrete syntax tree, where a highlights query uses captures to assign highlight names such as keyword or function to specific nodes. Because the highlighter operates on an actual parse tree rather than line-by-line regular expressions, it can color tokens based on their true syntactic role, support language injection (for example highlighting embedded SQL or HTML inside another language), and distinguish local-variable references from other identifiers.

The move from regex tokenizers to grammar-driven highlighting also unified previously separate features. The same parse tree that powers highlighting can drive code folding, structural selection, and navigation, so a single accurate model of the source serves several editor capabilities at once. Syntax highlighting thus traces the broader arc of editor tooling: from cheap approximations that look right most of the time to precise, parser-backed analysis fast enough to run on every keystroke.

Sources

Related