Full-Text Search

Full-text search looks inside the natural-language text of documents rather than matching only exact field values. The PostgreSQL documentation defines it as the “capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query.” This is what lets a search for “running” also find “ran” and rank the most relevant documents first.

The first step is tokenization: parsing each document into tokens. PostgreSQL notes that it is “useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently.” A parser breaks the raw text into these candidate terms.

Tokens are then normalized into lexemes. As the PostgreSQL docs put it, a lexeme “has been normalized so that different forms of the same word are made alike,” typically by folding case to lower-case and “removal of suffixes (such as s or es in English),” often using Snowball stemmer rules. Stemming is what collapses “run,” “running,” and “ran” toward a common form so they match each other.

The normalized terms are stored in an index built for rapid lookup, and results are ranked rather than merely matched. Apache Lucene, the library behind many search engines, advertises “ranked searching — best results returned first” with “pluggable ranking models, including the Vector Space Model and Okapi BM25.” PostgreSQL similarly supports proximity ranking, so a document with a dense cluster of query words scores higher than one where they are scattered. Tokenization, stemming, an inverted index, and relevance ranking together are what distinguish full-text search from a plain substring or exact-field match.

Sources

Related