Full-text and phrase search in PostgreSQL

Full-text and phrase search in PostgreSQL - Coletiv Blog

What are full-text and phrase search?

As per the Wikipedia, full-text search refers to techniques for searching a single or a collection of computer-stored documents in a full-text database. In this type of search, a search engine examines all of the words in every stored document as it tries to match the search criteria (for example, the text specified by a user).

Why do we need it?

Like I’ve said above, if a user specifies a text to search in some database we use this kind of technique to try to match most of the text that the user has inputted. This leads to nearly spot-on results and it lets the user fail some bits of text and still get results. We call this search technique fuzzy search.

What options did we try to solve the problem?

SQL LIKE

Well, this one is the easiest solution to integrate and surely it will work. LIKE is a well-known search pattern that can easily fulfill most of the day-to-day problems. It works on every SQL-based database since it was introduced right at the beginning of the query language.

An example of usage would be as such:

SELECT column_name FROM table_name WHERE column_name LIKE 'pattern'
-- The query will return column_names from table_name whose column_name matches the 'pattern'

Also available to this LIKE pattern is a set of wildcards. For example, % (matches any sequence of zero or more characters) and _ (matches any single character). If you want to negate just stick a NOT before LIKE. If you want case insensitive search, you can use ILIKE instead of LIKE.

To know more about using the LIKE pattern consider visiting PostgreSQL Tutorial. Additionally, we would love to know what technique you prefer most!

tsvector and tsquery

Being featured in PostgresSQL docs as the data types designed to support full-text search, tsvector and tsquery are another set of tools that can help us solve our problem in PostgreSQL.

Although both help us solve our problem, they use distinct approaches: tsvector represents a document in a form optimized for text search and tsquery represents a text query.

tsvector value is a sorted list of distinct lexemes (words that have been normalized to merge different variants of the same word). tsquery is a list of words that will be compared to the normalized vector. An example of usage would be as such

SELECT column_name FROM table_name WHERE to_tsquery('pattern')
-- The query will return the column_names from table_name whose column_name matches the pattern that is being compared to the normalized rows from column_name.

This type of full-text search supports some wildcards too. If you want to use this approach, I recommend checking this article that’s a more in-depth review of both tsvector and tsquery. If you're a documentation kind of person, you can read the docs of PostgreSQL on a full-text search where is covered this approach.

similarity

The third and last option I have researched, being the one we currently use in some projects at Coletiv, is similarity. Similarity uses the concept of Trigram. The solutions are found comparing the number of trigrams that both pattern and resulting string have in common. An example of usage would be as such

SELECT column_name FROM table_name WHERE similarity('pattern', column_name) > similarity_threshold
-- The query will return the column_names from table_name whose column_name is similar to the pattern given a certain similarity_threshold. Adjusting the similarity_threshold can improve the time of the response but can lead to improper results. You have to find the best value for your specific needs. Personally, I start at 0.1 and then start moving the value 0.01 points depending on whether I want the query to be faster or the results to be more precise.

You can find more details about this approach in the PostgreSQL docs. In order for this approach to work, you have to add the pg_trgm extension to your database and create git_trgm_ops index to the table you want to proceed with the search. This implementation supports wildcards too that you can check on the PostgreSQL docs.

In conclusion…

As usual, pick the right tool for the problem you have at hand. In our case, different projects have different requisites. As such we always take the time to analyze which one of the solutions presented better suits the project needs.

For reference, tsvector and tsquery first, normalize and then start the search which will output better precision results. On the other hand, similarity just goes for an approximation of the result but has other benefits.

The good news is that, if you like both of them, you can always implement them together as presented here, in section F.35.4. Text Search Integration.

Thanks for reaching this far, if you have other suggestions to tackle the problem please drop us a line.

Thank you for reading!

Thank you so much for reading, it means a lot to us! Don’t forget to follow Coletiv on Facebook, Twitter, and LinkedIn as we keep posting interesting articles on technologies, processes, and experiences.


If you'd like to work with us on a digital product just drop us a message here.




Do you want to become part of our community?

Join our newsletter 😎