At Lateral we use PostgreSQL to store documents for our visualiser. Each document consists of a
text column and a JSON column with meta data inside such as a title, date and URL. We wanted to create a fast search experience for the visualiser that lets you search the full text of documents as well as their titles to quickly find a document to get Lateral recommendations for.
To achieve this we assessed our options; we could use an open source solution such as Apache Solr or Elasticsearch, we could use a managed solution such as Elastic or Algolia or we could use PostgreSQL’s full text search. We decided to use PostgreSQL for the following reasons:
- There’s no need to install extra software or libraries
- We can use the same database interface we use in our app to perform search queries (in our case ActiveRecord)
- We don’t need to provision any additional servers
- There’s no extra cost to us
- The data is stored in one place that we control
- There’s no requirement to keep our database rows in sync with another source
The main arguments against using PostgreSQL search are accuracy and speed at scale. The accuracy of PostgreSQL searches is not the best. In the examples that follow I’ve explained how to weight titles over text, but we found a few cases where typing the exact title of a document wouldn’t return that document as the first result. This isn’t ideal and please let us know if you have any feedback about this. We’ve had issues in the past with the speed of PostgreSQL searches. This was for 23 million records and we found the speed deteriorates around 1 to 2 million rows.
So even though the dedicated solutions could probably work a bit better, we decided that for this use case they weren’t necessary.
To get optimum performance with PostgreSQL full text search you need to create a column to store the tsvector values with an index on it. Then you need to fill this column with the tsv values and create a trigger to update the field on INSERT and UPDATE. Then you’ll be able to query the table quickly.
To create the column for the tsvector values run:
ALTER TABLE documents ADD COLUMN tsv tsvector;
And then to create the index on this column run:
CREATE INDEX tsv_idx ON documents USING gin(tsv);
Now you have somewhere to store the tsvectors you can populate the column. In this example, I am assuming a table with the structure I mentioned above, with a
text column and a
meta column that contains a JSON object with a
title. To populate the tsv column with tsvectors, run the following:
UPDATE data_rows SET tsv = setweight(to_tsvector(coalesce(meta->>'title','')), 'A') || setweight(to_tsvector(coalesce(text,'')), 'D');
This query gets the
title from the meta JSON column and gives it the heighest weight of A. Then it gets the text value and weights it D. Then it combines the two tsvectors and writes them to the tsv column we just created.
At this point, if your data was static you could stop and start querying. But we want all future rows and updates to have up-to-date tsv columns so we need to create a trigger to do this. Firstly, you’ll need to create a function to take a column and update the
CREATE FUNCTION documents_search_trigger() RETURNS trigger AS $$
setweight(to_tsvector(coalesce(new.meta->>'title','')), 'A') ||
$$ LANGUAGE plpgsql;
This basically does the same as the update query above but as a function. Now you need to create a trigger to execute that function when any row is updated or inserted:
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON data_rows FOR EACH ROW EXECUTE PROCEDURE documents_search_trigger();
Now you can search the database. Replace both occurences of
YOUR QUERY text with what you want to search and run the following query:
SELECT id, meta->>'title' as title, meta FROM (
SELECT id, meta, tsv
FROM data_rows, plainto_tsquery('YOUR QUERY') AS q
WHERE (tsv @@ q)
) AS t1 ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('YOUR QUERY')) DESC LIMIT 5;
This query performs two queries; firstly it performs the search on the indexed tsv column. Then it ranks and sorts those results and returns 5.
This query takes about 50ms! We experimented with returning the document text with results and found that if we returned the full text of documents, it added about 350ms to the query time which seems to be network overhead rather than a slower query. If we only returned 200 characters from the texts then it only added around 100ms.
There is some added overhead because we’re using Rails to query PostgreSQL, so it’s not as fast as something like Algolia but it’s good enough for our use case.
See the search in action in one of our recommender demos:
Making stuff with computers at Lateral.
Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/N_EVHQcdai8/