dbi Blog

Indexing queries for like/similarity conditions is not that easy with the usual index types. The only option you have with btree indexes (especially if the wild-card is at the beginning of the filter) is to create a partial index on that columns for a very specific query.
Read More
Let’s do a simple example with a btree index. The test data:

drop table if exists t1;
 create table t1 ( a varchar(50) );
 insert into t1 ( select md5(tt::text)
 from generate_series ( 1, 100000) tt
 );

This created 1’000’000 distinct rows:

postgres=# with aa as ( select a, count(*) from t1 group by a having count(*) = 1  )
select count(*) from aa;
 count  
--------
 100000
(1 row)

This is a perfect distribution for creating an index if we look for few rows. What options do we have? The first, obvious one, is to create a btree index:

postgres=# create index i_btree on t1(a);
CREATE INDEX

This will perfectly work for queries like this:

postgres=# explain analyze select * from t1 where a = 'd0e2dc293c21d9628c65b5032fe9dc76';
                                                    QUERY PLAN                                               
     
-------------------------------------------------------------------------------------------------------------
-----
 Index Only Scan using i_btree on t1  (cost=0.42..8.44 rows=1 width=33) (actual time=0.041..0.041 rows=1 loop
s=1)
   Index Cond: (a = 'd0e2dc293c21d9628c65b5032fe9dc76'::text)
   Heap Fetches: 1
 Planning time: 0.135 ms
 Execution time: 0.060 ms
(5 rows)

But what if we want to query the data like this (Please excuse the double quotes around the like. I found no other ways to prevent this blog software from doing funny things with it):

postgres=# explain analyze select * from t1 where a like '"%"e2dc293c21d9628c65b5032fe9dc76';
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..2084.00 rows=10 width=33) (actual time=11.656..14.481 rows=1 loops=1)
   Filter: ((a)::text~~ '"%"e2dc293c21d9628c65b5032fe9dc76'::text)
   Rows Removed by Filter: 99999
 Planning time: 0.069 ms
 Execution time: 14.493 ms
(5 rows)
postgres=# explain analyze select * from t1 where a like '"%"e2dc93"%"c21d9628c65b5032fe9dc76';
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..2084.00 rows=10 width=33) (actual time=10.990..14.112 rows=1 loops=1)
   Filter: ((a)::text ~~ '"%"e2dc93"%"c21d9628c65b5032fe9dc76'::text)
   Rows Removed by Filter: 99999
 Planning time: 0.046 ms
 Execution time: 14.183 ms
(5 rows)

This will trigger a full table scan on the table as the index can not be used. Any chance to fix this with a btree index? We could create a partial index like this for the first query:

postgres=# create index i_btree_t1 on t1(a) where a like '"%"e2dc293c21d9628c65b5032fe9dc76'; CREATE INDEX postgres=#

Does it help?:

postgres=# explain analyze select * from t1 where a like '"%"e2dc293c21d9628c65b5032fe9dc76';
                                                     QUERY PLAN                                              
       
-------------------------------------------------------------------------------------------------------------
-------
 Bitmap Heap Scan on t1  (cost=4.13..40.97 rows=10 width=33) (actual time=0.013..0.013 rows=1 loops=1)
   Recheck Cond: ((a)::text ~~ '"%"e2dc293c21d9628c65b5032fe9dc76'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on i_btree_t1  (cost=0.00..4.13 rows=10 width=0) (actual time=0.010..0.010 rows=1 lo
ops=1)
 Planning time: 0.208 ms
 Execution time: 0.029 ms
(6 rows)

Yes, it does help. But only in this case. As soon as the query is slightly modified this will not work anymore:

postgres=# explain analyze select * from t1 where a like '"%"XXdc293c21d9628c65b5032fe9dc76';
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..2084.00 rows=10 width=33) (actual time=11.529..11.529 rows=0 loops=1)
   Filter: ((a)::text ~~ '"%"XXdc293c21d9628c65b5032fe9dc76'::text)
   Rows Removed by Filter: 100000
 Planning time: 0.080 ms
 Execution time: 11.541 ms
(5 rows)

So this is only an option for very static queries asking for the same “like” over and over again.
Luckily, in PostgreSQL, there is the pg_trgm extension which is shipped by default. From the documentation: “The pg_trgm module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.”
Can this help us? Let’s try. Installing extensions which are shipped by default is just matter of:

postgres=# create extension pg_trgm;
 CREATE EXTENSION
 postgres=#
Btw.: There is a plsql shortcut to list all the extension currently installed;
postgres=# \dx
                                     List of installed extensions
    Name    | Version |   Schema   |                            Description                            
------------+---------+------------+-------------------------------------------------------------------
 cstore_fdw | 1.2     | public     | foreign-data wrapper for flat cstore access
 hstore     | 1.3     | public     | data type for storing sets of (key, value) pairs
 pg_trgm    | 1.1     | public     | text similarity measurement and index searching based on trigrams
 plpgsql    | 1.0     | pg_catalog | PL/pgSQL procedural language
(4 rows)

Now that the extension is installed let’s create an index using the extension:

postgres=# drop index i_btree;
DROP INDEX
postgres=# drop index i_btree_t1;
DROP INDEX
postgres=# 
postgres=# create index i_rtgm on t1 using gist ( a gist_trgm_ops);
CREATE INDEX
postgres=#

Does it help for the queries above? :

postgres=# explain analyze select * from t1 where a like '"%"e2dc293c21d9628c65b5032fe9dc76';
                                                   QUERY PLAN                                                
   
-------------------------------------------------------------------------------------------------------------
---
 Bitmap Heap Scan on t1  (cost=4.36..41.20 rows=10 width=33) (actual time=1.654..1.654 rows=1 loops=1)
   Recheck Cond: ((a)::text ~~ '"%"e2dc293c21d9628c65b5032fe9dc76'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on i_rtgm  (cost=0.00..4.36 rows=10 width=0) (actual time=1.646..1.646 rows=1 loops=
1)
         Index Cond: ((a)::text ~~ '"%"e2dc293c21d9628c65b5032fe9dc76'::text)
 Planning time: 0.113 ms"
 Execution time: 1.682 ms
(7 rows)
postgres=# explain analyze select * from t1 where a like '"%"XXdc293c21d9628c65b5032fe9dc76';
                                                   QUERY PLAN                                                
   
-------------------------------------------------------------------------------------------------------------
---
Bitmap Heap Scan on t1  (cost=4.36..41.20 rows=10 width=33) (actual time=1.633..1.633 rows=0 loops=1)
   Recheck Cond: ((a)::text ~~ '"%"XXdc293c21d9628c65b5032fe9dc76'::text)
   ->  Bitmap Index Scan on i_rtgm  (cost=0.00..4.36 rows=10 width=0) (actual time=1.630..1.630 rows=0 loops=
1)
         Index Cond: ((a)::text ~~ '"%"XXdc293c21d9628c65b5032fe9dc76'::text)
 Planning time: 0.058 ms
 Execution time: 1.650 ms
(6 rows)

Even for multiple wild-cards;

postgres=# explain analyze select * from t1 where a like '"%"e2dc"%"93c21d"%%"62"%%"65b5032fe9dc76';
                                                   QUERY PLAN
   
-------------------------------------------------------------------------------------------------------------
---
 Bitmap Heap Scan on t1  (cost=4.36..41.20 rows=10 width=33) (actual time=1.480..1.481 rows=1 loops=1)
   Recheck Cond: ((a)::text ~~ '"%"e2dc"%"93c21d"%%"62"%%"65b5032fe9dc76'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on i_rtgm  (cost=0.00..4.36 rows=10 width=0) (actual time=1.475..1.475 rows=1 loops=
1)
         Index Cond: ((a)::text ~~ '"%"e2dc"%"93c21d"%%"62"%%"65b5032fe9dc76'::text)
 Planning time: 0.100 ms
 Execution time: 1.501 ms
(7 rows)

Isn’t that cool?

Post Views: 373

Indexing for like/similarity operations

Daniel Westermann

Leave a Reply:

Related blog articles

Daniel Westermann

Leave a Reply:

Related blog articles

A first look at openSUSE Leap Micro 6.0

Another file system for Linux: bcachefs (3) – Mirroring/Replicas

Another file system for Linux: bcachefs (2) – multi device file systems

Another file system for Linux: bcachefs (1) – basics