PgSQL Feature van de Week: Arrays voor Faceted Search stage

PostgreSQL kan velden vullen met arrays van integers of strings, en daar geindexeerd in zoeken met o.a. intersect en overlap, wat betekent dat je er eenvoudig een faceted search mee kunt maken.

De staat van dit artikel is "proof of concept". Het is geen tutorial of best-practice, het is inspiratie.

Eerst maak ik een producten tabel met een id, een titel en een array met de eigenschappen van het product. Ik gebruik een array van strings voor de eigenschappen, maar dat is puur om het leesbaar te houden. Een array van integers die verwijzen naar een tabel met de eigenschapdetails werkt ook.

Vervolgens vul ik de productentabel met 100k producten, elk voorzien van een array met daarin drie willekeurige attributen voor kleur, maat en materiaal.

code

DROP TABLE IF EXISTS products;
CREATE TEMPORARY TABLE products
(
    id         INT,
    title      TEXT,
    attributes TEXT[]
);

-- Genereer een lading producten met willkeurige attributen.
WITH colortable AS (SELECT ARRAY ['red', 'green', 'blue', 'yellow'] AS colors)
   , sizetable AS (SELECT ARRAY ['xs', 's', 'm', 'l', 'xl', 'xxl'] AS sizes)
   , materialtable AS (SELECT ARRAY ['wood', 'metal', 'plastic', 'diamond', 'lapiz', 'saffron', 'jeans'] AS materials)
INSERT
INTO products
SELECT i,
       'p_' || i,
       ARRAY ['size:' || sizes[1 + (4 * RANDOM())::int],
           'color:' || colors[1 + (5 * RANDOM())::INT],
           'material:' || materials[1 + (6 * RANDOM())::INT]
           ]
FROM colortable,
     sizetable,
     materialtable,
     generate_series(1, 100000) AS i;



SELECT * FROM products LIMIT 10;
 id | title |               attributes               
----+-------+----------------------------------------
  1 | p_1   | {size:m,color:blue,material:wood}
  2 | p_2   | {size:xl,color:blue,material:metal}
  3 | p_3   | {size:l,color:yellow,material:plastic}
  4 | p_4   | {size:xs,color:green,material:jeans}
  5 | p_5   | {size:xs,color:green,material:diamond}
  6 | p_6   | {size:s,color:blue,material:metal}
  7 | p_7   | {size:m,color:green,material:saffron}
  8 | p_8   | {size:s,color:blue,material:diamond}
  9 | p_9   | {size:s,color:yellow,material:lapiz}
 10 | p_10  | {size:l,color:blue,material:saffron}
(10 rows)

PostgreSQL's array functies staan fraai uitgelegd in de handleiding dus dat ga ik hier niet allemaal herhalen.

Hieronder gebruik ik && voor overlaps, @> voor contains en UNNEST() voor het uitpakken van arrays naar rijen.

Overlap:
SELECT ARRAY[1, 2, 3, 5, 6] && ARRAY[1, 2 ,3 ,4];
geeft TRUE omdat beide arrays de elementen 1, 2 en 3 gemeen hebben, ookal zit 4 niet in [1, 2, 3, 5, 6].

Contains:
SELECT ARRAY[1,2,3] @> array[1, 3];|
geeft True want [1, 2, 3] bevat 1 en 3.

SELECT ARRAY[1,2,3] @> array[1, 4];
geeft False want 1 zit wel in [1, 2 ,3] maar 4 niet.

UNNEST()
SELECT * FROM UNNEST(ARRAY[1,2,3]) AS arrvalues;

arrvalues
--------
1
2
3
(3 rows)

code

-- Geef alle producten waarvan de kleur rood is.
SELECT * FROM products WHERE attributes @> ARRAY['color:red'];

 id  | title |              attributes              
-----+-------+--------------------------------------
  11 | p_11  | {size:s,color:red,material:metal}
  28 | p_28  | {size:s,color:red,material:saffron}
  52 | p_52  | {size:l,color:red,material:metal}
  53 | p_53  | {size:l,color:red,material:wood}
  66 | p_66  | {size:m,color:red,material:wood}
  72 | p_72  | {size:xl,color:red,material:saffron}
  82 | p_82  | {size:xl,color:red,material:lapiz}
  85 | p_85  | {size:l,color:red,material:diamond}
  91 | p_91  | {size:m,color:red,material:metal}
 109 | p_109 | {size:xl,color:red,material:diamond}
...

code

-- Geef alle producten waarvan de kleur rood is en de maat valt in s, m en l.
-- Let op dat overlaps een match goedkeurt zodra er tenminste één element overeenkomt,
-- dit geval is de kleur verplicht dus dat moet met een aparte controle.
SELECT *
FROM products
WHERE attributes @> ARRAY ['color:blue']
  AND attributes && ARRAY ['size:s','size:m','size:l'];

 id | title |              attributes              
----+-------+--------------------------------------
  9 | p_9   | {size:m,color:blue,material:metal}
 20 | p_20  | {size:m,color:blue,material:metal}
 28 | p_28  | {size:s,color:blue,material:lapiz}
 29 | p_29  | {size:s,color:blue,material:diamond}
 40 | p_40  | {size:m,color:blue,material:plastic}
 56 | p_56  | {size:l,color:blue,material:lapiz}
 62 | p_62  | {size:l,color:blue,material:diamond}
 63 | p_63  | {size:s,color:blue,material:metal}
 72 | p_72  | {size:m,color:blue,material:saffron}
 74 | p_74  | {size:m,color:blue,material:wood}
...

Via UNNEST() kun je arrays uitpakken tot rijen en dus kun je bijvoorbeeld unieke waarden tellen.

code

-- Geef een telling van alle attributen van alle producten waar color:red in staat.
WITH red_products AS (SELECT * FROM products WHERE attributes && ARRAY ['color:red'])
SELECT u, COUNT(*)
FROM red_products,
     UNNEST(attributes) AS u
GROUP BY u;

        u         | count 
------------------+-------
 material:diamond |  1673
 color:red        |  9823
 size:s           |  2520
 material:jeans   |   833
 material:wood    |   812
 size:xs          |  1156
 material:saffron |  1594
 size:l           |  2502
 material:metal   |  1582
 material:lapiz   |  1617
 material:plastic |  1712
 size:xl          |  1193
 size:m           |  2452
(13 rows)

code

-- Een kleine variatie geeft alle waarden van de verschillende attributen voor alle rode producten.
WITH red_products AS (SELECT * FROM products WHERE attributes && ARRAY ['color:red'])
SELECT split_part(u, ':', 1), ARRAY_AGG(DISTINCT split_part(u, ':', 2))
FROM red_products,
     UNNEST(attributes) AS u
GROUP BY split_part(u, ':', 1);


 split_part |                    array_agg                     
------------+--------------------------------------------------
 color      | {red}
 material   | {diamond,jeans,lapiz,metal,plastic,saffron,wood}
 size       | {l,m,s,xl,xs}
(3 rows)

code

-- Geef hetzelfde terug als JSON (en nee dit zou je in de praktijk natuurlijk nooit zo doen, PostgreSQL heeft een JSON datatype)
WITH red_products AS (SELECT * FROM products WHERE attributes && ARRAY ['color:red'])
   , uniques AS (SELECT DISTINCT split_part(u, ':', 1) as attrib, split_part(u, ':', 2) as val
                 FROM red_products,
                      UNNEST(attributes) AS u)
   , jsons as (SELECT json_build_array(attrib, json_agg(val)) AS j FROM uniques GROUP BY attrib)
SELECT json_agg(j)
FROM jsons;

                                                                    json_agg                                                                     
-------------------------------------------------------------------------------------------------------------------------------------------------
 [["color", ["red"]], ["material", ["wood", "jeans", "saffron", "metal", "lapiz", "diamond", "plastic"]], ["size", ["s", "xs", "m", "xl", "l"]]]

code

-- En als we toch bezig zijn, geef dan alles terug als een JSON object
WITH red_products AS (SELECT * FROM products WHERE attributes && ARRAY ['color:red'] LIMIT 4)
   , facets AS (SELECT split_part(u, ':', 1) as attr, ARRAY_AGG(DISTINCT split_part(u, ':', 2)) as val
                FROM red_products,
                     UNNEST(attributes) AS u
                GROUP BY split_part(u, ':', 1))
   , product AS (SELECT json_agg(row_to_json(red_products.*)) AS prod FROM red_products)
   , facet AS (SELECT JSON_AGG(json_build_array(facets.attr, facets.val)) AS att FROM facets)
SELECT jsonb_build_object('products', prod, 'attributes', facet.att)
FROM product,
     facet;

--------------------------------
{
  "products": [
    {
      "id": 9,
      "title": "p_9",
      "attributes": [
        "size:m",
        "color:red",
        "material:metal"
      ]
    },
    {
      "id": 10,
      "title": "p_10",
      "attributes": [
        "size:m",
        "color:red",
        "material:diamond"
      ]
    },
    {
      "id": 11,
      "title": "p_11",
      "attributes": [
        "size:m",
        "color:red",
        "material:lapiz"
      ]
    },
    {
      "id": 15,
      "title": "p_15",
      "attributes": [
        "size:xl",
        "color:red",
        "material:lapiz"
      ]
    }
  ],
  "attributes": [
    [
      "color",
      [
        "red"
      ]
    ],
    [
      "material",
      [
        "diamond",
        "lapiz",
        "metal"
      ]
    ],
    [
      "size",
      [
        "m",
        "xl"
      ]
    ]
  ]
}

Kun je dit hele geintje niet n JSON doen? Opzich wel maar er is nog geen handige manier om een overlap te controleren in JSON.

Performance en indexes

Via EXPLAIN ANALYSE kun je achterhalen wat de queryplanner allemaal met de query doet en hoeveel elk onderdeel kost in tijd en moeite.

code

EXPLAIN ANALYSE
SELECT *
FROM products
WHERE attributes @> ARRAY ['color:blue']
  AND attributes @> ARRAY ['material:jeans']
  AND attributes && ARRAY ['size:s', 'size:m', 'size:l'];


QUERY PLAN
Seq Scan on products  (cost=0.00..3141.00 rows=1 width=68) (actual time=0.033..24.376 rows=1279 loops=1)
  Filter: ((attributes @> '{color:blue}'::text[]) AND (attributes @> '{material:jeans}'::text[]) AND (attributes && '{size:s,size:m,size:l}'::text[]))
  Rows Removed by Filter: 98721
Planning Time: 0.044 ms
Execution Time: 24.433 ms

Die sequential scans zijn sowieso niet goed, maar gelukkig kan PostgreSQL ook arrays indexeren:

code

CREATE INDEX idx_attribs ON products USING GIN (attributes);


EXPLAIN ANALYSE
SELECT *
FROM products
WHERE attributes @> ARRAY ['color:blue']
  AND attributes @> ARRAY ['material:jeans']
  AND attributes && ARRAY ['size:s', 'size:m', 'size:l'];


QUERY PLAN
Bitmap Heap Scan on products  (cost=100.00..104.02 rows=1 width=68) (actual time=2.249..5.008 rows=1279 loops=1)
  Recheck Cond: ((attributes @> '{color:blue}'::text[]) AND (attributes @> '{material:jeans}'::text[]) AND (attributes && '{size:s,size:m,size:l}'::text[]))
  Heap Blocks: exact=816
  ->  Bitmap Index Scan on idx_attribs  (cost=0.00..100.00 rows=1 width=0) (actual time=2.132..2.132 rows=1279 loops=1)
        Index Cond: ((attributes @> '{color:blue}'::text[]) AND (attributes @> '{material:jeans}'::text[]) AND (attributes && '{size:s,size:m,size:l}'::text[]))
Planning Time: 0.140 ms
Execution Time: 5.076 ms

En dat kan nog sneller door gebruikte maken van partial indexes. Onderstraande index bevat alleen de producten waarvan de kleur 'blue' is, dus als de query daar specifiek om vraagt(!) dan kan deze veelkleinere index worden gebruikt.

Zie hoe de cost van de bitmap scan op de idx_attribus_blue index nu 36 is, terwijl de cost van de scan op de idx_attribs eerst 100 was.

code

CREATE INDEX idx_attribs_blue ON products USING GIN (attributes) WHERE attributes @> ARRAY['color:blue'];

EXPLAIN ANALYSE
SELECT *
FROM products
WHERE attributes @> ARRAY ['color:blue']
  AND attributes @> ARRAY ['material:jeans']
  AND attributes && ARRAY ['size:s', 'size:m', 'size:l'];



QUERY PLAN
Bitmap Heap Scan on products  (cost=36.00..40.02 rows=1 width=68) (actual time=0.535..1.087 rows=1279 loops=1)
  Recheck Cond: ((attributes @> '{material:jeans}'::text[]) AND (attributes && '{size:s,size:m,size:l}'::text[]) AND (attributes @> '{color:blue}'::text[]))
  Heap Blocks: exact=816
  ->  Bitmap Index Scan on idx_attribs_blue  (cost=0.00..36.00 rows=1 width=0) (actual time=0.450..0.450 rows=1279 loops=1)
        Index Cond: ((attributes @> '{material:jeans}'::text[]) AND (attributes && '{size:s,size:m,size:l}'::text[]))
Planning Time: 0.072 ms
Execution Time: 1.142 ms

Hoe efficient een partial index is hangt sterk af van hoeveel records de queryplanner verwacht te kunnen afstrepen door middel van die index. Het zal in de praktijk dus nodig zijn om meerdere partiele indexes te maken voor de attributen waar de gebruikers vaak op zoeken.

Als gebruikers veel zoeken op vaste combinaties van attributen dan kan de partial index daar ook op worden gemaakt, waardoor er nog meer rijen kunnen worden uitgesloten.

Let op dat de query daarvoor in de WHERE dezelfde clausule moet gebruiken die ook in de index is gebruikt:

code

CREATE INDEX idx_attribs_blue_jeans ON products USING GIN (attributes) WHERE attributes @> ARRAY['color:blue', 'material:jeans'];


EXPLAIN ANALYSE
SELECT *
FROM products
WHERE attributes @> ARRAY ['color:blue', 'material:jeans']
  AND attributes && ARRAY ['size:s', 'size:m', 'size:l'];


QUERY PLAN
Bitmap Heap Scan on products  (cost=16.00..20.02 rows=1 width=68) (actual time=0.286..0.746 rows=1279 loops=1)
  Recheck Cond: ((attributes && '{size:s,size:m,size:l}'::text[]) AND (attributes @> '{color:blue,material:jeans}'::text[]))
  Heap Blocks: exact=816
  ->  Bitmap Index Scan on idx_attribs_blue_jeans  (cost=0.00..16.00 rows=1 width=0) (actual time=0.207..0.207 rows=1279 loops=1)
        Index Cond: (attributes && '{size:s,size:m,size:l}'::text[])
Planning Time: 0.084 ms
Execution Time: 0.798 ms

LoadTesting

Losse queries zijn tot daaraantoe maar hoe houd dit zich als het opschaalt?

https://medium.com/@FranckPachot/do-you-know-what-you-are-measuring-with-pgbench-d8692a33e3d6

PgBench

PgBench is een standaard commandlinetool om eenvoudig heel veel queries te draaien en te meten hoelang het duurde. Dit geeft je niet bijzonder veel inzicht

Je kunt scripts schrijven die één of meerdere queries uitvoeren, al dan niet voorzien van willekeurige waarden. Je kunt bepalen hoe vaak het script uitgevoerd moet worden, met hoeveel verbindingen tegelijk exc. Zie handleiding.

Een pgbench script zit er alsvolg uit:

code

\set COL random(1, 5)

BEGIN;
SELECT * FROM products WHERE attributes @> ARRAY ['color:color' || :COL] LIMIT 1;
END;

LIMIT 1 zit er alleen in omdat ik in dit geval niet geinteresseerd ben in hoe lang het duurt om de resultaten over te sturen, alleen in hoe lang het duurt op op te zoeken welke resultaten er zouden zijn.

Om het uit te voeren gebruik je iets als

pgbench -f pgbench_script_1.sql -f pgbench_script_2.sql -M prepared -n -t 20 -j 10 -c 10 -r

-f <scriptnaam> Voer ditr script uit.
-M prepared Geef de quer door als een prepared statement
-n Doe geen vacuum aan het begin dan de test.
-t 20 Voer elk script 20 keer uit.
-j 10 Gebruik tien

Jamaar en wat-als.

Een array is geen foreign-key, wat doe je als 'rood' 'wordt vervangen door 'red'?

Hoe de waarde van het attribuut opslaat en hoe je het presenteert aan de buitenwereld zijn twee verschillende zaken. Je zou 'rood' ook kunnen opslaan als 25 en dan in een koppeltabel opzoeken wat 25 is. Dan is het nog steeds geen foreign key, klopt, get over it. Referentiële integriteit kun je eenvoudig afdwingen met een trigger. Deze situatie komt te zelden voor om heir een punt van te maken.

Krijg je zo niet heel erg grote tabellen?

SELECT nspname || '.' || relname AS "relation", pg_size_pretty(pg_relation_size(C.oid)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(C.oid) DESC
LIMIT 20;