Table of Contents
The comprehensive JSON support in PostgreSQL is one of its best-loved features. Many people – particularly those with a stronger background in Javascript programming than in relational databases – use it extensively. However, my experience is that the vast majority of people don't use it correctly. That causes problems and unhappiness in the long run.
In this article, I will try to point out good and bad uses of JSON in PostgreSQL, and provide you with guidelines that you can follow.
This data model exemplifies everything that you can do wrong:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
/* this table is fine */ CREATE TABLE people ( id bigint PRIMARY KEY, name text NOT NULL ); INSERT INTO people VALUES (1, 'laurenz'), (2, 'julian'), (3, 'ants'), (4, 'kaarel'); /* this table is ill-defined */ CREATE TABLE rooms ( id bigint PRIMARY KEY, data jsonb ); INSERT INTO rooms VALUES (1, '{ 'name': 'Room 1', 'reservations': [ { 'who': 1, 'from': '2021-06-01 09:00:00', 'to': '2021-06-01 10:00:00' }, { 'who': 3, 'from': '2021-06-01 10:00:00', 'to': '2021-06-01 11:30:00' }, { 'who': 2, 'from': '2021-06-01 13:15:00', 'to': '2021-06-01 14:45:00' }, { 'who': 1, 'from': '2021-06-01 15:00:00', 'to': '2021-06-01 16:00:00' } ] }'), (2, '{ 'name': 'Room 2', 'reservations': [ { 'who': 2, 'from': '2021-06-01 09:30:00', 'to': '2021-06-01 10:30:00' } ] }'); |
There is no reason not to have the room name as a regular column. After all, every room will have a name, and we may want to enforce constraints like uniqueness on the room name.
The room reservations are perfectly regular tabular data that define a many-to-many relationship between the rooms and the people. It would have been simple to model the same data with a junction table:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
/* no primary key - we'll get to that later */ CREATE TABLE reservations ( people_id bigint REFERENCES people NOT NULL, room_id bigint REFERENCES rooms NOT NULL, reserved tsrange NOT NULL ); INSERT INTO reservations VALUES (1, 1, '[2021-06-01 09:00:00,2021-06-01 10:00:00)'), (3, 1, '[2021-06-01 10:00:00,2021-06-01 11:30:00)'), (2, 1, '[2021-06-01 13:15:00,2021-06-01 14:45:00)'), (1, 1, '[2021-06-01 15:00:00,2021-06-01 16:00:00)'), (2, 2, '[2021-06-01 09:30:00,2021-06-01 10:30:00)'); |
Many people seem to think that storing few large rows in a table is more efficient than storing many small rows. There is some truth to that, since every row has some overhead, and PostgreSQL compresses large data. But if you want to retrieve only parts of the data, or want to modify them, many small rows are much more efficient - as we will see below.
The "who"
attribute stores a foreign key reference to people
. That is not a good idea, because it is impossible for the database to enforce such a constraint: I could just as well have inserted a reference to a non-existing person. With the junction table from above, defining a foreign key is trivial.
Moreover, you often want to join on foreign keys. With JSON, that would require a cross join with the unnested JSON array:
1 2 3 4 5 6 7 8 |
SELECT rooms.data ->> 'name', people.name FROM rooms CROSS JOIN LATERAL jsonb_array_elements( rooms.data -> 'reservations' ) AS res(j) JOIN people ON res.j ->> 'who' = people.id::text; |
With the junction table, that would be
1 2 3 4 5 |
SELECT rooms.name, people.name FROM rooms JOIN reservations AS r ON r.room_id = rooms.id JOIN people ON r.people_id = people.id; |
You can probably guess which of these two queries will be more efficient.
If you want to add a new reservation, you have to execute a statement like
1 2 3 4 5 6 7 8 |
UPDATE rooms SET data = jsonb_set( data, '{reservations,100000}', '{'who': 3, 'from': '2021-06-01 11:00:00', 'to': '2021-06-01 12:00:00'}', TRUE ) WHERE id = 2; |
This will fetch the complete JSON object, construct a new JSON from it and store that new object in the table. The whole JSON object has to be read and written, which is more I/O than you would want - particularly if the JSON object is large and stored out of line.
Compare how simple the same exercise would be with the junction table:
1 2 |
INSERT INTO reservations VALUES (3, 2, '[2021-06-01 11:00:00,2021-06-01 12:00:00)'); |
This statement will only write a small amount of data.
Deleting a reservation is just as complicated and expensive, and is left as an exercise to the reader.
So far, our data model offers no protection against overlapping reservations, which would be good to enforce in the database.
With JSON, we are pretty much out of luck here. The best that comes to mind is a constraint trigger, but that would require elaborate locking or the SERIALIZABLE
transaction isolation level to be free from race conditions. Also, the code would be far from simple.
With the junction table, the exercise is simple; all we have to do is to add an exclusion constraint that checks for overlaps with the &&
operator:
1 2 3 4 5 6 |
CREATE EXTENSION IF NOT EXISTS btree_gist; ALTER TABLE reservations ADD EXCLUDE USING gist ( reserved WITH &&, room_id WITH = ); |
The extension is required to create a GiST index on a bigint
column.
Simple searches for equality can be performed with the JSON containment operator @>
, and such searches can be supported by a GIN index. But complicated searches are a pain.
Imagine we want to search for all rooms that are occupied at 2021-06-01 15:30:00. With JSON, that would look somewhat like
1 2 3 4 5 6 7 |
SELECT id FROM rooms CROSS JOIN LATERAL jsonb_array_elements( rooms.data -> 'reservations' ) AS elem(j) WHERE CAST(elem.j ->> 'from' AS timestamp) <= TIMESTAMP '2021-06-01 15:30:00' AND CAST(elem.j ->> 'to' AS timestamp) > TIMESTAMP '2021-06-01 15:30:00'; |
With our junction table, the query becomes
1 2 3 |
SELECT room_id FROM reservations WHERE reserved @> TIMESTAMP '2021-06-01 15:30:00'; |
That query can use the GiST index from the exclusion constraint we created above.
Don't get me wrong: JSON support in PostgreSQL is a wonderful thing. It is just that many people don't understand how to use it right. For example, the majority of questions about PostgreSQL and JSON asked on Stackoverflow are about problems that arise from the use of JSON where it had better been avoided.
Follow these guidelines when you consider using JSON in PostgreSQL:
WHERE
conditions.Often it may be a good idea to store some attributes as regular table columns and others in a JSON. The less you need to process the data inside the database, the better it is to store them as JSON.
To show an example of how JSON in the database can be used with benefit, let us consider a shop that sells all kinds of mixed goods. There are some properties that all or most of the goods will have, like price, weight, manufacturer, number available or package size. Other attributes may be rare, like the type of power plug, or ambiguous, like the pitch in screws or tuning forks.
Rather than defining a table with hundreds of columns for all the possible attributes, most of which will be NULL, we model the most frequent attributes with normal table columns and use a JSON for the rest:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
CREATE TABLE wares ( id bigint PRIMARY KEY, name text UNIQUE NOT NULL, price numeric(15,2) NOT NULL, weight_kg double precision NOT NULL, available integer NOT NULL CHECK (available >= 0), manufacturer text, package_size integer NOT NULL CHECK (package_size >= 0), attributes jsonb ); /* to search in 'attributes' */ CREATE INDEX ON wares USING gin (attributes); /* for similarity search on 'name' */ CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX ON wares USING gin (name gin_trgm_ops); |
This will allow efficient queries like
1 2 3 4 |
SELECT name, price, available FROM wares WHERE name LIKE '%tuning fork%' AND attributes @> '{'pitch': 'a'}'; |
There is a lot you can do wrong when using JSON in PostgreSQL, especially if you are not acquainted with relational databases. However, it can be a powerful tool - when used properly.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
This is great, thank you very much!
Nice post. Thanks.
Thanks!
This is a great article, thank you very much!
Minor fix : you should probably remove the primary key on reservations (room_id, people_id). It doesn't make sense since a same person can book the same room at another date.
You are right.
I have fixed it to use an exclusion constraint.
The one in "Fifth mistake" is the right one 😉
(without people_id)
Anyway, not very important.
Ah, yes. I didn't read that far. Thanks again!
Good article, but I'd actually argue that even the good example doesn't need a jsonb field. For example:
* The "attributes" column could be its own "Attributes" table that stores all attributes as rows (including the ones that already exist for the "wares" table, so those would need to be removed from the wares table).
* Next, an associative/junction table called "WaresAttributes" (or something more descriptive) can store all of the many-to-many relationships between the "wares" and "Attributes" tables. Just the IDs of the wares and attributes is enough.
* Then one can use JOINs for the SELECT queries when needed.
The Dev in me understands wanting to use jsonb fields and be done with all of the hassle. However, the DBA in me knows that there is usually a better way as shown above (in my opinion).
You are proposing a kind of entity-attribute-value design. That is usually a bad idea, as I tried to describe in this article.
Thanks for responding! Though after reading that article, I don't think my suggestion is the same as what you put in the examples. Even though they're both based on entity-relationship modeling, those examples are not as efficient. For example:
* Your examples have multiple attribute tables per type such as "attstring" and "attint". However, my suggestion has only one table ("Attributes") to store all of the attributes for the "wares" table. Any variation in types can be taken care of by using multiple columns. Admittedly, this does require more null checking, but hasn't been a problem in practice. Other solutions I have seen are to just store everything as text and convert to whatever is necessary.
* Your examples require the object ID to be inserted into every row of every attribute table. However, my suggestion has one associative table ("WaresAttributes") that stores the mappings between the wares and Attributes tables --meaning the Attributes table itself doesn't need to be aware of any wares entries.
* The number of SQL statements (along with the complexity) in your examples will be unnecessary, which you stated in the article, as it depends on how many type of attribute tables exist and the fact that you have to put the object ID in each INSERT. My suggestion would only ever deal with three tables max and generally just one table.
* As far as performance goes, my suggestion above is what we use at work (a large company) with many products and attributes having not a single performance issue with PostgreSQL.
At the end of the day, I appreciate your articles, and think both methods (your suggestion of using jsonb as well as what I suggested) can achieve the same thing. However, I'd be curious to see any benchmarks about jsonb fields being faster than what I suggested above rather than what is shown in the other article. Because if they're faster, and I mean faster enough to make a difference, then I may be convinced!
What will be a convient way model to services with attributes where also the attributes have types.The attribute types can have single values or multiple values.here is a quick example.Serivice is Tshirt printing in category of Apparael printing.
Pocket size(options :left or right or Both left and right)
Back(top or bottomor Both top and bottom)
DTF or Screen printing or Embroidery
There are also other Categorizes like Book printing but with diffrent attributes.To point out there are common attributes with the same value but other with different values.For example Sizes of clothes is different from sizes of paper, but printing options is the same for the case of Clothes , Book Covers,Notebooks.
Is jsonB a good approach in this
You are right that yours is a much less harmful case of EAV, and it might work well enough for your use case.
It probably won't outperform the JSON solution, but without testing it, I cannot be certain.
Just to clarify some more a bit, my suggestion follows the standard principle of using an "associative entity/table" to model many-to-many relationships. This is a part of relational modeling/theory that has existed far before the "EAV" design pattern and certainly before JSON's creation (and especially before PostgreSQL integrated it). So labeling it as a "much less harmful EAV" is not accurate, in my opinion. Probably more accurate to say that "EAV" is a less efficient way of using the standard associative entity/table design of the relational model.
Regarding performance, I'd say the burden of proof is to be fully put on the jsonb types since it's the new kid on the block competing with an established standard. I don't disregard JSON's flexibility, just saying that many large businesses did just fine before jsonb --even with the scenario of having many attributes. So if you ever do a benchmark between those, I'd be one of the first to read it!