The concept of "normalization" is often the first thing people who are new to databases are going to learn. We are talking about one of the fundamental principles in the realm of databases. But what is the use of normalization in the first place? Well, we want to avoid redundancies in the data and make sure that information is stored in a way that helps reduce mistakes and inconsistencies. Ultimately, that is all there is to it: No redundancies, no mistakes, no inconsistencies.
Table of Contents
Before we dive into more detail, we will first take a look at a simple example of a normalized table in PostgreSQL. The purpose of the following data structure is to store products and categories:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
CREATE TABLE t_category ( id int PRIMARY KEY, cat_name text ); INSERT INTO t_category VALUES (1, 'Shoes'), (2, 'Skiing equipment'), (3, 'Plants'); CREATE TABLE t_product ( id int PRIMARY KEY, cat_id int REFERENCES t_category (id) ON UPDATE CASCADE ON DELETE CASCADE, product_name text, price numeric ); INSERT INTO t_product VALUES (1, 1, 'Blue boots', 154.98), (2, 1, 'Yellow flip flop', 32.79), (3, 2, 'Blizzard carver', 459.98), (4, 3, 'Oak', 17.95), (5, 3, 'Birch', 15.39), (6, 3, 'Elder flower', 9.75) ; |
The goal here is to store each piece of information only once. Here, this means that the name of the product category is only mentioned one time. "Shoes" and "Plants" do not show up in the product table - instead, we only store the ID. The advantage of this is that we can rename a category without having to modify those products.
A join will reveal the content of our two simple tables:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
test=# SELECT p.*, c.cat_name FROM t_category AS c, t_product AS p WHERE c.id = p.cat_id ORDER BY p.id; id | cat_id | product_name | price | cat_name ----+--------+------------------+--------+------------------ 1 | 1 | Blue boots | 154.98 | Shoes 2 | 1 | Yellow flip flop | 32.79 | Shoes 3 | 2 | Blizzard carver | 459.98 | Skiing equipment 4 | 3 | Oak | 17.95 | Plants 5 | 3 | Birch | 15.39 | Plants 6 | 3 | Elder flower | 9.75 | Plants (6 rows) |
So far, normalization has only been one more concept in the vast ocean of ideas. However, it starts to be relevant as soon as we update categories:
1 2 3 4 5 6 7 8 9 |
test=# UPDATE t_category SET cat_name = 'Footware' WHERE id = 1 RETURNING *; id | cat_name ----+---------- 1 | Footware (1 row) UPDATE 1 |
It is important to note here that we have changed data in a single place and consequently affected multiple rows without ever touching them. This gives us something of great importance: Consistency and integrity.
Normalization sounds like some “holy grail”; however, as with everything in life, there are usually no simple answers to blindly be applied to all problems. The same is true for normalization. But when is this the case? When is over-normalization essentially a bug?
Consider the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
test=# CREATE TABLE t_invoice ( id int PRIMARY KEY, tstamp date, units int, product_id int REFERENCES t_product (id) ); CREATE TABLE test=# \d t_invoice Table "public.t_invoice" Column | Type | Collation | Nullable | Default ------------+---------+-----------+----------+--------- id | integer | | not null | tstamp | date | | | units | integer | | | product_id | integer | | | Indexes: "t_invoice_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "t_invoice_product_id_fkey" FOREIGN KEY (product_id) REFERENCES t_product(id) |
What is the problem here? We have created a table that stores invoices. It contains a listing of products we have sold.
Let us see what happens:
1 2 3 4 5 6 7 8 9 10 11 |
test=# INSERT INTO t_invoice VALUES (1, '2025-01-22', 22, 2), (2, '2025-01-23', 19, 2) RETURNING *; id | tstamp | units | product_id ----+------------+-------+------------ 1 | 2025-01-22 | 22 | 2 2 | 2025-01-23 | 19 | 2 (2 rows) INSERT 0 2 |
In this example we imagine that we sold shoes twice in the past. What we want to do next is to pull a report and figure out how much turnover we have.
Here is how it works:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
test=# SELECT * FROM t_product AS p, t_invoice AS v WHERE v.product_id = p.id; id | cat_id | product_name | price | id | tstamp | units | product_id ----+--------+------------------+-------+----+------------+-------+------------ 2 | 1 | Yellow flip flop | 32.79 | 1 | 2025-01-22 | 22 | 2 2 | 1 | Yellow flip flop | 32.79 | 2 | 2025-01-23 | 19 | 2 (2 rows) test=# SELECT sum(units * price) FROM t_product AS p, t_invoice AS v WHERE v.product_id = p.id; sum --------- 1344.39 (1 row) |
We made a stunning amount of 1344.39 Postgres bucks. But what is the problem? Remember, we used a foreign key to link the sales table to the product table. As is revealed by the following list, this was a big mistake:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
test=# UPDATE t_product SET price = 984.34 WHERE id = 2 RETURNING *; id | cat_id | product_name | price ----+--------+------------------+-------- 2 | 1 | Yellow flip flop | 984.34 (1 row) UPDATE 1 test=# SELECT sum(units * price) FROM t_product AS p, t_invoice AS v WHERE v.product_id = p.id; sum ---------- 40357.94 (1 row) |
In case the prices of the products change, we find ourselves with destroyed data. The problem is that we aren’t interested in the current price of the product - for a report we need the data as it was back when we sold those shoes. In case of full normalization we have made a major mistake destroying vital sales statistics.
The question naturally arising is: What is the better way to store this type of data? What we have to do is to redundantly store data as its value at the time of sale.
Here is how we might want to store the data:
1 2 3 4 5 6 7 8 9 |
test=# CREATE TABLE t_invoice ( id int4 PRIMARY KEY, tstamp date, units int, product_name text, product_price numeric ); CREATE TABLE |
The name of the product as well as the price have to be fields in the table to avoid changing data later on. It might happen that products are removed, or that prices as well as names are changed. We have to reflect this in our tables containing historic data.
As simple as it sounds: This type of mistake often happens in various applications and developers have to be aware of those issues when designing proper data structures (especially when generative AI is used to create table definitions).
Read more about "normalization" in other blog postings:
Data Normalization in PostgreSQL
Practical Examples of Data Normalization in PostgreSQL
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply