PostgreSQL: To normalize or not to normalize?

The concept of "normalization" is often the first thing people who are new to databases are going to learn. We are talking about one of the fundamental principles in the realm of databases. But what is the use of normalization in the first place? Well, we want to avoid redundancies in the data and make sure that information is stored in a way that helps reduce mistakes and inconsistencies. Ultimately, that is all there is to it: No redundancies, no mistakes, no inconsistencies.

Table of Contents

Normalization in action

Before we dive into more detail, we will first take a look at a simple example of a normalized table in PostgreSQL. The purpose of the following data structure is to store products and categories:

CREATE TABLE t_category 
(
	id 		int 	PRIMARY KEY, 
	cat_name 	text
);

INSERT INTO t_category 
	VALUES 	(1, 'Shoes'), 
		(2, 'Skiing equipment'), 
		(3, 'Plants');

CREATE TABLE t_product 
(	
	id 		int 	PRIMARY KEY, 
	cat_id 		int 	REFERENCES t_category (id)
					ON UPDATE CASCADE
					ON DELETE CASCADE, 
	product_name 	text, 
	price 		numeric
);

INSERT INTO t_product
	VALUES 	(1, 1, 'Blue boots', 154.98),
		(2, 1, 'Yellow flip flop', 32.79),
		(3, 2, 'Blizzard carver', 459.98),
		(4, 3, 'Oak', 17.95),
		(5, 3, 'Birch', 15.39),
		(6, 3, 'Elder flower', 9.75)
;

CREATE TABLE t_category

(

id int PRIMARY KEY,

cat_name text

);

INSERT INTO t_category

VALUES (1, 'Shoes'),

(2, 'Skiing equipment'),

(3, 'Plants');

CREATE TABLE t_product

(

id int PRIMARY KEY,

cat_id int REFERENCES t_category (id)

ON UPDATE CASCADE

ON DELETE CASCADE,

product_name text,

price numeric

);

INSERT INTO t_product

VALUES (1, 1, 'Blue boots', 154.98),

(2, 1, 'Yellow flip flop', 32.79),

(3, 2, 'Blizzard carver', 459.98),

(4, 3, 'Oak', 17.95),

(5, 3, 'Birch', 15.39),

(6, 3, 'Elder flower', 9.75)

;

The goal here is to store each piece of information only once. Here, this means that the name of the product category is only mentioned one time. "Shoes" and "Plants" do not show up in the product table - instead, we only store the ID. The advantage of this is that we can rename a category without having to modify those products.

A join will reveal the content of our two simple tables:

test=# SELECT   p.*, c.cat_name
	FROM	t_category AS c, t_product AS p
	WHERE   c.id = p.cat_id
	ORDER BY p.id;
 id | cat_id |   product_name   | price  |     cat_name     
----+--------+------------------+--------+------------------
  1 |      1 | Blue boots       | 154.98 | Shoes
  2 |      1 | Yellow flip flop |  32.79 | Shoes
  3 |      2 | Blizzard carver  | 459.98 | Skiing equipment
  4 |      3 | Oak              |  17.95 | Plants
  5 |      3 | Birch            |  15.39 | Plants
  6 |      3 | Elder flower     |   9.75 | Plants
(6 rows)

test=# SELECT p.*, c.cat_name

FROM t_category AS c, t_product AS p

WHERE c.id = p.cat_id

ORDER BY p.id;

id | cat_id | product_name | price | cat_name

----+--------+------------------+--------+------------------

1 | 1 | Blue boots | 154.98 | Shoes

2 | 1 | Yellow flip flop | 32.79 | Shoes

3 | 2 | Blizzard carver | 459.98 | Skiing equipment

4 | 3 | Oak | 17.95 | Plants

5 | 3 | Birch | 15.39 | Plants

6 | 3 | Elder flower | 9.75 | Plants

(6 rows)

So far, normalization has only been one more concept in the vast ocean of ideas. However, it starts to be relevant as soon as we update categories:

test=# UPDATE	t_category 
	SET 	cat_name = 'Footware' 
	WHERE 	id = 1 
	RETURNING *;
 id | cat_name 
----+----------
  1 | Footware
(1 row)
UPDATE 1

test=# UPDATE t_category

SET cat_name = 'Footware'

WHERE id = 1

RETURNING *;

id | cat_name

----+----------

1 | Footware

(1 row)

UPDATE 1

It is important to note here that we have changed data in a single place and consequently affected multiple rows without ever touching them. This gives us something of great importance: Consistency and integrity.

Is there "too much normalization"?

Normalization sounds like some “holy grail”; however, as with everything in life, there are usually no simple answers to blindly be applied to all problems. The same is true for normalization. But when is this the case? When is over-normalization essentially a bug?

Consider the following example:

test=# CREATE TABLE t_invoice 
(
	id 		int 	PRIMARY KEY, 
	tstamp 		date, 
	units 		int, 
	product_id 	int 	REFERENCES t_product (id) 
);
CREATE TABLE
test=# \d t_invoice
               Table "public.t_invoice"
   Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
 id         | integer |           | not null | 
 tstamp     | date    |           |          | 
 units      | integer |           |          | 
 product_id | integer |           |          | 
Indexes:
    "t_invoice_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "t_invoice_product_id_fkey" FOREIGN KEY (product_id) 
	REFERENCES t_product(id)

test=# CREATE TABLE t_invoice

(

id int PRIMARY KEY,

tstamp date,

units int,

product_id int REFERENCES t_product (id)

);

CREATE TABLE

test=# \d t_invoice

Table "public.t_invoice"

Column | Type | Collation | Nullable | Default

------------+---------+-----------+----------+---------

id | integer | | not null |

tstamp | date | | |

units | integer | | |

product_id | integer | | |

Indexes:

"t_invoice_pkey" PRIMARY KEY, btree (id)

Foreign-key constraints:

"t_invoice_product_id_fkey" FOREIGN KEY (product_id)

REFERENCES t_product(id)

What is the problem here? We have created a table that stores invoices. It contains a listing of products we have sold.

Let us see what happens:

test=# INSERT INTO t_invoice 
		VALUES 	(1, '2025-01-22', 22, 2), 
			(2, '2025-01-23', 19, 2) 
	RETURNING *;
 id |   tstamp   | units | product_id 
----+------------+-------+------------
  1 | 2025-01-22 |    22 |          2
  2 | 2025-01-23 |    19 |          2
(2 rows)

INSERT 0 2

test=# INSERT INTO t_invoice

VALUES (1, '2025-01-22', 22, 2),

(2, '2025-01-23', 19, 2)

RETURNING *;

id | tstamp | units | product_id

----+------------+-------+------------

1 | 2025-01-22 | 22 | 2

2 | 2025-01-23 | 19 | 2

(2 rows)

INSERT 0 2

In this example we imagine that we sold shoes twice in the past. What we want to do next is to pull a report and figure out how much turnover we have.

Here is how it works:

test=# SELECT 	* 
	FROM 	t_product AS p, t_invoice AS v 
	WHERE 	v.product_id = p.id;
 id | cat_id |   product_name   | price | id |   tstamp   | units | product_id 
----+--------+------------------+-------+----+------------+-------+------------
  2 |      1 | Yellow flip flop | 32.79 |  1 | 2025-01-22 |    22 |          2
  2 |      1 | Yellow flip flop | 32.79 |  2 | 2025-01-23 |    19 |          2
(2 rows)

test=# SELECT 	sum(units * price) 
	FROM 	t_product AS p, t_invoice AS v 
	WHERE 	v.product_id = p.id;
   sum   
---------
 1344.39
(1 row)

test=# SELECT *

FROM t_product AS p, t_invoice AS v

WHERE v.product_id = p.id;

----+--------+------------------+-------+----+------------+-------+------------

2 | 1 | Yellow flip flop | 32.79 | 1 | 2025-01-22 | 22 | 2

2 | 1 | Yellow flip flop | 32.79 | 2 | 2025-01-23 | 19 | 2

(2 rows)

test=# SELECT sum(units * price)

FROM t_product AS p, t_invoice AS v

WHERE v.product_id = p.id;

sum

---------

1344.39

(1 row)

We made a stunning amount of 1344.39 Postgres bucks. But what is the problem? Remember, we used a foreign key to link the sales table to the product table. As is revealed by the following list, this was a big mistake:

test=# UPDATE 	t_product 
	SET 	price = 984.34 
	WHERE 	id = 2 
	RETURNING *;
 id | cat_id |   product_name   | price  
----+--------+------------------+--------
  2 |      1 | Yellow flip flop | 984.34
(1 row)
UPDATE 1

test=# SELECT 	sum(units * price) 
	FROM 	t_product AS p, t_invoice AS v 
	WHERE 	v.product_id = p.id;
   sum    
----------
 40357.94
(1 row)

test=# UPDATE t_product

SET price = 984.34

WHERE id = 2

RETURNING *;

id | cat_id | product_name | price

----+--------+------------------+--------

2 | 1 | Yellow flip flop | 984.34

(1 row)

UPDATE 1

test=# SELECT sum(units * price)

FROM t_product AS p, t_invoice AS v

WHERE v.product_id = p.id;

sum

----------

40357.94

(1 row)

In case the prices of the products change, we find ourselves with destroyed data. The problem is that we aren’t interested in the current price of the product - for a report we need the data as it was back when we sold those shoes. In case of full normalization we have made a major mistake destroying vital sales statistics.

Storing information correctly

The question naturally arising is: What is the better way to store this type of data? What we have to do is to redundantly store data as its value at the time of sale.

Here is how we might want to store the data:

test=# CREATE TABLE t_invoice 
(
	id 		int4 	PRIMARY KEY, 
	tstamp 		date, 
	units 		int, 
	product_name 	text, 
	product_price 	numeric
);
CREATE TABLE

test=# CREATE TABLE t_invoice

(

id int4 PRIMARY KEY,

tstamp date,

units int,

product_name text,

product_price numeric

);

CREATE TABLE

The name of the product as well as the price have to be fields in the table to avoid changing data later on. It might happen that products are removed, or that prices as well as names are changed. We have to reflect this in our tables containing historic data.

As simple as it sounds: This type of mistake often happens in various applications and developers have to be aware of those issues when designing proper data structures (especially when generative AI is used to create table definitions).

Read more about "normalization" in other blog postings:

Data Normalization in PostgreSQL
Practical Examples of Data Normalization in PostgreSQL

PostgreSQL: To normalize or not to normalize?

Normalization in action

Is there "too much normalization"?

Storing information correctly

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts