Although I installed my first PostgreSQL database on RedHat Linux (not RHEL) 6.0 around 1999, I probably limited my interaction with this 6.5 release to starting the server with SysinitV scripts. PostgreSQL was at that time an underdog to MySQL (or so I thought) so it was cool. At that time I knew nothing about SQL (except the name), databases or data modelling.
Table of Contents
Not long after that, I found myself a student of a big technical university - and to my surprise, I was hired by the university IT centre to help with the deployment of a brand-new student records management system being developed internally. Suddenly, without any magical improvement in my knowledge and skills, I became an Oracle DBA. You’ll learn on the job, they said. Indeed, on the job I learned that Big Red is a thing (or rather — a beast).
I served my days as best as I could, sitting between two huge ERD diagrams (A0 size) I printed out and meticulously glued together from smaller sheets. I still was not much smarter about data modelling, but since the system was developed by the university’s most experienced academic staff, it was supposed to be designed correctly. And correctly designed it was.
(no money for anything more that 70 or so named user licenses — Win SQLPlus was all I had) I was trying to find my way through the maze of tables, views and foreign keys. I found myself writing ad-hoc reports and queries, correcting operators’ mistakes, and so on. Over the years, as if by some magic influence, the ancient knowledge of how to properly decompose and normalize data diffused into my mind, my thinking and my reflexes from those two huge diagrams, as if by osmosis.
Since then, whenever I see a database schema, I can immediately say whether or not it feels right. Much later on, I learned the reasons behind those feelings.
And yes, the database was designed in (mostly) Codd’s 3rd normal form.
So… what am I talking about?
The concept of…
…is older than SQL, or practical relational database software, for that matter. It was introduced by F. Edgar Codd @ IBM in his seminal work in 1970. That work laid the foundation for all relational databases we use today, including PostgreSQL.
Unfortunately, despite the fact that these concepts have very practical applications, they are somewhat academic in nature and origin; in order to talk about them, I need to introduce some theory and define some terms.
Note: The following few paragraphs are not specific to PostgreSQL, so you can apply them to any relational database. Also, if you already know all this like the back of your hand, feel free to skip ahead as much as is comfortable for you.
The important thing to remember is: The objective of introducing the concept of normal forms was to avoid…
…by means of the elimination of all possible data duplication. That’s what it is really all about:
How do anomalies look in practice? Let’s imagine we are back in the early 1970’s, and our thriving paperback romance bookstore uses an application that keeps all data in one set of records (e.g., in an ISAM file):
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
Angie Nadajoskova | Bratislava, SK | Irritated soul | 234 | $23.99 |
Rommy la Dequell | Altdorf, AT | Desire to suffer | 456 | $1.99 |
And we need to add another new author we signed up, but do not yet have details about the new thriller-romance he is writing:
Michael Birg | Bludenz, AT | $4.99 |
First, our best selling author of two books (record 1 & 3) has an inconsistency: Is Altdorf in Hungary or in Austria? Or has the author moved? Second, when the book “Irritated Soul” ran out of print, while removing it from the record we also lost all the information about the author - that we might very well want to keep (record 2). And finally, the new record cannot be added — it just fails to deliver the essential information. We cannot use it as it is.
These problems might seem naive, or even ridiculous today, now that we are all accustomed to the fantastic capabilities of PostgreSQL and other RDBMSes. But remember, this is 50 years ago when all joins were programmed by hand, and more importantly, remember that those anomalies can still appear today, although in a less obvious form.
As I do not expect anyone to have the same knowledge diffusion experience I have described above, there are a few…
…and terms I need to clarify first:
We have a record (also called a tuple or row) consisting of named and typed fields (also called attributes or columns). A collection of identically typed rows is a table (also known as a relation). Fields in a record or records in a table are not guaranteed to have any particular order. So far so good; all users of any relational database should already be familiar with these terms.
Now about…
A trivial way to identify any row in a table is to create a key that is a combination of all fields: this is the trivial superkey (values which are part of a key are underlined below)
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
It is usually possible to create superkeys that can still identify rows, but do not contain all fields.
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
or
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
...it is a prime attribute (or key attributes). If it does not, it is a non-prime attribute (non-key attribute). This distinction is important in describing what can depend on what. E,g, in the examples above, the numbers of pages or the prices may be the same for other books; there is no functional dependency.
If it is no longer possible to remove a field from a superkey without losing its ability to identify a row, we have come to a candidate key — there may be more than one such combination of fields.
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
or
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
To reliably access any particular row and enforce uniqueness, one of the candidate keys must be selected as the primary key — it consists of a designated combination of one or more columns (when there are multiple columns it is a compound key), and their combined contents must be unique and non empty.
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
Keys of the record can be natural — when they are derived from natural, inherent attributes of a real live thing described by the record — or surrogate. A natural key comes straight out of one or more than one existing columns; an example would be a social security number. A surrogate key has its own, separate column and is artificially created due to the need to uniquely identify each record. Surrogate keys have their benefits: they can be easily generated, cheaply stored and sorted (bigint). If necessary, they can have guaranteed global uniqueness (UUID). This type of key is often automatically used when a schema is autogenerated by some persistence framework or ORM.
they do not say anything about the actual properties of whatever real-world object the row describes. As a consequence, they cannot tell us anything about it; in particular, they can’t identify its uniqueness. As such, the only uniqueness a surrogate key can guarantee is their own one. And that means that in practice, duplicates can occur; the very thing normalization is trying to avoid.
On the other hand, since in the real world, objects are identified rather by a combination of features and not by a single one, natural primary keys are usually compound — that makes them rather inconvenient to use in practice. Also, repeating them in other tables as foreign keys leads to data repetition in the form of multiple rows and tables containing the same (possibly long) values over and over again which is just an inefficient use of space.
Since we haven't even started normalizing, and yet the use of a surrogate key in the role of primary key is common practice, here is an example:
Let’s create a table describing people, and let’s include a sufficient variety of attributes to capture their uniqueness: name, surname, place and date of birth, height, eye-colour, net worth, favourite dish, pet name and so on. Let’s decide on the combination of these attributes that will for our use case uniquely identify each person. Now, that exact combination will be our compound primary key — so that the database can assure only one record with exactly these values of selected attributes could exist in a table.
However, if we just decide to give such a person a number and make it a primary key, nothing will prevent us from having multiple identical records — for all intents and purposes, describing exactly the same person — differing just by the meaningless value of that one artificially assigned number (row 1 & 4) or worse, containing small differences caused by typos or mistakes (row 3). Note however, that a typo made in row 2 would not have been prevented anyway, since it happened in a column which was originally part of a primary key.
Id | Author | Origin | Title | Pages | Price |
1 | Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
2 | Rommy la Dequell | Altdorf, AT | Power of silence | 123 | $0.99 |
3 | Rommy la Dequel | Altdorf, HU | Power of silence | 123 | $0.99 |
4 | Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
Note: To cope with the issue of assigning uniqueness and yet still help the database do its job in an efficient manner, the primary key should always be natural, compound if necessary to assure uniqueness. To conveniently refer to the table rows, however, create one additional surrogate key and make it unique. Then we will separate the concern of assuring row uniqueness (server just in one place by primary key) from the concern of providing a way to conveniently refer to that row from other tables (server by column with just unique constraint). That will suffice. See the CYBERTEC blog on unique constraints for more info.
The last concept I need to introduce is the functional dependency between attributes. An attribute is functionally dependent on another attribute if its value always appears with the particular value of an attribute it is dependent on. This dependency is unidirectional. You can imagine it as a function or as a mapping table, unequivocally transforming one value into another.
Trivial example would be to have two columns expressing some physical property (e.g. weight), one metric, the other – imperial. The value of one is functionally dependent on the other, since the conversion formula is well known.
Less trivial example would be to assign a less precise property (e.g. category) based on some more specific property (e.g. price or number of pages). There is somewhere (usually in application) a precise formula to derive first from the second. If it is applied consistently and without exceptions — we have a functional dependency. That dependency can be encoded in an additional table in relation form and then calculated upon each query or in a view. Note however, that if there is a business intention to have exceptions (that is to assign category in an arbitrary way — even if only incidentally) there is no strict functional dependency and it needs to be reflected in schema design.
Back to normal forms —some are mostly…
…to daily practice, but with the constant evolution of database software capabilities (especially PostgreSQL) the division “relevant/ not relevant” should be taken with a grain of salt.
Each normal form has a precise requirement that must be satisfied, and builds upon the requirement expressed in the previous form, so let’s talk about them in order:
Person | Born in | Born on | Mother’s name | Born in Land | Mother born in | Mother born on | Lived in | ||
Anna Gruber | Klagenfurt | 12.10.1967 | Angela Bauer | Kärnten | Salzburg | 30.02.1948 |
| ||
Betty Huber | Bregenz | 14.05.1989 | Regina Wagner | Vorarlberg | Graz | 22.11.1970 |
|
Person | Born in | Born on | Born in Land | Mother’s name | Mother born in | Mother born on |
Anna Gruber | Klagenfurt | 12.10.1967 | Kärnten | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | Vorarlberg | Regina Wagner | Graz | 22.11.1970 |
Person | Born in | Born on | Lived in |
Anna Gruber | Klagenfurt | 12.10.1967 | Wien |
Anna Gruber | Klagenfurt | 12.10.1967 | Linz |
Betty Huber | Bregenz | 14.05.1989 | Linz |
Betty Huber | Bregenz | 14.05.1989 | Salzburg |
Person | Born in | Born on | Person ID | Born in Land | Mother’s name | Mother born in | Mother born on |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | Kärnten | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | 2 | Vorarlberg | Regina Wagner | Graz | 22.11.1970 |
Person ID | Lived in |
1 | Wien |
1 | Linz |
2 | Linz |
2 | Salzburg |
Person | Born in | Born on | Person ID | Born in Land | Mother’s name | Mother born in | Mother born on |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | Kärnten | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | 2 | Vorarlberg | Regina Wagner | Graz | 22.11.1970 |
Person | Born in | Born on | Person ID | Mother’s name | Mother born in | Mother born on |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | 2 | Regina Wagner | Graz | 22.11.1970 |
This assumes that for the cities table below, the city name and only the city name is a primary key (which usually is not the case in the real world).
City | Land |
Klagenfurt | Kärnten |
Bregenz | Vorarlberg |
Person | Born in | Born on | Person ID | Mother’s name | Mother born in | Mother born on |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | 2 | Regina Wagner | Graz | 22.11.1970 |
Person | Born in | Born on | Person ID | Mother ID |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | 3 |
Betty Huber | Bregenz | 14.05.1989 | 2 | 4 |
Angela Bauer | Salzburg | 30.02.1948 | 3 | «null» |
Regina Wagner | Graz | 22.11.1970 | 4 | «null» |
Person | Born in | Born on | Person ID | Mother ID | Birth Post Code |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | 3 | 9073 |
Betty Huber | Bregenz | 14.05.1989 | 2 | 4 | 6900 |
Angela Bauer | Salzburg | 30.02.1948 | 3 | «null» | 5020 |
Regina Wagner | Graz | 22.11.1970 | 4 | «null» | 8032 |
Person | Birth Post Code | Born on | Person ID | Mother ID |
Anna Gruber | 9073 | 12.10.1967 | 1 | 3 |
Betty Huber | 6900 | 14.05.1989 | 2 | 4 |
Angela Bauer | 5020 | 30.02.1948 | 3 | «null» |
Regina Wagner | 8032 | 22.11.1970 | 4 | «null» |
Post Code | City | Land |
9073 | Klagenfurt | Kärnten |
6900 | Bregenz | Vorarlberg |
5020 | Salzburg | Salzburg |
8032 | Graz | Steiermark |
This is usually where information about normal forms ends, particularly because the remaining normal forms are the ones any practitioner would usually not be able to quote from memory, and are generally considered…
…for practical applications. This however is not quite true as some are currently — even non-deliberately — applied in day-to-day design practice and it is a good thing.
These forms were introduced after the original Edgar F. Codd work. Tables violating them actually can happen, and should be taken into account. I have given you examples below:
Trainer | Course | Language |
Michal | PostgreSQL Professional | EN |
Michal | PostgreSQL Professional | PL |
Pavlo | PostgreSQL Professional | EN |
Pavlo | PostgreSQL Professional | SK |
Michal | Administration & Performance Tuning | EN |
Michal | Administration & Performance Tuning | PL |
Hans | Administration & Performance Tuning | EN |
Hans | Administration & Performance Tuning | DE |
Hans | PostgreSQL Professional | EN |
Hans | PostgreSQL Professional | DE |
Trainer | Course |
Michal | PostgreSQL Professional |
Michal | Administration & Performance Tuning |
Pavlo | PostgreSQL Professional |
Hans | PostgreSQL Professional |
Hans | Administration & Performance Tuning |
Trainer | Language |
Michal | EN |
Michal | PL |
Pavlo | EN |
Pavlo | SK |
Hans | EN |
Hans | DE |
Trainer | Course | Language |
Michal | PostgreSQL Professional | EN |
Michal | PostgreSQL Professional | PL |
Pavlo | PostgreSQL Professional | EN |
Pavlo | PostgreSQL Professional | SK |
Pavlo | Administration & Performance Tuning | EN |
Michal | Administration & Performance Tuning | EN |
Michal | Administration & Performance Tuning | PL |
Hans | Administration & Performance Tuning | EN |
Hans | Administration & Performance Tuning | DE |
Hans | PostgreSQL Professional | EN |
Hans | PostgreSQL Professional | DE |
Trainer | Course |
Michal | PostgreSQL Professional |
Michal | Administration & Performance Tuning |
Pavlo | PostgreSQL Professional |
Hans | PostgreSQL Professional |
Hans | Administration & Performance Tuning |
Trainer | Language |
Michal | EN |
Michal | PL |
Pavlo | EN |
Pavlo | SK |
Hans | EN |
Hans | DE |
Course | Language |
PostgreSQL Professional | EN |
PostgreSQL Professional | DE |
PostgreSQL Professional | PL |
PostgreSQL Professional | SK |
Administration & Performance Tuning | EN |
Administration & Performance Tuning | DE |
Administration & Performance Tuning | PL |
Course | Hours | Days | Length | Pace |
PostgreSQL Professional | 15 | 3 | normal | standard |
Administration & Performance Tuning | 25 | 5 | long | standard |
High Availability & Patroni | 24 | 3 | normal | intense |
Course | Hours | Days |
PostgreSQL Professional | 15 | 3 |
Administration & Performance Tuning | 25 | 5 |
High Availability & Patroni | 24 | 3 |
Length | min days | max days |
short | 1 | 1 |
normal | 2 | 3 |
long | 4 | 5 |
Pace | min hours per day | max hours per day |
introductory | 3 | 4 |
standard | 5 | 5 |
intense | 6 | 8 |
Now that you are familiar with data normalization in PostgreSQL and normal forms, keep alert for my next blog on the practical usage of these forms. With a little practice, you'll soon have an intutitve feeling for which form belongs to what situation. Or, you can try printing out this blog and posting it on the wall behind your computer, if you too want to try out "Normalization by Osmosis!"
To continue reading, see Part 2 of this series, Practical Examples of Data Normalization.
You may also be interested in this CYBERTEC blog on auto-generating primary keys.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
One of the excellent post I read anytime recently. I liked the way normalization is explained with the help of anomalies. Awesome job. This article is worth a book.