Having gone through a theoretical introduction in part 1, it is now time to discuss some more practical examples. I'll first talk about the…
Table of Contents
… of columns. It is important to realize that during the time Codd formulated the base 3 normal forms (see Part 1 for more details), the thinking about data types supported by a database was formed by a relatively primitive set of types available in commonly used programming languages. The model by itself does not make any assumptions about types, but the reality of using PostgreSQL today - with its fabulously rich set of available data types - is very, very different. Particularly if we realize how many of these types are considered simple, which of them can really be considered…
…and, when and why is it appropriate to see them as such?
For the most extreme example of such structures, we can consider simple textual data — namely char, varchar and text types. In its most simplistic form (hello C) they can be considered arrays of bytes, accessible by offset. If we consider multibyte encoding like UTF-8, it becomes an array of characters. Not very practical, but thanks to a rich set of operators in PostgreSQL, it is still quite doable.
On the level of standard features, a database offers capabilities for performing equal/unequal comparisons and some other scalar operations, and for accelerating those operations with indices.
textual data can be processed and constrained with the use of pattern matching (e.g. regular expressions) and those operations can be accelerated by GIN indices using trigrams. Fuzzy matching can be performed (with trigrams or Levenshtein distance) answering questions in a more subtle manner than true/false. Domains can be defined over a typical textual type that gives this type much more internal structure. That means the database starts to understand its data more.
Going even further, a full-text search mechanism can be defined over textual data, and then the array of characters turned into information that has some language, some discernible data items in it (email addresses, phone numbers, prices, words), and some of those items can have an additional structure that can be processed: a base form that can be derived from them, or synonyms to other base forms.
Other “base” types can provide similarly extended capabilities — date, timestamp and interval, despite being simple scalars under the hood, are equipped with so many operators that their internal structure, intricate and irregular, can be both processed naturally and accessed easily.
And then there are types openly defined by PostgreSQL as complex, and thus violating 1NF outright: staring into the twilight zone with geometric types and then range types defined over scalar types, and them boldly moving (in historical order) into the realm of arrays, complex types, hstore, XML and JSON — with extremely rich sets of operators, operator classes and index methods associated with them, which allow you to process them efficiently and conveniently on the database level, without the necessity of moving data back and forth into the application realm.
At the end of the day, what counts is not whether or not data elements (fields) are really atomic, but if they are at the same time…
…inaccessible from within a database. The very essence of an opaque structure (or file) is that the database kernel neither “understands” its content nor does it have any specific means of processing it. Two archetypical examples: First, a JPG image is stored in a bytea field. For the database kernel it is just a string of bytes. Second, a CSV file is stored in a text field. The database can sort the column according to text collation, can access words inside it using string processing functions, and can even apply a full-text search. But the database has no “idea” that the field stores a set of records with syntax and meaning.
For PostgreSQL, in all the cases like XML and JSON, they are certainly not opaque structures. And by storing those as text, we are not using all of the server’s capabilities, essentially wasting its potential: for simplifying the application code, and for achieving the best performance.
At any rate, it is important to distinguish between two use cases:
Building a hybrid data model which encompasses both a relational and non-relational approach is possible and well supported in PostgreSQL. I will discuss hybrid data models in the next part. In any case, even if a formal approach like normalization cannot be directly applied, it is important to remember what its objective is: to reduce data duplication and unintentional dependencies, in this way reducing the opportunity for anomalies (inconsistencies and errors) to occur.
To end our discussion of practical examples of data normalization, I want to talk about a few typical situations…
…in simple cases like deciding how to model structures on the level of a single (or a few) columns.
When trying to create a representation of various real-life objects in a database, it is often helpful to remember that…
People do not have primary keys
…and not only real people for that matter. Things that in daily use might seem to have a clearly identifiable structure that everyone intuitively understands might not really need that structure from the modeling point of view :
And I am not talking about some requirements that might occur in the future — I am referring specifically to requirements that are known now to the designer but might be masked by assumptions and preconceptions.
Some issues when this can manifest itself are:
All of them have a particularly nasty habit of popping up, especially when the set of values to be stored in the database is a bit more international than a designer expected.
All of these situations, which in theory would aid in finding and modeling the data structures within the database, can in practice make data less useful, more prone to contain errors, impose lots of work in ever more elaborate data validation and sanitation rules, and in the end lead to a much less stable schema than is to anyone’s liking.
Yes, there are situations where modeling must be done precisely:
However, neither of those will make the use of data easier or more natural.
So all the benefits provided by a well-normalized schema should be balanced against project practicalities and the designer’s ability to support his work, otherwise, it can easily turn into an exercise in…
…with unmaintainable, underperforming code delivering an overall bad experience to all parties involved. Because while being able to answer how many users have a second name Bettina or compute the distribution of family names with regard to which building floor they live on might be an interesting possibility, it is often an exercise in futility (unless the model is supposed to support a national census application).
So before applying data normalization to a model, I would strongly recommend:
Now that I’ve given you an overview of practical examples of data normalization in PostgreSQL. Be sure to stay alert for the next installment of this blog, which will describe building a hybrid data model which encompasses both a relational and non-relational approach. Hybrid data modeling is not only possible but also well-supported in PostgreSQL.
Read Part 1 of Normalization by Osmosis: Data Normalization in PostgreSQL
For further reading about data normalization, see the CYBERTEC blogpost on primary keys and unique constraints.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply