Summary: In this tutorial, we will learn what is normalization and what is its purpose in DBMS. In addition, we will also look at different types of normal forms (such as 1NF, 2NF, 3NF, and BCNF) and learn how to normalize data or tables in a SQL database.
What is Normalization in DBMS?
Normalization in DBMS is the process of structuring a database, usually a relational database, in such a way that there is minimum data redundancy and better data integrity.
In normalization, the tables of relational databases are structured according to a series of so-called normalization forms such as 1NF, 2NF, 3NF, etc, so that the data redundancies and anomalies do not occur.
If the database is not normalized and its tables have data redundancies (have multiple copies of the same data), it may cause anomalies in the DBMS.
Anomalies in DBMS
Anomalies in DBMS are any undesirable side-effects that arise in a database, that have not been sufficiently normalized, when an attempt is made to modify (update, insert, or delete) a relation.
There are three types of anomalies that occur when a database is not normalized. These are: update anomaly, insert anomaly and, delete anomaly.
Example of Update, Delete and, Insert Anomalies
For example, suppose there is a ‘user_details’ table in a SQL database that contains the following records:
id | name | dob | zip | city | language |
---|---|---|---|---|---|
1 | Harry | 15/05/1995 | 563503 | NY | English |
1 | Harry | 15/05/1995 | 563503 | NY | Korean |
2 | Sam | 01/12/2000 | 563503 | NY | English |
3 | Rock | 15/05/1995 | 856246 | DL | Japanese Hindi |
Clearly, the table has a lot of duplicate values and is not normalized. If we make any attempt to modify the above table the following anomalies may occur:
Update Anomaly: The above table has two rows for Harry. If we have to update the zip of Harry, we need to update both the rows for the data to be consistent. If by chance only one row is updated then Harry will have two different zip values in the same database, which is inconsistent.
Insert Anomaly: Every row of the table has non-nullable values for zip and city. If we have to insert a new user he should have an associated zip and city otherwise, we may not be able to insert him in the database table.
Delete Anomaly: The Rock in the above table is the only row that has ‘DL’ as a city. If we delete the row, we will also lose the only available information about the city ‘DL’.
To prevent such anomalies from happening, we need to normalize the table into suitable normal forms.
Types of Normal Forms
These are some of the normal forms which are enough to have a well-normalized database:
Normal Form | Description |
---|---|
1NF | A table is in 1NF if every cell of the table has an atomic value (cells cannot have tables as values). |
2NF | A table is in 2NF if it is in 1NF and has no partial functional dependencies of non-prime attributes on candidate keys. |
3NF | A table is in 3NF if it is in 2NF and has no transitive functional dependencies of non-prime attributes on candidate keys. |
BCNF (3.5NF) | A table is in BCNF if it is in 3NF and every non-trivial functional dependency begins with a superkey. |
4NF | A table is in 4NF if it is in BCNF and every non-trivial multivalued dependency begins with a superkey. |
To know more advanced normal forms, refer here.
First Normal Form (1NF)
The last row of the ‘user_details’ table contains two values for the language column:
id | name | dob | zip | city | language |
---|---|---|---|---|---|
1 | Harry | 15/05/1995 | 563503 | NY | English |
1 | Harry | 15/05/1995 | 563503 | NY | Korean |
2 | Sam | 01/12/2000 | 563503 | NY | English |
3 | Rock | 15/05/1995 | 856246 | DL | Japanese Hindi |
The table is clearly is not in 1NF as the cell doesn’t have an atomic value.
To transform the table in 1NF, we have to make the cell atomic as follows:
id | name | dob | zip | city | language |
---|---|---|---|---|---|
1 | Harry | 15/05/1995 | 563503 | NY | English |
1 | Harry | 15/05/1995 | 563503 | NY | Korean |
2 | Sam | 01/12/2000 | 563503 | NY | English |
3 | Rock | 15/05/1995 | 856246 | DL | Japanese |
3 | Rock | 15/05/1995 | 856246 | DL | Hindi |
Second Normal Form (2NF)
The table is in 1NF but not in 2NF because the table has partial functional dependencies:
partial functional dependencies: id → name, id → dob
id | language | name | dob | zip | city |
---|---|---|---|---|---|
1 | English | Harry | 15/05/1995 | 563503 | NY |
1 | Korean | Harry | 15/05/1995 | 563503 | NY |
2 | English | Sam | 01/12/2000 | 563503 | NY |
3 | Japanese | Rock | 15/05/1995 | 856246 | DL |
3 | Hindi | Rock | 15/05/1995 | 856246 | DL |
candidate key: {id, language}
non-prime attributes: name, dob, zip, city
The non-prime attributes (name and dob) have functional dependencies on id, which is the proper subset of the candidate key.
To convert the table to 2NF, we must have a separate table for the name and dob column.
id | language | zip | city |
---|---|---|---|
1 | English | 563503 | NY |
1 | Korean | 563503 | NY |
2 | English | 563503 | NY |
3 | Japanese | 856246 | DL |
3 | Hindi | 856246 | DL |
id | name | dob |
---|---|---|
1 | Harry | 15/05/1995 |
2 | Sam | 01/12/2000 |
3 | Rock | 15/05/1995 |
Third Normal Form (3NF)
The ‘user_details’ table is now in 2NF but not in 3NF. It is because it has a transitive functional dependency:
transitive functional dependency: {id, language} → zip, zip → city
id | language | zip | city |
---|---|---|---|
1 | English | 563503 | NY |
1 | Korean | 563503 | NY |
2 | English | 563503 | NY |
3 | Japanese | 856246 | DL |
3 | Hindi | 856246 | DL |
candidate_key: {id, language}
non-prime attributes: zip, city
id | name | dob |
---|---|---|
1 | Harry | 15/05/1995 |
2 | Sam | 01/12/2000 |
3 | Rock | 15/05/1995 |
The columns in a candidate key are called prime attributes, and a column that does not occur in any candidate key is called a non-prime attribute.
The table has a transitive functional dependency because a non-prime attribute (zip) is determining another non-prime attribute (city).
To make the table in accordance with 3NF, we have to separate zip and city from the user_details table as follows.
id | language | zip |
---|---|---|
1 | English | 563503 |
1 | Korean | 563503 |
2 | English | 563503 |
3 | Japanese | 856246 |
3 | Hindi | 856246 |
zip | city |
---|---|
563503 | NY |
856246 | DL |
Boyce Codd Normal Form (BCNF)
For the table to be in BCNF, every non-trivial functional dependency should begin with a super key.
A dependency X → Y is non-trivial if Y is not a subset of X.
There is no table that we have normalized so far that have a dependency that does not start with a super key:
- user_zips: {id, language} → zip
- zips: {zip} → city
- user_names: {id} → name, {id} → dob
Therefore all the above tables are in BCNF.
Fourth Normal Form (4NF)
A table or relation is in 4NF if, for every one of its non-trivial multivalued dependencies XY, X is a superkey.
A multivalued dependency exists when there are at least three attributes (like X,Y and Z) in a relation, Y and Z are dependent on X, however, the set of values of Y is independent of set Z and vice versa.
Wikipedia
In the following ‘user_zips’ table because the language is attached to the id and the zip is attached to the id are independent of each other, the table has a multivalued dependency.
id | language | zip |
---|---|---|
1 | English | 563503 |
1 | Korean | 563503 |
2 | English | 563503 |
3 | Japanese | 856246 |
3 | Hindi | 856246 |
If we were to add a new language for id=1, we would have to add one record for each of the zip on that id, and vice versa.
Overall, there are two multivalued dependencies on this table:
- id
language
- id
zip
Similarly, the ‘user_names’ table has two multivalued dependencies:
- id
name
- id
dob
id | name | dob |
---|---|---|
1 | Harry | 15/05/1995 |
2 | Sam | 01/12/2000 |
3 | Rock | 15/05/1995 |
However, in the ‘user_names’ table, the multivalue dependency begins with a superkey, whereas in the ‘user_zips’ table the dependency begins from id which is not a super key but a subset of the super key.
To make the ‘user_zips’ table in 4NF, we have to separate the id language and id
zip into two different tables as follows:
id | language |
---|---|
1 | English |
1 | Korean |
2 | English |
3 | Japanese |
3 | Hindi |
id | zip |
---|---|
1 | 563503 |
2 | 563503 |
3 | 856246 |
Conclusion
Normalization in DBMS is the process to structure relation or talbes in a database in such a way such that there is minimum data redundancies and improved data integrity.
The objectives of normalization beyond 1NF (first normal form) as stated by Codd is as follows:
- To free the database from undesirable insert, update and delete anomalies.
- To reduce the need for restructuring of tables, as new types of data are introduced.
- To make the relational model more informative to users.