Microsoft.NET

……………………………………………….Expertise in .NET Technologies

What Is Normalization?

Posted by Ravi Varma Thumati on December 4, 2009

Normalization is a set of rules to help database designers develop a schema that minimizes logic problems. Each rule builds on the previous rule. Normalization was adapted because the old style of putting all the data in one place, such as a file or database table, was inefficient and led to logic errors when trying to manipulate the contained data. For example, look at the Meet_A_Geek database. If you stored all the data in the Customers table, the table would look like something like the following:

Customers
Customer_ID
Last_Name
First_Name
Address
Product_Name1
Product_Cost1
Product_Picture1
Product_Name2
Product_Cost2
Product_Picture2
Order_Date
Order_Quantity
Shipper_Name

The table has been abbreviated, but it still portrays the general idea. Now, in your Customers table, how could you add a new customer? You would have to add a product and an order as well. What if you wanted to run a report that shows all the products you sell? You could not easily separate products from customers in a simple SQL statement. The beauty of a relational database, if designed correctly, is that you can do just that.

Normalization also makes things easier to understand. Humans tend to break things down to the lowest common denominator. We do it with almost everything—from animals to cars. We look at a big picture and make it less complex by grouping similar things together. The guidelines that normalization provides create the framework to break down the structure. In your sample database, It is easy to see that you have three distinct groups: customers, products, and orders. Following normalization guidelines, you would create your tables based on these groups.

The normalization process has a name and a set of rules for each phase of breakdown/grouping. This all may seem a little confusing at first, but I hope you will understand the process as well as the reasons for doing it this way. Most people are happy with a spreadsheet that holds all their pertinent data. The time it takes to break down your schema by going through the normalization process is well spent. It will require less time to go through the process than it would to cut and paste your columns of data so they fit the report the boss wants.

Another advantage to normalizing your database is space consumption. A normalized database will take up less space overall than one that is not normalized. There is less repetition of data, so the actual disk space that is consumed holding your data will be much smaller.

Degrees of Normalization

There are basically three steps of normalization. They are First Normal Form (1NF), Second Normal Form (2NF) and Third Normal Form (3NF). Each form has its own set of rules. After a database conforms to a level, it is considered normalized to that form. Say, for example, that your database conforms to all the rules of the second level of normalization. It is then considered to be in Second Normal Form. Sometimes it is not always the best idea to have a database conform to the highest level of normalization. It may cause an unnecessary level of complexity that could be avoided if it were at a lower form of normalization.

Note There are a total of nine different rules of normalization. They are First Normal Form, Second Normal Form, Third Normal Form, Boyce-Codd Normal Form, Fourth Normal Form, Fifth Normal Form or Join-Projection Normal Form, Strong Join-Projection Normal Form, Over-Strong Join-Projection Normal Form, and Domain Key Normal Form. This article will only cover the first three forms of normalization.

First Normal Form

The rule of First Normal Form states that all repeating columns should be eliminated and put into separate tables. This is a pretty easy rule to follow. Take a look at the schema for the Customers database in Table 5.1.

Table 5.1 Schema for Customers Database

Customers
Customer_ID
Last_Name
First_Name
Address
Product_Name1
Product_Cost1
Product_Picture1
Product_Name2
Product_Cost2
Product_Picture2
Order_Number
Order_Date
Order_Quantity
Shipper_Name

In Table 5.1, you have several repeating columns. They mostly deal with products. So, according to the rule, you must eliminate the repeaters and give them their own table. That’s easy to do. The resulting database tables are shown in Table 5.2.

Table 5.2 Eliminating Data Repetition in a Database

Customers Products
Customer_ID Product_Name
Last_Name Product_Cost
First_Name Product_Picture
Address
Order_Number
Order_Date
Order_Quantity
Order_Shipper
Shipper_Name

Now there are two tables. There still is a problem. There is no way currently to relate the data from the original table to the data in the new table. To do that, a key must be added to the second table to establish the relationship. To do this, add a primary key to the Products table called Product_ID, and add a key to Customers table that relates the Products table to the Customers table. The Product_ID field is an ideal candidate. The resulting tables resemble Table 5.3:

Table 5.3 First Normal Form

Customers Products
Customer_ID Product_ID
Product_ID Product_Name
Last_Name Product_Cost
First_Name Product_Picture
Address
Order_Number
Order_Date
Order_Quantity
Shipper_Name

Now, a one-to-many relationship has been established. This represents what the database will be doing in real life. The client will have many products to sell, regardless of how many customers there are to buy them. Also, a customer still needs to have ordered a product to be a customer. You are no longer obligated to add a new customer every time you add a new product to your inventory.

Bringing a database to First Normal Form solves the multiple column heading problem. Too often, inexperienced database designers will do something similar to the non-normalized table in today’s first example. They will create many columns representing the same data over and over again. In an electric company in the Northwest, there was a database that tracked nuclear power plant parts. The table in their database, which contained the part numbers, had a repeated column that numbered well into the 30s. Every time a new item was stored for this part, they created a new column to store the information. Obviously, this was a poorly designed database and a programmer’s/administrator’s nightmare.

Normalization helps to clarify the database and break it down into smaller, more understandable pieces. Instead of having to understand a huge, monolithic table that has many different aspects, you only have to understand smaller, more tangible objects and the simple relationships they share with all the other smaller objects. Needless to say, a better understanding of how a database works leads to a better utilization of your assets.

Second Normal Form

The rule of Second Normal Form states that all partial dependencies must be eliminated and separated into their own tables. A partial dependency is a term to describe data that doesn’t rely on the table key to uniquely identify it. In the sample database, the order information is in every record. It would be simpler to use just the order number. The rest of the information could reside in its own table. After breaking out the order information, your schema would resemble Table 5.4.

Table 5.4 Eliminating Partial Dependencies—Second Normal Form

Customers Products Orders
Customer_ID Product_ID Order_Number
Product_ID Order_Date Product_Name
Order_Number Product_Cost Order_Quantity
Last_Name Product_Picture
First_Name
Address
Shipper_Name

Again, by arranging the schema in this way, you have reflected the real world in your database. You would have to make some changes for your business rules to be applicable, but for illustrating normalization, this is okay.

By now you should be noticing some things. The table that was once hard to read and understand is now making more sense. Relationships between the information that is going to be stored is clearer and easier to understand. Things appear to be more logical. These are some of the advantages to normalizing a database.

One of the major disadvantages of normalization is the time it takes to do. Most people are busy enough, and to spend time making sure their data is normalized when it works just fine is perceived as a waste of time. This is not so. You will spend way more time fixing a broken, non-normalized database than you would a normalized, well-designed database.

By achieving the Second Normal Form, you enjoy some of the advantages of a relational database. For example, you can now add new columns to the Customers table without affecting the Products or the Orders tables. The same applies to the other tables. Getting to this level of normalcy allows data to fall naturally into the bounds for which it was intended.

After you have reached the level of Second Normal Form, most of the logic problems are taken care of. You can insert a record without excess data in most tables. Looking closer at the Customers table, there is a Shipper_Name column. This column is not dependant on the customer. The next level of normalization will explain how to clear this up.

Third Normal Form

The rule of Third Normal Form is to eliminate and separate any data that is not a key. This column must depend on the key for its value. All values must be uniquely identified by the key. In the sample database, the Customers table contains the Shipper_Name column. The Shipper_Name is not uniquely identified by the key. You could separate this data from the current table and put it into its own table. Table 5.5 shows the resulting database schema:

Customers Products OrderMaster OrderDetail Shippers
Customer_ID Product_ID Order_Number Order_Detail_ID Shipper_ID
Product_ID Product_Name Order_Date Order_Number Shipper_Name
Order_Number Product_Cost Order_Quantity Order_Date
Shipper_ID Product_Picture Order_Quantity
Last_Name
First_Name
Address

Table 5.5 Eliminating Non-Key Data for Third Normal Form

Now all your tables are in Third Normal Form. This provides the most flexibility and prevents any logic errors when inserting or deleting records. Each column in the table is uniquely identified by the key, and no data is repeated. This provides a clean, elegant schema that is easy to work with and easy to expand.

How Far to Take Normalization

The next decision is how far to go with normalization. Normalization is a subjective science. It is up to you to determine what needs to be broken down. If your database is just going to provide data to a single user for a simple purpose and there is little to no chance of expansion, taking your data to 3NF might be a little extreme. The rules of normalization exist as guidelines to create easily manageable tables that are flexible and efficient.

There are times when normalizing your data to the highest level doesn’t make sense. For example, suppose you added another address column to your database. It is quite normal to have two lines for an address. The table schema might look like the following:

Customer_ID
Last_Name
First_Name
Address1
Address2

According to the rules that would make this table compliant with First Normal Form, the address columns would be taken out and replaced with the key for the new table. The following is the resulting schema:

Customer_ID
Address_ID
Last_Name
Customer_ID
First_Name
Address

The database is now First Normal Form compliant. Your customers can have more than one address. The problem that exists is that you have overcomplicated a simple idea because you were trying to follow the rules of normalization. In the example, the second address is totally optional. It is there just to collect information that might be used for contact information. There is really no need to break it into its own table and force the rules of normalization on it. In this instance, taking it to a form of normalcy defeats the purpose for which the data is used. It adds another layer of complexity that is not needed. A good way to determine if your normalizing is getting carried away is to look at the number of tables you have. A large number of tables may indicate that you are normalizing too much. Take a step back and look at your schema. Are you breaking things down just to follow the rules, or is it a practical breakdown. These are the things that you, the database designer, need to decide. Experience and common sense will guide you to make the right decisions. Normalizing is not an exact science; it is a subjective one.

There are six more levels of normalization that have not been discussed so far. They are Boyce-Codd Normal Form, Fourth Normal Form (4NF), Fifth Normal Form (5NF), Strong Join-Protection Normal Form, Over-Strong Join-Protection Normal Form, and Domain Key Normal Form. These forms of normalization may take things further than they need to go. They exist to make a database truly relational. They mostly deal with multiple dependencies and relational keys.

Summary

Normalization is a technique used to create good logical relationships between tables in a database. It helps prevent logical errors when manipulating data. Normalization also makes adding new columns easier without disrupting the current schema and relationships.

There are several levels of normalization: First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Boyce-Codd Normal Form, Fourth Normal Form (4NF), Fifth Normal Form (5NF), Strong Join-Protection Normal Form, Over-Strong Join-Protection Normal Form, and Domain Key Normal Form. Each new level or form brings the database closer to being truly relational. The first three forms were discussed. They provided enough normalization to meet the needs of most databases.

Going overboard with normalization can lead to an inefficient database and can make your schema too complex with which to work. A proper balance of common sense and practicality can help you decide when to normalize and when to let sleeping dogs lie.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: