Sunday, April 5, 2015

Database Normalization

With normalization, the goal is to reduce or eliminate data redundancy. As data grows, it becomes increasingly complex for software developers to maintain the same information in multiple places, as you've learned in our relational data modeling article. Now, there are instances where data redundancy is beneficial. It can increase performance by reducing the time it takes to access your data, for example, with fewer tables to query across. Or increase reliability by giving you multiple locations to access the data in case one of your nodes goes down. As with any architectural problem, you have to weigh the quality attributes against each other. There's never one solution for all of your problems, but different solutions at different stages of the company and you have to be strategic. This article is going to focus on a scenario where the cost of maintaining the data far outweighs the performance benefits of keeping that data in one table.

The Original Dataset We Want to Normalize

Order
OrderId OrderDate CustomerName CustomerAddress
1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers, NY 10701
1002 2015-03-23 00:06:07 Elena Burris 78 Darly Rd. Addison, IL 60101
1003 2015-03-23 00:010:02 John Smith 700 Burnet Dr. Addison, IL 60101
1004 2015-03-23 00:010:11 Don Mckee 3 Spring County Blvd. Coward, SC 29530
1005 2015-03-23 00:11:41 Kristina Farrell 803 Potter Rd. Seattle, WA 98109
1006 2015-03-23 00:11:41 Denise Marsh 419 Lowel Rd. Chicago, IL 60607
1789 2017-05-24 12:12:06 John Smith 700 Burnet Dr. Addison, IL 60101

First Normal Form (1NF)

First Normal Form is defined by wikipedia as a property of a relation in a relational database. A relation is in first normal form if the domain of each field contains only atomic values, and the value of each field contains only a single value from that domain. Now, what does this actually mean? What we'll do is go over all of the qualifiers and explain each with examples from our original dataset. Most of these rules reinforce the idea that your data must live in a two dimensional table.

Columns and Rows

Simply put, the table must consist of a column and a row to represent the data. Moving on, one column must equal one field and have a unique field name. So this table would not be compliant with this 1NF rule:

OrderId, OrderDate <--violation CustomerName CustomerAddress
1001
2015-03-23 00:04:21
Dan Burke 35 East Parson St. Yonkers, NY 10701

However, this table would:

OrderId OrderDate CustomerName CustomerAddress
1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers, NY 10701

By complying with this rule, tracking down records by OrderDate is much easier.

Strict Types

Secondly, column entries must be of the same kind. As you've noticed, all of our entries have the same consistent format. All names include a first and last name, and all addresses contain the address, city, state, and zip code. If we were to have inconsistent address formats, it would be more difficult to query against and perform analyses. The following table would not be compliant:

OrderId OrderDate CustomerName CustomerAddress
1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers, NY 10701
1002 2015-03-23 00:06:07 Elena Burris Addison, IL <--violation

Unique Rows

Third, rows must be unique. If you scroll back to the top of this page and look at the original dataset, you'll see that we haven't violated this rule. While you will see repeated values, no row is a complete copy of another because the OrderID is always unique. This can also be accomplished with other fields, such as social security number or email address, but we aren't capturing those.

1 Column/Field = 1 Value

Fourth, each field must also contain one value and can't be in multiple columns to be two dimensional. To better illustrate this, we'll simplify our data to focus only on the customers and addresses. The following tables would violate that rule:

CustomerName CustomerAddress
Dan Burke 35 East Parson St. Yonkers, NY 10701
1003 Brighton St. Arlington, TX 76010 <--violation
Elena Burris 78 Darly Rd. Addison, IL 60101

CustomerName CustomerAddress1 CustomerAddress2
Dan Burke 35 East Parson St. Yonkers, NY 10701 1003 Brighton St. Arlington, TX 76010
Elena Burris 78 Darly Rd. Addison, IL 60101 Null <--violation

See how Dan Burke now has two addresses under the same field? Now Elena Burris, who lives in one location has a null value in her CustomerAddress2 column. You want to avoid this. We can solve this problem by splitting up customer data from address data, like so:

Customer
CustomerId CustomerName
2001 Dan Burke
2002 Elena Burris

Address
AddressId Address
3001 35 East Parson St. Yonkers, NY 10701
3002 1003 Brighton St. Arlington, TX 76010
3003 78 Darly Rd. Addison, IL 60101

Customer-Address
CustomerId AddressId
2001 3001
2001 3002
2002 3003

Now, we have the ability to keep both sets of data isolated and communicate the many-to-many, or n...:n..., relationships defined in a separate Customer-Address table.

Atomic Values

Lastly, you want to make sure your value are atomic, storing the smallest possible unit.

OrderId OrderDate CustomerFirstName CustomerLastName CustomerAddress CustomerCity CustomerState CustomerZipCode
1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers NY 10701
1002 2015-03-23 00:06:07 Elena Burris 78 Darly Rd. Addison IL 60101
1003 2015-03-23 00:010:02 John Smith 700 Burnet Dr. Addison IL 60101
1004 2015-03-23 00:010:11 Don Mckee 3 Spring County Blvd. Coward SC 29530
1005 2015-03-23 00:11:41 Kristina Farrell 803 Potter Rd. Seattle WA 98109
1006 2015-03-23 00:11:41 Denise Marsh 419 Lowel Rd. Chicago IL 60607
1789 2017-05-24 12:12:06 John Smith 700 Burnet Dr. Addison IL 60101

Second Normal Form (2NF)

The 2NF is a superset of the 1NF, meaning 2NF = 1NF + additional qualification. This additional qualification is that each column depends on the entire primary key. So let's focus in on two entries of our dataset to get a better understanding of this. In the previous article, we mentioned having two orders from a John Smith from the same location, but because the gap between the two order was so large, we couldn't tell if this was the same John Smith or if this was a new John Smith that had moved into the residence much later.

OrderId OrderDate CustomerFirstName CustomerLastName CustomerAddress CustomerCity CustomerState CustomerZipCode
1003 2015-03-23 00:010:02 John Smith 700 Burnet Dr. Addison IL 60101
1789 2017-05-24 12:12:06 John Smith 700 Burnet Dr. Addison IL 60101

2NF addresses this by making sure that every entry can be associated with a primary key. So when you have duplicate data, you can tell whether it's unique or not. We should have stored the customer data in a separate Customer table from the beginning and given each customer a CustomerId in order to uniquely identify accounts from orders. Take a look at the following data model, which is 2NF compliant:

Order
OrderId OrderDate CustomerId
1003 2015-03-23 00:010:02 2003
1789 2017-05-24 12:12:06 3002

Customer
CustomerId CustomerFirstName CustomerLastName CustomerAddress CustomerCity CustomerState CustomerZipCode
2003 John Smith 700 Burnet Dr. Addison IL 60101
3002 John Smith 700 Burnet Dr. Addison IL 60101

Now you can see that the CustomerIds don't match, meaning the order was made from two different John Smith's who happened to live in the same residence. Of course, you can also assume that this is the same person who just so happened to make an order from a different account, but eh, you get the point. If you really want to make sure, you should also store the user's email address or any other unique values.

Primary Keys vs Foreign Keys

So in our Order table, we've designated the OrderId as our primary key. Primary keys are used to uniquely identify a record in the same table. When defining the relationship between two tables, there must always be one common data element between them. That common data element is the foreign key. Our Order table references the Customer table by way of the CustomerId.

Third Normal Form (3NF)

3NF is very similar to 2NF in that it all ties back to the primary key, except it deals with a scenario where a column depends upon another column which depends on the primary key. This is also known as a transitive dependency. Let's take a look at our newly formed Customer table to get a better understanding of this concept:

CustomerId CustomerFirstName CustomerLastName CustomerAddress CustomerCity CustomerState CustomerZipCode
2003 John Smith 700 Burnet Dr. Addison IL 60101
3002 John Smith 700 Burnet Dr. Addison IL 60101

As you can see, although we have rows that are tied to the primary key (CustomerId), there are still dependencies within the data. For example, all of the address-related fields are unbreakably bound to the CustomerZipCode. To solve this, you can split some of the data into a a separate ZipCode table.

Customer
CustomerId CustomerFirstName CustomerLastName ZipCode
2003 John Smith 60101
3002 John Smith 60101

ZipCode
ZipCode Address City State
60101 700 Burnet Dr. Addison IL

Creating Our Data Model in MySQL

Follow this guide in order to have a mysql instance ready to connect to.

Set Up MySQL Tables

# ssh from the root of vagrant-ansible-mysql
vagrant ssh

# log in to mysql
mysql -uroot

# all of the following commands are...
# ...executed from within mysql>

SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
4 rows in set (0.01 sec)

# use mysql db
USE mysql;

# create schema for zipcode table
CREATE TABLE zip_codes( zip_code VARCHAR(20) NOT NULL, address VARCHAR(100) NOT NULL, city VARCHAR(100), state VARCHAR(100), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(zip_code) );

# create schema for customer table
CREATE TABLE customers( id INT NOT NULL AUTO_INCREMENT, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, zip_code VARCHAR(20) NOT NULL, created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id), FOREIGN KEY(zip_code) REFERENCES zip_codes(zip_code) ON UPDATE CASCADE ON DELETE CASCADE );

# create schema for order table
CREATE TABLE orders( id INT NOT NULL AUTO_INCREMENT, order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, customer_id INT NOT NULL, PRIMARY KEY(id), FOREIGN KEY(customer_id) REFERENCES customers(id) ON UPDATE CASCADE ON DELETE CASCADE );

Great, now let's take a look at all of our schemas.

# show orders schema
SHOW COLUMNS FROM mysql.orders;
+-------------+-----------+------+-----+-------------------+----------------+
| Field       | Type      | Null | Key | Default           | Extra          |
+-------------+-----------+------+-----+-------------------+----------------+
| id          | int(11)   | NO   | PRI | NULL              | auto_increment |
| order_date  | timestamp | NO   |     | CURRENT_TIMESTAMP |                |
| customer_id | int(11)   | NO   | MUL | NULL              |                |
+-------------+-----------+------+-----+-------------------+----------------+
3 rows in set (0.00 sec)

# show customers schema using describe
DESCRIBE mysql.customers;
+------------+--------------+------+-----+-------------------+----------------+
| Field      | Type         | Null | Key | Default           | Extra          |
+------------+--------------+------+-----+-------------------+----------------+
| id         | int(11)      | NO   | PRI | NULL              | auto_increment |
| first_name | varchar(100) | NO   |     | NULL              |                |
| last_name  | varchar(100) | NO   |     | NULL              |                |
| zip_code   | varchar(20)  | NO   | MUL | NULL              |                |
| created    | timestamp    | NO   |     | CURRENT_TIMESTAMP |                |
+------------+--------------+------+-----+-------------------+----------------+
5 rows in set (0.00 sec)

# show zip_codes schema
DESCRIBE mysql.zip_codes;
+----------+--------------+------+-----+-------------------+-------+
| Field    | Type         | Null | Key | Default           | Extra |
+----------+--------------+------+-----+-------------------+-------+
| zip_code | varchar(20)  | NO   | PRI | NULL              |       |
| address  | varchar(100) | NO   |     | NULL              |       |
| city     | varchar(100) | YES  |     | NULL              |       |
| state    | varchar(100) | YES  |     | NULL              |       |
| created  | timestamp    | NO   |     | CURRENT_TIMESTAMP |       |
+----------+--------------+------+-----+-------------------+-------+
5 rows in set (0.00 sec)

No comments:

Post a Comment