Sunday, April 5, 2015

Relational Data Modeling

Think back to the last time you ever filled out a form. Online or on paper, the process is pretty much the same. You're presented with a series of fields to which you enter values for. You're usually presented with First Name, Last Name, D.O.B., etc. Now, prior to you filling out this form, someone had to design the structure of that data and the data types (string, number, boolean), etc, in order to make records as manageable as possible. So data modeling is about applying various methodologies to a list of data elements in order to come up with an optimal relational database design. This makes your data easy to manage and inexpensive to scale.

Coming Up with a Data Model

Let's say you wanted to store customer and order data for an application that sells a desktop app through a website . You might come up with the following data model:

  • OrderID (Integer)
  • OrderDate (Timestamp)
  • CustomerName (String)
  • CustomerAddress (String)

With relational data, you want to follow these three steps after coming up with your data model, ultimately ending up with an Entity Relationship Model.

  • Identify candidate keys
  • Select a primary key
  • Apply normalization

Candidate Keys

Keys are essentially data elements. When identifying candidate keys, you want to point out which, if any, or how many are unique. So in our data model, we need to identify the keys in which no two orders can share the same value. So take a second to try and determine this for yourself and then read on for the answer.

The correct answer is OrderID. OrderDate isn't unique because multiple customers can make orders at the same time, and multiple customers can share the same Customer Name and CustomerAddress. OrderID is the only truly unique key in this data model. If that's a little hard to grasp, don't worry. We'll be revisiting this as we enter data into our table. We'll also indirectly be learning about normal forms before explicitly covering in this this follow up article.

Before we do that, I'd like to point out that this was an example where we came out with one candidate key. I'd also like to provide an example where you'd have two candidate keys. You may have multiple keys- one that's explicitly designed to be unique by the software architect and another that's unique by the nature of the data itself. So if we had an Employee table, the explicitly designed unique key would be the EmployeeID and the key that's unique by nature could be a Social Security Number. Whichever one you select as the primary key, unique keys provide the system with quicker access to the data instance you're looking for.

When you have multiple candidate keys, any remaining candidates that are not set as the primary key become classified as secondary, or alternate, keys. So having alternate keys provides you with multiple quick ways of hunting down one specific record or line item.

Entering Data and Understanding Unique Keys

Now, let's organize this data into a table. As you'll see below, our fields will form the header, each column will represent the value, and every row (line item) will indicate every instance of data. This data will exist in an Order table. Remember, our goal here is to identify why OrderID is our unique candidate key.

Note: rig is a handy command line tool generating a random identity and address. Install it with brew, apt or yum.

Order
OrderID OrderDate CustomerName CustomerAddress
1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers, NY 10701
1002 2015-03-23 00:06:07 Elena Burris 78 Darly Rd. Addison, IL 60101
1003 2015-03-23 00:010:02 John Smith 700 Burnet Dr. Addison, IL 60101
1004 2015-03-23 00:010:11 Don Mckee 3 Spring County Blvd. Coward, SC 29530
1005 2015-03-23 00:11:41 Kristina Farrell 803 Potter Rd. Seattle, WA 98109
1006 2015-03-23 00:11:41 Denise Marsh 419 Lowel Rd. Chicago, IL 60607
1789 2017-05-24 12:12:06 John Smith 700 Burnet Dr. Addison, IL 60101

As you can see from the data above, we've already run into a scenario where our last two orders came in at the same time so the OrderDate is definitely not a candidate key. We also have two John Smiths, with the same address. John Smith is a common name and the order numbers and dates between the two have a long gap so perhaps this is another John Smith that moved into the same location. We can't really tell, using this data model, and this is what normalization is for. We do, however, know that OrderId is our only unique key.

Improving Cohesion Between Data Types

Now, imagine if this data were to continue growing and our stakeholders decided they wanted to use the data for more than just identifying unique orders? What if they also wanted to leverage the customer data to get a better understanding of their different demographics? Let's say they wanted to know where most of their customers lived so they could open up more stores in those areas? Well, we do record their addresses, but because of the structure of the data, we're going to see a lot of duplicates. If David were to place 50 more orders, we'd have 50 more entries in Los Angeles, CA. Sure, we could perform a query and use the CustomerID to realize that it's the same person and throw out the redundant data, but what if that person were to change their address? We'd have to make sure to update every order entry with the new address. And even if we had a system to update that whenever we needed to, you can see how the cost of these operations continues to grow with the databases. Why not have a separate database for the customers? And what if Los Angeles were to be renamed to something different? Well, we should have a separate database for addresses, as well. Normalization to the rescue, again.

No comments:

Post a Comment