There a couple of important concepts to discuss about database architecture implementations. CAP Theorem, Data Models and how to scale your database implementation.
The CAP Theorem states that it is impossible for a database to provide more than 2 out of the following 3 concepts: Consistency (C), Availability (A) and Partitioning (P). Any implementation of a database system can only support 2, Consistency + Availability (CA), Consistency + Partitioning (CP), or Availability+Partitioning (AP).
See the diagram below:
The CAP Theorem states you can only have 2 out of the 3 concepts, so let's go over each concept.
Consistency - Any data read is guaranteed to be consistent and the most recent version of the data.
Availability - Reads will eventually receive a response, may not be the most recent version of the data.
Partitioning - Also known as Partitioning Tolerance, a Cluster continues to function, even if a node goes down.
Pick 2 out of 3 options
Based on the theory that you can only pick 2 out of the 3 concepts for implementation, below are the combinations and models that are used in the industry today.
Consistency + Availability (CA)
Relational Database Management Systems (RDBMS) implement the Consistency + Availability model via the Relational Data Model. Examples of RDBMS systems include SQL Server, MySQL and Postgres.
Consistency + Partitioning (CP)
Key Value, Columnar\Store, Document ordered databases systems can implement the Consistency + Partitioning model. Examples of these systems include BigTable, MongoDB, HBase.
Availability + Partitioning (AP)
Key Value and Columnar\Store databases systems can implement the Availability + Partitioning model. Examples of these systems include Cassandra, DynamoDB and RIAK.
Data Models
Now that we've discussed CAP Theorem and the various implementations, it's time to take a closer look at the data models implemented by some of the technologies mentioned above.
Below is a chart I made to help summarize the differences.
Data Model | Structure | Queries | Primary Consistency |
Relational | Row based | Complex | Immediate |
Columnar | Column based | Simple, Joins are slow | Eventual |
Document | Document based | Simple, No Joins | Eventual |
KeyValue | Key based | Simple, No Joins | Eventual |
Relational model
In an RDMBS system, data is organized into tables and relationships are built between other tables containing additional relevant information.
See the oversimplified example below.
The relational model is implemented in RDBMS systems. There are several benefits of RDBMS and I'm going to list them here without detail: ATOMIC transactions, Transaction management (roll forward, roll back), lock escalation, built in facilities for administration functions like security management, backups and restore. You can read several books on these topics but that's not the focus of this post.
Thinking about the diagram above, the SQL Statement to retrieve Bill Gates's hobbies requires two joins. Not a big deal and very common and by design in the RDBMS world, however, in the NOSQL world, joins are complex.
Let's get into the background of how data is stored in Relational databases. In the example above, each record in each table is contained in a row, and behind the scenes the rows are stored on pages and pages belong to extents. This storage becomes important as it impacts how the data is read
How the SELECT works in RDBMS
Let's consider the tables above have a millions of records. And a DBA just wants to see one hobby of Bill Gate's. The SQL statement would look something like the following:
select top 1 p.FirstName, p.LastName, h.Hobby
from Person p
inner join PersonHobby ph on p.PersonID = ph.PersonID
inner join Hobby h on ph.HobbyID = h.HobbyID
where p.FirstName = 'Bill' and p.LastName = 'Gates';
In order to find the one of the records associated with Bill's hobbies, the RDBMS system needs to retrieve all the columns from all the tables in the select statement. In addition, the RDBMS system must scan all the pages the rows exist on. Most RDBMS systems also implement a read-ahead algorithm, to anticipate the wanting of data close to the data your are selecting, so all the pages associated with the extent where the row resides are also scanned. Eight 8K pages can fit into a single Extent.
Below is what a typical Data page diagram looks like. Depending on the data types and size of the table, will depend how many rows fit onto a data page.
We previously mentioned that Data Pages belong to Extents in groups of eight, so consider the following:
In the above example, if a DBA writing a SQL statement only cares about retrieving a few columns for a single record, all of the data above must be scanned and retrieved in order to return the 1 row and 3 columns requested.
Before we wrap up with part 1 of this blog series, let's examine the life cycle of a SELECT statement in SQL Server.
- The SQL Server SNI on the client establishes a connection to SQL Server
- A connection to a Tabular Data Stream TDS endpoint is made
- The SELECT statement is sent to SQL Server in TDS
- The SNI reads the TDS message and sends to the command parser
- The command parser will check the plan cache in the buffer pool for an already existing query plan. If a plan is not found, it generates a query tree and sends it to the optimizer.
- The optimizer generates the most cost affective plan and sends it to the query executor.
- The query executor will decide if data is needed to be read to finish the query plan. It passes the query plan to the Access Methods in the Storage Engine
- The access methods will need to read a page from the database in order to finish the request from the query executor. It does this by asking the buffer manager to provide the data page
- The buffer manager will check the data cache to see if it already has the page in the cache. If it is not in the cache, the page is read from disk and copied to the cache and sent back to access methods
- The access methods will pass the data back to the relational engine in order to send back to the client.