Home

Wednesday, July 28, 2021

Percentage of Full Table Scans on MYSQL

Below is a useful script I use to determine what percentage of my queries in MYSQL are performing full table scans.  Limiting full table scans is a good place to start on improving database performance.


SELECT VARIABLE_VALUE INTO @Handler_read_rnd_next
FROM GLOBAL_STATUS 
WHERE variable_name ='Handler_read_rnd_next';
SELECT VARIABLE_VALUE INTO @Handler_read_rnd
FROM GLOBAL_STATUS 
WHERE variable_name ='Handler_read_rnd';
SELECT VARIABLE_VALUE INTO @Handler_read_first
FROM GLOBAL_STATUS 
WHERE variable_name ='Handler_read_first';
SELECT VARIABLE_VALUE INTO @Handler_read_next
FROM GLOBAL_STATUS 
WHERE variable_name ='Handler_read_next';
SELECT VARIABLE_VALUE INTO @Handler_read_key
FROM GLOBAL_STATUS 
WHERE variable_name ='Handler_read_key';
SELECT VARIABLE_VALUE INTO @Handler_read_prev
FROM GLOBAL_STATUS 
WHERE variable_name ='Handler_read_prev';


select ((@Handler_read_rnd_next + @Handler_read_rnd) / (@Handler_read_rnd_next + @Handler_read_rnd + @Handler_read_first + @Handler_read_next + @Handler_read_key + @Handler_read_prev))*100 as PercentFullTableScans;



Thursday, May 6, 2021

grep for DBA's

Many times I'll want to inspect a file and because I'm a DBA, my first inclination is to load the file into a database table so that I can write SQL queries against it.  However, it's not always possible or timely to load the data into a database server to inspect a file.  I've put together some useful grep commands below that I run to get a sense of the data attributes about the file.

For pretend purposes, let say we have a file named Person.csv that contains records about people.  The file contains an Id, FirstName, LastName, DateRecordCreated and DateRecordUpdated columns.  See below.







To count the number of instances of "2021" in a file.

$ grep -c -i 2021 Person.csv

Output:

4


To search for the existence of a pattern in a file

$ grep -i '2021' Person.csv

Output:

1,Bart,Simpson,1990-01-01,2021-01-01

3,Seth,Rogen,2000-07-01,2021-04-30

6,Bill,Gates,1966-05-04,2021-05-01

8,Drew,Brees,1999-02-15,2021-02-01


To obtain the line number(s) of a pattern in a file

$ grep -n -i 2021 Person.csv

Output:

2:1,Bart,Simpson,1990-01-01,2021-01-01

4:3,Seth,Rogen,2000-07-01,2021-04-30

7:6,Bill,Gates,1966-05-04,2021-05-01

9:8,Drew,Brees,1999-02-15,2021-02-01


To count the number of non pattern matches

$ grep -v -c 2021 Person.csv

Output:

6


Count number of instances of multiple pattern

$ grep -ioh "2018\|2019\|2020\|2021" Person.csv | sort | uniq -c | sort -n

Output:

1 2020

2 2019

4 2021


Count number of lines in a file

$ grep -c ^ Person.csv

Output:

10

Wednesday, November 25, 2020

SOUNDEX - A real world example

The Works Progress Administration created SOUNDEX in the late 1930's to help the searching and storing of surnames that are pronounced the same but are spelled differently.  Soundex is a way of assigning sound codes to names.  Interestingly, the US government indexed surnames based on SOUNDEX when collecting results of census records.  

Census example below

select SOUNDEX('Smith') 

select SOUNDEX('Smyth') 

select SOUNDEX('Smithe') 

select SOUNDEX('Smythe')  

All Return S530


Other real world use cases

Misspelled words:

select SOUNDEX('Chicken') 

select SOUNDEX('Chikin')  

Both return C500


Other real world use cases

Some states use SOUNDEX to generate license plate numbers

http://www.highprogrammer.com/alan/numbers/dl_us_shared.html



Wednesday, April 29, 2020

CAP Theorem and Database Architecture - Part 1

I was recently asked to give a talk about database architecture to a group of DEVOPS and Site Reliability Engineers.  So I wrote this BLOG post to capture some of the important talking points around this topic.

There a couple of important concepts to discuss about database architecture implementations.  CAP Theorem, Data Models and how to scale your database implementation.

The CAP Theorem states that it is impossible for a database to provide more than 2 out of the following 3 concepts: Consistency (C), Availability (A) and Partitioning (P).  Any implementation of a database system can only support 2,  Consistency + Availability (CA), Consistency + Partitioning (CP), or Availability+Partitioning (AP).

See the diagram below:



















The CAP Theorem states you can only have 2 out of the 3 concepts, so let's go over each concept.

Consistency - Any data read is guaranteed to be consistent and the most recent version of the data.

Availability - Reads will eventually receive a response, may not be the most recent version of the data.

Partitioning - Also known as Partitioning Tolerance, a Cluster continues to function, even if a node goes down.


Pick 2 out of 3 options
Based on the theory that you can only pick 2 out of the 3 concepts for implementation, below are the combinations and models that are used in the industry today.

Consistency + Availability (CA)
Relational Database Management Systems (RDBMS) implement the Consistency + Availability model via the Relational Data Model.  Examples of RDBMS systems include SQL Server, MySQL and Postgres.  

Consistency + Partitioning (CP)
Key Value, Columnar\Store, Document ordered databases systems can implement the Consistency + Partitioning model.  Examples of these systems include BigTable, MongoDB, HBase.

Availability + Partitioning (AP)
Key Value and Columnar\Store databases systems can implement the Availability + Partitioning model.  Examples of these systems include Cassandra, DynamoDB and RIAK.


Data Models
Now that we've discussed CAP Theorem and the various implementations, it's time to take a closer look at the data models implemented by some of the technologies mentioned above.  

Below is a chart I made to help summarize the differences.


Data ModelStructureQueriesPrimary Consistency
RelationalRow basedComplexImmediate
ColumnarColumn basedSimple, Joins are slowEventual
DocumentDocument basedSimple, No JoinsEventual
KeyValueKey basedSimple, No JoinsEventual



Relational model

In an RDMBS system, data is organized into tables and relationships are built between other tables containing additional relevant information.  

See the oversimplified example below.






The relational model is implemented in RDBMS systems.  There are several benefits of RDBMS and I'm going to list them here without detail:  ATOMIC transactions, Transaction management (roll forward, roll back), lock escalation, built in facilities for administration functions like security management, backups and restore.  You can read several books on these topics but that's not the focus of this post.

Thinking about the diagram above, the SQL Statement to retrieve Bill Gates's hobbies requires two joins.  Not a big deal and very common and by design in the RDBMS world, however, in the NOSQL world, joins are complex.

Let's get into the background of how data is stored in Relational databases.  In the example above, each record in each table is contained in a row, and behind the scenes the rows are stored on pages and pages belong to extents.  This storage becomes important as it impacts how the data is read

How the SELECT works in RDBMS
Let's consider the tables above have a millions of records.  And a DBA just wants to see one hobby of Bill Gate's.  The SQL statement would look something like the following:

select top 1 p.FirstName, p.LastName, h.Hobby 
from Person p
inner join PersonHobby ph on p.PersonID = ph.PersonID
inner join Hobby h on ph.HobbyID = h.HobbyID
where p.FirstName = 'Bill' and p.LastName = 'Gates';

In order to find the one of the records associated with Bill's hobbies, the RDBMS system needs to retrieve all the columns from all the tables in the select statement.  In addition, the RDBMS system must scan all the pages the rows exist on.  Most RDBMS systems also implement a read-ahead algorithm, to anticipate the wanting of data close to the data your are selecting, so all the pages associated with the extent where the row resides are also scanned.  Eight 8K pages can fit into a single Extent.

Below is what a typical Data page diagram looks like.  Depending on the data types and size of the table, will depend how many rows fit onto a data page.

We previously mentioned that Data Pages belong to Extents in groups of eight, so consider the following:

In the above example, if a DBA writing a SQL statement only cares about retrieving a few columns for a single record, all of the data above must be scanned and retrieved in order to return the 1 row and 3 columns requested.  

Before we wrap up with part 1 of this blog series, let's examine the life cycle of a SELECT statement in SQL Server.

  1. The SQL Server SNI on the client establishes a connection to SQL Server
  2. A connection to a Tabular Data Stream TDS endpoint is made
  3. The SELECT statement is sent to SQL Server in TDS
  4. The SNI reads the TDS message and sends to the command parser
  5. The command parser will check the plan cache in the buffer pool for an already existing query plan.  If a plan is not found, it generates a query tree and sends it to the optimizer.
  6. The optimizer generates the most cost affective plan and sends it to the query executor.
  7. The query executor will decide if data is needed to be read to finish the query plan.  It passes the query plan to the Access Methods in the Storage Engine
  8. The access methods will need to read a page from the database in order to finish the request from the query executor.  It does this by asking the buffer manager to provide the data page
  9. The buffer manager will check the data cache to see if it already has the page in the cache.  If it is not in the cache, the page is read from disk and copied to the cache and sent back to access methods
  10. The access methods will pass the data back to the relational engine in order to send back to the client.


























Monday, April 20, 2020

Planning for your next database disaster recovery

As DBA’s , we are accustomed to various recovery scenarios when restoring databases.  Sometimes database restores are simple recoveries that are planned.  Hopefully only once in awhile, recoveries on unplanned.  

I don’t have to tell you if you are reading this blog, simply put, DBA’s perform a lot of database backups and a fair amount of database restores.   We spend a lot of time thinking about backups, dreaming about backups.  Sometimes databases are restored as part of a server migration.  This is a great way to test your backups as a backup is only good if it’s restorable.  In other cases, you might be restoring a database due to a badly written DML statement and you are only concerned with a single table.  And hopefully, only once in great awhile,  you are restoring a database due to a real disaster.

Regardless of the scenario, this is how I plan and organize myself for successful database disaster recoveries.
  1. Understand your risks
  2. Mitigate the risks
  3. Practice the recovery
  4. What to do when the disaster strikes

Understand your risks
There are a lot of risks when it comes to operating a database that is part of a strategic business process.  Make sure you document the risks and communicate them to your manager and upper management so that all in the decision making process are well informed.  Don’t be afraid to remind them once a quarter :)

Here are a few typical risk factors:
  • Single Points of Failure (SPOF)
  • Old\Aging Infrastructure
  • Unmaintained Infrastructure
  • Backup devices with limited storage
  • Rapidly changing\unpredictable workloads (Capacity)

Mitigate the risks.  
There are 2 types of risks.  The knowable and the unknowable.

Knowable risks:
  • SPOF - Identify Single Points of failure and work towards redundancy.  
  • Old\Aging Infrastructure - If the infrastructure hosting your database is 2 years old, you should be planning and budgeting for replacement during year 3.
  • Unmaintained Infrastructure - Is the firmware up to date, OS patches up to date, infrastructure under support?  
  • Backup devices with limited storage - Know your backup requirements and how much space your backups require
  • Capacity - does your infrastructure support the workload that is currently support?  

SPOF
There are 2 aspects, hardware and software.  You can reduce this risk by implementing redundant database hardware and storage.  From a software perspective, there are many solutions you can implement:  Log Shipping, Replication, Mutli-node clustering, etc.   It is also a best practice to capture requirements to build redundancy within the application.

Old\Aging Infrastructure
You can reduce risks by planning, budgeting and communicating the need for new hardware.  However, in the short term, you should keep additional capacity on standby in the event of a fatal infrastructure issue.

Unmaintained infrastructure
The first thing your hardware support person is going to ask when filing an incident is “is your device up to date on firmware, patches, service packets, etc).  You should be keeping up on this and not doing so slows down any recovery as this is the first thing a support person is going to ask you to do.  Also, not keeping up on this is usually a key finding in RCA (Root Cause Analysis)

Backup devices with limited storage
Monitor your backup device for free disk space and alert on it.    Build reporting and alerting on the success, failures of backups, and the last date of the successful backup for your databases.  I can’t tell you how many times I’ve seen a potential recovery situation and there wasn’t knowledge of when the last backup occurred.

Capacity
Monitoring and reporting on resource utilization is key to predicting when you will run into resource constrains.  Publish weekly reports, and share monthly or quarterly reports with your manager and upper management.  Publishing and reviewing reports regularly can help tie changes to something that has changed in the environment (Software release, new customer with large user base, etc).  Alerting on resource utilization is important as well.

Unknowable risks
  • Quality of your most recent backup - Practice your recovery.
  • Geographic based natural disasters - Have an off premise plan in the event there is a geographic disaster where your database is hosted.

Quality of your most recent backup
There are a few sayings in the industry:  “Your backup is only good if you can restore it”, “We have a backup, but we don’t know if it’s restorable” and so on.  The point being is that a great way to test your backup and have confidence in the process is to regularly plan and test your recoveries.  

Geographic based natural disasters
Earthquake, fire, hurricane can all impact a region and cause outages.  In a perfect world, you have a redundant data center or cloud in a separate geographic area (flood plain, tectonic plate, power grid, etc).  Many times you won’t have this luxury so come up with a couple of different scenarios, with cost, effort and downtime, get management sign off on the approach and implement.  At a minimum, make sure backups are copied to another facility (cloud, data center, device, etc).  For the most part, re-deploying or building the application is much easier than reconstructing the data.  

Practice the Recovery
I touched upon this a bit in a previous section, so I’ll repeat it again.  A great way to test your backup and have confidence in the recovery process is to regularly plan and test your recoveries.  Practicing this can also give you a few more data points.  1) If you time the recovery process, you can communicate what downtime will look like in the future based on the amount of time it takes to recover.  2). Testing the recovery process after each software release helps you identify new areas they may require additional effort to be added to the recovery process (engineering adds new database tech, new database, etc)


What to do when disaster strikes
The hardest thing to do during a disaster is trying to communicate the status and answer questions while also focusing on doing the recovery.  You could write an entire blog post about this so I’ll summarize what I feel are important talking points.
  • Implement a status page - Post updates to the site.  Get your teams used to checking the status page for uptime metrics, downtime for any releases or patches, etc.
  • Nominate someone to organize the recovery - For the folks working on the technical side of the problem, create a Slack channel, Zoom meeting, start an internal email, update the status page, open a ticket with the appropriate hardware vendor, etc.
  • Communication talking points - usually 3 groups of audiences here:  The executive team, the customer success managers (or customers themselves), and the technical team working on the recovery.  Things to keep in mind, how long the recovery will take (you know this from practicing), what is impacted, who is impacted, the status, and when the next communication will occur again.
  • Host an RCA Meeting (Root Cause Analysis) and come up with action items to reduce risk going forward.

I’ve given you a lot to think about.  You may not be able to work on all of the recommendations above.  And maybe the items above are not enough to help you recover successfully during your next outage.  The most important things are to know and communicate your risks, how much it will cost to remediate the risks, and show, communicate and document evidence of improvement.


Thursday, March 26, 2020

Searching columns for patterns in MYSQL


There are many different ways to search a column for a pattern.  You can use string functions and operators or regular expressions.

Below are some (not all) of the different ways to search a column with phone numbers that contain the sequential 415 number.

To find numbers starting with 415:

SELECT PhoneNumber FROM ContactInformation WHERE PhoneNumber LIKE '415%';
SELECT PhoneNumber FROM ContactInformation WHERE LEFT(PhoneNumber,3) = '415';
SELECT PhoneNumber FROM ContactInformation WHERE PhoneNumber REGEXP '^415';
SELECT PhoneNumber FROM ContactInformation WHERE SUBSTRING(PhoneNumber,1,3) = '415';

To find numbers containing 415 in the number:

SELECT PhoneNumber FROM ContactInformation WHERE PhoneNumber LIKE '%415%';
SELECT PhoneNumber FROM ContactInformation WHERE PhoneNumber REGEXP '415';
SELECT PhoneNumber FROM ContactInformation WHERE PhoneNumber REGEXP '415?';

Sunday, February 23, 2020

MYSQL Date Formatting

I always feel like I'm digging around to find a list of date formats for MYSQL.  I then end up doing some trial and error to get the desired format. 


Below is helpful list to refer to for common date formatting.



Example   Results
SELECT NOW()   2020-02-23 19:45:14
SELECT DATE_FORMAT(NOW(),"%a")     Sun
SELECT DATE_FORMAT(NOW(),"%b")   Feb
SELECT DATE_FORMAT(NOW(),"%c")   2
SELECT DATE_FORMAT(NOW(),"%D")   23rd
SELECT DATE_FORMAT(NOW(),"%d")   23
SELECT DATE_FORMAT(NOW(),"%e")   23
SELECT DATE_FORMAT(NOW(),"%f")   0
SELECT DATE_FORMAT(NOW(),"%H")   19
SELECT DATE_FORMAT(NOW(),"%h")   7
SELECT DATE_FORMAT(NOW(),"%I")   7
SELECT DATE_FORMAT(NOW(),"%i")   45
SELECT DATE_FORMAT(NOW(),"%j")   54
SELECT DATE_FORMAT(NOW(),"%k")   19
SELECT DATE_FORMAT(NOW(),"%l")   7
SELECT DATE_FORMAT(NOW(),"%M")   February
SELECT DATE_FORMAT(NOW(),"%m")   2
SELECT DATE_FORMAT(NOW(),"%p")   PM
SELECT DATE_FORMAT(NOW(),"%r")   7:45:14 PM
SELECT DATE_FORMAT(NOW(),"%S")   14
SELECT DATE_FORMAT(NOW(),"%s")   14
SELECT DATE_FORMAT(NOW(),"%T")   19:45:14
SELECT DATE_FORMAT(NOW(),"%U")   8
SELECT DATE_FORMAT(NOW(),"%u")   8
SELECT DATE_FORMAT(NOW(),"%V")   8
SELECT DATE_FORMAT(NOW(),"%v")   8
SELECT DATE_FORMAT(NOW(),"%W")   Sunday
SELECT DATE_FORMAT(NOW(),"%w")   0
SELECT DATE_FORMAT(NOW(),"%X")   2020
SELECT DATE_FORMAT(NOW(),"%x")   2020
SELECT DATE_FORMAT(NOW(),"%Y")   2020
SELECT DATE_FORMAT(NOW(),"%y")   20

Percentage of Full Table Scans on MYSQL

Below is a useful script I use to determine what percentage of my queries in MYSQL are performing full table scans.  Limiting full table sca...