Wednesday, September 5, 2018

Loading Unknown CSV Data into a Table Easily

Peter Zaitsev of Percona posted a Tweet about an interactive book Select Star SQL which 'aims to be the best place on the internet for learning SQL'.  This interactive book uses a data set that can be downloaded or you could work the queries online.  I showed the website to a neighbor who was asking 'what is the best way to learn SQL'.   The neighbor wanted their own copy of the data and asked how do you turn a raw CSV file into a table.

Well, that can get complicated.

CSV


CSV files use commas to designate each field in the file.  This is a great idea until someone plants a comma in the middle of a field but it is sort of a least common denominator way of passing data between systems.

But a CSV file does not directly fit into a relational table.  You could shove all the data into a JSON column but that is not what we want.   What is needed is a peek at the data in the CSV file to see what the various fields are and possible hints at the types of data therein.  So download the data and please follow along.

Traditionally it took a bit of data sleuthing to examine the CSV file to determine how the data needed to be handled to fit into the relational table even before you fired up your database client. 

If you open the file with a spreadsheet program like Libre Calc they usually will make a best effort at putting each field in its own column.  Luckily this data set has the first row which names the columns. Are here are the column headings:

Execution
Date of Birth
Date of Offence
Highest Education Level
Last Name
First Name
TDCJ Number
Age at Execution
Date Received
Execution Date
Race
County
Eye Color
Weight
Height
Native County
Native State
Last Statement

Looking at the first few rows under the headings we can get a good glimpse at the data types for the columns.  What we want to do on the first pass is to see if values are strings (CHAR), Integers, Realm and the like.  It also helps to look for some field that can be used for a primary key, hopefully a column with unique values.

Execution  INT Primary key??  Unique??
Date of Birth DATE
Date of Offence DATE
Highest Education Level INT
Last Name CHAR
First Name CHAR
TDCJ Number INT
Age at Execution INT
Date Received   DATE
Execution Date DATE
Race CHAR
County CHAR
Eye Color CHAR
Weight INT
Height CHAR
Native County  CHAR
Native State CHAR
Last Statement  LONG TEXT

Second Pass


In this case it looks like we could use Execution as the primary key.  Now we need to go through the data to see what fields need special handling.  Are the possible INT fields all positive values (unsigned), could we use ENUMs anywhere (Maybe Native Country or Native State or Eye Color). 

Also, for metric fans, the Height column uses imperial measurements which may not fit a company policy of metric only (or cubits). Plus Height has two components -- Feet and Inches -- that we may need to break down later or we may want to record all this data in inches. We will not cover that in this blog post (but we could cover if I get enough folks asking for how to do this).

Also we need to examine each of the CHAR fields to check their maximum lengths.  This can be dangerous for future addendum where suddenly the maximum length set by our first pass at the data is swamped by some much longer fields.  If the only Last Names we see are Smith and Jones, setting the maximum length of the CHAR field works until we run into someone named Buchanan or Schwarzenegger.    But in this case we have a closed system (NO updates to the data) we will not worry about this, at least for his data

So making a best guess, lets assign lengths to the CHAR fields.

Execution  INT Primary key??  Unique??
Date of Birth DATE
Date of Offence DATE
Highest Education Level INT  unsigned
Last Name CHAR  25
First Name CHAR  25
TDCJ Number INT
Age at Execution INT
Date Received   DATE
Execution Date DATE
Race CHAR  10
County CHAR 20
Eye Color CHAR 10
Weight INT
Height CHAR 10
Native County  CHAR 25
Native State CHAR 25
Last Statement  LONG TEXT

Please note that I will be using VARCHAR instead of CHAR in the above and use CHAR just for clarity.

Create Our Table

We could of course create a table like that below and then load the data file.



MySQL Workbench makes quick work of creating the schema.  But Workbench will also has a wizard to open the CSV, let us name the columns, and import the data.

The Easy Way To Go From CSV to Relational Table

MySQL Workbench's import wizard is reached by clicking on a table icon and selecting Table Data Import Wizard.

Tell the import wizard where the CSV is located


Start the table import wizard

The CSV is split up by fields and we could edit field types 
And the data is loaded



Now I have a big warning in that column names are copied from the CSV table first row.  We could have changed Date of Birth to something like Birth_date but chose not to. This means we have to make queries like SELECT `Date of Birth` FROM tx_deathrow.

But now we have a local copy of the data to use when working through the interactive book.