======= Introduction to Database Management Systems with MySQL =======
Workshop given by Guillaume Larocque, research professional at the QCBS, at McGill University on November 2 and 9, 2012.
[[http://prezi.com/vswzpe89tbwp/present/?auth_key=gv8krfa&follow=ktr6emnemif_|Link to Prezi presentation - Simultaneous presentation]]
[[http://prezi.com/vswzpe89tbwp/introduction-to-database-management-with-mysql/|Link to Prezi presentation]]
===== Useful resources =====
* [[http://dev.mysql.com/doc/refman/5.5/en/|MySQL 5.5. Reference Manual]]
* [[http://www.artfulsoftware.com/infotree/queries.php|Common MySQL Queries]] - Very useful website with hundreds of example MySQL commands.
* [[http://www.springerlink.com/content/978-1-85233-716-2/|Database Design Manual: using MySQL for Windows]] - This book can be downloaded for free from most universities.
====== Installing MySQL ======
* Visit the [[http://www.mysql.com/downloads/mysql/|Download section of the MySQL website]] and download the version of MySQL Community Server appropriate for your platform (Windows users: choose the 32 bit version!). Note that you DO NOT have to register. Simply click on "No thanks, just take me to the downloads!" at the bottom.
* Follow the installation instructions. Use the Typical Setup Type.
* If a Window appears to register for a MySQL Entreprise support, **close the window**.
* Choose to proceed to create and configure a MySQL instance. Select the standard configuration type.
* On Windows, choose to install MySQL as a Windows service. Also choose a password for the root user (administrator). **WRITE DOWN THIS PASSWORD!!**
* Download [[http://www.libreoffice.org/download/|LibreOffice 3.6]] for your platform and follow the installation instructions. Choose the Typical installation option.
* On Windows: Install the [[http://dev.mysql.com/downloads/connector/odbc/|32 bit version of the MySQL ODBC connector]].
====== Access to PHPMyAdmin ======
https://secure2.creationmw.com/systeme/phpmyadmin/
login:qcbsworkshop-ro
password:YyQphZLHxWtPR7fN
====== Group discussion ======
[[http://qcbs.ca/wp-content/uploads/2012/10/Crustacean_plankton_Canada.ods|Click here to download the original spreadsheet]]
====== MySQL Exercises ======
==== CREATE YOUR FIRST DATABASE ====
**On Linux**:
Open a terminal and type
mysql -u root -p
and type your password. You are now in MySQL.
**On MAC**:
/usr/local/mysql/bin/mysql -u root
**On Windows**: In the Start menu, find MySQL Server... Command Line Client.
Create a new database:
CREATE DATABASE QCBSWORK;
Use this database:
USE QCBSWORK;
To see the list of available databases:
SHOW DATABASES;
==== MAJOR MySQL commands ====
|[[http://dev.mysql.com/doc/refman/5.6/en/show.html|SHOW]]|show databases or tables|
|[[http://dev.mysql.com/doc/refman/5.6/en/describe.html|DESCRIBE]]|describe table|
|[[http://dev.mysql.com/doc/refman/5.6/en/create-table.html|CREATE]]|create database or table|
|[[http://dev.mysql.com/doc/refman/5.6/en/insert.html|INSERT]]|insert a row in a table|
|[[http://dev.mysql.com/doc/refman/5.6/en/select.html|SELECT]]|perform queries|
|[[http://dev.mysql.com/doc/refman/5.6/en/update.html|UPDATE]]|modify existing columns|
|[[http://dev.mysql.com/doc/refman/5.6/en/delete.html|DELETE]]|delete columns|
|[[http://dev.mysql.com/doc/refman/5.6/en/alter-table.html|ALTER TABLE]]|add new columns, or modify the column types|
|[[http://dev.mysql.com/doc/refman/5.6/en/drop-table.html|DROP]]|delete [[http://dev.mysql.com/doc/refman/5.5/en/drop-database.html|database]] or [[http://dev.mysql.com/doc/refman/5.5/en/drop-table.html|table]]|
===== Exercise 1 - CREATE AND INSERT statements =====
[[http://dev.mysql.com/doc/refman/5.0/en/data-type-overview.html|Click here]] or [[http://www.htmlite.com/mysql003.php|here]] for the full list of MySQL data types
More common data types
^function call^meaning^example^
|int()|integer(maximum length)|5|
|decimal(,)|decimal(total digits, digits after period)|122.52|
|varchar()|character(maximum length)| 'tree swallow'|
|text|text of unspecified length| 'Once upon a time, there was...'|
|datetime|date and time| '2012-10-21 10:03:21'|
CREATE TABLE Bird_sightings (Sight_ID int(5) NOT NULL AUTO_INCREMENT PRIMARY KEY,
Obs_Time datetime,Species text,Sp_count int(10),Lat decimal(8,5),`Long` decimal(8,5));
SHOW TABLES;
DESC Bird_sightings;
INSERT INTO Bird_sightings SET Obs_Time='2012-06-19 12:31:16',Species='Tachycineta bicolor',Sp_count=10,Lat=45.21231,`Long`=-73.79102;
Note that we did not specify the Sight_ID field since it is set to AUTO_INCREMENT and it will be assigned a value automatically.
Other syntax for the insert command:
INSERT INTO Bird_sightings (Obs_Time,Species,Sp_count,Lat,`Long`) VALUES ('2012-06-19 12:31:16','Sturnus vulgaris',40,45.3522,-73.7930);
To see the table:
SELECT * FROM Bird_sightings;
**Question 1** :?: Reproduce the following table in MySQL, name it 'Actors' and visualize it in MySQL
^first_name^last_name^Birth_date^gender^height^Trade_mark^
|Bruce|Willis|1955-03-19|M|1.81|Frequently plays a man who suffered a tragedy, had lost something or had a crisis of confidence or conscience.|
|Emma|Watson|1990-04-15|F|1.65|Often portrays literary characters|
===== Exercise 2 - Import data into MySQL =====
Download the following files to your computers (right-click, save link as...):
[[http://qcbs.ca/wp-content/uploads/2012/10/Lakes.csv|Lakes.csv]]
[[http://qcbs.ca/wp-content/uploads/2012/10/Lakes_Species.csv|Lakes_Species.csv]]
[[http://qcbs.ca/wp-content/uploads/2012/10/Species_Acro.csv|Species_Acro.csv]]
On Windows: Copy those three files to a folder named 'mysqlwork' in the C: folder (i.e. C:\mysqlwork)
On Mac or Linux: Copy those three files to a folder in your path that you can easily identify (e.g. /home/myname/mysqlwork). Modify the commands below accordingly by replacing "C:/mysqlwork" with your path.
Create the table containing the environmental information of each lake - **Lakes**
CREATE TABLE Lakes (Lake_ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,numero INT, Lake_name text,
province text, latitude DEC(10,5),longitude DEC(10,5), Number_of_species INT, ECOPROV VARCHAR(20),
ECOZONE VARCHAR(20), GSS DEC(10,2),GSE DEC(10,2), GSL DEC(10,2), GDD10 DEC(10,2), EGDD DEC(10,2),
MEAN_ELE DEC(10,2), PE_ANN_P DEC(10,2), TOTP_ANN DEC(10,2), SRANN_ME DEC(10,2), SHANN_ME DEC(10,2),
TMAX_ANN DEC(10,2), TMEAN_AN DEC(10,2), VPANN_ME DEC(10,2));
Load the CSV file into this table
LOAD DATA LOCAL INFILE 'C:/mysqlwork/Lakes.csv' INTO TABLE Lakes FIELDS TERMINATED BY ','
ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES;
Note: if the above command gives you 'command not allowed' error. Type 'exit;' and restart MySQL with the following command mysql -u root -p --local-infile=1;
Create the table containing the Species presence at each lakes and load CSV into it - **Lakes_Species**
CREATE TABLE Lakes_Species (Lake_ID INT NOT NULL, Species_ID varchar(25) NOT NULL);
LOAD DATA LOCAL INFILE 'C:/mysqlwork/Lakes_Species.csv' INTO TABLE Lakes_Species
FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES;
For the third table, we could use the command line with the following:
++++ Command |
CREATE TABLE Species_Acro (Unique_ID INT NOT NULL, Full_name varchar(100), Species_ID Varchar(25));
LOAD DATA LOCAL INFILE 'C:/mysqlwork/Species_Acro.csv' INTO TABLE Species_Acro
FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES;
++++
However, we will use LibreOffice instead...
===== Importing data with LibreOffice =====
==== Configuring the LibreOffice/MySQL connexion====
**On Windows**
Open LibreOffice Base. Select database...Connect to an existing database...MySQL. On the next screen, choose 'Connect using ODBC'. On the next screen, click on Browse and choose organize in the bottom right. Then, choose to add a data source and find 'MySQL ODBC 5.2w driver'. In the next screen... Data source name: MySQL ODBC. TCP/IP Server : localhost. User: root. Password: Your password. Database: QCBSWORK. Then click OK, Select MySQL ODBC and click OK again.
On the next screen: Username:root. Select 'Password required'. Select Next...Finish and then save the database in a folder of your choice.
**On Mac**
[[http://dev.mysql.com/downloads/connector/j/|Download the tar.gz archive here]]. Open LibreOffice Writer. In the top menu find Preferences...LibreOffice>>Java>>Class path>>Add archive and select the archive you just downloaded.
Open LibreOffice Base. Select database...Connect to an existing database...MySQL. On the next screen, choose 'Connect using JDBC'. On the next screen... Database name: QCBSWORK. Server: localhost.
**On linux**
Open LibreOffice Base. Select database...Connect to an existing database...MySQL. On the next screen, choose 'Connect using JDBC'. On the next screen... Database name: QCBSWORK. Server: localhost. If this fails, go to the top menu in LibreOffice and choose Tools>>Options>>LibreOffice>>Java and select an installation of a JRE.
=== Importing a table ===
* Open the Species_Acro.csv file with LibreOffice Calc (In Windows, find LibreOffice Calc in the start menu). Specify that the file is separated by comma.
* Select the entire worksheet with ctrl-a;
* Open LibreOffice Base and go to Edit...paste. Select to use the column headers as headers for the table
* Follow the wizard... Select all fields except the Unique_ID field. Specify the proper data format (verify the types of attributes in the table) for the other columns.
* When prompted to create a primary key, click yes.
===== REFERENCES =====
** MATH AND GROUP BY (Aggregate) Functions**
|avg() |Return the average value of the argument|
|count(DISTINCT) |Return the count of a number of different values|
|count() |Return a count of the number of rows returned|
|group_concat() |Return a concatenated string|
|max() |Return the maximum value|
|min() |Return the minimum value|
|stddev() |Return the population standard deviation|
|stddev_pop() |Return the population standard deviation|
|stddev_samp() |Return the sample standard deviation|
|sum() |Return the sum|
|var_pop() |Return the population standard variance|
|var_samp() |Return the sample variance|
|variance() |Return the population standard variance|
[[http://www.postgresql.org/docs/9.3/static/functions-aggregate.html|Full list]]
\\
\\
**Conditional selections used in the where clause**
|= |Equal|
|> |Greater than|
|< |Less than|
|>= |Greater than or equal|
|< = |Less than or equal|
|<> or !=|Not equal to|
|LIKE|String matching|
**Order of operators in a SELECT statement**:
SELECT columns FROM tables WHERE conditions JOIN GROUP BY columns HAVING condition ORDER BY columns;
===== Exercise 3 - SELECT statement =====
Select 'Bonjour';
Select cos(4*3);
Select all rows from the Lakes table.
SELECT * FROM Lakes;
Select just the lake names and province.
SELECT Lake_name, province FROM Lakes;
Show all province names in table
SELECT DISTINCT Province from Lakes;
Show all province names in table, use an alias for province and order by descending alphabetical order.
SELECT DISTINCT Province as `Province name` FROM Lakes ORDER BY Province DESC;
Show the birth year of each actor
SELECT First_name, Last_name, year(Birth_date) FROM Actors;
Show initials of each actor
SELECT First_name, Last_name, concat(substr(Firstname,1,1),' ',substr(Lastname,1,1)) as Initials FROM Actors;
Select all lakes from Quebec where the mean annual temperature is below -5.
SELECT Lake_name, TMEAN_AN FROM Lakes WHERE Province='Quebec' AND `TMEAN_AN`<-5;
Or simply, to count the number of lakes
SELECT count(*) FROM Lakes WHERE Province='Quebec' AND `TMEAN_AN`<-5;
or equivalently:
SELECT count(Lake_ID) FROM Lakes WHERE Province='Quebec' AND `TMEAN_AN`<-5;
**Question 1** :?: - How many lakes in British Columbia receive more than 3000 mm of precipitation (TOTP_ANN column)?
\\ ++answer| 12 ++ \\
**Question 2** :?: - What is the average Elevation (MEAN_ELE column) of all lakes in the Montane Cordillera Ecozone (use the avg() operator)?
\\ ++answer| 1364.36 ++ \\
\\
Note: the % operator is a wildcard used to replace the beginning or the end of a string.
\\
Select all lake names that contain the word 'Small'
SELECT Lake_name FROM Lakes WHERE Lake_name like '%Small%';
Select all lake names that contain the word 'Small' and that are not located in Ontario.
SELECT Lake_name, province
FROM Lakes
WHERE Lake_name like '%Small%' AND province!='ONTARIO';
other possibility:
SELECT Lake_name, province
FROM Lakes
WHERE Lake_name like '%Small%' AND province NOT LIKE 'ONTARIO';
**Question 3** :?: - What is the maximum latitude of all lakes in Ecozones with names that start with 'Taiga'
\\ ++answer| 68.8 ++ \\
**Question 4** :?: - How many species of Daphnia are there in the Species_Acro table?
\\ ++answer| 17 (21 have the word 'Daphnia' somewhere in their name) ++ \\
===== Exercise 4 - GROUPING =====
Calculate the average annual temperature of all lakes within each province
SELECT province, AVG(TMEAN_AN) as Mean_AN_T
FROM Lakes
GROUP BY Province;
Same table ordered by increasing temperatures
SELECT province, AVG(TMEAN_AN) as Mean_AN_T
FROM Lakes
GROUP BY Province ORDER BY AVG(TMEAN_AN);
**Question 5** :?: - Which province has the lake with the highest maximum precipitation (column TMAX_ANN)?
\\ ++answer| BRITISH COLUMBIA 14.60 ++ \\
Update statement:
Give all rows where the province name includes NWT the same province name 'NWT' - **Careful, this actually modifies the table!**
UPDATE Lakes
SET Province='NWT'
WHERE Province like '%NWT%';
===== Exercise 5 - Working with multiple tables =====
Count the number of species in each lake. Display lake name and species number.
SELECT Lake_name, COUNT(DISTINCT Species_ID) as Num_Species FROM Lakes, Lakes_Species
WHERE Lakes.Lake_ID=Lakes_Species.Lake_ID
GROUP BY Lakes.Lake_ID
ORDER BY Num_Species DESC;
Same query, but only display the first 20.
SELECT Lake_name, COUNT(DISTINCT Species_ID) as Num_Species FROM Lakes, Lakes_Species
WHERE Lakes.Lake_ID=Lakes_Species.Lake_ID
GROUP BY Lakes.Lake_ID
ORDER BY Num_Species DESC LIMIT 0,20;
Count the number of species in each Ecozone and Ecoprovince. Display lake name and species number.
SELECT ECOZONE, ECOPROV, COUNT(Species_ID) as Num_Species FROM Lakes, Lakes_Species
WHERE Lakes.Lake_ID=Lakes_Species.Lake_ID
GROUP BY ECOPROV
ORDER BY ECOZONE, COUNT(Species_ID);
List the species that are found in only one Ecoprovince.
SELECT Species_ID, ECOPROV FROM Lakes, Lakes_Species
WHERE Lakes.Lake_ID=Lakes_Species.Lake_ID
GROUP BY Species_ID HAVING Count(DISTINCT ECOPROV)=1
ORDER BY Species_ID;
**Question 6** :?: Generate a table that lists the full name of each species of Daphnia and the number of lakes where this species is present?
++++ Answer |
SELECT Full_name, count(DISTINCT a.Lake_ID) as `Count` FROM Lakes a, Species_Acro b, Lakes_Species c WHERE Full_name like 'Daphnia%' AND a.Lake_ID=c.Lake_ID AND b.Species_ID=c.Species_ID GROUP BY Full_name;
^Full_name^Count^
| Daphnia pulex Leydig, 1860 | 234 |
| Daphnia similis Claus, 1876 | 17 |
| Daphnia ambigua Scourfield, 1947 | 59 |
| Daphnia catawba Coker, 1926 | 28 |
| Daphnia dubia Herick, 1883 | 71 |
| Daphnia galeata Sars, 1864 | 17 |
| Daphnia longiremis G. O. Sars, 1862 | 426 |
| Daphnia longispina hyalina ceresiana | 3 |
| Daphnia longispina hyalina microcephela | 27 |
| Daphnia magna Straus, 1820 | 18 |
| Daphnia mendotae Birge, 1918 | 433 |
| Daphnia middendorffiana Fischer, 1851 | 117 |
| Daphnia parvula Fordyce, 1901 | 28 |
| Daphnia pulicaria Forbes, 1893 | 110 |
| Daphnia retrocurva Forbes, 1882 | 324 |
| Daphnia rosea G.O. Sars, 1862 | 52 |
| Daphnia thorata Forbes, 1893 | 14 |
++++ \\
[[http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html|Click here for a list of MySQL mathematical functions]]
===== Exercise 6 - JOIN Operations =====
|JOIN (INNNER JOIN, CROSS JOIN)| Keep all possible combinations of rows from A and B|
|LEFT JOIN| Keep all records of table A, join with elements from B that are present in A|
|RIGHT JOIN| Keep all records of table B, join with elements from A that are present in B|
|OUTER JOIN| Keep all records from table A and B|
Create a list of all the species found in the lakes in Quebec, with the full names of the species and the number of occurences
SELECT Full_name, count(Full_name)
FROM Species_Acro,Lakes_Species,Lakes
WHERE Species_Acro.Species_ID=Lakes_Species.Species_ID AND Lakes.Lake_ID=Lakes_Species.Lake_ID AND province='QUEBEC'
GROUP BY Full_name;
Alternatively, and equivalently...
SELECT Full_name, count(Full_name)
FROM Lakes
LEFT JOIN (Species_Acro, Lakes_Species)
ON (Species_Acro.Species_ID=Lakes_Species.Species_ID AND Lakes.Lake_ID=Lakes_Species.Lake_ID) WHERE province='QUEBEC'
GROUP BY Species_ID;
For all possible combination of lakes found in the Baffin Uplands Ecoprovince, return the difference in mean annual temperature
SELECT concat(a.Lake_name,'-',b.Lake_name) as Lakes, a.TMEAN_AN-b.TMEAN_AN as Temp_Diff
FROM Lakes a, Lakes b
WHERE a.ECOPROV='Baffin Uplands' AND b.ECOPROV='Baffin Uplands';
Equivalently:
SELECT concat(a.`Lake_ID`,'-',b.`Lake_ID`) as Lakes, a.`TMEAN_AN`-b.`TMEAN_AN` as Temp_Diff
FROM `Lakes_1665` a JOIN `Lakes_1665` b
WHERE a.`ECOPROV`='Baffin Uplands' AND b.`ECOPROV`='Baffin Uplands';
===== Exercise 7 - Subqueries =====
Generate a table containing the count of the presence of each species in lakes North and South of the 50th parallel (latitude).
SELECT a.SPECIES_ID, Count_50South, Count_50North
FROM
(SELECT SPECIES_ID, Count(d.SPECIES_ID) as Count_50South FROM Lakes c, Lakes_Species d WHERE c.Lake_ID=d.Lake_ID AND `latitude`<=50 GROUP BY Species_ID) a,
(SELECT SPECIES_ID, Count(f.SPECIES_ID) as Count_50North FROM Lakes e, Lakes_Species f WHERE e.Lake_ID=f.Lake_ID AND `latitude`>50 GROUP BY Species_ID) b
WHERE a.Species_ID=b.Species_ID;
Equivalently,
SELECT SPECIES_ID, Count(d.SPECIES_ID) as Count_50South FROM Lakes c, Lakes_Species d WHERE c.Lake_ID=d.Lake_ID AND `latitude`<=50 GROUP BY Species_ID
UNION SELECT SPECIES_ID, Count(f.SPECIES_ID) as Count_50North FROM Lakes e, Lakes_Species f WHERE e.Lake_ID=f.Lake_ID AND `latitude`>50 GROUP BY Species_ID;
:?: **Question 7** Using a subquery, find the mean elevation of lakes (MEAN_ELE column) where are species of Daphnia is present
\\ ++ Answer | 474.627172 \\ Code: SELECT avg(MEAN_ELE) FROM (SELECT DISTINCT a.Lake_ID, MEAN_ELE FROM Lakes a,Species_Acro b,Lakes_Species c WHERE b.Full_name like 'Daphnia%' AND a.Lake_ID=c.Lake_ID AND c.Species_ID=b.Species_ID ORDER BY a.Lake_ID) d;\\ ++
====== Using the LibreOffice Base front-end ======
===== Creating tables and forms =====
You are given a dataset in the following format:
^Plot 1^
|Acer rubrum|12.4|alive|
|Acer saccharum|25.3|alive|
|Fagus grandifolia|14.1|alive|
|Fagus grandifolia|66.0|Diseased bark|
|Fraxinus americana|30.1|alive|
^Plot 2^
|Quercus rubra|12.4|alive|
|Fraxinus americana|64.2|diseased bark|
|Fraxinus americana|53.1|dead|
|Carpinus caroliana|10.3|alive|
:?: You are told that thousands more trees will have to be entered in this way. How would you go about designing a database to store this information?
++++ FOLLOW THESE STEPS TO CREATE THE DATABASE IN LIBREOFFICE BASE|
:?: Use the Table design view to create the following table. Specify the Species_ID field as the primary key (++ right click on row| {{:screen.png}} ++) and with AutoValue: on.
Table name: Tree_species
^Species_ID^Species_name^
|1|Quercus rubra|
|2|Acer saccharum|
|3|Acer rubrum|
|4|Fagus grandifolia|
|5|Fraxinus americana|
|6|Carpinus caroliana|
:?: Create a form (with the form wizard) which allows you to fill the table above.
\\
:?: Repeat the process with this table. You can fill this one directly in the design view if you want. Health_ID is the primary key.
Table name: Health
^Health_ID^Health_details^
|1|Healthy|
|2|Diseased bark|
|3|Diseased branches|
|4|Dead|
:?: Create this empty table in the design view and create a form to fill it in the form design view, with drop-down menus (list boxes) to select the species names and the tree health from the tables above. Tree_ID is the primary key.
Table name: Tree_Plots
^Plot_ID^Tree_ID^Species_ID^DBH^Health_ID^
|1|1|3|12.4|1|
|1|2|2|25.3|1|
|1|3|4|14.1|1|
|1|4|4|66.0|2|
|1|5|4|30.1|1|
|2|6|1|40.1|1|
|2|7|5|64.2|3|
|2|8|5|53.1|4|
|2|9|6|10.3|1|
++++
====== Linking R with MySQL ======
From within R:
On first use only:
install.packages('RMySQL')
library(RMySQL)
con <- dbConnect(MySQL(), user="root", password="your_password", dbname="QCBSWORK");
lakes <- dbGetQuery(con,'SELECT * FROM Lakes');
lakesqc <- dbGetQuery(con,'SELECT * FROM Lakes WHERE province="Quebec"')
hist(lakes$TMEAN_AN)