Table of Contents

Introduction to database management with open source tools

Workshop offered by Guillaume Larocque (guillaume.larocque@mcgill.ca), research professional at the Quebec Centre for Biodiversity Science.

Link to Prezi presentation

Useful resources

Installation of PostgreSQL

On Windows and Mac

On Linux/Ubuntu

In a terminal, type:

sudo apt-get install postgresql

It is possible that the version of postgresql available in your repositories is earlier. You can install the version you have available.

For other distributions, check this page.

Accessing PSQL

On Windows and Mac:

If that doesn't work...

On Linux Simply type:

psql

in a terminal.

If that doesn't work...



If that doesn't work on Mac...

Group discussion

Click here to download the original spreadsheet (ODS format)

Click here to download the original spreadsheet (XLSX format)




Creating a new user and database

If desired, you can create a superuser other than the default postgres user. To do so, type the following command, replacing your_username and your_password with values of your choice.

CREATE USER your_username WITH SUPERUSER PASSWORD 'your_password';

Creating a new database for the workshop

CREATE DATABASE workshop;

Important step to secure access to database.

REVOKE CONNECT ON DATABASE your_database FROM PUBLIC;
CREATE USER user_name WITH PASSWORD 'your_password';
GRANT ALL ON DATABASE your_database TO user_name;

Connecting PSQL and DBEAVER

Basic commands

List all databases

\l

Connect to a database

\c workshop

List and describe all tables in the current database

\dt

… with more details

\dt+

View a list of all tables in database

\d+

Describe a particular table

\d tablename

List all users

\du

Quit Postgresql

\q

Exercise 1 - CREATE AND INSERT statements

More common data types

function callmeaningexample
integer-2147483648 to +21474836475
decimal(,)decimal(total digits, digits after period)122.52 decimal(5,2)
real6 decimal digits precision122.527964
double15 decimal digits precision122.527964
varchar()character(maximum length) 'tree swallow'
texttext of unspecified length 'Once upon a time, there was…'
timestampdate and time '2012-10-21 10:03:21'

Full list of PostgreSQL data types.

CREATE TABLE bird_sightings (sight_id SERIAL PRIMARY KEY,
obs_time timestamp,species text,sp_count integer,lat real,"long" real);

We use “” when we want to store a column or table name that is a reserved word or that contains spaces or special characters.

MySQL

To add a new row with values to the table

INSERT INTO bird_sightings (obs_time,species,sp_count,lat,"long") VALUES ('2012-06-19 12:31:16','Sturnus vulgaris',40,45.3522,-73.7930);

Note that we did not specify the sight_id field since it is set to SERIAL (auto-increment) and it will be assigned a value automatically.

To see the table:

SELECT * FROM bird_sightings;

If you need to delete the table

DROP TABLE bird_sightings;

Question 1 :?: Reproduce the following table in PostgreSQL, name it 'actors' and visualize it in PostgreSQL

first_namelast_namebirth_dategenderheighttrade_mark
BruceWillis1955-03-19M1.81Frequently plays a man who suffered a tragedy, had lost something or had a crisis of confidence or conscience.
EmmaWatson1990-04-15F1.65Often portrays literary characters

Exercise 2 - Importing data into PostgreSQL

Download the following files to your computers (right-click, save link as…): Lakes.csv lakes_species.csv species_acro.csv

On Windows: Copy those three files to a folder named 'opendbwork' in the C: folder (i.e. C:\opendbwork) On Mac or Linux: Copy those three files to a folder in your path that you can easily identify (e.g. /home/myname/opendbwork). Modify the commands below accordingly by replacing "C:/opendbwork" with your path.

Create the table containing the environmental information of each lake - Lakes

CREATE TABLE lakes (lake_id SERIAL PRIMARY KEY,numero INT, lake_name text, 
province text, latitude DEC(10,5),longitude DEC(10,5), number_of_species INT, ecoprov VARCHAR(50), 
ecozone VARCHAR(50), gss DEC(10,2),gse DEC(10,2), gsl DEC(10,2), gdd10 DEC(10,2), egdd DEC(10,2), 
mean_ele DEC(10,2), pe_ann_p DEC(10,2), totp_ann DEC(10,2), srann_me DEC(10,2), shann_me DEC(10,2), 
tmax_ann DEC(10,2), tmean_an DEC(10,2), vpann_me DEC(10,2));

MySQL

Load the CSV file into this table

COPY lakes FROM 'C:/opendbwork/lakes.csv' WITH csv HEADER DELIMITER AS ',';

If this command gives you an access denied error message in Windows

If this command gives you an access denied error message on Mac

MySQL

Create the table containing the Species presence at each lakes and load CSV into it - lakes_species

CREATE TABLE lakes_species (lake_id INT NOT NULL, species_id INT NOT NULL);
COPY lakes_species FROM 'C:/bdlibre/lakes_species.csv' WITH csv HEADER DELIMITER AS ',';

MySQL

For the third table:

CREATE TABLE species (species_id INT NOT NULL, full_name varchar(255), short_name VARCHAR(100));
COPY species FROM 'C:/bdlibre/species_acro.csv' WITH csv HEADER DELIMITER AS ',';

REFERENCES

MATH AND GROUP BY (Aggregate) Functions

avg() Return the average value of the argument
count(DISTINCT) Return the count of a number of different values
count() Return a count of the number of rows returned
group_concat() Return a concatenated string
max() Return the maximum value
min() Return the minimum value
stddev() Return the population standard deviation
stddev_pop() Return the population standard deviation
stddev_samp() Return the sample standard deviation
sum() Return the sum
var_pop() Return the population standard variance
var_samp() Return the sample variance
variance() Return the population standard variance

Full list

Conditional selections used in the where clause

= Equal
> Greater than
< Less than
>= Greater than or equal
< = Less than or equal
<> or !=Not equal to
LIKEString matching

Order of operators in a SELECT statement:

SELECT columns FROM tables WHERE conditions JOIN GROUP BY columns HAVING condition ORDER BY columns;

Exercise 3 - SELECT statement

Select all rows from the lakes table.

SELECT * FROM lakes;

In the terminal, when you are done browsing a table, hit 'q' to go back to the prompt.

Select just the lake names and province.

SELECT lake_name, province FROM lakes;

Show all province names in table

SELECT DISTINCT province from lakes;

Show all province names in table, use an alias for province and order by descending alphabetical order.

SELECT DISTINCT province as "province name" FROM lakes ORDER BY province DESC;

Show the birth year of each actor

SELECT first_name, last_name,  EXTRACT(YEAR FROM birth_date) FROM actors;

Show initials of each actor

SELECT first_name, last_name, concat(substr(first_name,1,1),' ',substr(last_name,1,1)) as initials FROM actors;

Select all lakes from Quebec where the mean annual temperature is below -5.

SELECT lake_name, tmean_an FROM lakes WHERE province='QUEBEC' AND tmean_an<-5;

Or simply, to count the number of lakes

SELECT count(*) FROM lakes WHERE province='QUEBEC' AND tmean_an<-5;
or equivalently:
SELECT count(lake_id) FROM lakes WHERE province='QUEBEC' AND tmean_an<-5;

Question 1 :?: - How many lakes in British Columbia receive more than 3000 mm of precipitation (totp_ann column)?
answer
Question 2 :?: - What is the average Elevation (mean_ele column) of all lakes in the Montane Cordillera Ecozones (use the avg() operator)?
answer

Note: the % operator is a wildcard used to replace the beginning or the end of a string in a query using LIKE.
Select all lake names that contain the word 'Small'

SELECT lake_name FROM lakes WHERE lake_name like '%Small%';

Select all lake names that contain the word 'Small' and that are not located in Ontario.

SELECT lake_name, province 
FROM lakes
WHERE lake_name like '%Small%' AND province!='ONTARIO';
other possibility:
SELECT lake_name, province 
FROM lakes 
WHERE lake_name like '%Small%' AND province NOT LIKE 'ONTARIO';

Question 3 :?: - What is the maximum latitude of all lakes in Ecozones with names that start with 'Taiga'
answer
Question 4 :?: - How many species of Daphnia are there in the species table?
answer

Exercise 4 - GROUPING

Calculate the average annual temperature of all lakes within each province

SELECT province, AVG(tmean_an) as mean_an_t
FROM  lakes
GROUP BY province;

Same table ordered by increasing temperatures

SELECT province, AVG(tmean_an) as Mean_AN_T
FROM  lakes
GROUP BY province ORDER BY AVG(tmean_an);

Question 5 :?: - Which province has the lake with the highest maximum temperature (column tmax_ann)?
answer

Update statement: Give all rows where the province name includes NWT the same province name 'NWT' - Careful, this actually modifies the table!

UPDATE lakes
SET province='NWT'
WHERE province like '%NWT%';

Exercise 5 - Working with multiple tables

Count the number of species in each lake. Display lake name and species number.

SELECT lake_name, COUNT(DISTINCT species_id) as num_species FROM lakes, lakes_species 
WHERE lakes.lake_id=lakes_species.lake_id 
GROUP BY lakes.lake_id 
ORDER BY num_species DESC;

Same query, but only display the first 20.

SELECT lake_name, COUNT(DISTINCT species_id) as num_species FROM lakes, lakes_species 
WHERE lakes.lake_id=lakes_species.lake_id 
GROUP BY lakes.lake_id 
ORDER BY num_species DESC LIMIT 20;

Count the number of species in each Ecozone and Ecoprovince. Display lake name and species number.

SELECT ecozone, ecoprov, COUNT(species_id) as num_species FROM lakes, lakes_species 
WHERE lakes.lake_id=lakes_species.lake_id 
GROUP BY ecozone, ecoprov
ORDER BY ecozone, COUNT(species_id);

Question 6 :?: Generate a table that lists the full name of each species of Daphnia and the number of lakes where this species is present?

Answer


Exercise 6 - JOIN Operations

JOIN (INNNER JOIN, CROSS JOIN) Keep all possible combinations of rows from A and B
LEFT JOIN Keep all records of table A, join with elements from B that are present in A
RIGHT JOIN Keep all records of table B, join with elements from A that are present in B
OUTER JOIN Keep all records from table A and B

Create a list of all the species found in the lakes in Quebec, with the full names of the species and the number of occurences

SELECT full_name, count(full_name) 
FROM species,lakes_species,lakes 
WHERE species.species_id=lakes_species.species_id AND lakes.lake_id=lakes_species.lake_id AND province='QUEBEC' 
GROUP BY full_name;

Alternatively, and equivalently…

SELECT full_name, count(full_name) 
FROM lakes
LEFT JOIN lakes_species ON (lakes.lake_id=lakes_species.lake_id)
LEFT JOIN species ON (species.species_id=lakes_species.species_id)
WHERE province='QUEBEC' 
GROUP BY full_name;

For every possible combinations of lakes found in the Lake of the Woods Ecoprovince, return the difference in mean annual temperature between those lakes.

SELECT concat(a.lake_name,'-',b.lake_name) as lakes, a.TMEAN_AN-b.TMEAN_AN as temp_diff 
FROM lakes a, lakes b 
WHERE a.ecoprov='Lake of the Woods' AND b.ecoprov='Lake of the Woods';

Exercise 7 - Subqueries

Generate a table containing the count of the presence of each species in lakes North and South of the 50th parallel (latitude).

SELECT a.species_id, count_50south, count_50north
FROM 
 (SELECT species_id, count(d.species_id) as count_50south FROM lakes c, lakes_species d WHERE c.lake_id=d.lake_id AND latitude<=50 GROUP BY species_id) a,
 (SELECT species_id, count(f.species_id) as count_50north FROM lakes e, lakes_species f WHERE e.lake_id=f.lake_id AND latitude>50 GROUP BY species_id) b 
WHERE a.species_id=b.species_id;

:?: Question 7 Using a subquery, find the mean elevation of lakes (MEAN_ELE column) where are species of Daphnia is present
Answer

Exporting a table

Create a new table with just lakes in Quebec

CREATE TABLE lakes_qc AS SELECT * FROM lakes WHERE province='QUEBEC';
COPY lakes_qc TO 'C:/opendbwork/lakes_qc.csv' WITH csv HEADER DELIMITER AS ',';

Using the LibreOffice Base front-end

Creating a form

List Box

Radio buttons

Exercise

You are given a dataset in the following format:

Plot 1
species_namedbhstatuscomment
Acer rubrum12.4alive
Acer saccharum25.3alive
Fagus grandifolia14.1dead
Fagus grandifolia66.0alive
Fraxinus americana30.1alivediseased bark
Plot 2
Quercus rubra12.4alive
Fraxinus americana64.2alivebig hole in trunk
Fraxinus americana53.1deaduprooted
Carpinus caroliana10.3alive

:?: You are told that thousands more trees will have to be entered in this way. How would you go about designing a database and form system to store this information?

:?: Create the following table in PostgreSQL. You can then fill the table with its values by clicking on the table name in Base and entering values manually. Table name: tree_species

species_idspecies_nameenglish_name
1Quercus rubraRed oak
2Acer saccharumSugar maple
3Acer rubrumRed maple
4Fagus grandifoliaAmerican beech
5Fraxinus americanaWhite ash
6Carpinus carolianaMusclewood


:?: Create this table structure and create a form to fill it in the form design view, with a drop-down menus (list box) to select the species names and radio buttons for the tree health. tree_id is the primary key. For technical reasons, make sure that you specify varchar as the format for the status field. Table name: tree_plots

plot_idtree_idspecies_iddbhhealthcomment
11312.4alive
12225.3alive
13414.1dead
14466.0alive
15430.1alivediseased bark
26140.1alive
27564.2alivebig hole in trunk
28553.1deaduprooted
29610.3alive

Using Microsoft Access with PostgreSQL

Linking R with PostgreSQL

Using the RPostgreSQL package

From within R: On first use only:

install RPostgreSQL package
install.packages('RPostgreSQL')
R Code
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, user="your_username", password="your_password", dbname="workshop");
lakes <- dbGetQuery(con,'SELECT * FROM lakes');
Plot some data
lakesqc <- dbGetQuery(con,"SELECT * FROM lakes WHERE province='QUEBEC'")
hist(lakes$tmean_an)

Using the dplyr package

library(dplyr)
src<-src_postgres(dbname="workshop",host="localhost", port="5432",user="your_username",password="your_password")
lakes <- tbl(src, "lakes") # Define lakes table
lakes_qc<-filter(lakes, province  %=% 'QUEBEC') # Select lakes in Quebec
prov_tmean<-summarise(group_by(lakes, province), mean(tmean_an)) # Mean annual temperature per province
prov_tmean=collect(prov_tmean) # Transfer result to standard R data frame
lakes_qc2<-tbl(src, sql("SELECT * FROM lakes WHERE province='QUEBEC'")) #Perform any SQL statement