Data Manipulation with sqldf
Paul Shannon
Belfast R Users
29/08/13
I HATE data manipulation with R. Look at how ugly and unreadable this is:
df <- airquality
# Group Ozone levels into 'High', 'Medium' & 'Low'
df$Ozone_Grpd[df$Ozone < 18] <- "Low"
df$Ozone_Grpd[df$Ozone >= 18 & df$Ozone < 63] <- "Medium"
df$Ozone_Grpd[df$Ozone >= 63] <- "High"
And that’s something very basic!
SQLDF uses an SQLite database to perform querys and updates. However it’s good to know that once R kicks data to the database, SQLite only will return queries. This means any data manipulation done in SQLite will only remain in SQLite.
Introduction to SQL
Queries
Queries begin with the select statement, followed by what we want to select:
# This query will return everything from df
sqldf("select * from df")
# This will only return 2 columns, Wind and Ozone, from df
sqldf("select Ozone, Wind from df")
These queries can be assigned to R objects. (I think select Ozone, Wind from df is much nicer than df[, names(df) %in% c(“Ozone, Wind”)]).
We can easily create aggregated data using expressions in the select statement. Expressions include:
- AVG: The avg() function returns the average value of all non-NULL X within a group.
- COUNT: The count(X) function returns a count of the number of times that X is not NULL in a group. The count(*) function (with no arguments) returns the total number of rows in the group.
- MAX: The max() aggregate function returns the maximum value of all values in the group.
- MIN: The min() aggregate function returns the minimum non-NULL value of all values in the group.
- SUM: The sum() and total() aggregate functions return sum of all non-NULL values in the group.
Unfortunately SQLite in quite limited and doesn’t support other expressions such as std (stand deviation) and var (variance).
Query Examples
titantic <- data.frame(Titanic)
head(titantic)
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
sqldf("select sum(Freq) as Count from titantic")
## Loading required package: tcltk
## Count
## 1 2201
sqldf("select sum(Freq) as Count from titantic group by Sex")
## Count
## 1 470
## 2 1731
sqldf("select Sex, sum(Freq) as Count from titantic group by Sex")
## Sex Count
## 1 Female 470
## 2 Male 1731
sqldf("select Sex, Class, Age, sum(Freq) as Count from titantic where Survived = 'No' group by Sex, Class, Age")
## Sex Class Age Count
## 1 Female 1st Adult 4
## 2 Female 1st Child 0
## 3 Female 2nd Adult 13
## 4 Female 2nd Child 0
## 5 Female 3rd Adult 89
## 6 Female 3rd Child 17
## 7 Female Crew Adult 3
## 8 Female Crew Child 0
## 9 Male 1st Adult 118
## 10 Male 1st Child 0
## 11 Male 2nd Adult 154
## 12 Male 2nd Child 0
## 13 Male 3rd Adult 387
## 14 Male 3rd Child 35
## 15 Male Crew Adult 670
## 16 Male Crew Child 0
Inserting Variables
Previously I’ve written sql strings with placeholders for variable names. The use gsub to swap them out. Eg. “select avg(predicted) as Pred, avg(variable) from table group by variable”. This can be done using the unix bash script style variable notation and fn$sqldf. This uses gsubfn, a quasi-perl-style string interpolation. gsubfn is used by sqldf so its already loaded. Note the fn prefix to invoke the interpolation functionality.
Eg.
Windvalue <- 10
Ozonevalue <- "High"
fn$sqldf("select * from df where Wind > $Windvalue and Ozone_Grpd = '$Ozonevalue'")
## Ozone Solar_R Wind Temp Month Day Ozone_Grpd
## 1 71 291 13.8 90 6 9 High
## 2 63 220 11.5 85 7 20 High
## 3 89 229 10.3 90 8 8 High
Updates
We can update our table within the database and apply all the transformations we want. This only applies the changes to the database not the table in our workspace! To account for this we can then select everything back from the database by running select * from main .(table name). This selects from the database table rather than the workspace table.
library("sqldf")
df <- sqldf(c("alter table df add month_grpd char(30)",
"alter table df add Wind_Ozone num",
"alter table df add Wind_1 num",
"alter table df add Wind_2 num",
"alter table df add Wind_norm num",
"update df set
month_grpd = case
when Month = 5 then 'May'
when Month = 6 then 'Jun'
when Month = 7 then 'Jul'
when Month = 8 then 'Aug'
when Month = 9 then 'Sep'
else 'Not bothered'
end,
Wind_Ozone = Wind*Ozone,
Wind_1 = case
when Wind > 9.7 then 0
else Wind
end,
Wind_2 = case
when Wind <= 9.7 then 0
else Wind
end,
Wind_norm = Wind/(select avg(Wind) from df)",
"select * from main.df"))
head(df)
## Ozone Solar_R Wind Temp Month Day Ozone_Grpd month_grpd Wind_Ozone
## 1 41 190 7.4 67 5 1 Medium May 303.4
## 2 36 118 8.0 72 5 2 Medium May 288.0
## 3 12 149 12.6 74 5 3 Low May 151.2
## 4 18 313 11.5 62 5 4 Medium May 207.0
## 5 NA NA 14.3 56 5 5 <NA> May NA
## 6 28 NA 14.9 66 5 6 Medium May 417.2
## Wind_1 Wind_2 Wind_norm
## 1 7.4 0 0.7432
## 2 8.0 0 0.8034
## 3 0.0 12 1.2654
## 4 0.0 11 1.1549
## 5 0.0 14 1.4361
## 6 0.0 14 1.4964
Joins
Database style joins are possibly the easiest way to merge and join data together. Data can be stacked on up of each other (a la rbind) using UNION and UNION ALL.
Different joins supported by SQLite include:
- Left Join (merges what it can from the right onto the left)
- Inner Join (merges common rows from both tables)
- Cross Join (cartesian product of tables)
# lets create a fake variable
df2 <- data.frame(mon = rep(5:9, each=16), day = seq(1, 31, 2))
df2$crazy_var <- df2$mon * rnorm(80, 30, 2)
sqldf("select Month, Day, month_grpd, Ozone, Wind_norm, crazy_var
from
(
select a.*, b.crazy_var
from
df a
left join
df2 b
on a.Month = b.mon and a.Day = b.day
where b.crazy_var is not NULL
union all
select a.*, b.crazy_var
from
(
select a.*
from
df a
left join
df2 b
on a.Month = b.mon and a.Day = b.day
where b.crazy_var is NULL
) a
left join
(
select mon, avg(crazy_var) as crazy_var
from
df2
group by mon
) b
on a.Month = b.mon
)")
## Month Day month_grpd Ozone Wind_norm crazy_var
## 1 5 1 May 41 0.7432 131.7
## 2 5 3 May 12 1.2654 134.5
## 3 5 5 May NA 1.4361 145.4
## 4 5 7 May 23 0.8637 157.9
## 5 5 9 May 8 2.0186 140.5
## 6 5 11 May 7 0.6929 151.9
## 7 5 13 May 11 0.9239 154.1
## 8 5 15 May 18 1.3256 156.6
## 9 5 17 May 34 1.2051 138.7
## 10 5 19 May 30 1.1549 158.9
## 11 5 21 May 1 0.9741 146.8
## 12 5 23 May 4 0.9741 171.3
## 13 5 25 May NA 1.6671 141.8
## 14 5 27 May NA 0.8034 144.0
## 15 5 29 May 45 1.4964 153.2
## 16 5 31 May 37 0.7432 147.9
## 17 6 1 Jun NA 0.8637 169.1
## 18 6 3 Jun NA 1.6169 184.5
## 19 6 5 Jun NA 0.8637 186.0
## 20 6 7 Jun 29 0.9741 181.6
## 21 6 9 Jun 71 1.3859 171.6
## 22 6 11 Jun NA 1.0947 180.5
## 23 6 13 Jun 23 0.8034 195.7
## 24 6 15 Jun NA 1.1549 197.5
## 25 6 17 Jun 37 2.0788 188.9
## 26 6 19 Jun 12 1.1549 185.3
## 27 6 21 Jun NA 0.6327 179.8
## 28 6 23 Jun NA 0.4620 162.1
## 29 6 25 Jun NA 0.8034 171.3
## 30 6 27 Jun NA 1.0344 174.5
## 31 6 29 Jun NA 1.4964 192.6
## 32 7 1 Jul 135 0.4117 232.3
## 33 7 3 Jul 32 0.9239 229.8
## 34 7 5 Jul 64 0.4620 203.6
## 35 7 7 Jul 77 0.5122 173.1
## 36 7 9 Jul 97 0.5724 225.8
## 37 7 11 Jul NA 0.8637 225.0
## 38 7 13 Jul 27 1.4964 204.2
## 39 7 15 Jul 7 1.4361 214.7
## 40 7 17 Jul 35 1.0344 174.1
## 41 7 19 Jul 79 0.5122 195.2
## 42 7 21 Jul 16 0.6929 209.1
## 43 7 23 Jul NA 1.1549 185.3
## 44 7 25 Jul 108 0.8034 225.1
## 45 7 27 Jul 52 1.2051 216.2
## 46 7 29 Jul 50 0.7432 209.8
## 47 7 31 Jul 59 0.9239 221.1
## 48 8 1 Aug 39 0.6929 239.4
## 49 8 3 Aug 16 0.7432 252.6
## 50 8 5 Aug 35 0.7432 208.1
## 51 8 7 Aug 122 0.4017 250.5
## 52 8 9 Aug 110 0.8034 254.5
## 53 8 11 Aug NA 1.1549 236.7
## 54 8 13 Aug 28 1.1549 239.8
## 55 8 15 Aug NA 1.1549 268.2
## 56 8 17 Aug 59 0.6327 255.6
## 57 8 19 Aug 31 1.0947 223.3
## 58 8 21 Aug 21 1.5566 222.5
## 59 8 23 Aug NA 1.2654 236.2
## 60 8 25 Aug 168 0.3415 250.7
## 61 8 27 Aug NA 0.5724 259.1
## 62 8 29 Aug 118 0.2310 248.3
## 63 8 31 Aug 85 0.6327 224.6
## 64 9 1 Sep 96 0.6929 284.1
## 65 9 3 Sep 73 0.2812 268.8
## 66 9 5 Sep 47 0.7432 255.4
## 67 9 7 Sep 20 1.0947 289.0
## 68 9 9 Sep 21 1.0947 300.5
## 69 9 11 Sep 44 1.4964 256.4
## 70 9 13 Sep 28 0.6327 243.0
## 71 9 15 Sep 13 1.1549 277.5
## 72 9 17 Sep 18 1.3859 282.0
## 73 9 19 Sep 24 1.0344 266.3
## 74 9 21 Sep 13 1.2654 261.3
## 75 9 23 Sep 36 1.0344 241.9
## 76 9 25 Sep 14 1.6671 233.2
## 77 9 27 Sep NA 1.3256 283.9
## 78 9 29 Sep 18 0.8034 258.8
## 79 5 2 May 36 0.8034 148.5
## 80 5 4 May 18 1.1549 148.5
## 81 5 6 May 28 1.4964 148.5
## 82 5 8 May 19 1.3859 148.5
## 83 5 10 May NA 0.8637 148.5
## 84 5 12 May 16 0.9741 148.5
## 85 5 14 May 14 1.0947 148.5
## 86 5 16 May 14 1.1549 148.5
## 87 5 18 May 6 1.8479 148.5
## 88 5 20 May 11 0.9741 148.5
## 89 5 22 May 11 1.6671 148.5
## 90 5 24 May 32 1.2051 148.5
## 91 5 26 May NA 1.4964 148.5
## 92 5 28 May 23 1.2051 148.5
## 93 5 30 May 115 0.5724 148.5
## 94 6 2 Jun NA 0.9741 182.3
## 95 6 4 Jun NA 0.9239 182.3
## 96 6 6 Jun NA 1.4361 182.3
## 97 6 8 Jun NA 0.6929 182.3
## 98 6 10 Jun 39 1.1549 182.3
## 99 6 12 Jun NA 0.9239 182.3
## 100 6 14 Jun NA 1.3859 182.3
## 101 6 16 Jun 21 1.4964 182.3
## 102 6 18 Jun 20 0.9239 182.3
## 103 6 20 Jun 13 1.0344 182.3
## 104 6 22 Jun NA 0.1707 182.3
## 105 6 24 Jun NA 0.6327 182.3
## 106 6 26 Jun NA 0.8034 182.3
## 107 6 28 Jun NA 1.1549 182.3
## 108 6 30 Jun NA 0.8034 182.3
## 109 7 2 Jul 49 0.9239 209.0
## 110 7 4 Jul NA 1.0947 209.0
## 111 7 6 Jul 40 1.0947 209.0
## 112 7 8 Jul 97 0.6327 209.0
## 113 7 10 Jul 85 0.7432 209.0
## 114 7 12 Jul 10 1.4361 209.0
## 115 7 14 Jul NA 1.4964 209.0
## 116 7 16 Jul 48 0.6929 209.0
## 117 7 18 Jul 61 0.6327 209.0
## 118 7 20 Jul 63 1.1549 209.0
## 119 7 22 Jul NA 0.9741 209.0
## 120 7 24 Jul 80 0.8637 209.0
## 121 7 26 Jul 20 0.8637 209.0
## 122 7 28 Jul 82 0.7432 209.0
## 123 7 30 Jul 64 0.7432 209.0
## 124 8 2 Aug 9 1.3859 241.9
## 125 8 4 Aug 78 0.6929 241.9
## 126 8 6 Aug 66 0.4620 241.9
## 127 8 8 Aug 89 1.0344 241.9
## 128 8 10 Aug NA 0.8637 241.9
## 129 8 12 Aug 44 1.1549 241.9
## 130 8 14 Aug 65 0.9741 241.9
## 131 8 16 Aug 22 1.0344 241.9
## 132 8 18 Aug 23 0.7432 241.9
## 133 8 20 Aug 44 1.0344 241.9
## 134 8 22 Aug 9 1.4361 241.9
## 135 8 24 Aug 45 0.9741 241.9
## 136 8 26 Aug 73 0.8034 241.9
## 137 8 28 Aug 76 0.9741 241.9
## 138 8 30 Aug 84 0.6327 241.9
## 139 9 2 Sep 78 0.5122 264.2
## 140 9 4 Sep 91 0.4620 264.2
## 141 9 6 Sep 32 1.5566 264.2
## 142 9 8 Sep 23 1.0344 264.2
## 143 9 10 Sep 24 0.9741 264.2
## 144 9 12 Sep 21 1.5566 264.2
## 145 9 14 Sep 9 1.0947 264.2
## 146 9 16 Sep 46 0.6929 264.2
## 147 9 18 Sep 13 1.0344 264.2
## 148 9 20 Sep 16 0.8034 264.2
## 149 9 22 Sep 23 0.9239 264.2
## 150 9 24 Sep 7 1.0344 264.2
## 151 9 26 Sep 30 0.6929 264.2
## 152 9 28 Sep 14 1.4361 264.2
## 153 9 30 Sep 20 1.1549 264.2
Comparison to ddply
Previously I’ve used ddply to summarize data, however I found if the variable I’m summarizing over is very granular the performance bombs.
library("plyr")
df_test1 <- data.frame(x = round(rnorm(1e+05), 3), y = rnorm(1e+05))
t <- Sys.time()
df_test2 <- ddply(df_test1, "x", summarise, y = mean(y))
nrow(df_test2)
## [1] 5779
Sys.time() - t
## Time difference of 2.302 secs
t <- Sys.time()
df_test3 <- sqldf("select x, avg(y) as y from df_test1 group by x")
nrow(df_test3)
## [1] 5779
Sys.time() - t
## Time difference of 0.4117 secs
# Test to see if they are the same
all(round(df_test2$y, 10) == round(df_test3$y, 10))
## [1] TRUE
Links
https://code.google.com/p/sqldf/