August 29, 2013

Data Manipulation with sqldf

Paul Shannon

Belfast R Users

29/08/13

I HATE data manipulation with R. Look at how ugly and unreadable this is:

df <- airquality

# Group Ozone levels into 'High', 'Medium' & 'Low'
df$Ozone_Grpd[df$Ozone < 18] <- "Low"
df$Ozone_Grpd[df$Ozone >= 18 & df$Ozone < 63] <- "Medium"
df$Ozone_Grpd[df$Ozone >= 63] <- "High"

And that’s something very basic!

SQLDF uses an SQLite database to perform querys and updates. However it’s good to know that once R kicks data to the database, SQLite only will return queries. This means any data manipulation done in SQLite will only remain in SQLite.

Introduction to SQL

Queries

Queries begin with the select statement, followed by what we want to select:

# This query will return everything from df
sqldf("select * from df")
# This will only return 2 columns, Wind and Ozone, from df
sqldf("select Ozone, Wind from df")

These queries can be assigned to R objects. (I think select Ozone, Wind from df is much nicer than df[, names(df) %in% c(“Ozone, Wind”)]).

We can easily create aggregated data using expressions in the select statement. Expressions include:

  • AVG: The avg() function returns the average value of all non-NULL X within a group.
  • COUNT: The count(X) function returns a count of the number of times that X is not NULL in a group. The count(*) function (with no arguments) returns the total number of rows in the group.
  • MAX: The max() aggregate function returns the maximum value of all values in the group.
  • MIN: The min() aggregate function returns the minimum non-NULL value of all values in the group.
  • SUM: The sum() and total() aggregate functions return sum of all non-NULL values in the group.

Unfortunately SQLite in quite limited and doesn’t support other expressions such as std (stand deviation) and var (variance).

Query Examples

titantic <- data.frame(Titanic)
head(titantic)
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

sqldf("select sum(Freq) as Count from titantic")
## Loading required package: tcltk
##   Count
## 1  2201
sqldf("select sum(Freq) as Count from titantic group by Sex")
##   Count
## 1   470
## 2  1731
sqldf("select Sex, sum(Freq) as Count from titantic group by Sex")
##      Sex Count
## 1 Female   470
## 2   Male  1731
sqldf("select Sex, Class, Age, sum(Freq) as Count from titantic where Survived = 'No' group by Sex, Class, Age")
##       Sex Class   Age Count
## 1  Female   1st Adult     4
## 2  Female   1st Child     0
## 3  Female   2nd Adult    13
## 4  Female   2nd Child     0
## 5  Female   3rd Adult    89
## 6  Female   3rd Child    17
## 7  Female  Crew Adult     3
## 8  Female  Crew Child     0
## 9    Male   1st Adult   118
## 10   Male   1st Child     0
## 11   Male   2nd Adult   154
## 12   Male   2nd Child     0
## 13   Male   3rd Adult   387
## 14   Male   3rd Child    35
## 15   Male  Crew Adult   670
## 16   Male  Crew Child     0

Inserting Variables

Previously I’ve written sql strings with placeholders for variable names. The use gsub to swap them out. Eg. “select avg(predicted) as Pred, avg(variable) from table group by variable. This can be done using the unix bash script style variable notation and fn$sqldf. This uses gsubfn, a quasi-perl-style string interpolation. gsubfn is used by sqldf so its already loaded. Note the fn prefix to invoke the interpolation functionality.

Eg.

Windvalue <- 10
Ozonevalue <- "High"
fn$sqldf("select * from df where Wind > $Windvalue and Ozone_Grpd = '$Ozonevalue'")
##   Ozone Solar_R Wind Temp Month Day Ozone_Grpd
## 1    71     291 13.8   90     6   9       High
## 2    63     220 11.5   85     7  20       High
## 3    89     229 10.3   90     8   8       High

Updates

We can update our table within the database and apply all the transformations we want. This only applies the changes to the database not the table in our workspace! To account for this we can then select everything back from the database by running select * from main .(table name). This selects from the database table rather than the workspace table.

library("sqldf")

df <- sqldf(c("alter table df add month_grpd char(30)",
              "alter table df add Wind_Ozone num",
              "alter table df add Wind_1 num",
              "alter table df add Wind_2 num",
              "alter table df add Wind_norm num",
              "update df set

                  month_grpd = case
                    when Month = 5 then 'May'
                    when Month = 6 then 'Jun'
                    when Month = 7 then 'Jul'
                    when Month = 8 then 'Aug'
                    when Month = 9 then 'Sep'
                    else 'Not bothered'
                  end,

                  Wind_Ozone = Wind*Ozone,

                  Wind_1 = case
                    when Wind > 9.7 then 0
                    else Wind
                  end,

                  Wind_2 = case
                    when Wind <= 9.7 then 0
                    else Wind
                  end,

                  Wind_norm = Wind/(select avg(Wind) from df)", 
              "select * from main.df"))

head(df)
##   Ozone Solar_R Wind Temp Month Day Ozone_Grpd month_grpd Wind_Ozone
## 1    41     190  7.4   67     5   1     Medium        May      303.4
## 2    36     118  8.0   72     5   2     Medium        May      288.0
## 3    12     149 12.6   74     5   3        Low        May      151.2
## 4    18     313 11.5   62     5   4     Medium        May      207.0
## 5    NA      NA 14.3   56     5   5       <NA>        May         NA
## 6    28      NA 14.9   66     5   6     Medium        May      417.2
##   Wind_1 Wind_2 Wind_norm
## 1    7.4      0    0.7432
## 2    8.0      0    0.8034
## 3    0.0     12    1.2654
## 4    0.0     11    1.1549
## 5    0.0     14    1.4361
## 6    0.0     14    1.4964

Joins

Database style joins are possibly the easiest way to merge and join data together. Data can be stacked on up of each other (a la rbind) using UNION and UNION ALL.

Different joins supported by SQLite include:

  • Left Join (merges what it can from the right onto the left)
  • Inner Join (merges common rows from both tables)
  • Cross Join (cartesian product of tables)
# lets create a fake variable
df2 <- data.frame(mon = rep(5:9, each=16), day = seq(1, 31, 2))
df2$crazy_var <- df2$mon * rnorm(80, 30, 2)

sqldf("select Month, Day, month_grpd, Ozone, Wind_norm, crazy_var 
      from
      (
        select a.*, b.crazy_var
        from
        df a
        left join
        df2 b
        on a.Month = b.mon and a.Day = b.day
        where b.crazy_var is not NULL

        union all

        select a.*, b.crazy_var
        from
        (
          select a.*
          from
          df a
          left join
          df2 b
          on a.Month = b.mon and a.Day = b.day
          where b.crazy_var is NULL
        ) a
        left join
        (
          select mon, avg(crazy_var) as crazy_var
          from
          df2
          group by mon
        ) b
        on a.Month = b.mon
      )")
##     Month Day month_grpd Ozone Wind_norm crazy_var
## 1       5   1        May    41    0.7432     131.7
## 2       5   3        May    12    1.2654     134.5
## 3       5   5        May    NA    1.4361     145.4
## 4       5   7        May    23    0.8637     157.9
## 5       5   9        May     8    2.0186     140.5
## 6       5  11        May     7    0.6929     151.9
## 7       5  13        May    11    0.9239     154.1
## 8       5  15        May    18    1.3256     156.6
## 9       5  17        May    34    1.2051     138.7
## 10      5  19        May    30    1.1549     158.9
## 11      5  21        May     1    0.9741     146.8
## 12      5  23        May     4    0.9741     171.3
## 13      5  25        May    NA    1.6671     141.8
## 14      5  27        May    NA    0.8034     144.0
## 15      5  29        May    45    1.4964     153.2
## 16      5  31        May    37    0.7432     147.9
## 17      6   1        Jun    NA    0.8637     169.1
## 18      6   3        Jun    NA    1.6169     184.5
## 19      6   5        Jun    NA    0.8637     186.0
## 20      6   7        Jun    29    0.9741     181.6
## 21      6   9        Jun    71    1.3859     171.6
## 22      6  11        Jun    NA    1.0947     180.5
## 23      6  13        Jun    23    0.8034     195.7
## 24      6  15        Jun    NA    1.1549     197.5
## 25      6  17        Jun    37    2.0788     188.9
## 26      6  19        Jun    12    1.1549     185.3
## 27      6  21        Jun    NA    0.6327     179.8
## 28      6  23        Jun    NA    0.4620     162.1
## 29      6  25        Jun    NA    0.8034     171.3
## 30      6  27        Jun    NA    1.0344     174.5
## 31      6  29        Jun    NA    1.4964     192.6
## 32      7   1        Jul   135    0.4117     232.3
## 33      7   3        Jul    32    0.9239     229.8
## 34      7   5        Jul    64    0.4620     203.6
## 35      7   7        Jul    77    0.5122     173.1
## 36      7   9        Jul    97    0.5724     225.8
## 37      7  11        Jul    NA    0.8637     225.0
## 38      7  13        Jul    27    1.4964     204.2
## 39      7  15        Jul     7    1.4361     214.7
## 40      7  17        Jul    35    1.0344     174.1
## 41      7  19        Jul    79    0.5122     195.2
## 42      7  21        Jul    16    0.6929     209.1
## 43      7  23        Jul    NA    1.1549     185.3
## 44      7  25        Jul   108    0.8034     225.1
## 45      7  27        Jul    52    1.2051     216.2
## 46      7  29        Jul    50    0.7432     209.8
## 47      7  31        Jul    59    0.9239     221.1
## 48      8   1        Aug    39    0.6929     239.4
## 49      8   3        Aug    16    0.7432     252.6
## 50      8   5        Aug    35    0.7432     208.1
## 51      8   7        Aug   122    0.4017     250.5
## 52      8   9        Aug   110    0.8034     254.5
## 53      8  11        Aug    NA    1.1549     236.7
## 54      8  13        Aug    28    1.1549     239.8
## 55      8  15        Aug    NA    1.1549     268.2
## 56      8  17        Aug    59    0.6327     255.6
## 57      8  19        Aug    31    1.0947     223.3
## 58      8  21        Aug    21    1.5566     222.5
## 59      8  23        Aug    NA    1.2654     236.2
## 60      8  25        Aug   168    0.3415     250.7
## 61      8  27        Aug    NA    0.5724     259.1
## 62      8  29        Aug   118    0.2310     248.3
## 63      8  31        Aug    85    0.6327     224.6
## 64      9   1        Sep    96    0.6929     284.1
## 65      9   3        Sep    73    0.2812     268.8
## 66      9   5        Sep    47    0.7432     255.4
## 67      9   7        Sep    20    1.0947     289.0
## 68      9   9        Sep    21    1.0947     300.5
## 69      9  11        Sep    44    1.4964     256.4
## 70      9  13        Sep    28    0.6327     243.0
## 71      9  15        Sep    13    1.1549     277.5
## 72      9  17        Sep    18    1.3859     282.0
## 73      9  19        Sep    24    1.0344     266.3
## 74      9  21        Sep    13    1.2654     261.3
## 75      9  23        Sep    36    1.0344     241.9
## 76      9  25        Sep    14    1.6671     233.2
## 77      9  27        Sep    NA    1.3256     283.9
## 78      9  29        Sep    18    0.8034     258.8
## 79      5   2        May    36    0.8034     148.5
## 80      5   4        May    18    1.1549     148.5
## 81      5   6        May    28    1.4964     148.5
## 82      5   8        May    19    1.3859     148.5
## 83      5  10        May    NA    0.8637     148.5
## 84      5  12        May    16    0.9741     148.5
## 85      5  14        May    14    1.0947     148.5
## 86      5  16        May    14    1.1549     148.5
## 87      5  18        May     6    1.8479     148.5
## 88      5  20        May    11    0.9741     148.5
## 89      5  22        May    11    1.6671     148.5
## 90      5  24        May    32    1.2051     148.5
## 91      5  26        May    NA    1.4964     148.5
## 92      5  28        May    23    1.2051     148.5
## 93      5  30        May   115    0.5724     148.5
## 94      6   2        Jun    NA    0.9741     182.3
## 95      6   4        Jun    NA    0.9239     182.3
## 96      6   6        Jun    NA    1.4361     182.3
## 97      6   8        Jun    NA    0.6929     182.3
## 98      6  10        Jun    39    1.1549     182.3
## 99      6  12        Jun    NA    0.9239     182.3
## 100     6  14        Jun    NA    1.3859     182.3
## 101     6  16        Jun    21    1.4964     182.3
## 102     6  18        Jun    20    0.9239     182.3
## 103     6  20        Jun    13    1.0344     182.3
## 104     6  22        Jun    NA    0.1707     182.3
## 105     6  24        Jun    NA    0.6327     182.3
## 106     6  26        Jun    NA    0.8034     182.3
## 107     6  28        Jun    NA    1.1549     182.3
## 108     6  30        Jun    NA    0.8034     182.3
## 109     7   2        Jul    49    0.9239     209.0
## 110     7   4        Jul    NA    1.0947     209.0
## 111     7   6        Jul    40    1.0947     209.0
## 112     7   8        Jul    97    0.6327     209.0
## 113     7  10        Jul    85    0.7432     209.0
## 114     7  12        Jul    10    1.4361     209.0
## 115     7  14        Jul    NA    1.4964     209.0
## 116     7  16        Jul    48    0.6929     209.0
## 117     7  18        Jul    61    0.6327     209.0
## 118     7  20        Jul    63    1.1549     209.0
## 119     7  22        Jul    NA    0.9741     209.0
## 120     7  24        Jul    80    0.8637     209.0
## 121     7  26        Jul    20    0.8637     209.0
## 122     7  28        Jul    82    0.7432     209.0
## 123     7  30        Jul    64    0.7432     209.0
## 124     8   2        Aug     9    1.3859     241.9
## 125     8   4        Aug    78    0.6929     241.9
## 126     8   6        Aug    66    0.4620     241.9
## 127     8   8        Aug    89    1.0344     241.9
## 128     8  10        Aug    NA    0.8637     241.9
## 129     8  12        Aug    44    1.1549     241.9
## 130     8  14        Aug    65    0.9741     241.9
## 131     8  16        Aug    22    1.0344     241.9
## 132     8  18        Aug    23    0.7432     241.9
## 133     8  20        Aug    44    1.0344     241.9
## 134     8  22        Aug     9    1.4361     241.9
## 135     8  24        Aug    45    0.9741     241.9
## 136     8  26        Aug    73    0.8034     241.9
## 137     8  28        Aug    76    0.9741     241.9
## 138     8  30        Aug    84    0.6327     241.9
## 139     9   2        Sep    78    0.5122     264.2
## 140     9   4        Sep    91    0.4620     264.2
## 141     9   6        Sep    32    1.5566     264.2
## 142     9   8        Sep    23    1.0344     264.2
## 143     9  10        Sep    24    0.9741     264.2
## 144     9  12        Sep    21    1.5566     264.2
## 145     9  14        Sep     9    1.0947     264.2
## 146     9  16        Sep    46    0.6929     264.2
## 147     9  18        Sep    13    1.0344     264.2
## 148     9  20        Sep    16    0.8034     264.2
## 149     9  22        Sep    23    0.9239     264.2
## 150     9  24        Sep     7    1.0344     264.2
## 151     9  26        Sep    30    0.6929     264.2
## 152     9  28        Sep    14    1.4361     264.2
## 153     9  30        Sep    20    1.1549     264.2

Comparison to ddply

Previously I’ve used ddply to summarize data, however I found if the variable I’m summarizing over is very granular the performance bombs.

library("plyr")

df_test1 <- data.frame(x = round(rnorm(1e+05), 3), y = rnorm(1e+05))

t <- Sys.time()
df_test2 <- ddply(df_test1, "x", summarise, y = mean(y))
nrow(df_test2)
## [1] 5779
Sys.time() - t
## Time difference of 2.302 secs


t <- Sys.time()
df_test3 <- sqldf("select x, avg(y) as y from df_test1 group by x")
nrow(df_test3)
## [1] 5779
Sys.time() - t
## Time difference of 0.4117 secs

# Test to see if they are the same
all(round(df_test2$y, 10) == round(df_test3$y, 10))
## [1] TRUE

Links

https://code.google.com/p/sqldf/

Blog comments powered by Disqus