In R, I typically use data.frames to hold all my data. I've wondered when I should just use a matrix instead of a data.frame and this was nicely answered. It is also quite easy to perform operations on data.frames to obtain a subset of the data.
However if I had two data frames and wanted to connect them based on a common column, nothing seems more appropriate than a SQL join clause (especially if you've worked with relational databases before). Then I found this package in R that lets you perform SQL queries on data.frames (for example, joins!).
Installing:
install.packages("sqldf") library("sqldf")
Testing:
library(sqldf) sqldf("select * from iris limit 5") Loading required package: tcltk Error in structure(.External("dotTcl", ..., PACKAGE = "tcltk"), class = "tclObj") : [tcl] unknown math function "min".
After reading the troubleshooting page, I found out there's some problem with the tcltk library on the server. I don't have root access to the server running CentOS, so I just went with using the slower R engine.
#make sure your gsubfn version is gsubfn 0.6-4 or later packageVersion("gsubfn") [1] '0.6.5' options(gsubfn.engine = "R") sqldf("select * from iris limit 5") Sepal_Length Sepal_Width Petal_Length Petal_Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa
On the contrary, I installed sqldf on my Windows 7 laptop running R 2.15.2 and it worked without any errors.
Getting table information
sqldf("pragma table_info(iris)") cid name type notnull dflt_value pk 1 0 Sepal_Length REAL 0 <NA> 0 2 1 Sepal_Width REAL 0 <NA> 0 3 2 Petal_Length REAL 0 <NA> 0 4 3 Petal_Width REAL 0 <NA> 0 5 4 Species TEXT 0 <NA> 0
More examples
For more examples, see the documentation.
This work is licensed under a Creative Commons
Attribution 4.0 International License.