Learning R through a mini game part 3

This is my third post on learning R through the BetaBit package, which contains three mini games for learning R. I wrote about the first game, called proton, late last year and the second game, called frequon, a week and a half ago. The third game is called regression and it’s much more statistical than the other two. I actually couldn’t complete the last task of the game, so if you know how to approach it, please let me know!

To get started load the BetaBit package and type regression().

library(BetaBit)

regression()
#  _____ _          _____                         _            _____
# |_   _| |_ ___   | __  |___ ___ ___ ___ ___ ___|_|___ ___   |   __|___ _____ ___
#   | | |   | -_|  |    -| -_| . |  _| -_|_ -|_ -| | . |   |  |  |  | .'|     | -_|
#   |_| |_|_|___|  |__|__|___|_  |_| |___|___|___|_|___|_|_|  |_____|__,|_|_|_|___|
#                            |___|
# 
# During their summer internship Beta and Bit are helping professor Pearson to analyse data from his educational research on Polish upper-secondary schools students.
# 
# Professor Pearson is mostly interested in educational inequalities, particularly in how the socio-economic status of the parents affects their children's educational attainment. He hopes his research will help to develop programmes providing support to young people from disadvantaged families and reducing inequalities in our society.
# 
# If you want to help Beta and Bit in their analysis, type:
#   regression(subject = "Summer internship")
# 
# If you need help, try to add `hint = TRUE` and/or `techHint = TRUE` argument to the `regression()` function call.

To continue, simply reply with regression(subject = "Summer internship").

regression(subject = "Summer internship")

# Dear Beta and Bit,
# 
# I sent you a dataset named 'dataFSW' for an analysis. An additional dataset called 'varLabels' contains # variable labels so that you can check what each variable measures.
# 
# Let's start with something simple. Can you compute for me the correlations between the measures of cognitive abilities (`MATH_2009`, `READ_2009`, `SCIE_2009`) and the highest parental International Socio-# Economic Index `hisei`?
# 
# You should get three correlations.
# 
# Please send me back a vector containing these three correlations by calling:
#   `regression(subject = "Correlations", content = "vector of correlations")`
# 
# Best regards
# 
# Professor Pearson

I assumed that with a name like Professor Pearson, he/she wanted the Pearson correlation. There’s some missing data in the data set, so I needed to specify use='na.or.complete'.

# check the dimension of the data
dim(dataFSW)
# [1] 3796   54

# check what the columns are
my_column <- c('MATH_2009', 'READ_2009', 'SCIE_2009', 'hisei')
varLabels[varLabels$variable_name %in% my_column,]
#    variable_name                                                                    variable_label
# 14     MATH_2009                              Score on PISA mathematics test (pseudoEAP estimator)
# 15     READ_2009                                  Score on PISA reading test (pseudoEAP estimator)
# 16     SCIE_2009                                  Score on PISA science test (pseudoEAP estimator)
# 27         hisei Highest parental occupational status (highest International Socio-Economic Index)

# calculate the correlations
a <- cor(dataFSW$MATH_2009, dataFSW$hisei, use='na.or.complete')
b <- cor(dataFSW$READ_2009, dataFSW$hisei, use='na.or.complete')
c <- cor(dataFSW$SCIE_2009, dataFSW$hisei, use='na.or.complete')

cor_vector <- c(a, b, c)

# task 1
regression(subject = "Correlations", content = cor_vector)

# Dear Beta and Bit,
# 
# The correlations you sent me are very strange. They should be positive and about twice as high! Perhaps # there is something wrong with the dataset... I think that I mixed names of the variables when I was preparing the dataset. I'm sure the first sixteen names are correct, but the rest might be a mess.
# 
# Can you try to identify the name of the variable that measures HISEI in the dataset? You should by able to find it because I remember that the relationship between `READ_2009` and HISEI was nearly perfectly linear (although not very strong).

# When you're done, please send me back the correct name of the variable that describes HISEI by calling:
#   `regression(subject = "Name of the variable", content = "name of the variable describing HISE")`
# 
# Best regards
# 
# Professor Pearson

For the next bit, I just manually went through all the scatter plots and correlations.

par(ask=TRUE)
for (i in 17:ncol(dataFSW)){
  x <- cor(dataFSW$READ_2009, dataFSW[,i], use='na.or.complete')
  plot(dataFSW$READ_2009, dataFSW[,i], pch=19, main=names(dataFSW)[i])
  print(paste(names(dataFSW)[i], x))
}
par(ask=FALSE)

regression(subject = "Name of the variable", content = "highconf")

# Good work!
# 
# Luckily, I found the correct version of the dataset. I sent it to you. It is named simply `FSW`. Please, use it for further analysis.
# 
# Now I want you to help me with analysing the relationship between the students' attainment and the income of their parents. I estimated an OLS regression model in which the reading test score is predicted by two variables: `cultpos` (index describing the availability of cultural resources in a household) and `income` (monthly household income):
#   `lm(READ_2009 ~ cultpos + income, FSW)`
# Although I don't expect the income effect to be strong when the availability of cultural resources is controlled for, it should be statistically significant.
# 
# I think the skewness of the `income` distribution might cause some problems in the model. Perhaps you could propose a (nonlinear) transformation of `income`, so that the transformed variable is more strongly related to `READ_2009` and statistically significant (on a 0.05 significance level) when put instead of the original variable `income` into the regression model described above.
# 
# Please, send me the expression describing such a transformation by calling:
#   `regression(subject = "transformation", content = expression(your transformation of income))`
# 
# Best regards
# 
# Professor Pearson

Income is usually quite spread out, so I thought it was a good idea to log transform it.

# income not significant
summary(lm(READ_2009 ~ cultpos + income, FSW))

# Call:
# lm(formula = READ_2009 ~ cultpos + income, data = FSW)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -323.90  -54.01    6.18   59.49  217.27 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 517.4101     2.2232 232.731   <2e-16 ***
# cultpos      28.7648     1.3877  20.729   <2e-16 ***
# income        0.1096     0.1887   0.581    0.561    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 83.05 on 3783 degrees of freedom
#   (10 observations deleted due to missingness)
# Multiple R-squared:  0.1083,	Adjusted R-squared:  0.1079 
# F-statistic: 229.8 on 2 and 3783 DF,  p-value: < 2.2e-16

# income significant
summary(lm(READ_2009 ~ cultpos + log10(income), FSW))

# Call:
# lm(formula = READ_2009 ~ cultpos + log10(income), data = FSW)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -323.81  -53.87    6.66   58.62  217.52 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    504.881      5.094  99.109  < 2e-16 ***
# cultpos         27.881      1.404  19.860  < 2e-16 ***
# log10(income)   15.288      5.537   2.761  0.00579 ** 
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 82.97 on 3783 degrees of freedom
#   (10 observations deleted due to missingness)
# Multiple R-squared:  0.1101,	Adjusted R-squared:  0.1096 
# F-statistic: 233.9 on 2 and 3783 DF,  p-value: < 2.2e-16

regression(subject = "transformation", content = expression(log10(income)))

# Very good!
# 
# It's nice you decided to use logarithmic transformation. The slope parameter for transformed income has clear interpretation: that's the change in prediction when the value of income rises ten times.
# 
# Please look at a somewhat more complicated model. I wanted to examine impact of multiple variables on reading test scores (`READ_2009`). There are variables describing sex and school type (track), and indicators of socio-economic status of student's family (`log(income)`, `homepos`, `hisei`, `csesi`) and of psychological tests scores (`RAVEN_WYN`, `STAI_C_WYN`, `STAI_S_WYN`, `SES_WYN`, `ZAMPS_WYN`). I estimate the model with:
#   `lm(READ_2009 ~ SEX + SCHOOL_TYPE + log(income) + homepos + hisei + csesi + RAVEN_WYN + STAI_C_WYN + STAI_S_WYN + SES_WYN + ZAMPS_WYN, FSW)`
# 
# However, although each variable alone is a statistically significant predictor of `READ_2009` most of them turns out to be insignificant in this model.
# 
# Perhaps the problem lies in some relationship between the predictors ... it's called "collinearity", or something like that... I think that removing some variables from the model will make the remaining ones statistically significant.
# 
# Please find out which variables should be removed so that the remaining ones are significant at the 0.05 significance level. You should remove as few variables as possible.
# 
# If you're done, send me the names of the variables to remove by calling:
#   `regression(subject = "Collinearity", content = character_vector_with_names_of_removed_variables)`
# 
# Best regards
# 
# Professor Pearson

Firstly, I wanted to visualise all the correlations by using pairs().

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...){
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r = (cor(x, y, use='na.or.complete'))
  txt <- format(c(r, 0.123456789), digits=digits)[1]
  txt <- paste(prefix, txt, sep="")
  text(0.5, 0.5, txt,cex=2)
}

pairs(~ SEX + SCHOOL_TYPE + log(income) + homepos + hisei + csesi + RAVEN_WYN + STAI_C_WYN + STAI_S_WYN + SES_WYN + ZAMPS_WYN, FSW, lower.panel=panel.cor)

pairsWe can definitely remove some variables from the multiple regression.

I removed a combination of correlated variables but I couldn’t find the optimal solution. Therefore I wrote some code to check the number of significant variables after removing variables one at a time. Some essential reading at Cross Validated.

feature <- c('READ_2009', 'SEX', 'SCHOOL_TYPE',  'income', 'homepos', 'hisei', 'csesi', 'RAVEN_WYN',  'STAI_C_WYN', 'STAI_S_WYN', 'SES_WYN', 'ZAMPS_WYN')
for (i in 2:length(feature)){
  my_subset <- FSW[,feature[-i]]
  my_model <- lm(READ_2009 ~ ., my_subset)
  x <- summary(my_model)$coefficients[,4]
  cat(feature[i])
  print(table(x <0.05))
}

# removing ZAMPS_WYN, gave 9 significant variables and 4 insiginificant
# this was the best ratio among all the other variables
# ZAMPS_WYN
# FALSE  TRUE 
#     4     9
# after removing ZAMPS_WYN, I repeated the process
# these were the variables that gave the best ratio
# csesi
# FALSE  TRUE 
#     2    10 
# RAVEN_WYN
# FALSE  TRUE 
#     2    10 
# STAI_C_WYN
# FALSE  TRUE 
#     2    10
# after some testing, I came up with
regression(subject = "Collinearity", content = c('ZAMPS_WYN', 'csesi', 'STAI_C_WYN'))

# It looks fine!
# 
# I wonder whether you have tried to remove variables of some special properties or have simply written a code that checks all possible combinations of the predictors...
# 
# Let's get back to the relationship between `hisei` and `READ_2009'. I wonder if it looks similar within different schools in the sample. It might be a general relationship or it might depend on a school context.
# 
# Can you provide me a data frame consisting of two columns: `SCHOOL_ID` and `par_hisei`? The first column is self-descriptive and the second one should contain values of the slope parameter for `hisei` (from the OLS regression model) in each school.
# 
# When you finish, please send me the data frame by calling:
#   `regression(subject = "Groups", content = data_frame_containig_results`)
# 
# Best regards
# 
# Professor Pearson

I relied on packages in the Hadleyverse to calculate different regressions per group.

library(plyr); library(dplyr)
models <- dlply(FSW, 'SCHOOL_ID', function(df)
  lm(READ_2009 ~ hisei, data = df))

# Apply coef to each model and return a data frame
all_model <- ldply(models, coef)
school_id_slope <- all_model
school_id_slope <- school_id_slope[,-2]
names(school_id_slope)[2] <- 'par_hisei'

regression(subject = "Groups", content = school_id_slope)

# Great!
# 
# Did you perform different regressions in each group or used one model with an interaction term between `hisei` and `factor(SCHOOL_ID)`?
# 
# The variability between schools does not seem to be high with respect to the `hisei` parameter. However let's try to identify schools with bigger differences.
# 
# The mean value of the slope parameter among schools is about 0.434. Can you identify schools for which the slope parameter value of `hisei` is statistically significantly different (at the 0.05 significance level) from the mean value among schools (0.434)? I want you to do this using the two-sided Wald t test (or the F test, which is equivalent in this case) on the basis of the OLS regression model (or its reparametrisation):
#   lm(READ_2009 ~ hisei * factor(SCHOOL_ID), FSW)
# 
# As a result please send me a vector of `SCHOOL_ID` of the schools for which the slope parameter value is statistically significantly different from mean by calling:
#   `regression(subject = "Significant differences", content = vector_of_school_ids)`
# 
# Best regards
# 
# Professor Pearson

I had to use a hint for the next task and read the contrast coding page suggested by the hint to come up with the answer. I tried all sorts of approaches before giving in and resorting to the hint system. The trick was to use the contr.sum() function, so that each school could be contrasted to the overall mean.

regression(subject = "Significant differences", content = 'blah', hint=TRUE)
# You can try to reparametrize the regression model by using proper contrasts so that the differences of # interest will be model parameters. For a brief introduction to contrasts you can consult 
# http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm.
# Alternatively you can compute the Wald test for all schools slope coefficients.

new <- select(FSW, READ_2009, hisei, SCHOOL_ID)
new$SCHOOL_ID <- factor(new$SCHOOL_ID)
contrasts(new$SCHOOL_ID) <- contr.sum(nrow(school_id_slope))
z <- lm(READ_2009 ~ hisei * SCHOOL_ID, new)
summary(z)
# output not pasted
# I eyeballed the school IDs that were significant
# these are NOT the school IDs but only the indexes
answer_index <- c(43, 61, 73, 83, 94, 95, 96, 105, 112, 117, 133, 136, 146, 155, 187)
answer <- school_id_slope[answer_index,1]
regression(subject = "Significant differences", content = answer)

# Well done!
# 
# That's nice you took into account that the mean value of slope parameters is also estimated with error.
# 
# Now let's turn to a little different issue. It's well known that in lower grades the cognitive abilities, as measured by for example in `READ_2009`, depend on pupils' age. At the upper-secondary level this effect is probably much lower or even non-existent but we should check it.
# 
# There is another issue that complicates the analysis. In the sample there are students who started school one year earlier than they were supposed to (normally children in Poland start primary school at the age of 7 but under certain conditions they can start when they are 6 years old). These pupils are not representative for their cohort because they are selective (they are brighter than the average). On the other hand there are also pupils who are older than they should be. These are pupils who either started school one year later than they were supposed to or repeated a grade. Whatever the reason we can describe them as "negatively selected", i.e. they have usually lower cognitive abilities than other pupils of their age (who are not in our sample because when the study was conducted they were already in an upper grade).
# 
# Because of that the relationship between `RAVEN_AGE` and `READ_2009` is perhaps not continuous: it can look different in some ranges of age than in others. Your task is to check how this relationship looks like and to propose a linear model describing it in a proper way.
# 
# Please, send me the formula of the model by calling:
#   `regression(subject = "Age", content = READ_2009 ~ your_formula`
# 
# Generally, your formula should contain transformations of only one variable: `RAVEN_AGE`. If you find it more convenient not to use explicit transformations in the formula (using `I()` or other functions), you can send me a formula containing additional variables you want to create. In such case you must call `regression()` function with an additional argument `vars` that will contain a named list of the expressions describing how to compute the additional variables. Nevertheless you can construct them only as transformations of `RAVEN_AGE`. Here is an example of such a call:
#   `regression(subject = "age", content = READ_2009 ~ RAVEN_AGE + AGE3, vars = list(AGE3 = expression(RAVEN_AGE^3)))`
# 
# Best regards
# 
# Professor Pearson

Firstly, I visualised the variables using ggplot2.

library(ggplot2)
ggplot(FSW, aes(RAVEN_AGE, READ_2009)) + geom_point() + geom_smooth()
# Warning messages:
# 1: Removed 322 rows containing non-finite values (stat_smooth). 
# 2: Removed 322 rows containing missing values (geom_point). 

raven_age_vs_read_2009How do you transform the densely packed region between 17 and 18?

I tried all sorts of transformations but it was not good enough. I went and looked at the underlying code behind the package and found out that I need a deviance better than or equal to 2444,6455. The best I could do was 2539,6703. I read into Box-Cox transformation and using the MASS package, it gave a lambda of around 2 but that had a deviance of 2544,5454. I decided to throw in the towel for now.

age_lm <- lm(READ_2009 ~ RAVEN_AGE, data = FSW)
deviance(age_lm) # 2546,3885
# [1] 25463885

library(MASS)
boxcox(age_lm, lambda=seq(1,3,by=.1))

lambda

Summary

If you complete the game, you should get the message:

I wish you many interesting and challenging analytical problems to solve in the future! Remember: per aspera ad astra! You have solved all the problems and helped me analyse the data! I’m sure that your solutions will enable me to understand the causes of educational inequalities better and thus develop effective ways to reduce them. Congratulations!

Unfortunately, I couldn’t solve the last task…

Useful resources

  1. Regression Models in R
  2. Handling Multicollinearity in Regression Analysis
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours
  1. Can you explain me in why the answer in the second question is “highconf”. There are other variables, such as “csesi” or even “attcomp” which match IMO the conditions better.

  2. Spline regression gives the deviance below the threshold:


    regression(subject = "Age", content = READ_2009 ~ ifelse(RAVEN_AGE<16.9,RAVEN_AGE,0) + ifelse(RAVEN_AGE>18,RAVEN_AGE,0))

    Note that there is no dependency in this model between y and x in the
    interval between 16.9 and 18 of x axis.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.