Introduction
R is designed to only use one cpu (or core) when running tasks.
However, you may have access to a computer cluster that allows you to
access more RAM and cpus. The use of more than one cpu is known as
parallel computing in R. The goal of this tutorial is to provide the
basics of using the batchtools
package and utilizing more
cores in a cluster.
This tutorial is built off the information provided by UCR’s High
Performance Computing Center tutorial of batchtools
package. This tutorial uses a simulation study to show you the power of
batchtools
. For more information visit their help
documentation for batchtools.
This tutorial is meant to be run on the HPCC cluster at UCR. You will need an
account to the HPCC cluster. If you are a graduate student in UCR’s Statistics Department, contact
the UCR Statistics Graduate Student Association to gain access.
This tutorial conducts different simulation scenarios to be submitted
as jobs. It requires an amount of set up to be effective. There is an an
R script that you can use that may provide better insight on using
batchtools
. You can access R script here
Simulation Example
To demonstrate how to use the batchtools
package in R,
we conduct several simulation studies showing how the estimates from the
ordinary least squares estimator leads to unbiased results. We will
simulate data from the following models:
\[
Y \sim N(20+30X,\ 3)
\] \[
Y \sim N(5-8X_1+2X_2,\ 3)
\]
\[
Y \sim N(5+4X_1-5X_2-3X_3,\ 3)
\]
\[
Y \sim N(5+4X_1+-5X_2-3X_3+6X_4,\ 3)
\] Each simulation scenario with have 500 data sets with 200
observations. The values for the predictor variables will be simulated
by multivariate normal distributions. The mean vector for the predictors
simulation are \((-2, 0)^T\), \((-2, 0)^T\), \((-2, 0, 2)^T\), \((-2, 0, 2 8)^T\). Each covariance for the
predictor simulation will be an identity matrix.
Simulation Parameters
The simulation parameters will be stored in a list. Each element in
the list will contain information of the about the simulation and the
formula for the lm()
function.
sim_list <- list(list(N = 500, # Number of Data sets
nobs = 200, # Number of observations
beta = c(20, 30), # beta parameters
xmeans = c(0), # Means for predictors
xsigs = diag(rep(1, 1)), # Variance for predictor
sigma = 3, # Variance for error term
formula = y ~ x), #Formula
list(N = 500, # Number of Data sets
nobs = 200, # Number of observations
beta = c(5, -8, 2), # beta parameters
xmeans = c(-2, 0), # Means for predictors
xsigs = diag(rep(1, 2)), # Variance for predictor
sigma = 3, # Variance for error term
formula = y ~ x.1 + x.2), #Formula
list(N = 500, # Number of Data sets
nobs = 200, # Number of observations
beta = c(5, 4, -5, -3), # beta parameters
xmeans = c(-2, 0, 2), # Means for predictors
xsigs = diag(rep(1, 3)), # Variance for predictor
sigma = 3, # Variance for error term
formula = y ~ x.1 + x.2 + x.3), #Formula
list(N = 500, # Number of Data sets
nobs = 200, # Number of observations
beta = c(5, 4, -5, -3, 6), # beta parameters
xmeans = c(-2, 0, 2, 8), # Means for predictors
xsigs = diag(rep(1, 4)), # Variance for predictor
sigma = 3, # Variance for error term
formula = y ~ x.1 + x.2 + x.3 + x.4) #Formula
)
Simulation Functions
The function below generates 1 data set from a simulation scenario
above and returns a data frame.
data_set_sim <- function(seed, nobs, beta, sigma, xmeans, xsigs){ # Simulates the data set
set.seed(seed) # Sets a seed
xrn <- rmvnorm(nobs, mean = xmeans, sigma = xsigs) # Simulates Predictors
xped <- cbind(rep(1,nobs),xrn) # Creating Design Matrix
y <- xped %*% beta + rnorm(nobs ,0, sigma) # Simulating Y
df <- data.frame(x=xrn, y=y) # Creating Data Frame
return(df)
}
The function needs the following arguments:
seed
: the value to set for the random number
generator
nobs
: number of observations
beta
: a vector specifying the true values for the
regression coefficients (\(\beta_0\),
\(\beta_1\), \(\beta_2\), \(\beta_3\))
sigma
: the variance for the model above
xmeans
: a vector of means used to generate the values
for \(X_1\), \(X_2\), and \(X_3\)
xsigs
: a matrix for the covariance for \(X_1\), \(X_2\), and \(X_3\)
The function below generates the data for a simulation scenario and
returns a list of data (in a list) and the formula to assess the data
for the lm()
function
data_sim <- function(data){ # Simulates the data set
df_list <- lapply(1:data$N, data_set_sim,
nobs = data$nobs, beta = data$beta, sigma = data$sigma,
xmeans = data$xmeans, xsigs = data$xsigs)
return(list(df_list = df_list, formula = data$formula))
}
The function below takes the data generated from the
data_sim()
and fits a linear regression model. The function
wraps around the lm()
function and returns estimated
regression coefficients.
Results
Once your jobs are completed, you can check the results. First you
will need to extract the results from the registry and store it in an R
object.
parallel_results <- lapply(1:length(standard_data), loadResult) #obtains the results of each job adds them as an element in a list
Now use the colMeans()
function to see if the simulation
study worked.
lapply(parallel_results, colMeans)
