12/18/2020

Outline

  • Introduction

  • Cluster

  • Parallel Processing in R

  • Simulation Study

Introduction

Terminology

  • job: A task a computer must complete

  • bash script: tells the computer what to do

My Cluster Usage

  • Computationally Intensive

  • Use R and C++ (via Rcpp)

HPCC Cluster

How do I view the cluster?

  • Virtual Desktop Computer
    • Set number of cores (max 32 cpus)
    • Set RAM (max 128 GB)
    • Set storage (20 GB)

How to access the cluster?

RStudio Server

  • Edit any text document

  • Submit jobs

  • Upload/Download Data or Documents

  • Do not do computationally-intensive tasks in RStudio Server

    • Light Tasks
    • Submit Jobs

Linux Commands

  • sbatch: submit a job to the cluster

  • scancel: cancel jobs

  • squeue: view jobs status

  • slurm_limits: view what is available to you

How I use the cluster?

  • Code on my computer at a smaller scale

    • Parallelize using my computer’s cores
  • Scale it up for the cluster

  • Have an R Script do everything for me

  • Save all results in an RData file

  • Use the try() function to catch errors

  • Use RProjects

Submitting Jobs

  • Use a bash script to specify the parameters for the cluster

  • Add the following line at the end of a bash script:

Rscript NAME_OF_FILE.R
  • Submit the job with the line below:
sbatch NAME_OF_BASH.sh

Anatomy of Bash Script

#!/bin/bash -l 

#SBATCH --nodes=1 # Setting the number of nodes, usually 1 node is used
#SBATCH --ntasks=1 # Setting the number of tasks, usually 1 task is used
#SBATCH --cpus-per-task=8 # Setting the number of cpus per task
#SBATCH --mem-per-cpu=1G # How much ram per cpu, usually 1 GB
#SBATCH --time=01:00:00    # Time to run task, changes based on predicted time of task
#SBATCH --output=my.stdout # Where to store the output, usually a standard output
#SBATCH --mail-user=NETID@ucr.edu # Where to email information about job
#SBATCH --mail-type=ALL # Not Sure
#SBATCH --job-name="Cluster Job 1" # Name of the job, can be anything
#SBATCH -p statsdept # statsdept is the only partition dept of stats students can use

Rscript Parallel_Job.R # The command that tells Linux to process your R Script

Anatomy of R Script

# Obtain System Date and Time
date_time <- format(Sys.time(),"%Y-%m-%d-%H-%M")

# Set Working Directory
setwd("~/rwork")

# Load libraries and functions
library(parallel)
source("Fxs.R")

# Pre - Parallel Analysis
# Parallel Analysis
results <- mclapply(data, FUN, mc.cores = number_of_cores)
# Post - Parallel Analysis

# Save Results
file_name <- paste("Results_", date_time, ".RData",sep = "")
save(results, file = file_name, version = 2)

Parallel Processing in R

How to parallelize your R Code

  • mclapply()

    • Recommended for the cluster

    • Has built-in try() function

    • Replace any lapply() with mclapply() and add mc.cores= argument.

  • parLapply()

    • Use if multiple nodes are involved

    • Use if on Windows PC

Where to parallelize?

  • Identify loops or *apply functions

    • Iterations must be independent of each other
  • Identify bottlenecks

    • Use benchmark R packages

How to speed up you R Code?

  • Vectorize your R code

  • Minimize loops

    • Use *apply functions
  • Use optimized functions

    • colMeans() and rowMeans()
  • Implement c++ via Rcpp

  • More Information: Advanced R (adv-r.hadley.nz)

Simulation Study

Simulation Study

  • Show that Ordinary Least Squares provides consistent estimates

  • Model: \(Y = \boldsymbol X^\mathrm T \boldsymbol \beta + \epsilon\)

    • \(\boldsymbol \beta = (\beta_0, \beta_1, \beta_2, \beta_3)^\mathrm T = (5, 4, -5, -3)^\mathrm T\)

    • \(\boldsymbol X = (1, X_1, X_2, X_3)^\mathrm T\)

    • \(\epsilon \sim N(0,3)\)

Simulation Parameters

  • Number of Data sets: 10000

  • Number of Observations: 200

  • \(\boldsymbol X \sim N\left(\left(-2,0,2\right)^\mathrm T, \boldsymbol I_3 \right)\)

  • \(\boldsymbol I_3\): \(3\times 3\) Identity Matrix

Parallelization of Simulation

Executing Simulation Study

  1. RStudio Server

    1. https://rstudio.hpcc.ucr.edu/
    2. Login with credentials
  2. Download Documents in R Console

    1. Type: download.file("https://ucrgradstat.github.io/presentations/Parallel_Job.R", "Parallel_Job.R")

    2. Type: download.file("https://ucrgradstat.github.io/presentations/Cluster_Script.sh", "Cluster_Script.sh")

  3. Bash Script

    1. Line 9: Change to your email
    2. Line 12: Change to intel or short if needed
  4. Submit Job

    1. Console Pane
    2. Terminal Tab
    3. Type: sbatch Cluster_Script.sh

Thank You!

Resources

More information of parallel processing:

Link