Parallel processing with R on Windows

Parallel programming can save you a lot of time when you are processing large amounts of data. Modern computers provide multiple processors and cores and hyper-threading ability; therefore, R has become compatible with it and enables multiple simultaneous computations on all resources. There are some discussions regarding when to parallelize, because there is no linear relationship between the number of processors and cores used simultaneously and the computational timing efficiency. In this blog post, I am going to utilize two packages in R, which allows parallelization, for a basic example of when each instance of computation is standalone and when there is no need for communication between cores that are being used in parallel.

Install.packages(“parallel”)
Install.packages(“doParallel”)
library(doParallel)
library(parallel)

If you enter Ctrl+Shift+Esc on your keyboard and click on the Performance tab in the Task Manager window, you will see how many actual logical processes, which are the combination of processors and cores, are available on your local Windows machine and can be used simultaneously for your analysis. We can also detect this number with the following command:

no_cores <- detectCores(logical = TRUE)  # returns the number of available hardware threads, and if it is FALSE, returns the number of physical cores

Now, we need to allocate this number of available cores to the R and provide a number of clusters and then register those clusters. If you specify all the cores to the R, you may have trouble doing anything else on your machine, so it is better not to use all the resources in R.

cl <- makeCluster(no_cores-1)  
registerDoParallel(cl)  

First, we are going to create a list of files that we want to analyze. You can download the example dataset here.

all_samples<- as.data.frame(list.files("your directory/R_Parallel_example/"))
seq_id_all<- seq_along(1:nrow(all_samples))

Then, we will create a function that we are will use for processing our data. Each file in this set has a daily value for several variables and 31 years. Columns 1, 2, and 3 are year, month, and day, respectively. I am going to extract the yield value for each year from column “OUT_CROP_BIOMYELD” and calculate the average yield for the entire period. All the libraries that you are going to use for your data process should be called inside the function. I am going to use “data.table” library to efficiently read my data into R. At the end of the function, I put the two outputs that I am interested in (“annual_yield” and “average_yield”) into one list to return from the function.

myfunction<- function(...) {
  i<-(...)
  library(data.table)
  setwd(paste("your directory/R_Parallel_example/"))
  sample<- fread(paste(all_samples[i,]))
      annual_yield<- subset(sample,sample$OUT_CROP_BIOMYELD>0)  # this column (OUT_CROP_BIOMYELD) is always zero except when the yield is reported which should be a value above zero.
      annual_yield$No<-  as.numeric(gsub("_47.65625_-117.96875","",all_samples[i,]))  # extract some part of the file name, use it as an identification for this dataset
      annual_yield<- annual_yield[,c(1,13,17)]  # extract just “Year”,”Yield” and “No” columns.
      colnames(annual_yield)<- c("Year","Yeild","No")
  average_yield<- colMeans(annual_yield[,c("Yeild","No")])  #calculate average year for each dataset
  return( list(annual_yield,average_yield))
}

Now, we need to export our function on the cluster. Because in the function we used “all_samples” data-frame, which was created outside the function, this should also be exported to the cluster:

clusterExport(cl,list('myfunction','all_samples'))

With the command line below, we are running the function across the number of cores that we specified earlier, and with “system.time,” the process time will be printed at the end:

system.time(
  results<- c(parLapply(cl,seq_id_all,fun=myfunction))
)

The function outputs are saved in the list “results” that we can extract:

k_1<- list()
k_2<- list()
for (k in 1: nrow(all_samples)){
  k_1[[k]]<- results[[k]][[1]]
  k_2[[k]]<- results[[k]][[2]]
}
annual_yield<- data.table::rbindlist(k_1)
period_yield<- data.table::rbindlist(k_2)

One thought on “Parallel processing with R on Windows

  1. Pingback: 12 Years of WaterProgramming: A Retrospective on >500 Blog Posts – Water Programming: A Collaborative Research Blog

Leave a comment