## Contents

- 1. Introduction
- 2. The parallel R taxonomy
- 3. lapply-based parallelism
- 4. foreach-based parallelism
- 5. Caveats with lapply- and foreach-based parallelism
- 6. Alternative forms of parallelism
- 7. Map-Reduce-based parallelism with Hadoop

## 1. Introduction

This tutorial goes through various parallel libraries available to R programmers by applying them all to solve a very simple parallel problem: k-means clustering. Although trivially parallel, k-means clustering is conceptually simple enough for people of all backgrounds to understand, yet it can illustrate most of the core concepts common to all parallel R scripts.

Algorithmically, k-means clustering involves arriving at some solution (a
*local minima*) by iteratively approaching it from a randomly selected
starting position. The more random starts we attempt, the more local minima
we get. For example, the following diagram shows some random data (top left)
and the result of applying k-means clustering from three different random
starting guesses:

We can then calculate some value (I think of it as an *energy function*) that
represents the error in each of these local minima. Finding the smallest
error (the lowest "energy") from all of the starting positions (and their
resulting local minima) gives you the "best" overall solution (the *global
minimum*. However, finding this global minimum is what we call an *NP-hard
problem*, meaning you'd need infinite time to be sure you've truly found the
absolute best answer possible. Thus, we rely on increasing the number of
random starts to get as close as we can to this one true global minimum.

The simplest example of a k-means calculation in R looks like

data <- read.csv('dataset.csv') result <- kmeans(data, centers=4, nstart=100) print(result)

This code tries to find four cluster centers using 100 starting positions,
and the value of `result` is the k-means object containing the
minimal `result$tot.withinss` value for all 100 starts. We'll now
look at a couple of different ways we can parallelize this calculation. All of
the example codes presented here can be found in my Parallel R GitHub
repository.

This guide is adapted from a talk I give, and it assumes that you already know how to actually run R jobs on parallel computing systems. I wrote a guide, Running R on HPC Clusters that goes through the basics of how to actually run these example codes.

## 2. The parallel R taxonomy

There are a number of different ways to utilize parallelism to speed up a given R script. I like to think of them as generally falling into one of a few broad categories of parallel R techniques though:

- lapply-based parallelism
- foreach-based parallelism
- Poor-man's parallelism and hands-off parallelism
- Map-Reduce-based parallelism

Although there are an increasing number of additional libraries entering CRAN that provide means to add parallelism that I have not included in this taxonomy, they generally fall into (or close to) one of the above categories.

To illustrate how these forms of parallelism can be used in practice, the remainder of this guide will demonstrate how a solution to the aforementioned k-means clustering problem can be found using these parallel methods.

To begin, the most straightforward form of parallelism for R programmers is lapply-based parallelism which is covered in the next section.