A Crash Course in R Part 1

What is R?  Also known as the language for statistical computing, it was developed in the 1990s, and provides the ability to use a wide variety of statistical techniques and visualization capabilities across a set of data. 

Pros for the language include the fact its open source, it has great graphical capabilities, runs via a CLI (provides the ability to script and automate), has a huge community behind it, and is gaining a wider adoption in business!

R can be used by a number of tools; the most common are R Tools for Visual Studio, RStudio standalone, or more recently R Scripts in Power BI.

 

Basics

Let’s start with the absolute basics. If you print 1+2 to the command line, the console will return 3. Print text to the command line, the console will return the same body of text.

Everyone loves variables. To be able to store values in variables, we can use the syntax apples <- 4 and pears <- 2 to the command line to store 4 in the apples, and 2 in the pears variable. There’s no print out here because its a variable. We can then do total_fruit <- apples + pears to create a new variable using existing variables.

As you create variables, you create a workspace, which you can reference using ls(). This details all the variables created within the R session. You can then use rm(variable) to clean up the workspace to maintain resource.

Now, not everyone loves commenting but you can comment your code via #. Here’s a simple example script to calculate the volume of a circle:

# Create the variables r and R 
r <- 2 
R <- 6 

# Calculate the volume of a circle
vol_circle <- 2*pi^2*r^2*R 

# Remove all intermediary variables that you've used with rm() 
rm(r,R) 

# List the elements in your workspace 
ls() 

[1] "vol_circle"

Data Types

As with any language, there are a number of data types supported:

  • TRUE/FALSE are “logical”
  • “This is text” is “character”
  • 2 or 2.5 are “numerics”
  • You can add L to numeric such as 2L to call this number an integer (the outputs the same, but the class is different). Here we have what’s known as a type hierarchy.
  • Other types include double, complex, and raws.

We can use the function class() to determine the data type of variable. We can also use the dot function as to coerce or transform between the data types, such as as.character(4) = “4” and as.numeric(“4.5”) = 4.5.  To evaluate the type use is.numeric(2) = TRUE or is.numeric(“2”) = TRUE.  NA is returned when trying to convert “hello” to numeric.

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Vectors

A vector is a sequence of data elements of the same data type which can called using the c() function. Vectors can be of all the types seen previously.  The three properties of a vector are type, length, and attributes.

For example, a vector to provide us UK Government Parties (who doesn’t love politics) and assigned to the variable parties can be built parties <- c(“Labour”,”Conservative”,”Libdems”,”SNP”). A check can done to see if the variable is of a vector type similar to before using is.vector(parties).

But what if the vector contains data that has slightly more meaning behind it, for instance seat_count <- c(262,318,12,35). You can attach labels to the vector elements by using the names() function: names(seat_count) <- parties.  You could also do this using one line – c(Labour = 262, Conservative = 318, Libdem = 12, SNP = 35).

The variables created previously are actually stored in a vector of length 1. R does not provide a data structure to hold a single number/character string.  If you do build a vector of different data types, R performs coercion to make sure they are all of the dame data type. For instance c(1,5,”A”,”B”,”C”,”D”) becomes “1”,”5″,”A”,”B”,”C”,”D”.  If you need to differentiate data types, you can use a list instead!

 

Vector Arithmetic

Computations can be performed between vectors and are done so in an element-wise fashion. This means you can do earnings <- c(10,20,30) * 3 to generate [30] [60] [90].

You can also do vector minus vector so earnings – expenses (again done element wise). Multiplication and division using this method does not result in a matrix!

Other functions include sum(bank) to sum all elements of the vector, sum(bank > 0) to return a count of elements in the vector (given bank contains numerics) or sum(bank == x) to return the count of element x in a vector.

 

Vector Subsetting

As the name suggests, this is basically selecting parts of your vector to end up as a new vector (a subset of your original vector).

If you want to select the first element from our seat_count vector, we write seat_count[1] and this will return Labour 262. Both the value and name are kept from the original vector. We can also use the name, so seat_count[“Labour”] will return the same result.

If you want to select multiple elements, you can use the syntax seat_count[c(1,4)] by passing in a vector. To select elements in a sequence you can use 2:5 instead of 2,3,4,5. You can also subset via an inverse, by using the syntax seat_count[-1] which returns all the seats, apart from the element in [1].

One last method to create subsetting is by logical vectoring, so by specifying seat_count[c(TRUE,FALSE,FALSE,FALSE)] we can return the equivalent of [1]. R is also able to “recycle” this type of vectoring so if your logical vector is length 2, it will loop over itself to fit the vector you are subsetting.

Remember we can also use the arithmetic from the previous section, to select our vector contents for subsetting, examples include main_parties <- seat_count[seat_count > 50].

Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.

 

Matrices

While a vector represents a sequence of data elements (1 dimensional), a matrix is a similar collection of data elements but arranged as rows/columns (2 dimensional) – a very natural extension to a vector.

To build a matrix you’ll need to use the following format; matrix(1:6, nrow = 2) which creates a 2-by-3 matrix for values 1-6. You can also specify columns rather than rows by using ncol = 3. R infers the other dimension by using the length of the input vector. To fill the vector by row instead of by columns, you can use the argument byrow = TRUE.

Another way to create a matrix is by using the functions rbind() and cbind(). These essentially take the 2 vectors you pass the function and stick them together. You can also use these functions to bind together a new vector with an existing matrix. For example my_matrix <- rbind(matrixA, vectorA)

To name the matrix, you can use rownames() and colnames(). For example rownames(my_matrix) <- c(“row1″,”row2”) and colnames(my_matrix) <- c(“col1″,”col2”).

You can also create a matrix using a one-liner, by using dimnames() and specifying a list of vector names.

my_matrix <- matrix(1:6, byrow = TRUE, nrow = 2,
dimnames = list(c("row1", "row2"), c("col1","col2","col3")))

Similar to vectors, matrices are also able to recycle themselves, only store a single atomic data type and perform coercion automatically.

Continuing on from vectors, matrices can also be subsetted. If you’re after a single element, you’ll need to specify both row and column elements of interest using the syntax m[1,3] for row 1 column 3. If you’re looking to subset an entire row or column, you can use the syntax m[3,] (notice the missing column value) to select the entirety of row 3. Columns can be selected using the inverse via m[,3]. You can also select multiple elements using a simple methodology to vectors. This can be achieved by using the syntax m[2, c(2,3)] to select the 2nd and 3rd column values of row 2. Subsetting the names works just the same as by index, you can even use a combination of both! The same is true of subsetting by a logical vector – just use c(FALSE,TRUE,TRUE) for the last 2 rows of a 3 row matrix. You can see some examples below.

# Create a matrix using 2 vectors
my_mega_matrix <- cbind(vectorA, vectorB)

# Subset the matrix to get columns 2 and 3
my_subsetted_mega_matrix <- my_mega_matrix[,c(FALSE,TRUE,TRUE,FALSE]

# Subset the matrix using names for columns 1 to 4
my_alt_subsetted_mega_matrix <- my_mega_matrix[,c("A","B","C","D")]

# Calculate totals for the columns 2 and 3
total_mega_matrix <- colSums(my_subsetted_mega_matrix)

As seen above there are another 2 functions we can use on matrices namely colSums() and rowSums() to do column and row arithmetic. This is addition to other standard arithmetic. All computations are performed element wise. So we can do my total_mega_matrix * 1.3 to convert the totals in GBP to USD (as an example). Performing calculations using 2 matrices is just the same (matrixA – MatrixB). Be careful here though, if they contain the same number of elements, everything will be done element wise, else recycling will occur.

Notice the similarity between vectors and matrices – they’re both data structures that store elements of the same type. Both perform coercion, and recycling. Arithmetic is also similar as everything is performed element wise.

Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.

 

Factors

 

Unlike numeric variables, categorical variables can only take a limited number of different values. The specific data structure for this is what is known as a factor. A good example of this is blood, types can only be of type A, B, AB, or O – we then define a vector of peoples blood types blood <- c(“B”,”AB”,”O”,”O”,”A”,”B”,”B”). To convert this vector to a factor we can use factor(blood). R scans the vector to check for categories, and stores the distinct list as levels (sorted alphabetically). Values in the vector and then replaced with numeric values corresponding to the associated level. You can think of factors as integer vectors, where each integer refers to a category or level. To inspect the structure, we can use str(factor).

Similar to the names() function, you can also specify the levels() function and pass a vector to name the levels differently to those categories picked up in the scan, for instance levels(my_factor) <- c(“BT_A”,”BT_B”,”BT_O”,”BT_AB”). However its much safer to pass in both the levels and the labels because of the way in which it sets the levels alphabetically which means you have to be careful your names correspond correctly to the levels.

In statistics, there is also a difference between nominal categorical variables and ordinal categorical variables – nominal variables have no implied order, i.e. blood type A is not greater or less than B. There are examples however where ordering does exist, for example with t-shirt sizes, and you can use R to impose this on the factor. To do this, you can set the ordered function inside the vector to TRUE, and then specify the levels in the correct order. You can now evaluate the levels in the factor. An example can be seen below.

# Definition of temperature_vector
temperature_vector <- c("High", "Low", "High", "Low", "Medium")

# Encoded temperature_vector as a factor
temperature_factor <- factor(temperature_vector, 
                             ordered = TRUE,
                             levels = c("Low","Medium","High")
                             )

# Print out
temperature_factor

Interested in testing your knowledge, check out the DataCamp exercises here.

This is only just the start of understanding R – in the next blog I’ll look at lists, data frames and most importantly graphics! We can then start to looking at some more complicated examples and use cases.