Introduction to Working with Data: R Version

Author

Nick Huntington-Klein

This page will serve as an entry-level introduction to how to code with data. If you are starting out for the first time using code and programming to work with data (as opposed to point-and-click statistics software like SPSS, or spreadsheet software like Excel) then this will get you up to the point where you can load data and make some basic changes.

This is companion material to my textbook on causal inference, The Effect. I will focus on the methods and tools used in the book. There’s plenty this page doesn’t cover. The focus is just the basics! Please check out the book website or my main website for more resources.

This page uses the R programming language. The page is also available for Stata and Python.

Getting Started

What We’re Learning (and not learning)

This page will teach you how to:

Get started with running code
Install and load libraries
Load in data from files, the internet, or libraries
Write good code and responsibly use AI in your coding
Look at your data and get basic statistical results from your data
Manipulate your data and get it ready for the kinds of analyses you do in The Effect

By the time we’re done, you should be where you need to be to take a data set that is already reasonably usable and prepare it for analysis for a research project. We won’t be getting super deep into the weeds - I will not be covering, for example, how to clean extremely messy or unstructured data. I am also not trying to make you an expert coder. Just the stuff to get you to where you can use the book and start on your own research project.

Further, I will be using the same tools and packages that are used in The Effect, and will be focusing on the kind of tools and procedures used in the book. That means I will be focusing on data sets that are small enough to be opened up on your computer (i.e. “in memory”) rather than interacting with huge databases.

The Very Start

You will need to start by getting R itself running.

There are many ways to run R, but I will assume that you are using RStudio, which is a very popular program for running R written by the company Posit. It has a lot of tools and add-ons that make working in R easier.

Installing RStudio takes two steps. First, you’ll need to install the R language itself.

Go to the R-project mirrors page
Click on one of the mirrors (it doesn’t matter which, although perhaps pick one close to you)
Click “Download R for (your operating system)”
Select the “base” installation, then install it as you would any other program

Once that’s done, you can install RStudio

Go to the RStudio download page on Posit’s website
Download the program and install it
Open it up!

Alternately, you can skip all of that and make an account on posit.cloud, which is a free Posit service that lets you run RStudio in your browser. It’s slower than running RStudio on your own computer, but you can use it without installing anything. Once you’re in your account, click on “New Project” and then “New RStudio Project” to open up RStudio.

You can check whether everything is working properly by going to the “Console” in the bottom-left part of the RStudio screen, putting in 2+2, and hitting enter. You should get back a 4.

Running a Script, and Very Basic Code

In RStudio, you can run code by putting it in the Console (bottom-left) and hitting Enter. Try it by putting 2+2 in the Console and hitting enter. You should get a 4.

Most of the time, however, you’ll want to keep track of your code so you can rerun it later. So you’ll be writing scripts. The top-left pane is where you’ll have your scripts. You can make a new one using the File menu (way up top-left) and picking New File, then R Script.¹

Let’s write one. Copy the following into the new script you just made:

2+2
a <- 1
a
c(4, 3, 1)
median(c(4, 3, 1))
help(median)

Click on the line of text with the 2+2.² Then, either hit the “Run” button, or do Ctrl-Enter (on Windows) or Cmd-Enter (on Mac). This will run the line of code down in the Console. You should see an answer of 4 pop up in your Console.

This will also advance the cursor to the next line, a <- 1 . This line of code takes the number 1 and stores it inside (<-) of a variable called a. Use Ctrl/Cmd-Enter on this line.

You’ve now stored a variable. Look in the top-right pane of RStudio to the Environment tab. It will show you that you’ve created a new variable called a, and it has a value of 1.

Now use Ctrl/Cmd-Enter to run the a line. This will show you the contents of a, which we know to be 1. You’ll see a 1 pop up in your Console.

Next, run the c(4,3,1) line. This (c)oncatenates several numbers together to make a vector of values. In other words, data!

Let’s use a function to do a calculation on our data. A function takes input (our data and settings), does a calculation, and gives us back an output. In R, functions use parentheses (). Run the median(c(4,3,1)) line. This takes our data c(4,3,1) and passes it to a function, median(), which calculates the median of our data. You should get back a value of 3 in the console, since that’s the median of 1, 3, and 4.

How did we know how to take a median? You can read a function’s help file using the help() function (or use the Help pane in the bottom-right of RStudio; it has a search bar). Run the help(median) line. You’ll see the list of arguments (options) you can set. From this page we learn that the median() function takes an argument x for the data (for us this was c(4, 3, 1)) and a na.rm option to tell it whether to drop missing values or not before taking the median. We can set options by naming them, for example median(c(4, 3, 1), na.rm = TRUE) to tell it to drop missing values.³

At the bottom of the help file if you scroll down, you’ll see some example code for using the function. This is often handy since you can copy/paste the working code and then edit it for your case.

Now, use File -> Save (or the Save icon) to save your script somewhere on your computer.

Downloading and Using Packages

R has a lot of user-written libraries (packages) containing useful functions for making graphs, doing different kinds of estimates, and so on. These are maintained on a central repository called CRAN. If you’ve never used another coding language before, breathe a sigh of relief that you get to use CRAN because it is very user-friendly compared to a lot of other package management systems.

You can install packages from CRAN using the install.packages() function, or the Packages tab in the bottom-right pane of RStudio.

In The Effect (and on this page), we make heavy use of the tidyverse package, so let’s install that now. Run the code:

install.packages('tidyverse')

(this may take a while as the tidyverse is really a collection of a lot of different packages).

Once you’ve installed the package, you don’t need to install it again, it’s already on your computer. But you do need to load it to use it every time you open up R. Run:

library(tidyverse)

You can often get a walkthrough of how to use a package in its vignettes. If you like, run browseVignettes('tidyverse') to see what introductory vignettes they have. Often there is example code, too. Try taking a look at the “Welcome to the Tidyverse” vignette and then come on back here. Any other package you come across you may want to take a look at its vignettes as well for tips on usage.

Loading Data

Getting Data from Packages

Lots of R packages (as well as base R itself) come with data sets that you can load in directly. You’re unlikely to do your own research on these data sets as they’re typically just for demonstrations and to play around with. But The Effect makes use of the data sets in the causaldata package I wrote, so you’ll certainly get used to loading data from it! The example code on this page, as well as the example code in the book, will use the texas data from the causaldata package (the texas data is included in the package because it appears in Scott Cunningham’s Causal Inference: the Mixtape).

So, first, install causaldata by running install.packages('causaldata') if you haven’t already. Then, load the package and the texas data with the following code:

library(causaldata)
data(texas)

Easy! Click on texas in the Environment tab (top-right, it might be labeled as a “promise” rather than a data set at this point) to get it properly loaded, then click on texas in the Environment tab to take a look at the data. Additionally, you can use help() to find documentation for data sets included in packages. Run help(texas) to see what all the variables in the data set mean. help() will not work for data sets you got from somewhere other than a package.

In addition to data sets stored inside packages, there are a number of packages that exist to help you download data from elsewhere, for example the ipumsr package for downloading data from IPUMS. However, these packages each work in different ways and I won’t be covering them here.

Loading Data from a File

Often your data will be stored in a file that you will need to load in. Examples of these kinds of files include:

Excel spreadsheets (.xlsx files)
Comma-separated or tab-separated values files (.csv and .tsv), which are like Excel spreadsheets but simpler and more universal
Special compressed data files like Stata datasets (.dta) or Parquet files (.parquet), which cater to the needs of data analysts

There are a lot of functions designed to read these files in. You can tell the function where the file is, for example giving it a URL if the file is on the internet, and it will load it in for you. For example, after loading the tidyverse you can use read_csv() to load in a comma-separated values file.

However, I strongly recommend using the rio package, which has an import() function that works for just about any data file type you might want. No need to memorize a bunch of different functions. Install the rio package with install.packages('rio') if you haven’t already, and then run:

library(rio)
texas <- import('https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/texas.csv')

The texas data is stored at that URL. After running this code you’ll see the texas data loaded in your Environment tab on the top-right. Notice that we couldn’t just run import() on its own - that would load in the data briefly and show it to us but then it would be gone. If we want the data to stick around so we can use it, we need to store it as a variable (<-), just like we used a <- 1 to make a variable called “a” containing a 1 earlier.

The data doesn’t have to be stored on the internet. You can also load a file from your computer. However, this will require that you know how to use your computer’s file system… see the next section.

Files on Your Computer

import() and similar functions don’t just work with data from the internet. They also work with files loaded on your computer. You can give the import() function a filepath that points to a place on your computer, instead of a URL that points to a place on the internet.

If you, like most people, didn’t start using a computer to do work until after Google Drive launched in 2012 (and was followed by other cloud storage services like Dropbox and OneDrive soon after), you might not be very familiar with the proper way to store and retrieve files from your computer. Those services encourage you to store files in a way that makes using your file system difficult. This is fine for 99% of people, but not great if you want to write code that accesses files.

As someone writing code, you really want to USE your file system. Modern computer use for casual users tries to push you into just putting all your files in a single “Downloads” folder and using a search bar to find things rather than keeping things organized. This will quickly turn into a major headache if you’re using your computer heavily, and especially if you’re writing code that accesses files. If you have a project, make a folder for the project and keep all the files for that project in it. Maybe even make different subfolders to store data, code, writing, and so on!

Files on your computer are stored in a series of nested folders. On my computer, for instance, I have a file called test_data.csv stored in my Documents folder. If I’m clicking around then I’d just go to “Documents” and search for it. But really, that Documents folder is inside of my “Nick” folder, which is inside of my “Users” folder, which is in my “C:” drive (standard on Windows).

Now that we have things in folders, how can we load data from files? If I just tried to run import('test_data.csv') the code would have no idea that the file is inside of my Documents folder and would give me an error. I need to tell the computer where to look to find the file. That’s the filepath.

There are two approaches to filepaths. One is to use an absolute filepath. An absolute filepath tells the computer exactly where to look, starting from the topmost folder and working its way down. In this case that would be:

test <- import('C:/Users/Nick/Documents/test_data.csv')

The / forward slashes indicate that we’re going into a subfolder. So this starts in C: and then goes to the Users folder, then the Nick folder inside of that, then Documents inside of that, and finally the test_data.csv file.⁴

This will work. However, it’s a bad idea. Why? For one, if you give someone else your code, it won’t run any more. Unlike URLs, which point to a place on the internet everyone can see, a filepath points to a location on your computer that other people can’t access. Sending someone a filepath to a file on your computer does not send them the file.

Instead, you should use relative filepaths. Instead of starting from the topmost folder, relative filepaths start from a working directory. You can set your working directory, and then use the filepath from there. For example, if my working directory were the Nick folder, then I could do import('Documents/test_data.csv') and it would know where to find the file. Or if my working directory were the Documents folder, I could simply do import('test_data.csv'). Then, I could send someone my code and the test_data.csv file, and if they set their own working directory properly, they can run my code just fine.

The easiest way to do this is to ensure that your code and data are stored in the same folder. Then, in RStudio, use the Session menu (at the top) and do Set Working Directory -> To Source File Location (or, if you’re working in Rmarkdown or Quarto, your working directory will always automatically be the folder the code is stored in). This will set your working directory to the folder the code is stored in, which is also the folder the data is stored in. Then you can just do import('test_data.csv'). Easy!

One step more difficult, but better practice, is to make a data subfolder for your data. Then you can set the working directory to the folder the code is in and do import('data/test_data.csv'). Even better is to have a main project folder with a data subfolder and also a code subfolder. Put the code in the code folder and the data in the data folder. Then use .. to “hop up” a folder. import('../data/test_data.csv') will start in your code folder, go up one level to the project folder with ../, then go into the data folder, and finally find your file. Send someone your whole project folder (or have it on a shared cloud-storage folder like a OneDrive your research team can all access) and everything will run smoothly! Even better practice is to use the Projects system in RStudio and have the project folder be your working directory… but we’re getting too advanced here.

Looking at Data

An important part of working with data is looking at it. That includes looking directly at the data itself, as well as looking at graphs and summary statistics of the data.

When you look at data, what are you looking for? There are a few things:

What is the distribution of the data?
- What values can it take?
- How often does it take certain values?
- If it’s numeric, what is its average?
Is there anything fishy?
- What should this variable look like? Does it look like that?
- Are there any values in the variable that don’t make sense?
- Are there any missing values in the variable?
How are the variables related?
- How does the distribution of one variable change with another?

The kinds of variables there are and the task of calculating summary statistics and the relationships between variables is a huge subject. So instead of cramming it all in here I’ll point out that these are covered in Chapters 3 and 4 of the book, including with code examples in Chapter 4. I’ll add just a few additional notes here:

You can look directly at your data, as if it were an Excel spreadsheet, by clicking on it in the Environment pane, or doing View(texas).
To look at summary statistics for a specific variable, you can take the variable out of the data set with $ . For example, texas$perc1519 returns the perc1519 variable from texas. From there you can send it to all sorts of summary statistics functions, like taking its mean with mean(texas$perc1519), getting a full range of summary statistics with summary(texas$perc1519), and so on.
Many summary statistics functions will return a NA (missing) value if any of the data is missing. Most functions have a na.rm option that you can set to TRUE to drop missing values before doing the calculation. mean(c(1, NA, 3)) returns NA. But mean(c(1, NA, 3), na.rm = TRUE) returns 2. This isn’t the only (or even necessarily best) way to handle missing values, but it is definitely the most common. See Chapter 23 for more.
For categorical or discrete variables, a good way to see how often each value is taken is table(). So table(texas$year) will show us how many observations there are for each year.
The vtable package can automatically produce summary tables that show up in the RStudio Viewer tab (bottom-right) so you can refer back to this information as you work. Do install.packages('vtable') if you haven’t yet, then library(vtable) to load the package. vtable(texas) will show you all the variables and some basic information about each one (along with variable descriptions, if the data has them), and sumtable(texas) will show a full table of summary statistics at a glance.

The Goal of Manipulating Data

What You’re Trying to Do

Your goals in preparing your data are the following:

Take your raw data and put it in the form you need it for your analysis
- Selecting the right subsample (if any)
- Creating any variables you might need that aren’t in there
- Merging with other data sets
Checking if anything looks wrong
- Does anything in the raw data look like an error?
- Have you made any errors in your code?

Often, you will want to make sure your data is in tidy data format.

What is tidy data? Let’s take as an example a data set of the heights and weights of 30 different children in a classroom. We’ve taken one set of measurements per child, so the observation level is one child. Two of the children in this classroom are Maria and Tom.

In a tidy data set,

Each observation is one row of data (meaning: There are 30 children in the data, so the data set has 30 rows. Maria is one row of the data, and her height and weight are both on that row. We don’t have one row for Maria’s height and another row for her weight. We don’t have Maria’s height and Tom’s height on the same row.)
Each variable is a column, and each column is a variable (meaning: one column of the data is everyone’s height. We don’t have one column for Maria’s height and another column for Tom’s height.)
Each value is a cell, and each cell is a single value (meaning: we don’t try to stuff multiple values in a single cell. We don’t have one column just called “height and weight” and put “50 inches tall and 70 lbs” for Maria. We also don’t stuff all the children into one row and put [50 inches, 46 inches, 52 inches, …] in a single cell in the height column).

Tidy data is helpful because it makes data analysis on lots of observations much easier to perform. Often if you are getting data from sources that tend to only look at one observation at a time (for instance, in a quarterly report for a business where they look at a single sales number, a single costs number, etc.) you tend to get non-tidy data since non-tidy data often makes it easier to find a single data point of interest. Not so good for analysts though! For more information on tidy data, see this page.

If your data is heavily non-tidy and you need to tidy it, a useful tool you’ll probably need is the pivot. Pivoting data is a huge headache in any language and notoriously is hard to get the syntax right for, and I don’t cover it on this page. You can find out more here though!

Writing (or having AI Write) Acceptable Data Manipulation Code

There are a few things you should keep in mind when writing code to manipulate your data:

Add comments to your code. In R you can do this by starting the line with #. These comments will help others figure out what your code is trying to do. It will also help you figure out what your code is trying to do when you return to it later.
Make sure that all of your code is written in a script (or multiple scripts). If you write code directly in the Console, make sure you copy it into your script before moving on. Don’t fall prey to the trap of thinking that you’ll just remember it later! It should be possible to start from a fresh brand-new RStudio session and run your script from start to finish without error. As you’re working, it’s not a terrible idea to occasionally save your work, close RStudio down, reopen it, down and run your script line-by-line from the start to make sure you haven’t changed something that breaks your code.
You need to look at your data after every line of data manipulation code. Using the tools from the “Looking at Data” section above, you should be checking what your data looks like after every line of code. You just wrote a line of code that tries to do something - make a new variable, fill in a missing value, aggregate the data, etc.. So ask yourself - did it do that thing? You should probably check. Especially when you’re starting out, your code might not do what you expect! I have been helping students with data manipulation code for a long long time. If they run into an issue, the great majority of the time it’s not some complex technical problem, it’s just that they forgot to check their data to ensure the code did what they thought it did.

With that in mind, how about AI/LLM services like ChatGPT Canvas, Copilot, and others? Can’t we just have them write the code?

Well, yeah. Coding assistants are effective and powerful. And realistically, outside of a classroom setting where it might not be allowed, you’re probably going to be using them to help write your code the rest of your life.

So… use them now right? Well, maybe. While AI is excellent at preparing code, it is prone to error. You simply can’t take it for granted that the code will work. This is especially true in coding for data manipulation, where the AI is often lacking crucial context necessary to write the correct code. That means that AI code is only acceptable if you are capable of checking its work.

That means two things:

Don’t have AI write code you couldn’t write yourself. You need to understand how to code before you can use AI to write code. Otherwise you won’t know what the code is doing and can’t check whether it’s right.
Don’t assume that the AI did it properly. This is also true if you’re writing code yourself, but especially if AI is writing code, you have to do the above step of going line-by-line and checking whether the code is actually doing what it’s supposed to do.

Running Into Errors

When you code you will almost certainly run into errors.

In R there are two kinds of messages you will receive. Warnings are something you get when you code runs and completes, but they sort of suspect it didn’t do what you want, or they just want you to be aware of something.

For example, consider the code as.numeric(c('hello','3')) . as.numeric() turns strings into numbers, so '3' will get turned into regular-ol 3. However, 'hello' isn’t something it can convert into a number, so it will give you the result c(NA, 3) and the warning Warning: NAs introduced by coercion meaning that it did run your code and give you a result, but, like… did you know you got some NA (missing) values? Just checking. Maybe you knew this would happen and it’s intentional, or maybe it means you have to fix something.

On the other hand, errors are what you get when it actually can’t run your code and stops. For example, consider the code as.numeric(median). median is a function that gives you a median. You can’t convert it to a number, that doesn’t make any sense. It doesn’t know what to do and so tells you you can’t do that. You get Error in as.numeric(median): cannot coerce type 'closure' to vector of type 'double' and your code stops. When you get an error, it’s always something you have to fix.

How can you fix things when you run into an error? Errors are often confusingly worded and require some technical understanding to figure out what they’re saying. But I have some pointers.

I have a page of the most common R errors my students run into and what they mean here. Check that page for your error and it will explain what’s going on and how to fix it.
In general, try Googling or asking an AI what the error means. It can often tell you.
The error message itself is telling you what the problem is. Even if you can’t understand it, will usually tell you which part of the code the error is coming from. THIS is the part to look for. In the above error message, you might not know what a 'closure' type is, but you do know that it’s saying the error is in as.numeric(median). So look at that part of your code - you might be able to figure out yourself what the problem is, even if you can’t understand the error message itself (“oh whoops I meant to calculate the median and then convert that to a number, not convert the median function itself”).

How to Manipulate Data

Now we will get into some actual data manipulation code. As in the book we will be working with dplyr tools, as available in the tidyverse package.

Something to note is that we’ll be using the pipe to work with data. Functions in the tidyverse are set up such that their first argument is the thing they’re manipulating, and then all the options come after that. The pipe looks like this: %>%.⁵ Its purpose is to take whatever’s on the left and make it the first argument of the function on the right. This serves to let you chain together multiple functions in a row, moving your data along a data-manipulating machine like it was a conveyer belt. This makes your code easier to read, and avoids having to nest a bunch of functions inside of each other, which can make it easy to make errors.

For example, consider this code:

c(1, 3, 8, 9) %>%
  rep(2) %>%
  length()

[1] 8

This code takes the vector of [1, 3, 8, 9]. Then it uses the pipe to pass it to the rep() function, which copies vectors and other things. I ask it to give me two copies, which gives me [1, 3, 8, 9, 1, 3, 8, 9]. Then I pass that to length() which calculates the length of a vector. The length is now 8, which is the result we get. Compare this to writing the same thing without a pipe, length(rep(c(1,3,8,9),2)) which is a bit harder to follow and has to be read from the inside-out to figure out what’s going on, rather than from the top down.

Also, note that all of the following code will show you how to manipulate data. However, often you will want to use that manipulated data. If this is the case, you’ll need to store the manipulated data in a new object (or overwrite the old one).

For example, try the following code:

library(tidyverse)
library(causaldata)
data(texas)

# make a new variable of just 1s
texas %>% mutate(just1s = 1)

This code will add the column, show you the data with the new column, and then immediately forget it ever did anything. If you check the texas data again there’s no new column, because you didn’t store your changes! For that you’d have to do:

library(tidyverse)
library(causaldata)
data(texas)

# make a new variable of just 1s and update the texas data with it
texas <- texas %>% mutate(just1s = 1)

NOW the texas data has the new column in it.

Picking Only Some of Your Data (Subsetting)

If you want to keep only certain columns (variables) from your data, you can use the select() function. select() is quite flexible - you can tell it what columns you want to keep by writing their names, giving their index position (the second column = 2, etc.), or which columns you want to drop by putting a - before them.

library(tidyverse)
library(causaldata)
data(texas)

# Keep the year and state variables
texas %>% select(year, statefip)
# or
texas %>% select('year', 'statefip')
# or, since they're the first two variables
texas %>% select(1, 2)

# Get rid of the bmprison variable
texas %>% select(-bmprison)

How about picking just some of the rows (observations) to keep? The primary way to do this is with the filter() function.

filter() wants you to give it a calculation using your data that is either true or false. If it’s true, it keeps the row. If it’s false, it drops it.

library(tidyverse)
library(causaldata)
data(texas)

# Keep only data from the state of Utah
# Note that == *checks* whether two things are equal,
# while = *assigns* the left thing to be the right
# so state == 'Utah' is TRUE whenever the state variable is 'Utah'
# and FALSE otherwise
texas %>%
  filter(state == 'Utah')

# Use | to say "or" to chain multiple statements,
# or & for "and"
# Keep data that is from Utah OR is from Montana and the year 1994 or later
# >= means "greater than or equal to"
texas %>%
  filter((state == 'Utah') | (state == 'Montana' & year >= 1994))

In addition to filter() there is also slice(). slice() keeps rows based not on a true/false statement, but rather their index position. texas %>% slice(1:10) will keep only the first ten rows of the texas data. We don’t tend to use slice() all that often in causal inference but it has occasional uses.

Now You

Load up the tidyverse and causaldata packages, and load in the texas data.
Use help(texas) to see the variable descriptions and find which variable is for the unemployment rate.
Use any of the the tools from the Looking at Data section to figure out whether the unemployment rate is on a 0-1 or a 0-100 scale.
Use pipes (%>%) to chain together the following into a single set of commands:
Use select() to drop the income variable. Then use it again to keep only the unemployment rate variable, as well as state and year.
Use filter() to keep only rows that either have an unemployment rate above 5% or are in the state of Texas.
Save the result as filtered_texas.
Look at your result to make sure it worked as intended (hint: maybe check the unemployment rates for all non-Texas states (state != 'Texas') to ensure they’re above 5%? Check to make sure all the Texas observations are there?)

library(tidyverse)
library(causaldata)
data(texas)

library(tidyverse)
library(causaldata)
data(texas)

help(texas)
# looks like the unemployment rate variable is ur

summary(texas$ur)
# It goes from 0 to 100, not 0 to 1

filtered_texas <- texas %>%
  # select to drop income
  select(-income) %>%
  # keep year, state, and ur
  select(state, year, ur) %>%
  # Keep rows if they have an ur above 5 or are in Texas
  filter(state == 'Texas' |  ur > 5)

# Check states
table(filtered_texas$state)
# check UR of non-Texas states
non_texas <- filtered_texas %>%
  filter(state != 'Texas')
summary(non_texas$ur)

Creating and Editing Variables

You can create and edit variables using the mutate() function. For example:

texas <- texas %>%
  mutate(just1s = 1,
         just2s = 2)

will create two new columns in the data: just1s, which is 1 in each row, and just2s, which is 2 in each row. We’re both naming the variables we add (just1s, just2s) and setting their values (1, 2) at the same time. If the name we give a column is the same as a name already in the data, this will overwrite the old column, letting us edit existing data.

Usually, you’re not setting all the values of your new variable to the same thing, as we are here. Typically you’re combining your existing variables and doing some calculations on them to make a new column. Let’s take our year variable and add 1 to it. And while we’re at it, let’s pretty up our naming with the rename() function and change our column names to be capitalized (which might look nicer on a graph we make later).

data(texas)
texas <- texas %>%
  mutate(next_year = year + 1) %>%
  rename(Year = year,
         `Next Year` = next_year)

What’s with the ` marks? Well, R variable names have to follow some rules: they can’t start with a number, they can’t have spaces in them, most punctuation isn’t allowed, and a few other things. Why not? Because if the variable name was “Next Year” it would be hard for R to tell when the variable name ended. Is it after Next or after Year? But we can allow rule-breaking column names if we surround the name in backticks (look at the top-left of your keyboard, usually on the ~ key; this isn’t the same as an apostrophe).

Now You

Load up the tidyverse and causaldata packages, and load in the texas data.
Use mutate() to make a new variable called ur_1scale by dividing the ur variable by 100.
Then, use rename() to change the name of ur to unemprate and ur_1scale to Unemployment Rate (0-1 Scale)
Save the result as texas_newvars and either use View(), head(), or click on the data set in the Environment tab to check that your new variable looks like it’s supposed to.

library(tidyverse) 
library(causaldata)
data(texas)

library(tidyverse) 
library(causaldata) 
data(texas)  

texas_newvars <- texas %>%
  mutate(ur_1scale = ur/100) %>%
  rename(unemprate = ur,
    `Unemployment Rate (0-1 Scale)` = ur_1scale)
  
View(texas_newvars)
# Looks good!

Working With Numeric Variables

The easiest kinds of variables to create and edit are numeric variables. Just do a calculation on your existing columns to get your new variable! It will go row by row to do the calculation. Or, if you do some sort of calculation that only gives you back one value, it will apply that single value to all the rows.

From there it’s just learning the different functions and operations to do your math, and being careful with parentheses to keep your intended order of operations intact.

For example, the following code multiplies the share of the population in the state that is Black by the share who are 15-19 to get a (rough) share who are both Black and 15-19. Then, it uses the mean() function, which only gives back a single value, to calculate the state/year’s income relative to the mean:

texas <- texas %>%
  mutate(black_and_1519 = (black/100)*(perc1519/100),
         income_relative_to_mean = income - mean(income))

One important thing to note about numeric variables is that if you want a variable to behave like a number, you should make it a number. What do I mean? Let’s say you want to check whether the alcohol variable takes a value below 2. You’ll want to do this with alcohol < 2, and NOT with alcohol < '2'. The first version compares a number (since alcohol is a numeric variable) with a number (2), and the second compares a number with a string ('2'). This is a really bad idea since it sometimes causes no errors but other times completely messes everything up. See more detail in the footnote if you want to know why.⁶

Now You

Load up the tidyverse and causaldata packages, and load in the texas data.
Use help(texas) to see the descriptions of all the variables.
Use mutate() with the min() function (which calculates the minimum of a variable) to calculate the alcohol consumption per capita divided by the minimum alcohol consumption per capita across all values. Call this new variable alcohol_relative.
Use mutate() to create a new variable called white_share equal to the number of white men in prison divided by (white men in prison + black men in prison).
Save the resulting data set as texas_newvars
Check the data using tools from Looking at Data to ensure the results look right (how might you do this? Well, one example is that white_share should never be below 0 or above 1, and alcohol_relative should never be below 1. Are both of those true? How did I come up with those rules?)

library(tidyverse)  
library(causaldata) 
data(texas)

library(tidyverse)  
library(causaldata)  
library(vtable)
data(texas)

texas_newvars <- texas %>%
  mutate(alcohol_relative = alcohol/min(alcohol),
        # Careful with the parentheses to make sure both wmprison
        # and bmprison are in the denominator!
        white_share = wmprison/(wmprison + bmprison))

# Check the ranges of our variables (there are ways other than vtable to do this)
texas_newvars %>%
  select(alcohol_relative, white_share) %>%
  sumtable()
# Alcohol's min is 1 and white_share is never negative or above 1. Great!

Working With Strings (Text Variables)

It’s not uncommon to get data in the form of character variables, i.e. text. There’s a lot to say on how to work with these variables, but for our purposes we don’t need to get too deep into the weeds.

The stringr package, which is automatically loaded when you load the tidyverse package, contains basically anything you’re going to need to do with strings at the level we’ll need to. Thankfully we don’t have to get into the terror that is regular expressions.

str_sub() takes a substring out of a string based on its position. str_sub('hello', 2, 3) will take the characters of 'hello' from the 2nd to the 3rd position, and will give you 'el'.

str_split() takes a string and splits it into pieces based on a “splitting” character. str_split('Eggs, milk, cheese', ', ')[[1]] will give you back c('Eggs','milk','cheese') (the [[1]] picks the first part of what’s extracted, which you almost always want. str_split_1('Eggs, milk, cheese', ', ') does the same thing).

str_detect() checks whether a substring is present in a string. str_detect(c('water','food','shelter'),'t') will give you back c(TRUE, FALSE, TRUE) since “t” is in “water” and “shelter” but not “food”.

Finally, word() will give you back a word from a multi-word string, based on its position. word('Tale of Two Cities',2) will give you 'of'.

There’s plenty more, but this will be enough to get going and it may be a while before you need anything else. Try help(package = 'stringr') to see what else is out there.

One last thing you might need to do: bringing strings together! paste() will help you combine together strings. For instance, paste('Two digit state code:', statefip) could be used to create a new variable that reads ‘Two digit state code: California’ instead of just ‘California’.

Now You

Load up the tidyverse and causaldata packages, and load in the texas data.
Use select to limit the data to the state variable, and then pipe that to the unique() function which will get rid of all duplicates. Now you’ve got a data set that just lists the 50 state names (plus Washington DC).
Use mutate() to make a new variable called first_letter, using str_sub() to get the first letter of every state (the state variable name should be the first argument of str_sub()).
Use mutate() to make a new variable called is_virginia, using str_detect() to make the variable TRUE for every state that contains the string “Virginia”
Use mutate() to create a new variable called new_state, using word() to get the first word of the state’s name, and then using == to check if that first word is “New”.
Save the resulting data set as state_names
Check the data using View() or clicking on it in the Environment pane to make sure everything worked properly.

library(tidyverse)
library(causaldata)  
data(texas)

library(tidyverse)   
library(causaldata)   
data(texas)

state_names <- texas %>%
  select(state) %>%
  unique() %>%
  # We want the first letter
  mutate(first_letter = str_sub(state, 1, 1),
    # check for 'Virginia'
    is_virginia = str_detect(state, 'Virginia'),
    # Get the first word out, then see if it's 'New'
    new_state = word(state, 1) == 'New')
          
# Check our work. Looks good!
View(state_names)

Working With Categories

Often, we have a variable that takes one of a limited set of possible mutually exclusive values. For example, days of the week can only be Sunday, Monday, Tuesday, etc.. We might store these as strings, but really they’re not the same thing. We want R to recognize these as categorical variables and treat them appropriately. Plus, treating them as categorical lets the computer handle them more efficiently.⁷

Like with stringr for strings in the last section and lubridate for dates in the next section, there is a tidyverse package specifically for working with categorical variables called forcats. It’s worth looking into if you’re going to be working heavily with categorical variables, but for the basics we don’t really need it.

The factor() function can convert a string (or discrete numeric) variable to a factor. We can also use it to set the levels of the factor.

Some categorical variables are ordered. For example, if the variable is “country”, that’s not ordered. France and Australia are different countries, but one isn’t, like, more than the other. However, if the variable is “education”, then “High School” is more education than “Elementary School” and less education than “College”, so these categories are ordered. The levels of the factor are its order. The order matters! If you run a regression with the factor variable as a predictor, the reference category will by default be the first level of the factor. If you use the variable in a graph, then the levels of the factor will be used to determine the order the categories are shown on the graph.

data(texas)

# Turn the state variable into a factor
texas <- texas %>%
  mutate(state_factor = factor(state))

# Leave all the levels as-is but change the reference category to Texas
texas <- texas %>%
  mutate(state_tx_ref = relevel(state_factor, ref = 'Texas'))
# Notice now that Texas is shown first
table(texas$state_tx_ref)


               Texas              Alabama               Alaska 
                  16                   16                   16 
             Arizona             Arkansas           California 
                  16                   16                   16 
            Colorado          Connecticut             Delaware 
                  16                   16                   16 
District of Columbia              Florida              Georgia 
                  16                   16                   16 
              Hawaii                Idaho             Illinois 
                  16                   16                   16 
             Indiana                 Iowa               Kansas 
                  16                   16                   16 
            Kentucky            Louisiana                Maine 
                  16                   16                   16 
            Maryland        Massachusetts             Michigan 
                  16                   16                   16 
           Minnesota          Mississippi             Missouri 
                  16                   16                   16 
             Montana             Nebraska               Nevada 
                  16                   16                   16 
       New Hampshire           New Jersey           New Mexico 
                  16                   16                   16 
            New York       North Carolina         North Dakota 
                  16                   16                   16 
                Ohio             Oklahoma               Oregon 
                  16                   16                   16 
        Pennsylvania         Rhode Island       South Carolina 
                  16                   16                   16 
        South Dakota            Tennessee                 Utah 
                  16                   16                   16 
             Vermont             Virginia           Washington 
                  16                   16                   16 
       West Virginia            Wisconsin              Wyoming 
                  16                   16                   16

# Set the factor levels directly
fake_data <- data.frame(Letters = c('A','B','C','C','D')) %>%
  mutate(Letters = factor(Letters, levels = c('C','D','A','B')))
table(fake_data$Letters)


C D A B 
2 1 1 1

Another function to highlight is one that’s useful in creating factor variables, and that’s case_when(). case_when() lets you set different conditions and produce categories as an outcome.⁸ Note that for each row case_when() will return the first matching condition. So below if a state has a poverty rate of 8, it will get categorized as ‘Low Poverty’ since that’s the first matching condition (since 8 < 10), rather than ‘Medium Poverty’, even though 8 < 20 is also true.

texas <- texas %>%
  mutate(poverty_rate_categories = case_when(
    poverty < 10 ~ 'Low Poverty',
    poverty < 20 ~ 'Medium Poverty',
    poverty >= 20 ~ 'High Poverty'
  )) %>%
  mutate(poverty_rate_categories = factor(poverty_rate_categories,
                                          levels = c('Low Poverty',
                                                     'Medium Poverty',
                                                     'High Poverty')))

Load up the tidyverse and causaldata packages, and load in the texas data.
Use mutate() and case_when() to create alcohol_per_capita_bins equal to ‘Low Drinking’ if alcohol is below 2, ‘Medium-Low Drinking’ for 2-2.5, ‘Medium-High Drinking’ for 2.5-3, and ‘High Drinking’ for values above 3.
Use mutate() and factor() to turn alcohol_per_capita_bins into a factor, with levels set to in the order of Low, Medium-Low, Medium-High, and High.
Use mutate() and relevel() to change alcohol_per_capita_bins so that its reference category is ‘High Drinking’.
Check the data using table() to make sure the categories were created and ordered properly. After following all the instructions above, the order should be High, Low, Medium-Low, Medium-High.

library(tidyverse) 
library(causaldata)   
data(texas)

library(tidyverse)    
library(causaldata)    
data(texas)  

texas <- texas %>%
  mutate(alcohol_per_capita_bins = case_when(
    alcohol < 2 ~ 'Low Drinking',
    alcohol < 2.5 ~ 'Medium-Low Drinking',
    alcohol < 3 ~ 'Medium-High Drinking',
    alcohol >= 3 ~ 'High Drinking'
  )) %>%
  mutate(alcohol_per_capita_bins = factor(alcohol_per_capita_bins,
    levels = c('Low Drinking','Medium-Low Drinking',
              'Medium-High Drinking','High Drinking'))) %>%
  mutate(alcohol_per_capita_bins = relevel(alcohol_per_capita_bins,
    ref = 'High Drinking'))
    
table(texas$alcohol_per_capita_bins)

Working With Dates

Dates are the worst. Everyone who works with data hates them. They’re inconsistent (different number of days in the month, days in the year (leap years), time zones are wild, many different date formats with different levels of precision, etc. etc.). The tidyverse package lubridate makes working with dates much much easier and I strongly recommend using it whenever you’re dealing with dates.

At our basic level, there are only a few things we need to worry about with dates. The first is converting non-date data, like a string that reads ‘1999-02-09’ into a proper date that R recognizes. The second is pulling date information back out of that date variable (like getting that the year of that date is 1999). The third is “flattening” dates, for example getting that the first day in that month’s date is February 1, 1999, which can be handy when aggregating date-based information.

We can convert date data into a proper date format using the ymd(), ym(), mdy() etc. series of functions. Just pick the function that accords with the format of your data. Do you have data that looks like ‘1999-02-18’, i.e. year-month-day? use ymd() to make it a date. How about just the year and month, like ‘February 1999’? Use my() since it’s month-year in that order. There are similar functions with hms additions for hours-minutes-seconds. See help(ymd) and help(ymd_hms) for the full lists.

Once we have a proper date, we can use year(), month(), day(), week(), and so on to pull out information from the date. If mydate is the date March 8, 1992, then month(mydate) will give back 3 (since March is the 3rd month).

Finally, we can flatten dates with floor_date(). floor_date(mydate, 'month') will get the date at the start of the month that mydate is in, which would be March 1, 1992.

Now You

Load up the tidyverse and causaldata packages, and load in the texas data.
Use mutate() and paste() to paste together ‘February 2’ with the year variable, creating date_feb_2.
Use mutate() and either ymd(), mdy(), ym() or a similar function (which one is correct?) to turn date_feb_2 into a proper date variable.
Use mutate() and floor_date() to make date_jan_1 out of date_feb_2. date_jan_1 is January 1 of the year, i.e. the first day of that year.
Use mutate() and year() to create new_year, taking the year of date_jan_1.
Use mean() with new_year and year to check that they are always the same.

library(tidyverse)
library(causaldata)
data(texas)

library(tidyverse)
library(causaldata)
data(texas)

texas <- texas %>%
  # paste together
  mutate(date_feb_2 = paste('February 2',year)) %>%
  # It's month, day, year, so use mdy()
  mutate(date_feb_2 = mdy(date_feb_2)) %>%
  # floor_date to go back to the first day of the year
  mutate(date_jan_1 = floor_date(date_feb_2, 'year')) %>%
  # get the year out
  mutate(new_year = year(date_jan_1))
  
# Check that they're always the same
mean(texas$new_year == texas$year)
# the mean is 1, so they're always the same

Aggregating Data

A data set’s observation level is whatever is represented by a single row of data. If you have a data set of 30 students in a classroom, and you have one measurement per student and thus one row of data per student, then the observation level is “one student”. If you follow those same students over many years and have one row of data per student per year, then the observation level is “student-year”.

Often you might have data at a more fine-grained observation level and want to broaden it. Perhaps you have that student-year data but want to do an analysis of students overall. To do this you can aggregate the data.

In the tidyverse this can be done with group_by(), which ensures that any calculations that follow (using dplyr/tidyverse functions) are done within-group. Then you’ll usually follow that with a summarize() including a function to aggregate multiple rows into one. The result will be a data set at the observation level of whatever you put in group_by().

data(texas)

# the original texas data has a state-year observation level

# Get the alcohol consumption by state,
# averaged over all the years
average_alcohol_by_state <- texas %>%
  group_by(state) %>%
  summarize(avg_alcohol = mean(alcohol))
head(average_alcohol_by_state)

Importantly, you have to actually tell it how to summarize the data. If you just do group_by(state) %>% summarize(alcohol) it won’t know what to do. It’s got multiple rows of alcohol data per state. Does it take the mean? The median? The first observation? What? So nothing will happen. You have to tell it to take the mean or the sum or whatever it is you want.

Now You

Load up the tidyverse and causaldata packages, and load in the texas data.
Use group_by() to group the data by state.
Use summarize() to create the variables average_income and total_bmprison which contain the mean income and the sum of bmprison, respectively. Call the summarized data set stateavg.
Look at the data with View() or click on it in the Environment pane to ensure it worked properly.

library(tidyverse) 
library(causaldata) 
data(texas)

library(tidyverse) 
library(causaldata) 
data(texas)

stateavg <- texas %>%
  group_by(state) %>%
  summarize(average_income = mean(income),
            total_bmprison = sum(bmprison))
            
View(stateavg)

Multi-Row Calculations

It’s not uncommon to need to use multiple (but not all) rows of data to calculate something you’re interested in. Some ways this might occur include:

Calculating a statistic by group (for instance, calculating the mean of something separately for group A, then again for group B, and so on)
Calculating something like a growth rate where you want to know the value of something relative to the period just before, or the first period in the data

To calculate statistics by group, just like in the Aggregating Data section, we can use group_by() to do by-group calculations. However, if we want to calculate by group and create a new column with that calculation rather than change the observation level, we want to follow that group_by() with a mutate() instead of a summarize.

One common application of group_by() followed by mutate() is to calculate variables that are de-meaned or standardized, where you want to subtract the group mean from something, and then (for standardizing) divide by the group standard deviation:

data(texas)
texas <- texas %>%
  # group by state to calculate an unemployment rate standardized by state
  group_by(state) %>%
  mutate(standardized_ur = (ur - mean(ur))/sd(ur))

For growth rates, you’ll need to keep track of the order of the data. arrange() lets you sort the data set by whatever variables you give it, presumably time order. Then, lag() lets you refer to an observation one row above (or any number of rows), or first() or last() let you refer to the first or last row in the group, respectively. There are also the functions cumsum() and cumprod() for cumulative sums and products of all the observations up to that point (and more stuff like rolling averages in the zoo package). You can also do all this while dropping the group_by() stuff if your data is just a single time series without groups.

data(texas)
texas <- texas %>%
  group_by(state) %>%
  # we want growth from year to year so set it in year order
  arrange(state, year) %>%
  # lag will take data one row above
  mutate(prison_growth = bmprison/lag(bmprison) - 1) %>%
  # although we can take longer lags if we like (handy if the data is, say, monthly)
  mutate(prison_growth_10yrs = bmprison/lag(bmprison, 10) - 1) %>%
  # perhaps we want growth since the start
  mutate(prison_index = bmprison/first(bmprison))

# let's look at one state's results
texas %>%
  filter(state == 'Texas') %>%
  select(year, bmprison, prison_growth, prison_growth_10yrs, prison_index)

Now You

From the causaldata package load the castle data set (not texas!). Use help(castle) to see the descriptions of all variables
Group the data by year
Use mutate() to create avg_robbery equal to the mean robbery rate each year across all states.
Now group by sid and use arrange() to sort the data by year
Use mutate() to create the variable robbery_yoy to calculate the percentage growth in robbert rates by state from one year to the next
Use mutate() to create the variable robbery_index which is the percentage growth in robbery since the start of the data
Overwrite the original castle with your changed version. Then use filter() and select() to look at sid == 1 and just the variables year, robbery, and any variables you just created

library(tidyverse)  
library(causaldata)  
data(castle)
help(castle)

library(tidyverse)  
library(causaldata)  
data(castle)
help(castle)

castle <- castle %>%
  # group by year and then calculate average robberies by year
  group_by(year) %>%
  mutate(avg_robbery = mean(robbery)) %>%
  # now group by state and put in order
  group_by(sid) %>%
  arrange(sid, year) %>%
  # Get year on year growth
  mutate(robbery_yoy = (robbery/lag(robbery)) - 1) %>%
  # and growth since the first observation
  mutate(robbery_index = (robbery/first(robbery)) - 1)
  
# Look at the results
castle %>%
  filter(sid == 1) %>%
  select(year, robbery, avg_robbery, robbery_yoy, robbery_index)

Merging Multiple Data Sets

Often you will have multiple data sets on the same sets of observations. For example, in a data set of companies you might have one file that contains those companies’ marketing spending budgets, and in a different file you might have data on their numbers of employees. Or perhaps you have one data set with the population of a bunch of countries in different years (with one row per country per year) and another data set with the names of the capital cities for each country (one row per country). If you want to analyze these variables together you must first merge (or join) the two data sets.

Merging works by identifying a set of key variables that are shared by both of the data sets. In at least one of the data sets, those key variables should be the observation level of the data, meaning that there is no more than one row with the same combination of those key variables.⁹

For example, consider the following two data sets A and B:

# A tibble: 4 × 3
  Person  Year DaysExercised
  <chr>  <dbl>         <dbl>
1 Nadia   2021           104
2 Nadia   2022           144
3 Ron     2024            87
4 Ron     2025            98

# A tibble: 3 × 2
  Person Birthplace
  <chr>  <chr>     
1 Nadia  Seville   
2 Ron    Paris     
3 Jin    London

These data sets share the column “Person”. Notice that “Person” is the observation level of data set B - there’s only one row for each value of Person, with no duplicates. The observation level of data set A is not Person but Person and Year (there’s one one row for each unique combination of Person and Year).

If I were to merge these data sets, it would go row by row in data set A. “Hmm, OK, the Person in the first row is Nadia. Let’s check for Nadia in B. In B, Nadia is born in Seville, so I can add a Birthplace column in data set A where Nadia is born in Seville.” Then on to the second row. “The Person in the second row is Nadia again. So the Birthplace column is Seville again” and so on.

# A tibble: 4 × 4
  Person  Year DaysExercised Birthplace
  <chr>  <dbl>         <dbl> <chr>     
1 Nadia   2021           104 Seville   
2 Nadia   2022           144 Seville   
3 Ron     2024            87 Paris     
4 Ron     2025            98 Paris

Now the data is merged and we can do our Birthplace / DaysExercised analysis.

What about Jin? Jin was in data set B but not data set A. We have some options. One option is an “inner join” (inner_join()) which keeps only matching observations and drops anyone who didn’t match, which is what we did above. Another is a “full join” (full_join()) which keeps any non-matching observations and just puts missing values wherever we don’t have info, like this:

# A tibble: 5 × 4
  Person  Year DaysExercised Birthplace
  <chr>  <dbl>         <dbl> <chr>     
1 Nadia   2021           104 Seville   
2 Nadia   2022           144 Seville   
3 Ron     2024            87 Paris     
4 Ron     2025            98 Paris     
5 Jin       NA            NA London

We can also choose whether or not to keep the non-matching observation based on which data set it’s in. In merging, there is a “left” data set (the first one we merge) and a “right” data set (the second). For us, the left data set is A and the right data set is B. We can do a “left join” (left_join()), which means “if an observation is in A but not B, keep it. If it’s in B but not A, drop it.” Similarly there’s right_join() which keeps observations in B but not A, but drops those in A but not B. For us, Jin would stick around with a right_join() but be dropped with a left_join(), since he’s in the right data set but not the left one.

person_year_data <- data.frame(Person = c('Nadia','Nadia','Ron','Ron'),
       Year = c(2021, 2022, 2024, 2025),
       DaysExercised = c(104, 144, 87, 98))
person_data <- data.frame(Person = c('Nadia','Ron', 'Jin'),
       Birthplace = c('Seville','Paris', 'London'))

merged_data <- person_year_data %>%
  right_join(person_data, by = 'Person')

merged_data

  Person Year DaysExercised Birthplace
1  Nadia 2021           104    Seville
2  Nadia 2022           144    Seville
3    Ron 2024            87      Paris
4    Ron 2025            98      Paris
5    Jin   NA            NA     London

Now You

Run the below code to load in the two example data sets
Merge data1 with data2, keeping both departments
Use View() or click on the data set in the Environment pane to look at the result

library(tidyverse)
data1 <- data.frame(Month = c(1,1,2,2,3,3,4,4), 
                Sales = c(6,8,1,2,4,1,2,4),
                Department = rep(c('Sales','R&D'),4))
data2 <- data.frame(Department = 'Sales', Director = 'Lavy')

library(tidyverse)
data1 <- data.frame(Month = c(1,1,2,2,3,3,4,4), 
                Sales = c(6,8,1,2,4,1,2,4),
                Department = rep(c('Sales','R&D'),4))
data2 <- data.frame(Department = 'Sales', Director = 'Lavy')

# We want to keep both departments, so if we do data1 first
# we want a left_join() or full_join()
# since that will maintain R&D which is in data1 but not data2

merged_data <- data1 %>%
  left_join(data2, by = 'Department')
  
View(merged_data)
# looks good!

Footnotes

There are other ways besides scripts to organize your code, like Quarto or RMarkdown notebooks, or even Jupyter. I won’t get into that here, but you can find out more at this site.↩︎
Or select a chunk of code to run more than one line.↩︎
Note we didn’t have to do x = - you can skip argument names as long as you’re going in order. x is the first argument, so if we don’t give it an argument name it will assume that the first thing we put in there is for x.↩︎
Note that Windows computers, unlike Mac or Linux, typically use backslashes (\) instead of forward slashes to go into a subfolder. However, when writing code in most languages, they won’t let you use the backslash and you must use a forward slash.↩︎
That’s the tidyverse pipe. More recent versions of R also have a base-R pipe that looks like |> and works almost (but not exactly) the same. You can use that one if you like! I usually do in my own work. But The Effect is old enough that it continues to use %>% so I’ll keep using that here too.↩︎
Let’s say we want to check the share of observations that have an alcohol per capita value below 2. I can do this with mean(texas$alcohol < 2). This will take the texas$alcohol variable, check for each row whether it’s below 2, give a TRUE when it is and a FALSE when it’s not, and take the mean() of that, which in effect gives the proportion of observations that are below 2. So far so good! We get an answer of .235 so 23.5% of observations are below 2.

But what if we write it like this instead, using a string version of 2? mean(texas$alcohol < '2') ? Surprisingly, this gives us the same answer of .235! Seems like turning numbers into strings is a-OK right? Well, no! Definitely not! Because this version isn’t actually comparing the alcohol value to 2. Consider an alcohol value of 3. What it’s doing is saying “hmm, I can’t compare 3 to a string, so I’ll convert 3 to a string as well, and ask whether '3' < '2'. It will then make a comparison alphabetically.

So that means that while ‘3’ < ‘2’ is FALSE, as it should be, ‘03’ < ‘2’ is TRUE, even though 3 < 2 is FALSE, since ‘03’ comes alphabetically before ‘2’. Similarly, ‘11’ < ‘3’ is TRUE. Bad!↩︎
Saying your data is “Sunday”, “Monday”, “Tuesday”, etc. are more memory-intensive and difficult for the computer to work with than saying your data is 1, 2, 3, and oh hey computer remember that 1 = Sunday, 2 = Monday, and so on. Numbers are way more efficient for computers to handle.↩︎
You can also use it for non-categorical outcomes, for instance using it to check which country you’re in and returning a different currency conversion rate based on the result. But often you’re using it to create categories.↩︎
Technically you can merge two data sets where the key variables are not the observation level of either data set. This is called a many-to-many merge. However, it’s usually a bad idea and if you’re doing it you’re almost certainly making a mistake.↩︎