2+2
= 1
a
a4, 3, 1]
[sum([4, 3, 1])
help(sum)
Introduction to Working with Data: Python Version
This page will serve as an entry-level introduction to how to code with data. If you are starting out for the first time using code and programming to work with data (as opposed to point-and-click statistics software like SPSS, or spreadsheet software like Excel) then this will get you up to the point where you can load data and make some basic changes.
This is companion material to my textbook on causal inference, The Effect. I will focus on the methods and tools used in the book. There’s plenty this page doesn’t cover. The focus is just the basics! Please check out the book website or my main website for more resources.
This page uses the R programming language. The page is also available for Stata and Python.
Getting Started
What We’re Learning (and not learning)
This page will teach you how to:
Get started with running code
Install and load libraries
Load in data from files, the internet, or libraries
Write good code and responsibly use AI in your coding
Look at your data and get basic statistical results from your data
Manipulate your data and get it ready for the kinds of analyses you do in The Effect
By the time we’re done, you should be where you need to be to take a data set that is already reasonably usable and prepare it for analysis for a research project. We won’t be getting super deep into the weeds - I will not be covering, for example, how to clean extremely messy or unstructured data. I am also not trying to make you an expert coder. Just the stuff to get you to where you can use the book and start on your own research project.
Further, I will be using the same tools and packages that are used in The Effect, and will be focusing on the kind of tools and procedures used in the book. That means I will be focusing on data sets that are small enough to be opened up on your computer (i.e. “in memory”) rather than interacting with huge databases.
The Very Start
You will need to start by getting Python itself running.
There are many ways to run Python, but I will assume that you are using some implementation of Jupyter, which is a very popular notebook program for running Python, from within some popular IDE (integrated development environment) like VSCode or Spyder.1
If you are working with Anaconda / Spyder, you can go to the Anaconda download page to install it. It will come with Python and Jupyter already set up.
If you are working with VSCode, you can go to the VSCode download page to download it. Then, open it up and go to Extensions on the left-hand pane. Pick out and download the Python extension and the Jupyter extension. The Python extension may give you additional instructions and steps for how to install Python itself.
Alternately, you can skip all of that and make an account on Google Colab, which is a Google service that lets you run Jupyter in your browser. There are some limitations on its use, like how long your code can run or the size of your files, but you can use it without installing anything. Once you’re in your account, click on “New Notebook”.
You can check whether everything is working properly by opening a new Jupyter notebook, (File -> New File -> New Jupyter Notebook in VSCode), making a Python code chunk (+Code on the top bar), putting in 2+2
, and hitting Play / Ctrl-Enter / Cmd-Enter. It will ask you to pick a kernel (Python), then you should get back a 4.
Running a Script, and Very Basic Code
In Jupyter, you can run code by putting it in a code chunk and hitting Run (shortcut Ctrl-Enter on Windows and Cmd-Enter on Mac). Try it by putting 2+2
in a code chunk and hitting enter. You should get a 4.
Let’s take a look at some very basic example code.
Take the 2+2
and copy it into its own code chunk. Then, either hit the “Run” button, or do Ctrl-Enter (on Windows) or Cmd-Enter (on Mac). This will run the line of code and give you the result. Typically you’ll include more than one line of code in a code chunk, but here we want to see the results one at a time, so we’ll just run one line at a time.
Let’s look at the next line, a = 1
. This line of code takes the number 1
and stores it inside (=
) of a variable called a
.
Once you’ve run this line, you’ve now stored a variable. It will stay in memory even if you delete the line of code, until you restart your Python. Depending on the software you’re using Python in you can probably find a list of all the variables you’ve stored in memory, perhaps an Environment tab. In VSCode you can look for the “Jupyter” tab and click on “Jupyter Variables” to see that you’ve now stored a variable called a
with a value of 1.
Now run the a
line. This will show you the contents of a
, which we know to be 1
. You’ll see a 1
pop up in the results.
Next, run the [4,3,1]
line. This puts several numbers together to make an array / list of values. In other words, data!
Let’s use a function to do a calculation on our data. A function takes input (our data and settings), does a calculation, and gives us back an output. In Python, functions use parentheses ()
. Let’s load the Run the sum([4,3,1])
line. This takes our data [4,3,1]
and passes it to a function, sum()
, which calculates the sum of all of our data. You should get back a value of 8
, since that’s the sum of 1, 3, ane 4.
How did we know how to take a sum? You can read a function’s help file using the help()
function. Run the help(sum)
line. You’ll see the list of arguments (options) you can set. From this page we learn that the sum()
function takes an iterable
as its first argument (anything you can loop over, like a list, for us this was c(4, 3, 1)
) and a start
option to tell it what value to start adding to (0 by default). We can set options by naming them, for example sum([4,3,1], start = 1)
to tell it to add 1 to our total sum to get us 9.
At the bottom of the help text if you scroll down, you’ll see some example code for using the function. This is often handy since you can copy/paste the working code and then edit it for your case.
Now, use File -> Save (or the Save icon) to save your Jupyter notebook somewhere on your computer.
Downloading and Using Packages
Python has a lot of user-written libraries (packages) containing useful functions for making graphs, doing different kinds of estimates, and so on. These are maintained on several different central repositories, the most popular of which is PyPI.
You can install packages from PyPI using the pip package and its pip
function, or the Packages tab in the bottom-right pane of RStudio.
In The Effect (and on this page), we make heavy use of the pandas package, so let’s install that now. Run the code:
!pip install pandas
Once you’ve installed the package, you don’t need to install it again, it’s already on your computer. But you do need to load it to use it every time you open up Python. Run:
import pandas as pd
When you import
a package, it loads all of the package’s functions into memory so you can use them. To use them you need to specify which package the function is in! If we had just done import pandas
, then we could access the pandas function DataFrame
using pandas.DataFrame()
.
However, we’re going to be using a lot of pandas functions, and repeatedly, so typing pandas.
out every time would get old. So instead we do import pandas as pd
which will let us do pd.DataFrame()
instead of pandas.DataFrame()
.
You can also import a function directly, for example from pandas import DataFrame
. If we did this then we could just do DataFrame()
without any pd.
. However, this is generally only a good idea if you’re importing just one or two functions from a package (and you’ll see this as an example for how we load the texas
data on this page below). Otherwise you’re likely to run into issues with multiple packages using the same name for a function. So import pandas as pd
is what you’ll see as the standard, as well as in The Effect.
Loading Data
Getting Data from Packages
Some Python packages come with data sets that you can load in directly. This is not that common in Python, but this is how we get data for The Effect’s code examples so I’ve included it here. The Effect makes use of the data sets in the causaldata package I wrote, so you’ll certainly get used to loading data from it! The example code on this page, as well as the example code in the book, will use the texas
data from the causaldata package (the texas
data is included in the package because it appears in Scott Cunningham’s Causal Inference: the Mixtape).
So, first, install causaldata by running !pip install causaldata
if you haven’t already. Then, load the package and the texas
data with the following code:
from causaldata import texas
= texas.load_pandas().data texas
Easy! Click on texas
in the Jupyter / Jupyter Variables tab to look at the data like it’s a spreadsheet (you may need a data viewer extension if you’re using VSCode; it will suggest you one). Or use texas.head()
to look at the first few rows of data. Additionally, you can use texas.DESCRLONG
or texas.DESCRSHORT
to find documentation for the dat, and texas.NOTE
to see variable definitions. Documentation will not be available from inside Python for data sets that don’t come from packages.
Loading Data from a File
Often your data will be stored in a file that you will need to load in. Examples of these kinds of files include:
Excel spreadsheets (.xlsx files)
Comma-separated or tab-separated values files (.csv and .tsv), which are like Excel spreadsheets but simpler and more universal
Special compressed data files like Stata datasets (.dta) or Parquet files (.parquet), which cater to the needs of data analysts
There are a lot of functions designed to read these files in. You can tell the function where the file is, for example giving it a URL if the file is on the internet, and it will load it in for you. For example, after loading pandas you can use pd.read_csv()
to load in a comma-separated values file, pd.read_excel()
for Excel sheets, or a whole bunch of other options that should pop up in the autocomplete if you start typing pd.read_
. To see pd.read_csv()
in action, run:
import pandas as pd
= pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/texas.csv') texas
The texas
data is stored at that URL. After running this code you’ll see the texas
data loaded in your Jupyter Variables tab. Notice that we couldn’t just run pd.read_csv()
on its own - that would load in the data briefly and show it to us but then it would be gone. If we want the data to stick around so we can use it, we need to store it as a variable (=
), just like we used a = 1
to make a variable called “a
” containing a 1 earlier.
The data doesn’t have to be stored on the internet. You can also load a file from your computer. However, this will require that you know how to use your computer’s file system… see the next section.
Files on Your Computer
pd.read_csv()
and similar functions don’t just work with data from the internet. They also work with files loaded on your computer. You can give the pd.read_csv()
function a filepath that points to a place on your computer, instead of a URL that points to a place on the internet.
If you, like most people, didn’t start using a computer to do work until after Google Drive launched in 2012 (and was followed by other cloud storage services like Dropbox and OneDrive soon after), you might not be very familiar with the proper way to store and retrieve files from your computer. Those services encourage you to store files in a way that makes using your file system difficult. This is fine for 99% of people, but not great if you want to write code that accesses files.
As someone writing code, you really want to USE your file system. Modern computer use for casual users tries to push you into just putting all your files in a single “Downloads” folder and using a search bar to find things rather than keeping things organized. This will quickly turn into a major headache if you’re using your computer heavily, and especially if you’re writing code that accesses files. If you have a project, make a folder for the project and keep all the files for that project in it. Maybe even make different subfolders to store data, code, writing, and so on!
Files on your computer are stored in a series of nested folders. On my computer, for instance, I have a file called test_data.csv
stored in my Documents folder. If I’m clicking around then I’d just go to “Documents” and search for it. But really, that Documents folder is inside of my “Nick” folder, which is inside of my “Users” folder, which is in my “C:” drive (standard on Windows).
Now that we have things in folders, how can we load data from files? If I just tried to run pd.read_csv('test_data.csv')
the code would have no idea that the file is inside of my Documents folder and would give me an error. I need to tell the computer where to look to find the file. That’s the filepath.
There are two approaches to filepaths. One is to use an absolute filepath. An absolute filepath tells the computer exactly where to look, starting from the topmost folder and working its way down. In this case that would be:
= pd.read_csv('C:/Users/Nick/Documents/test_data.csv') test
The /
forward slashes indicate that we’re going into a subfolder. So this starts in C: and then goes to the Users folder, then the Nick folder inside of that, then Documents inside of that, and finally the test_data.csv file.2
This will work. However, it’s a bad idea. Why? For one, if you give someone else your code, it won’t run any more. Unlike URLs, which point to a place on the internet everyone can see, a filepath points to a location on your computer that other people can’t access. Sending someone a filepath to a file on your computer does not send them the file.
Instead, you should use relative filepaths. Instead of starting from the topmost folder, relative filepaths start from a working directory. You can set your working directory, and then use the filepath from there. For example, if my working directory were the Nick folder, then I could do pd.read_csv('Documents/test_data.csv')
and it would know where to find the file. Or if my working directory were the Documents folder, I could simply do pd.read_csv('test_data.csv')
. Then, I could send someone my code and the test_data.csv
file, and if they set their own working directory properly, they can run my code just fine.
Jupyter will typically run code using as a working directory whatever folder you’ve saved your Jupyter notebook in. So the easiest way to use relative filepaths is to ensure that your code and data are stored in the same folder. Then you can just do pd.read_csv('test_data.csv')
. Easy!
One step more difficult, but better practice, is to make a data
subfolder for your data. Then you can set the working directory to the folder the code is in and do pd.read_csv('data/test_data.csv')
. Even better is to have a main project folder with a data
subfolder and also a code
subfolder. Put the code in the code folder and the data in the data folder. Then use ..
to “hop up” a folder. pd.read_csv('../data/test_data.csv')
will start in your code folder, go up one level to the project folder with ../
, then go into the data folder, and finally find your file. Send someone your whole project folder (or have it on a shared cloud-storage folder like a OneDrive your research team can all access) and everything will run smoothly! Even better practice is to use a formal project manager and have the project folder be your working directory… but we’re getting too advanced here.
Looking at Data
An important part of working with data is looking at it. That includes looking directly at the data itself, as well as looking at graphs and summary statistics of the data.
When you look at data, what are you looking for? There are a few things:
What is the distribution of the data?
What values can it take?
How often does it take certain values?
If it’s numeric, what is its average?
Is there anything fishy?
What should this variable look like? Does it look like that?
Are there any values in the variable that don’t make sense?
Are there any missing values in the variable?
How are the variables related?
- How does the distribution of one variable change with another?
The kinds of variables there are and the task of calculating summary statistics and the relationships between variables is a huge subject. So instead of cramming it all in here I’ll point out that these are covered in Chapters 3 and 4 of the book, including with code examples in Chapter 4. I’ll add just a few additional notes here:
- You can look directly at your data, as if it were an Excel spreadsheet, by clicking on it in the Jupyter Variables tab.
- To look at summary statistics for a specific variable, you can take the variable out of the data set with
.
or[]
. For example,texas.perc1519
returns theperc1519
variable fromtexas
, as doestexas['perc1519']
(which may be preferable if your variable name has a space in it or shares a name with a method). From there you can send it to all sorts of summary statistics functions, like taking its mean withtexas.perc1519.mean()
, getting a full range of summary statistics withtexas.perc1519.describe()
, and so on. - Many of the summary statistics methods we use come from the numpy package, which pandas is based on. We can, if we like, do
import numpy as np
and thennp.mean(texas.perc1519)
instead oftexas.perc1519.mean()
. This may give us access to a wider range of descriptive functions. - Many summary statistics functions will automatically drop return a
pd.NA
(missing) value if any of the data is missing. This isn’t the only (or even necessarily best) way to handle missing values, but it is definitely the most common. See Chapter 23 for more. - For categorical or discrete variables, a good way to see how often each value is taken is
.value_counts()
. Sotexas.year.value_counts()
will show us how many observations there are for each year. .describe()
can be used on the whole data set at once. Trytexas.describe()
to see summary statistics for all the variables.
The Goal of Manipulating Data
What You’re Trying to Do
Your goals in preparing your data are the following:
- Take your raw data and put it in the form you need it for your analysis
- Selecting the right subsample (if any)
- Creating any variables you might need that aren’t in there
- Merging with other data sets
- Checking if anything looks wrong
- Does anything in the raw data look like an error?
- Have you made any errors in your code?
Often, you will want to make sure your data is in tidy data format.
What is tidy data? Let’s take as an example a data set of the heights and weights of 30 different children in a classroom. We’ve taken one set of measurements per child, so the observation level is one child. Two of the children in this classroom are Maria and Tom.
In a tidy data set,
Each observation is one row of data (meaning: There are 30 children in the data, so the data set has 30 rows. Maria is one row of the data, and her height and weight are both on that row. We don’t have one row for Maria’s height and another row for her weight. We don’t have Maria’s height and Tom’s height on the same row.)
Each variable is a column, and each column is a variable (meaning: one column of the data is everyone’s height. We don’t have one column for Maria’s height and another column for Tom’s height.)
Each value is a cell, and each cell is a single value (meaning: we don’t try to stuff multiple values in a single cell. We don’t have one column just called “height and weight” and put “50 inches tall and 70 lbs” for Maria. We also don’t stuff all the children into one row and put [50 inches, 46 inches, 52 inches, …] in a single cell in the height column).
Tidy data is helpful because it makes data analysis on lots of observations much easier to perform. Often if you are getting data from sources that tend to only look at one observation at a time (for instance, in a quarterly report for a business where they look at a single sales number, a single costs number, etc.) you tend to get non-tidy data since non-tidy data often makes it easier to find a single data point of interest. Not so good for analysts though! For more information on tidy data, see this page (which is written for the R language but the tidy data concept applies to Python too).
If your data is heavily non-tidy and you need to tidy it, a useful tool you’ll probably need is the pivot. Pivoting data is a huge headache in any language and notoriously is hard to get the syntax right for, and I don’t cover it on this page. You can find out more here though!
Writing (or having AI Write) Acceptable Data Manipulation Code
There are a few things you should keep in mind when writing code to manipulate your data:
Add comments to your code. In Python you can do this by starting the line with
#
. These comments will help others figure out what your code is trying to do. It will also help you figure out what your code is trying to do when you return to it later.Make sure that all of your code is written in a code chunk (or multiple scripts). If you use interactive Python (iPython) in addition to Jupyter, make sure you copy your code into your script / notebook before moving on. Don’t fall prey to the trap of thinking that you’ll just remember it later! Further, make sure your code is in order. You don’t want to have to remember “oh okay in this notebook I’m supposed to run chunk 1 first, then chunk 5, then back to chunk 3…” It should be possible to start from a fresh brand-new Python session and run your notebook from start to finish without error. As you’re working, it’s not a terrible idea to occasionally save your work, close Python down, reopen it, down and run your script chunk-by-chunk from the start to make sure you haven’t changed something that breaks your code.
You need to look at your data after every line of data manipulation code. Using the tools from the “Looking at Data” section above, you should be checking what your data looks like after every line of code. You just wrote a line of code that tries to do something - make a new variable, fill in a missing value, aggregate the data, etc.. So ask yourself - did it do that thing? You should probably check. Especially when you’re starting out, your code might not do what you expect! I have been helping students with data manipulation code for a long long time. If they run into an issue, the great majority of the time it’s not some complex technical problem, it’s just that they forgot to check their data to ensure the code did what they thought it did.3
With that in mind, how about AI/LLM services like ChatGPT Canvas, Copilot, and others? Can’t we just have them write the code?
Well, yeah. Coding assistants are effective and powerful. And realistically, outside of a classroom setting where it might not be allowed, you’re probably going to be using them to help write your code the rest of your life.
So… use them now right? Well, maybe. While AI is excellent at preparing code, it is prone to error. You simply can’t take it for granted that the code will work. This is especially true in coding for data manipulation, where the AI is often lacking crucial context necessary to write the correct code. That means that AI code is only acceptable if you are capable of checking its work.
That means two things:
- Don’t have AI write code you couldn’t write yourself. You need to understand how to code before you can use AI to write code. Otherwise you won’t know what the code is doing and can’t check whether it’s right.
- Don’t assume that the AI did it properly. This is also true if you’re writing code yourself, but especially if AI is writing code, you have to do the above step of going line-by-line and checking whether the code is actually doing what it’s supposed to do.
Running Into Errors
When you code you will almost certainly run into errors.
For example, consider the code pd.to_numeric(['hello','3'])
. pd.to_numeric()
turns strings into numbers, so '3'
will get turned into regular-ol 3
. However, 'hello'
isn’t something it can convert into a number, so it will give you the error and the warning `Unable to parse string 'hello'
. When you get an error, it’s always something you have to fix.
How can you fix things when you run into an error? Errors are often confusingly worded and require some technical understanding to figure out what they’re saying. But I have some pointers.
- In general, try Googling or asking an AI what the error means. It can often tell you.
- The error message itself is telling you what the problem is. Even if you can’t understand it, will usually tell you which part of the code the error is coming from. THIS is the part to look for. In the above error message, you might not know what
'parse'
means, but you do know that it’s saying the error has to do with the'hello'
string. So look at that part of your code - you might be able to figure out yourself what the problem is, even if you can’t understand the error message itself (“oh whoops I forgot that this file had some random words along with the numbers”).
How to Manipulate Data
Now we will get into some actual data manipulation code. As in the book we will be working with pandas tools.
Something to note is that we’ll be using method chaining to work with data. pandas lets us work with objects called DataFrame
s. These objects store data in a table-like structure, and they have methods that let you manipulate the data. Further, each method gives you back a new DataFrame
that has been changed in the way you want… and has the same methods intact so you can keep modifying it further! This serves to let you chain together multiple manipulations in a row, moving your data along a data-manipulating machine like it was a conveyer belt. This makes your code easier to read, and avoids having to nest a bunch of functions inside of each other, which can make it easy to make errors.
For example, consider this code:
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
(texas'poverty < 20')
.query('state == "TX"')) .query(
This code starts with the texas
data, applies a .query()
to it to limit the data to observations where the poverty rate is below 20 (we’ll get to more on .query()
in a bit), and then takes that and limits it even further to just observations from Texas.
Also, note that all of the following code will show you how to manipulate data. However, often you will want to use that manipulated data. If this is the case, you’ll need to store the manipulated data in a new object (or overwrite the old one).
For example, if you run that last code chunk I showed you, the code will limit the data to observations with poverty below 20% and in Texas, show you that limited data, and then immediately forget it ever did anything. If you check the texas
data again there’s no new column, because you didn’t store your changes! For that you’d have to do:
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas = (texas
texas 'poverty < 20')
.query('state == "TX"')) .query(
NOW the texas
data is limited to just those rows.
Picking Only Some of Your Data (Subsetting)
If you want to keep only certain columns (variables) from your data, there are a few options. One option is to make a list of the column names you want to keep and send that to the DataFrame
itself with []
, or to .loc[:, ]
(the :
means “keep all the rows”). If you like you can use column positions instead of names and use .iloc[:, ]
. If you want to drop a specific column or set of columns, you can use the .drop()
method and its columns
option.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
# Keep the year and state variables
'year', 'statefip']]
texas[[# or
'year','statefip']]
texas.loc[:, [# or, since they're the first two variables
1,2]]
texas.iloc[:, [
# Get rid of the bmprison variable
= 'bmprison') texas.drop(columns
How about picking just some of the rows (observations) to keep? The primary ways to do this are with .loc[]
or with the more method-chaining friendly option .query()
.
Both of these approaches want you to give it a calculation using your data that is either true or false. If it’s true, it keeps the row. If it’s false, it drops it. .loc[]
wants you to construct the calculation and then pass the result to .loc[]
, while .query()
has you write a string that does the calculation, using direct references to the column names. The book mostly uses .query()
so I’ll focus on that for most of this page.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
# Keep only data from the state of Utah
# Note that == *checks* whether two things are equal,
# while = *assigns* the left thing to be the right
# so state == 'Utah' is TRUE whenever the state variable is 'Utah'
# and FALSE otherwise
== 'Utah', :]
texas.loc[texas.state # or
'state == "Utah"')
texas.query(
# Use | to say "or" to chain multiple statements,
# or & for "and"
# Keep data that is from Utah OR is from Montana and the year 1994 or later
# >= means "greater than or equal to"
'(state == "Utah") | (state == "Montana" & year >= 1994)') texas.query(
In addition to .loc[]
and .query()
there is also .iloc()
. .iloc()
keeps rows based not on a true/false statement, but rather their index position. texas.iloc[1:10]
will keep rows of the texas
data with index values 1 through 10. We don’t tend to use .iloc[]
all that often in causal inference but it has occasional uses.
That index is actually something important to keep in mind when working with pandas. The index value of the rows is not the row position of the data. When you first load data, then it’s likely that texas.iloc[100]
will give you the 101st row of the data. However, if you filter the data down and only keep some of the rows, then each data point’s index stays the same, even as other rows disappear. If you drop the first 20 rows, for instance, then texas.iloc[100]
will give you the 81st row of the data, not the 101st.
There are some benefits to this approach, and the indexing system that pandas uses can sometimes let you pull some neat tricks. But for our current application it is mostly just a pain, and we’d usually rather that the row indices are just the row number. For that reason, we often follow any changing of the rows with a .reset_index()
method, which resets the index to just be the row numbers.
Now You
- Load up the pandas and causaldata packages, and load in the
texas
object. Don’t do theload_pandas()
stuff yet. - Use
texas.NOTE
to see the variable descriptions and find which variable is for the unemployment rate. Then load the actual data set astexas
. - Use any of the the tools from the Looking at Data section to figure out whether the unemployment rate is on a 0-1 or a 0-100 scale.
- Use
.drop()
to drop theincome
variable and save the result asfiltered_texas
. For all following changing, keep saving the changes tofiltered_texas
- Then, use keep only the unemployment rate variable, as well as
state
andyear
. - Use
.query()
to keep only rows that either have an unemployment rate above 5% or are in the state of Texas. - Look at your result to make sure it worked as intended (hint: maybe check the unemployment rates for all non-Texas states (
state != 'Texas'
) to ensure they’re above 5%? Check to make sure all the Texas observations are there?)
import pandas as pd
from causaldata import texas
import pandas as pd
from causaldata import texas
texas.NOTE# looks like the unemployment rate variable is ur
= texas.load_pandas().data
texas
texas.ur.describe()# It goes from 0 to 100, not 0 to 1
# drop income
= texas.drop(columns = 'income')
filtered_texas
# keep year, state, and ur
= filtered_texas[['state','year','ur']]
filtered_texas
# Keep rows if they have an ur above 5 or are in Texas
= filtered_texas.query('state == "Texas" | ur > 5')
filtered_texas
# Check states
filtered_texas.state.value_counts()# check UR of non-Texas states
'state != "Texas"').ur.describe() filtered_texas.query(
Creating and Editing Variables
You can create and edit variables using the .assign()
method, or by using []
. For example:
= (texas
texas = 1)
.assign(just1s = 2))
.assign(just2s
# or
'just1s'] = 1
texas['just2s'] = 2 texas[
will create two new columns in the data: just1s
, which is 1
in each row, and just2s
, which is 2
in each row. We’re both naming the variables we add (just1s
, just2s
) and setting their values (1
, 2
) at the same time. If the name we give a column is the same as a name already in the data, this will overwrite the old column, letting us edit existing data.
Usually, you’re not setting all the values of your new variable to the same thing, as we are here. Typically you’re combining your existing variables and doing some calculations on them to make a new column. Let’s take our year
variable and add 1 to it. And while we’re at it, let’s pretty up our naming with the rename()
function and change our column names to be capitalized (which might look nicer on a graph we make later).
= (texas
texas = lambda x: x.year + 1)
.assign(next_year = {'year': 'Year', 'next_year': 'Next Year'}))
.rename(columns
# or
'next_year'] = texas['year'] + 1
texas[= texas.rename(columns = {'year': 'Year', 'next_year': 'Next Year'}) texas
What’s with the lambda x
thing? Well, typically when we’re making a new variable in our context, we’re going row-by-row and doing a calculation using data in that row. In this case that’s taking the row’s year
value and adding 1. lambda x:
does that. It’s a little mini-function for calculating row-by-row. It iterates through each row with x
(or whatever you want to call it), so x.year
means “the year value on the current row”.
Now You
- Load up the pandas and causaldata packages, and load in the
texas
object and data. - Use
.assign()
to make a new variable calledur_1scale
by dividing theur
variable by 100. - Then, use
.rename()
to change the name ofur
tounemprate
andur_1scale
toUnemployment Rate (0-1 Scale)
- Save the result as
texas_newvars
and use.head()
to check that your new variable looks like it’s supposed to.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data texas
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
= (texas
texas_newvars = lambda x: x.ur/100)
.assign(ur_1scale = {'ur':'unemprate','ur_1scale':'Unemployment Rate (0-1 Scale)'}))
.rename(columns
texas_newvars.head()# Looks good!
Working With Numeric Variables
The easiest kinds of variables to create and edit are numeric variables. Just do a calculation on your existing columns to get your new variable! It will go row by row to do the calculation. Or, if you do some sort of calculation that only gives you back one value, it will apply that single value to all the rows.
From there it’s just learning the different functions and operations to do your math, and being careful with parentheses to keep your intended order of operations intact.
For example, the following code multiplies the share of the population in the state that is Black by the share who are 15-19 to get a (rough) share who are both Black and 15-19. Then, it uses the .mean()
method which only gives back a single value, to calculate the state/year’s income relative to the mean:
= (texas
texas = lambda x: (x.black/100)*(x.perc1519/100))
.assign(black_and_1519 = lambda x: x.income - texas.income.mean())) .assign(income_relative_to_mean
Now You
- Load up the pandas and causaldata packages, and load in the
texas
object, but hold off on getting the data for now. - Use
texas.NOTE
to see the descriptions of all the variables. - Use
.assign()
with the .min()
method (which calculates the minimum of a variable) to calculate the alcohol consumption per capita divided by the minimum alcohol consumption per capita across all values. Call this new variablealcohol_relative
. - Use
.assign()
to create a new variable calledwhite_share
equal to the number of white men in prison divided by (white men in prison + black men in prison). - Save the resulting data set as
texas_newvars
- Check the data using tools from Looking at Data to ensure the results look right (how might you do this? Well, one example is that
white_share
should never be below 0 or above 1, andalcohol_relative
should never be below 1. Are both of those true? How did I come up with those rules?)
import pandas as pd
from causaldata import texas
= texas.load_pandas().data texas
import pandas as pd
from causaldata import texas
texas.NOTE= texas.load_pandas().data
texas
= (texas
texas_newvars = lambda x: x.alcohol/texas.alcohol.min())
.assign(alcohol_relative = lambda x: x.wmprison/(x.wmprison+x.bmprison)))
.assign(white_share
# Check the ranges of our variables (there are ways other than vtable to do this)
'alcohol_relative','white_share']].describe()
texas_newvars[[# Alcohol's min is 1 and white_share is never negative or above 1. Great!
Working With Strings (Text Variables)
It’s not uncommon to get data in the form of character variables, i.e. text. There’s a lot to say on how to work with these variables, but for our purposes we don’t need to get too deep into the weeds.
In Python in general, strings are stored as arrays of letters. So for example 'hello'[0]
will return 'h'
since that’s the first letter (index position 0) of the string 'hello'
. You can also combine together strings by adding them. 'hello'+' '+'there'
gets you 'hello there'
.
The pandas package gives us a lot more we can do, however. If you have a column of string data, you can access its string-specific functions with the .str
method, which should then provide you a bunch of autocomplete options for things you can do with strings. We’re not going to need to go much beyond that. Thankfully we don’t have to get into the terror that is regular expressions. Note that these methods don’t work on strings in general, they only work on strings stored as pandas data columns.
.str.slice()
takes a substring out of a string based on its position. So if our data column words
in the DataFrame
called d
contained a string 'hello'
, then d.words.str.slice(1,2)
will take the characters of 'hello'
from the 2nd to the 3rd position, and will give you 'el'
.
.str.split()
takes a string and splits it into pieces based on a “splitting” character. If our words
column contains the string 'Eggs, milk, cheese'
then d.words.str.split(', ')
will give you back ['Eggs','milk','cheese']
.
.str.contains()
checks whether a substring is present in a string. If our words
column has the entries ['water','food','shelter']
in its three rows, then d.words.str.contains('t')
will give you back [True,False,True]
since “t” is in “water” and “shelter” but not “food”.
Finally, you might commonly want to get a specific word of the string. We can do this by combining our .str.split()
from before with .str.get()
which will get an element from a list. So if our words
column contains a string 'Tale of two cities'
, then d.words.str.split(' ').str.get(1)
will give you 'of'
.
There’s plenty more, but this will be enough to get going and it may be a while before you need anything else. Try help(pd.core.strings.accessor.StringMethods)
to see what else is out there.
Now You
- Load up the pandas and causaldata packages, and load in the
texas
data. - Limit the data to the
state
variable, and then use the.drop_duplicates()
method which will get rid of all duplicates, and.reset_index(drop = True)
to reset the index. Now you’ve got a data set that just lists the 50 state names (plus Washington DC). - Use
assign()
withlambda
to make a new variable calledfirst_letter
, using.str.slice()
to get the first letter of every state. - Use
assign()
withlambda
to make a new variable calledis_virginia
, using.str.contains()
to make the variableTrue
for every state that contains the string “Virginia” - Use
assign()
withlambda
to create a new variable callednew_state
, using.str.split().str.get()
to get the first word of the state’s name, and then using==
to check if that first word is “New”. - Save the resulting data set as
state_names
- Check the data by looking at it directly with
texas
or clicking on it in the Jupyter Variables tab to make sure everything worked properly.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data texas
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
= texas[['state']].drop_duplicates().reset_index(drop = True)
texas
= (texas
texas = lambda x: x.state.str.slice(0,1))
.assign(first_letter = lambda x: x.state.str.contains('Virginia'))
.assign(is_virginia = lambda x: x.state.str.split(' ').str.get(0) == 'New')
.assign(new_state
)
texas
Working With Categories
Often, we have a variable that takes one of a limited set of possible mutually exclusive values. For example, days of the week can only be Sunday, Monday, Tuesday, etc.. We might store these as strings, but really they’re not the same thing. We want R to recognize these as categorical variables and treat them appropriately. Plus, treating them as categorical lets the computer handle them more efficiently.4
The pd.Categorical()
function can convert a string (or discrete numeric) variable to a factor. We can also use it to set the levels of the factor.
Some categorical variables are ordered. For example, if the variable is “country”, that’s not ordered. France and Australia are different countries, but one isn’t, like, more than the other. However, if the variable is “education”, then “High School” is more education than “Elementary School” and less education than “College”, so these categories are ordered. The levels of the factor are its order. The order matters! If you run a regression with the factor variable as a predictor, the reference category will by default be the first level of the factor. If you use the variable in a graph, then the levels of the factor will be used to determine the order the categories are shown on the graph.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
# Turn the state variable into a factor
= (texas
texas = pd.Categorical(texas.state)))
.assign(state_factor
# Leave all the levels as-is but change the reference category to Texas
# First, pull out the list of categories and put Texas at the front
= ['Texas'] + [cats for cats in texas.state_factor.cat.categories if cats != 'Texas']
cats # Re-make the factor but with the new categories, setting it to be ordered
= pd.Categorical(texas.state_factor, categories=cats, ordered=True)
texas.state_factor # See how Texas is now at the front
texas.state_factor.cat.categories
Index(['Texas', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa',
'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts',
'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana',
'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
'South Dakota', 'Tennessee', 'Utah', 'Vermont', 'Virginia',
'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
dtype='object')
# Set the factor levels directly
= pd.DataFrame({'Letters': ['A', 'B', 'C', 'C', 'D']})
fake_data 'Letters'] = pd.Categorical(fake_data['Letters'], categories=['C', 'D', 'A', 'B'], ordered=True)
fake_data['Letters'].value_counts() fake_data[
Letters
C 2
D 1
A 1
B 1
Name: count, dtype: int64
Another function to highlight is one that’s useful in creating factor variables, and that’s pd.cut()
. pd.cut()
lets you set carve up a numerical value into bins, which are then considered an ordered factor.
= (texas
texas = pd.cut(texas.poverty,
.assign(poverty_rate_categories =[-float('inf'), 10, 20, float('inf')],
bins=['Low Poverty', 'Medium Poverty', 'High Poverty'],
labels=True))) ordered
- Load up the pandas and causaldata packages, and load in the
texas
data. - Use
.assign()
andpd.cut()
to createalcohol_per_capita_bins
equal to ‘Low Drinking’ ifalcohol
is below 2, ‘Medium-Low Drinking’ for 2-2.5, ‘Medium-High Drinking’ for 2.5-3, and ‘High Drinking’ for values above 3. - Use
.assign()
andpd.cut()
to turnalcohol_per_capita_bins
into a factor, withlevels
set to in the order of Low, Medium-Low, Medium-High, and High. - Use
pd.Categorical()
to changealcohol_per_capita_bins
so that its reference category is ‘High Drinking’. - Check the data using
.value_counts()
to make sure the categories were created and ordered properly. After following all the instructions above, the order should be High, Low, Medium-Low, Medium-High.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data texas
= (texas
texas = pd.Categorical(
.assign(alcohol_per_capita_bins
pd.cut(texas.alcohol, =[-float('inf'), 2, 2.5, 3, float('inf')],
bins=['Low Drinking', 'Medium-Low Drinking', 'Medium-High Drinking', 'High Drinking']),
labels=True)))
ordered= pd.Categorical(texas.alcohol_per_capita_bins,
texas.alcohol_per_capita_bins =['High Drinking','Low Drinking',
categories'Medium-Low Drinking', 'Medium-High Drinking'],
=True)
ordered texas.alcohol_per_capita_bins.cat.categories
Working With Dates
Dates are the worst. Everyone who works with data hates them. They’re inconsistent (different number of days in the month, days in the year (leap years), time zones are wild, many different date formats with different levels of precision, etc. etc.). The tidyverse package lubridate makes working with dates much much easier and I strongly recommend using it whenever you’re dealing with dates.
At our basic level, there are only a few things we need to worry about with dates. The first is converting non-date data, like a string that reads ‘1999-02-09’ into a proper date that R recognizes. The second is pulling date information back out of that date variable (like getting that the year of that date is 1999). The third is “flattening” dates, for example getting that the first day in that month’s date is February 1, 1999, which can be handy when aggregating date-based information.
We can convert date data into a proper date format using the pd.to_datetime()
function. This function is fairly flexible. Do you have data that looks like ‘1999-02-18’, i.e. year-month-day? It’ll figure it out. How about just the year and month, like ‘February 1999’? It’ll figure that out too, and give you the first day of the month. Have data with time as well, like “Mar 10, 1998 08:00”? Yep that works too.
Once we have a proper date, we can use .year
, .month
, .day
, .week
, and so on to pull out information from the date. If mydate
is the date March 8, 1992, then mydate.month
will give back 3 (since March is the 3rd month).
Finally, we can flatten dates with .to_period()
. .dt.to_period('M').dt.to_timestamp()
will first get the date at the start of the month that mydate
is in (with .dt.to_period('M')
, and then convert back to a datetime (with .to_timestamp()
), which includes the day (the first, it defaults to), resulting in March 1, 1992.
Now You
- Load up the pandas and causaldata packages, and load in the
texas
data. - Use
.assign()
and+
to paste together ‘February 2’ with theyear
variable, creatingdate_feb_2
. - Use
.assign()
andpd.to_datetime()
to turndate_feb_2
into a proper date variable. - Use
.assign()
and.to_period()
to makedate_jan_1
out ofdate_feb_2
.date_jan_1
is January 1 of the year, i.e. the first day of that year. - Use
.assign()
and.year
to createnew_year
, taking the year ofdate_jan_1
. - Use
np.mean()
withnew_year
andyear
to check that they are always the same.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data texas
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
# Paste together
= texas.assign(date_feb_2=lambda df: pd.to_datetime('February 2 '+df['year'].astype(str)))
texas = texas.assign(date_feb_2=lambda df: pd.to_datetime(df.date_feb_2))
texas = texas.assign(date_jan_1=lambda df: df.date_feb_2.dt.to_period('Y').dt.to_timestamp())
texas = texas.assign(new_year=texas.date_feb_2.dt.year)
texas
# Check that they're always the same
== texas.year).mean()
(texas.new_year # the mean is 1, so they're always the same
Aggregating Data
A data set’s observation level is whatever is represented by a single row of data. If you have a data set of 30 students in a classroom, and you have one measurement per student and thus one row of data per student, then the observation level is “one student”. If you follow those same students over many years and have one row of data per student per year, then the observation level is “student-year”.
Often you might have data at a more fine-grained observation level and want to broaden it. Perhaps you have that student-year data but want to do an analysis of students overall. To do this you can aggregate the data.
In the tidyverse this can be done with .groupby()
, which ensures that aggregating .agg()
methods or transforming .transform()
methods that follow are done within-group. .agg()
will aggregate multiple rows into one. The result will be a data set at the observation level of whatever you put in .groupby()
.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
# the original texas data has a state-year observation level
# Get the alcohol consumption by state,
# averaged over all the years (note of course agg doesn't JUST do means)
= (texas
average_alcohol_by_state 'state')
.groupby(= ('alcohol','mean')))
.agg(avg_alcohol # Other more flexible ways of specifying the aggregation function:
# (allowing functions that they didn't build in like 'mean',
# or allowing multiple variables to be aggregated at once)
# .agg({'alcohol': 'mean'}))
# .agg({'alcohol': np.mean}))
# .agg({'alcohol': lambda x: x.mean()}))
average_alcohol_by_state.head()
Now You
- Load up the pandas and causaldata packages, and load in the
texas
data. - Use
.groupby()
to group the data bystate
. - Use
.agg()
to get meanincome
and the sum ofbmprison
, respectively. Call the summarized data setstateavg
. - Look at the data with
.head()
or click on it in the Environment pane to ensure it worked properly.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data texas
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
= (texas
stateavg 'state')
.groupby('income':'mean','bmprison':'sum'})
.agg({
)
stateavg.head()
Multi-Row Calculations
It’s not uncommon to need to use multiple (but not all) rows of data to calculate something you’re interested in. Some ways this might occur include:
Calculating a statistic by group (for instance, calculating the mean of something separately for group A, then again for group B, and so on)
Calculating something like a growth rate where you want to know the value of something relative to the period just before, or the first period in the data
To calculate statistics by group, just like in the Aggregating Data section, we can use .groupby()
to do by-group calculations. However, if we want to calculate by group and create a new column with that calculation rather than change the observation level, we want to follow that .groupby()
with a .transform()
instead of an .agg()
.
One common application of .groupby()
followed by .transform()
is to calculate variables that are de-meaned or standardized, where you want to subtract the group mean from something, and then (for standardizing) divide by the group standard deviation. Unfortunately .groupby()
doesn’t play nice with .assign()
so we need to be a little laborious here:
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
'mean_income'] = (texas.groupby('state')['income'].transform('mean'))
texas['sd_income'] = (texas.groupby('state')['income'].transform('std'))
texas[= texas.assign(
texas = (mydat.income - mydat.mean_income) / mydat.sd_income
income_z )
For growth rates, you’ll need to keep track of the order of the data. .sort_values()
lets you sort the data set by whatever variables you give it, presumably time order. Then, .shift()
lets you refer to an observation one row above (or any number of rows),.pct_change()
will specifically calculate a percentage change from the previous row, and .iloc[0]
or .iloc[-1]
let you refer to the first or last row in the group, respectively. There are also the functions .cumsum()
and .cumprod()
for cumulative sums and products of all the observations up to that point (and .rolling()
for rolling averages). You can also do all this while dropping the .groupby()
stuff if your data is just a single time series without groups.
import pandas as pd
from causaldata import texas
= texas.load_pandas().data
texas
# we want growth from year to year so set it in year order
= texas.sort_values(by=['state', 'year'])
texas # pct_change will calculate the percentage change from the previous row
'prison_growth'] = texas.groupby('state')['bmprison'].pct_change()
texas[# although we can take longer lags if we like (handy if the data is, say, monthly)
'prison_growth_10yrs'] = texas.groupby('state')['bmprison'].pct_change(periods=10)
texas[# let's show a way to do it with .shift() for flexibility
'prison_growth_10yrs_shift'] = texas.groupby('state')['bmprison'].transform(lambda x: x / x.shift(10) - 1)
texas[# perhaps we want growth since the start
'prison_index'] = texas.groupby('state')['bmprison'].transform(lambda x: x / x.iloc[0])
texas[
# let's look at one state's results
"state == 'Texas'")[['year', 'bmprison', 'prison_growth', 'prison_growth_10yrs','prison_growth_10yrs_shift', 'prison_index']] texas.query(
Now You
- From the causaldata package load the
castle
data set (nottexas
!). Usecastle.NOTE
to see the descriptions of all variables after importing thecastle
object but before loading the data. - Use
.transform()
to createavg_robbery
equal to the mean robbery rate each year across all states, grouping the data byyear
. - Now use
.sort_values()
to sort the data bysid
andyear
- Create the variable
robbery_yoy
to calculate the percentage growth in robbery rates by state from one year to the next, grouping bysid
. - Create the variable
robbery_index
which is the percentage growth in robbery since the start of the data, grouping bysid
- Use
query()
to look atsid == 1
and just the variablesyear
,robbery
, and any variables you just created
import pandas as pd
from causaldata import castle
= castle.load_pandas().data castle
import pandas as pd
from causaldata import castle
castle.NOTE= castle.load_pandas().data
castle
# Group the data by year and get average robbery rate
'avg_robbery'] = castle.groupby('year')['robbery'].transform('mean')
castle[
# now group by state and put in order
= castle.sort_values(by=['sid', 'year'])
castle # calculate the percentage change in robbery rates by state from one year to the next
'robbery_yoy'] = castle.groupby('sid')['robbery'].pct_change()
castle[# calculate the percentage growth in robbery since the start of the data
'robbery_index'] = castle.groupby('sid')['robbery'].transform(lambda x: x / x.iloc[0] - 1)
castle[
#Use query() to look at sid == 1 and just the variables year, robbery, and any variables you just created
"sid == 1")[['year', 'robbery', 'avg_robbery', 'robbery_yoy', 'robbery_index']] castle.query(
Merging Multiple Data Sets
Often you will have multiple data sets on the same sets of observations. For example, in a data set of companies you might have one file that contains those companies’ marketing spending budgets, and in a different file you might have data on their numbers of employees. Or perhaps you have one data set with the population of a bunch of countries in different years (with one row per country per year) and another data set with the names of the capital cities for each country (one row per country). If you want to analyze these variables together you must first merge (or join) the two data sets.
Merging works by identifying a set of key variables that are shared by both of the data sets. In at least one of the data sets, those key variables should be the observation level of the data, meaning that there is no more than one row with the same combination of those key variables.5
For example, consider the following two data sets A and B:
Person | Year | DaysExercised | |
---|---|---|---|
0 | Nadia | 2021 | 104 |
1 | Nadia | 2022 | 144 |
2 | Ron | 2024 | 87 |
3 | Ron | 2025 | 98 |
Person | Birthplace | |
---|---|---|
0 | Nadia | Seville |
1 | Ron | Paris |
2 | Jin | London |
These data sets share the column “Person”. Notice that “Person” is the observation level of data set B - there’s only one row for each value of Person, with no duplicates. The observation level of data set A is not Person but Person and Year (there’s one one row for each unique combination of Person and Year).
If I were to merge these data sets, using either the .join()
method or the pd.merge()
function, it would go row by row in data set A. “Hmm, OK, the Person in the first row is Nadia. Let’s check for Nadia in B. In B, Nadia is born in Seville, so I can add a Birthplace column in data set A where Nadia is born in Seville.” Then on to the second row. “The Person in the second row is Nadia again. So the Birthplace column is Seville again” and so on.
Person | Year | DaysExercised | Birthplace | |
---|---|---|---|---|
0 | Nadia | 2021 | 104 | Seville |
1 | Nadia | 2022 | 144 | Seville |
2 | Ron | 2024 | 87 | Paris |
3 | Ron | 2025 | 98 | Paris |
Now the data is merged and we can do our Birthplace / DaysExercised analysis.
What about Jin? Jin was in data set B but not data set A. We have some options. One option is an “inner join” (the how = 'inner'
option in either .join()
or pd.merge()
) which keeps only matching observations and drops anyone who didn’t match, which is what we did above. Another is a “full join” (how = 'outer'
) which keeps any non-matching observations and just puts missing values wherever we don’t have info, like this:
Person | Year | DaysExercised | Birthplace | |
---|---|---|---|---|
0 | Nadia | 2021 | 104 | Seville |
1 | Nadia | 2022 | 144 | Seville |
2 | Ron | 2024 | 87 | Paris |
3 | Ron | 2025 | 98 | Paris |
4 | Jin | <NA> | <NA> | London |
We can also choose whether or not to keep the non-matching observation based on which data set it’s in. In merging, there is a “left” data set (the first one we merge) and a “right” data set (the second). For us, the left data set is A and the right data set is B. We can do a “left join” (how='left'
), which means “if an observation is in A but not B, keep it. If it’s in B but not A, drop it.” Similarly there’s how='right'
which keeps observations in B but not A, but drops those in A but not B. For us, Jin would stick around with a how = 'right'
join but be dropped with a how = 'left'
join, since he’s in the right data set but not the left one.
= pd.DataFrame({
personyeardata 'Person': ['Nadia', 'Nadia', 'Ron', 'Ron'],
'Year': [2021, 2022, 2024, 2025],
'DaysExercised': [104, 144, 87, 98]
})
= pd.DataFrame({
persondata 'Person': ['Nadia', 'Ron', 'Jin'],
'Birthplace': ['Seville', 'Paris', 'London']
})
# Merge on 'Person'
= pd.merge(personyeardata, persondata, on='Person', how='right')
merged_data
merged_data
Person | Year | DaysExercised | Birthplace | |
---|---|---|---|---|
0 | Nadia | 2021.0 | 104.0 | Seville |
1 | Nadia | 2022.0 | 144.0 | Seville |
2 | Ron | 2024.0 | 87.0 | Paris |
3 | Ron | 2025.0 | 98.0 | Paris |
4 | Jin | NaN | NaN | London |
Now You
- Run the below code to load in the two example data sets
- Merge
data1
withdata2
usingpd.merge()
, keeping both departments - Print out the data set to look at the result
import pandas as pd
= pd.DataFrame({
data1 'Month': [1, 1, 2, 2, 3, 3, 4, 4],
'Sales': [6, 8, 1, 2, 4, 1, 2, 4],
'Department': ['Sales', 'R&D'] * 4
})= pd.DataFrame({
data2 'Department': ['Sales'],
'Director': ['Lavy']
})
import pandas as pd= pd.DataFrame({
data1 'Month': [1, 1, 2, 2, 3, 3, 4, 4],
'Sales': [6, 8, 1, 2, 4, 1, 2, 4],
'Department': ['Sales', 'R&D'] * 4
})= pd.DataFrame({
data2 'Department': ['Sales'],
'Director': ['Lavy']
})
# We want to keep both departments, so if we do data1 first
# we want a left join or full/outer join
# since that will maintain R&D which is in data1 but not data2
= pd.merge(data1, data2, how = 'left', on = 'Department')
merged_data
merged_data# looks good!
Footnotes
Jupyter is far from the only way of working in Python. You can work directly with scripts of course. Even in an interactive environment there are alternative approaches like Quarto (which is what I actually use when working with Python myself). I won’t get into that here, but you can find out more at this site.↩︎
Note that Windows computers, unlike Mac or Linux, typically use backslashes (\) instead of forward slashes to go into a subfolder. However, when writing code in most languages, they won’t let you use the backslash and you must use a forward slash.↩︎
Jupyter makes this kind of hard by making it so you have to run an entire chunk of code rather than a single line at once. One way you can get around this is to have a chunk of your working code, and then write each new line of code in a new chunk that only has a single line (plus whatever code you need to see whether the outcome is as expected). If it works, then you copy it back up to your working code.↩︎
Saying your data is “Sunday”, “Monday”, “Tuesday”, etc. are more memory-intensive and difficult for the computer to work with than saying your data is 1, 2, 3, and oh hey computer remember that 1 = Sunday, 2 = Monday, and so on. Numbers are way more efficient for computers to handle.↩︎
Technically you can merge two data sets where the key variables are not the observation level of either data set. This is called a many-to-many merge. However, it’s usually a bad idea and if you’re doing it you’re almost certainly making a mistake.↩︎