How to Use converters in pandas.read_excel When Using MultiIndex at the Same Time
Background/Problem
Ok, I think I had this problem about two years ago (not quite sure). Then a little over a year ago, I saw that someone had pretty much the exact same problem on stackoverflow and I answered it there. Well, I was kind of late and someone else had answered before, but I felt that my solution got to the point of the question just a little better. Then I started to write a post for my website about it, but somehow never quite finished and published it. Today I finally got to do so 😃.
Enough bla bla bla, let’s get to the problem.
Imagen you have an excel file with measurement data. The first header row tells
you what was measured, the second one what the measurement unit is. Every row
beneath that represents a set of measurements taken at a specific time. Now you
need to read in the data from the excel file into a pandas.DataFrame. While
doing so, you also want to convert the measurement data into SI units.
pandas.read_excel function has an optional argument converters. To use it,
you have to pass in a dictionary. Each key in that dictionary is a column name
and the corresponding value a lambda function that will be applied to that
column. The problem is, the excel file at hand has two header rows represented
as a MultiIndex in pandas. So what do you use for the keys in the
converters dictionary?
Solution
To better demonstrate the problem and to have an example to work with, imagine you had an excel file with the following table:
| width | height |
|---|---|
| nano_meter | milli_meter |
| 1 | 4 |
| 2 | 5 |
| 3 | 6 |
Back when I had the problem, it took me quite some time to figure out how to
set up the converters dictionary. When I look at it now, it’s actually pretty
easy. But I guess that’s just how it always is. Once you know how to do
something, it seems obvious and easy 🤓. Anyways, all you have to do is, use
tuples as keys in the dictionary. Here we go:
import pandas as pd
converters = {
("width", "nano_meter"): lambda nm: nm / 1_000_000_000,
("height", "milli_meter"): lambda mm: mm / 1_000,
}
data = pd.read_excel("<PATH/TO/EXCEL/FILE>", header=[0, 1], converters=converters)
print(data)
This short script produces the following output:
width height
nano_meter milli_meter
0 1.000000e-09 0.004
1 2.000000e-09 0.005
2 3.000000e-09 0.006
Of course after the conversion the units in the DataFrame are not correct
anymore. So you should rename them or remove them. Even though this was not
part of the exercise, here is how you can remove the units header:
data.columns = data.columns.droplevel(1)
Printing the data now yields the following:
width height
0 1.000000e-09 0.004
1 2.000000e-09 0.005
2 3.000000e-09 0.006
Take care,
Andreas
Leave a comment