In this post, I briefly review a few methods to select rows and/or columns of a DataFrame that satisfy one or more criteria. I then introduce two additional requirements that arises frequently in practice–slicing with previously unknown criteria and managing serialization and deserialization to recover the desired data structure.
Lever multiIndexes
I often find pandas’ multiIndex to be helpful, although I do not observe it
used very often. With a multi-indexed DataFrame, pandas’ .xs
method is a
clean way to select instances, as it can be executed
in a piecewise sequence of criteria. A downside is that you lose the
multiIndex level(s) from your table.
When you have a multi-indexed dataframe, we can use .loc
at varying depth. In
general, .loc
expects a row indexer followed by a column indexer. To select
multiple labels (values) of a level, list them; same for the columns you
want to tabulate:
df.loc[['NA', 'EMEA', 'LA'], ['post_dt', 'swift_msg']]
Note that the row indexer in this example is one list, as is that of the column indexer; this will look into level 0 of the multiIndex. To further utilize the multiIndex, provide multiple lists of labels, bounded as one indexer by a tuple:
df.loc[(['NA', 'EMEA', 'LA'], ['Mining', 'Retail']), ['post_dt', 'swift_msg']]
Pushing further, we arrive at the pinacle of dataframe slicing: the IndexSlice:
df.loc[pd.IndexSlice[:, ['Mining', 'Retail']], ['post_dt', 'swift_msg']]
:
is interpreted as “include every label in this level,” which in this example
implies “all geographies.”
Note that all of the above are more organized and readable version of
applying a boolean mask like df[(df['A']==0) & (df['B']==1)]
.
When slicing goes awry
Often, we try to progammatically select a (sub)category of something from a data set containing multiple groups. With commercial payments, the entire data set may contain multiple bank clients and payment channels (paper check, ACH, wires, etc.), but not every client may have made payments through all channels. We do not know this in advance and yet ask the program to slice thousands of samples.
Pandas’ .xs()
method will slice a dataframe on the basis of one level
of an index. If the value we specify is absent, a KeyError
is produced:
import pandas as pd
df = pd.DataFrame({'c': ['x', 'x', 'y'], 'b': [1, 2, 3]}).set_index('c')
df.xs('z')
KeyError: 'z'
How should we handle this error so that the program will proceed to slice the next requested subject?
The following block provides KeyError
(most specific)
versus Exception
(next level up); print(error)
will produce 'z'
.
df = pd.DataFrame({'c': ['x', 'x', 'y'], 'b': [1, 2, 3]}).set_index('c')
try:
df.xs('z')
except KeyError as error:
print(error)
print(f'custom message')
So with the code, we know to associate the output 'z'
as a key of some kind.
Let’s try to use exceptions’ internal attributes:
try:
df.xs('z')
except KeyError as error:
print(error)
print(error.__traceback__)
>> > 'z'
< traceback object at 0x7fea784e9308 >
'z'
is the __context__
and __cause__
.
Let’s say these operations are within a function that we call from an
external module. Where should we handle this potential error, in situ or
in main.py
? Intuitively in-line with the slicing operation:
def slice_it(data, rail):
try:
slice = data.xs(rail)
print(f'Sample successful')
except KeyError as error:
print(f'KeyError {error}; i.e., not found in data')
slice = pd.DataFrame()
return slice
df = pd.DataFrame({'rails': ['x', 'x', 'y', 'q'],
'b': [1, 2, 3, 9]}).set_index('rails')
channels = ['x', 'y', 'abc', 'q']
for channel in channels:
slice = slice_it(df, channel)
if slice.empty:
continue
The 3rd channel z
is not in the data; the error will be printed and the code
will
attempt to sample the next rail, per the continue
command. Any follow-on
operations will only occur upon successful completion of the slice. Note
that you cannot return None
and apply a pandas method to check if the slice
worked. Instead, we chose to return an empty DataFrame and check it with
the .empty
attribute.
If empty, we could log that the data were unavailable for that particular subject. Error and exception handling intertwines with how you design the scope of functions. The ease of deciding where to handle exceptions (in the processing function itself or a driver function) indicates how well you designed your function’s scope.