Aggregation: Implications of indexing

While there are multiple syntaxes and methods to produce the same aggregated data, those variations produce different indices. The format and contents of the index can impact other processes, such as serialization and deserialization.

Consider the following artificial transactional data.

txns = pd.concat([pd.DataFrame({'dt': pd.date_range("2022", freq="D", periods=10),
                                'amount': np.random.random(10),
                                'segment': ['ex'] * 10})] * 10, axis=0)

	dt	amount	segment
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex

Our primary objective is to produce a table of the highest amounts for one or more days. In this exercise, we try different approaches and pick one that yields the desired index. With any approach, .groupby places the grouping variable into the index of any aggregated DataFrame.

Apply a lambda function directly to the groupby object (gbo), in which x is the row of the aggregated DataFrame per the gbo directive. In this approach, we specify the column(s) with which to rank the instances. If present, other columns will be returned…

(txns.groupby('dt').apply(lambda x: x.nlargest(3, 'amount'))
 )[0:9]

	dt	amount	segment
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex

…Unless we follow that operation with a selection:

(txns.groupby('dt').apply(lambda x: x.nlargest(3, 'amount'))
 .loc[:, 'amount']
 )[0:9]

	amount
(Timestamp(‘2022-01-01 00:00:00’), 0)	0.992821
(Timestamp(‘2022-01-01 00:00:00’), 0)	0.992821
(Timestamp(‘2022-01-01 00:00:00’), 0)	0.992821
(Timestamp(‘2022-01-02 00:00:00’), 1)	0.954333
(Timestamp(‘2022-01-02 00:00:00’), 1)	0.954333
(Timestamp(‘2022-01-02 00:00:00’), 1)	0.954333
(Timestamp(‘2022-01-03 00:00:00’), 2)	0.489229
(Timestamp(‘2022-01-03 00:00:00’), 2)	0.489229
(Timestamp(‘2022-01-03 00:00:00’), 2)	0.489229

Either way, nlargest provides dt as the DataFrame index. If we want to serialize as JSON in a bundle of other objects (covered elsewhere), we need to convert the data to a dictionary. Note that .to_dict ignores the index, so .reset_index().

Notice how, in both of the previous two examples, the grouping variable _ combines_ with the original index to form a MultiIndex:

txnsg = (txns.groupby('dt').apply(lambda x: x.nlargest(3, 'amount'))
        )[0:9]
print(txnsg.index)

MultiIndex([('2022-01-01', 0),
            ('2022-01-01', 0),
            ('2022-01-01', 0),
            ('2022-01-02', 1),
            ('2022-01-02', 1),
            ('2022-01-02', 1),
            ('2022-01-03', 2),
            ('2022-01-03', 2),
            ('2022-01-03', 2)],
           names=['dt', None])

Before determining the downstream implications of this MultiIndex, note that we can produce the same output by reversing the lambda application with the column selection:

txns.groupby('dt')['amount'].apply(lambda x: x.nlargest(3))

dt           
2022-01-01  0    0.992821
            0    0.992821
            0    0.992821
2022-01-02  1    0.954333
            1    0.954333
            1    0.954333
2022-01-03  2    0.489229
            2    0.489229
            2    0.489229
2022-01-04  3    0.234701
            3    0.234701
            3    0.234701
2022-01-05  4    0.695798
            4    0.695798
            4    0.695798

If returning a multiIndexed DataFrame by a groupby seems unexpected, you’re not alone. It only happens when .groupby is used with .apply, via the group_keys=True default argument. If group_keys=False, only the original indices are retained:

(txns.groupby('dt', group_keys=False)
     .apply(lambda x: x.nlargest(3, 'amount'))
     .loc[:, 'amount']
)[0:9]

0    0.992821
0    0.992821
0    0.992821
1    0.954333
1    0.954333
1    0.954333
2    0.489229
2    0.489229
2    0.489229

Another variation is whether to retain the grouping variable values as_index=True (default) or replace them with integer indices as_index=False:

(txns.groupby('dt', as_index=False, group_keys=True).apply(
    lambda x: x.nlargest(3, 'amount'))
 .loc[:, 'amount']
 )[0:9]

0  0    0.992821
   0    0.992821
   0    0.992821
1  1    0.954333
   1    0.954333
   1    0.954333
2  2    0.489229
   2    0.489229
   2    0.489229

When we write a multi-indexed DataFrame or Series with .to_csv, it will put commas after level 0 and the nameless integer index (level 1). Remember, such multiIndexing only happens when .groupby is used with .apply, via the group_keys=True default, such that they will form two separate columns when opened in Excel or loaded by .read_csv. The issue is that level 1 remains nameless–a blank column header–in Excel. But notice that, upon reading the file, pandas provides the name Unnamed: 1 (note also dt.1):

txnsg.to_csv('multi_indexed_csv_test.csv')
reload = pd.read_csv('multi_indexed_csv_test.csv')

	dt	Unnamed: 1	dt.1	amount	segment
0	2022-01-01	0	2022-01-01	0.992821	ex
1	2022-01-01	0	2022-01-01	0.992821	ex
2	2022-01-01	0	2022-01-01	0.992821	ex
3	2022-01-02	1	2022-01-02	0.954333	ex
4	2022-01-02	1	2022-01-02	0.954333	ex
5	2022-01-02	1	2022-01-02	0.954333	ex
6	2022-01-03	2	2022-01-03	0.489229	ex
7	2022-01-03	2	2022-01-03	0.489229	ex
8	2022-01-03	2	2022-01-03	0.489229	ex

To ensure index levels are consistently labeled, give any unnamed levels a name with the .rename_axis method:

(txns.rename_axis('txn_id')
 .groupby('dt')
 .apply(lambda x: x.nlargest(3, 'amount'))
 )[0:9]

	dt	amount	segment
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-01 00:00:00’), 0)	2022-01-01 00:00:00	0.992821	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-02 00:00:00’), 1)	2022-01-02 00:00:00	0.954333	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex
(Timestamp(‘2022-01-03 00:00:00’), 2)	2022-01-03 00:00:00	0.489229	ex

The exported csv of the dataframe above will have all columns labeled. But what if, before the groupby operation, we slice the data with values contained in a non-index field (other than txn_id)? We must .set_index('slicer_col') before slicing the dataframe with those values, which typically would be in a pd.Index instance (idx, say) and applied via .loc[idx].

Whether we rename a pre-existing index or not, if we set the index with another column, that previous index will be lost, unless it is reset into columns.