Effective Pandas I Matt Harrison I PyData Salt Lake City Meetup - 2021

Details

Title : Effective Pandas I Matt Harrison I PyData Salt Lake City Meetup Author(s): PyData Link(s) : https://www.youtube.com/watch?v=zgbUk90aQ6A

Rough Notes

Load data

Suppose we load a dataset autos with Pandas.

When chaining, try the more readable approach:

cols = [...] # Names of the columns

(autos
 [cols]
 .select_dtypes(int)
 .describe()
)

If a column has a few values of data, it is good to check how they are distributed:

autos.cylinders.value_counts(dropna=False)

Assuming the above has missing data, we can investigate why that is the case with the following:

(autos
 [cols]
 .query('cylinders.isna()')
)

We can assign values to Nan entries, and convert specific columns to specific types via:

(autos
 [cols]
 .assign(cylinders=autos.cylinders.fillna(0).astype('int8'),
 displ=autos.displ.fillna(0))
 .astype({'highway08':'int8', 'city08':'int16'})
 .describe()
)

To know more about specific data types in Numpy:

np.finfo(np.float16)

If we have more categorical columns (i.e. few values) but as object types in the dataframe, pandas can help save memory by using the category type.

Chaining, also called Flow Programming, leverages the face that most operations in pandas return a new object - this contrasts with using intermediate variables. The chain should read like a recipe of ordered steps.

Don't mutate! In-place rarely actually does anything in-place. In general this does not improve performance. It also means you cannot chain.

The apply has a lot of overhead, for e.g. the time difference between the following lines of code is order of magnitude.

autos.city08.apply(lambda x: 200/x)

200./autos.city08

Use the where method if you plan to do something with numbers using apply.

np.select is also helpful, can be thought of as a vectorized version of switch statements.

Use assign to create new columns.

Types

Chaining

Mutation

Apply

Aggregation

Emacs 29.4 (Org mode 9.6.15)