Effective Pandas I Matt Harrison I PyData Salt Lake City Meetup - 2021
Details
Title : Effective Pandas I Matt Harrison I PyData Salt Lake City Meetup Author(s): PyData Link(s) : https://www.youtube.com/watch?v=zgbUk90aQ6A
Rough Notes
Load data
Suppose we load a dataset autos
with Pandas.
When chaining, try the more readable approach:
cols = [...] # Names of the columns (autos [cols] .select_dtypes(int) .describe() )
If a column has a few values of data, it is good to check how they are distributed:
autos.cylinders.value_counts(dropna=False)
Assuming the above has missing data, we can investigate why that is the case with the following:
(autos
[cols]
.query('cylinders.isna()')
)
We can assign values to Nan entries, and convert specific columns to specific types via:
(autos [cols] .assign(cylinders=autos.cylinders.fillna(0).astype('int8'), displ=autos.displ.fillna(0)) .astype({'highway08':'int8', 'city08':'int16'}) .describe() )
To know more about specific data types in Numpy:
np.finfo(np.float16)
If we have more categorical columns (i.e. few values) but as object
types in the dataframe, pandas can help save memory by using the
category
type.
Chaining, also called Flow Programming, leverages the face that most operations in pandas return a new object - this contrasts with using intermediate variables. The chain should read like a recipe of ordered steps.
Don't mutate! In-place rarely actually does anything in-place. In general this does not improve performance. It also means you cannot chain.
The apply
has a lot of overhead, for e.g. the time difference between
the following lines of code is order of magnitude.
autos.city08.apply(lambda x: 200/x) 200./autos.city08
Use the where
method if you plan to do something with numbers using
apply
.
np.select
is also helpful, can be thought of as a vectorized
version of switch statements.
Use assign
to create new columns.