Isaac Slavitt - The 10 Commandments of Reliable Data Science | PyData Global 2022 - 2023

Details

Title : Isaac Slavitt - The 10 Commandments of Reliable Data Science | PyData Global 2022 Author(s): PyData Link(s) : https://www.youtube.com/watch?v=Q21VLmnHvw8

Rough Notes

These are some guidelines.

Motivation

There was once an airline disaster involving experienced airline pilots where 2 planes collided. This launched the field of accident analysis, it is about norms, culture and quality improvement which eventually made airline safety one of the most if not the most successful safety protocols.

What can we do about it

We need go beyond the alchemy perspective of data science.

The speaker and their team's argument has 3 premises:

Data science work is a kind of software.
The correctness and reliability of software depend on development practices.
Therefore, data science quality depends on software engineering quality.

Scenario

Consider the model data science project - a data scientist needs to analyze some data. This talk does not consider models in production or MLOps, integration with other software, causal analysis etc.

Two environments stand out:

Data science consulting, where the work is often on end-to-end R&D projects.
Data science competitions.

Rules

Start organized, stay organized.

Avoid the "we can clean this latter" attitude. If teams follow guidelines, they are 80% of the way there. Consider Django projects in Python, the structure is the same throughout, the database models will be in models.py, and so on. This makes it easier to collaborate.

One solution is to use cookie cutter data science as a starting template.

Everything comes from somwhere, raw data is immutable.

Every now and then one may realize there might be for e.g. a CSV file that is critical to the project, but not know how it came from, what to do if it gets out of date etc.

Version control is basic professionalism.

Data should mostly not be kept in version control. Some choices for this are:

Git Large File Storage.
Amazon S3.
Google Cloud Storage.
DVC.
Own disk space.

Version control for code enables code review. Code review is the most effective tool for maturing data science teams. Most effective software teams have a code review process where in order for code to be accepted in to the code base, it is read by another human - code should primarily be written for humans to read. nbautoexpert can help here, it converts .ipynb files to .py files for example.

Notebooks are for exploration, source files are for repetition.

Notebooks make version control harder, and having a human manually open a notebook to run it is not automation. They also encourage copying code from one notebook to another, which can introduce errors.

A more balanced workflow is to use notebooks for quick feedback and get things working.

Tests and sanity checks prevent catastrophes.

This is challenging, since:

Large datasets result in long running test suites.
Data will change over time, hence downstream values fluctation.
Using randomness means it is hard to know is changes are meaningful or due to expected variation.
Visualizations are hard to test.
Notebooks do not have good tooling for test discovery and running.

If we move often used and common functions out to source code files, we can write tests for them.

Fail loudly, fail quickly.

Do not try to handle exceptions.

We could get a mean from a pandas dataframe even if it has NaNs - in this case one solution is to do assertion testing like:

assert np.allclose(arr, HISTORICAL_MEAN, atol=1.5 * HISTORICAL_SD)
assert df["transaction_type"].isin(KNOWN_ENUM_TYPES).all()

One package for assertion testing is Bulwark.

Overall, we should stick to the ideas of Defensive Programming.

Project runs should be fully automated.

From the raw data to final outputs, the project run should be automated.

Consider GNU Make or something similar.

Important parameters should be extracted and centralized.

Many functions in use will have dozens of parameters. On top of this there are environment variables, Make variables, central configuration variables for the project, command line arguments, constants at the top of files, magic numbers right on top of when they are used in some function.

One tool for this is Hydra.

Project runs should be verbose by default and give tangible artifacts.

The typical workflow is to start from a notebook then eyeball the results back and forth.

A more improved workflow is to use the ideas and tools mentioned and then eyeball the results.

Log files should be verbose, and stored all the time.

Start with the simplest possible end-to-end pipeline.

Get to the end product before it is good, and implement things when we need them and not when we think we need them - paraphrased from Mike Gancarx, Linux and the Unix Philosophy 2E.

This means doing the following for e.g.

Clean the data, but use a very small subset/collection.
Choose the relevant features, but take a few columns first.
Train the model, but on a simple model like logistic regresison depending on the task.
Make the predictions, do benchmarking
- Print results on a table.

(Repeat the above until satisfactory).

Then make the end web app etc.

Conclusion

Overall, the data science process can be summarized as:

Ask an interesting question.
- What is the scientific goal?
- What would you do if you had all the data?
- What do you want to predict/estimate?
- etc.
Get the data.
- How were the data sampled?
- Which data is relevant?
- Are there privacy issues?
- etc.
Explore the data.
- Plot the data.
- Are there anomalies?
- Are there patterns?
Model the data.
- Build a model.
- Fit the model.
- Validate the model.
Communicate and visualize the results.
- What did we learn?
- Do the results make sense?
- Can we tell a story?

The book version of this talk is available here.