Research Software Engineering

A typical workflow consists of the following:

Specifying the desiderata.
Implement the specifications.
Write unit tests.
Document all code at every level.
Use version control.
Iterate the above steps.

Rough notes

One commit should reflect one unit of related work, e.g. one function, and should include all changes related to that function, instead of each related file being committed separately.
Use virtual environments to manage packages and related dependencies - e.g. conda, pipenv, poetry, venv, virtualenv, asdf or docker.
To make a new environment in conda, use conda create -n <env-name> python=3.9, followed by conda activate <env-name> to activate it - miniconda can help save space.
After installing packages via conda install <package-name>, you can recreate your environment by first saving it to a YAML file via conda env export > environment.yml followed by conda create -n <new-env-name> -f environment.yml. Committing the YAML file means the packages and dependencies will be in version control.
Difference between conda and pip: Conda is a package manager and virtual environment manager, can install non-pip software like gcc, and not all Python packages are available via conda. Pip is just a package manager, only installs Python packages and can install every package on PyPI alongside local packages. Installing pip packages could negatively affect conda's ability to install conda packages, it is recommended to install big conda packages first, then install small pip packages afterwards.
The directory structure is very important, ideally its good to separate data, docs, results, scripts (or utils), src, tests. To create them manually, mkdir {data,docs,results,scripts,src,tests}.
In Python, to use a function some_function defined in scripts/some_script.py in src/some_src.py, we need to tell Python where to look for the library code -> done by changing Python path (not recommended) or creating an installable package.
Python path is changed by adding sys.path.append("/home/name/projects/goodresearch/scripts") on top of src/some_src.py before loading some_function via from scripts.some_script import some_function.
To create an installable package, first create a setup.py file with minimal setup below.
```
from setuptools import find_packages, setup

setup(
    name="src",
    packages=find_packages(),
)
```
Now, put an empty file named __init__.py where you want to borrow functions from, here, in the scripts directory. (!!! Should we put this under the project main directory actually?) Then install the package via pip install -e . in the project directory, -e means package is editable. Now, the functions are available irrespective of which directory you are in the terminal. To change name of package, change the directory name in the terminal, and reinstall via pip install -e . in the main package directory.
The package cookiecutter automates a lot of the work above, install it via pip install cookiecutter in the conda based environment, one e.g. is the true-neutral project skeleton, available via cookiecutter gh:patrickmineault/true-neutral-cookiecutter.
Use linters that detect bad code style, e.g. flake8 or pylint. Code formatters automatically fix issues instead of just suggesting them, e.g. black.
Cleaning up dead code is very important, packages like vulture can help.
Use notebooks only when needed, not for long pipelines, classes and functions, IO. Use it instead for mixing code, text explanations and graphics. Also make sure they run top to bottom. Refactoring notebooks is easier in text format, check for e.g. jupytext.
Debugging is twice as hard as writing code, therefore if you write code as cleverly as possible, you are, by definition, not smart enough to debug it. - Kernigans law.
Be way of code smells: mysterious names, magic numbers, duplicated code, uncontrolled side-effects and variable mutations, large functions, high cyclomatic complexity (nested ifs and for loops), globals, hard coded configurations.
Good code will have separation of concerns: a function does one thing, a module assembles functions that all work towards the same goal, a class modifies its own members rather than other objects.
If the function you are writing makes sense as a pure function, (e.g. deterministic, stateless), write it as such.
Side effects include: modifying a global variable, modifying a static local variable, modifying an argument, doing IO (printing to screen, calling remote server).
In Python, a function which modifies its argument returns None. Note that lists, dictionaries, objects are passed by reference, meaning they can be modified by functions, so the arguments can also be returns (!!! Understand what this exactly means).
Tips for above: Write functions that modify arguments or return values, not both; put IO in their own functions instead of mixing with computations, use classes (and private variables) to encapsulate state rather than using stateful functions and globals.
Python's collections.defaultdict
Use pytest for testing, should be ideally done for pure and non-pure functions. If something causes a bug, write a test for it, ~70% of bugs are old bugs that keep reappearing.
Use sphinx to publish documentation from doc strings, via pip install sphinx && cd docs && sphinx-quickstart && make html.
Write code that uses CLI arguments using for e.g. argparse, click.
Using shell scripts to execute pipelines, and put them in version control. Use shellcheck- as a tool to make sure they do not have common mistakes.

Resources

The Practice of Reproducible Researhch: Case Studies and Lessons from the Data-Intensive Sciences (Reading section has interesting material).
Research Software Engineering with Python, UCL.

Research Software Engineering

Links

Rough notes

Resources