Set up your project

Setting up a organized project will help you remain productive as your project grows. The broad steps involved are:

  1. Pick a name and create a folder for your project

  2. Initialize a git repository and sync to Github

  3. Set up a virtual environment

  4. Create a project skeleton

  5. Install a project package

The end result will be a logically organized project skeleton that’s synced to version control.

Warning

I will present most of the project setup in the terminal, but you can do many of these steps inside of an IDE or file explorer.

Pick a name and create a folder for your project

When you start a project, you will need to decide how to structure it. As an academic, a project will tend to naturally map to a paper. Therefore, one project = one paper = one folder = one git repository is a generally a good default structure.

Pick a short and descriptive name for your project and create a folder in your Documents folder. For instance, when I created the project for this book, the first step was to create the codebook folder:

~/Documents$ mkdir codebook

Initialize a git repository and sync to Github

Since git is such a core tool to manage code-heavy projects, I recommend that you set it up immediately. The way I prefer to do this is by going to Github and clicking the big green New button to create a new repository. I name the remote the same as my local folder and hit Create Repository.

_images/github-repo.png

Fig. 4 The big green New button.

I then follow Github’s instructions to initialize the repo. In ~/Documents/codebook, I run:

echo "# codebook" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/patrickmineault/codebook.git
git push -u origin main

How often do you think you should commit to git?

The general rule of thumb is that one commit should represent a unit of related work. For example, if you made changes in 3 files to add a new functionality, that should be one commit. Splitting the commit into 3 would lose the relationship between the changes; combining these changes with 100 other changed files would make it very hard to track down what changed. Try to make your git commit messages meaningful, as it will help you keep track down bugs several months down the line.

If you don’t use git very often, you might not like the idea of committing to git daily or multiple times per day. The git command line can feel like a formidable adversary; GUIs can ease you into it. I used to use the git command line exclusively. These days, I tend to prefer the git panel in VSCode.

_images/git-vscode.png

Fig. 5 The git panel in VSCode.

Set up a virtual environment

Why do I use virtual Python environments? So I don’t fuck up all my local shit.

Nick Wan

_images/python_environment_2x.png

Fig. 6 Python environments can be a real pain. From xkcd.com by Randall Munroe.

Many novices starting out in Python use one big monolithic Python environment. Every package is installed in that one environment. The problem is that this environment is not documented anywhere. Hence, if they need to move to another computer, or they need to recreate the environment from scratch several months later, they’re in for several hours or days of frustration.

The solution is to use a virtual environment to manage dependencies. Each virtual environment specifies which versions of software and packages a project uses. The specs can be different for different projects, and each virtual environment can be easily swapped, created, duplicated or destroyed. You can use software like conda, pipenv, poetry, venv, virtualenv, asdf or docker - among others - to manage dependencies. Which one you prefer is a matter of personal taste and countless internet feuds. Here I present the conda workflow, which is particularly popular among data scientists and researchers.

Conda

Conda is the de facto standard package manager for data science-centric Python. conda is both a package manager (something that installs package on your system) and a virtual environment manager (something that can swap out different combinations of packages and binaries - virtual environments - easily).

Once conda is installed - for instance, through miniconda - you can create a new environment and activate it like so:

~/Documents/codebook$ conda create --name codebook python=3.8
~/Documents/codebook$ conda activate codebook

From this point on, you can install packages through the conda installer like so:

(codebook) ~/Documents/codebook$ conda install pandas numpy scipy matplotlib seaborn

Now, you might ask yourself, can I use both pip and conda together?

Export your environment

To export a list of dependencies so you can easily recreate your environment, use the export env command:

(codebook) ~/Documents/codebook$ conda env export > environment.yml

You can then commit environment.yml to document this environment. You can recreate this environment - when you move to a different computer, for example - using:

$ conda env create --name recoveredenv --file environment.yml

This export method will create a well-documented, perfectly reproducible conda environment on your OS. However, it will document low-level, OS-specific packages, which means it won’t be portable to a different OS. If you need portability, you can instead write an environment.yml file manually. Here’s an example file:

name: cb
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8
  - numpy=1.21.2
  - pip
  - pip:
    - tqdm==4.62.3

pip and conda packages are documented separately. Note that pip package versions use == to identify the package number, while conda packages use =. If you need to add dependencies to your project, change the environment.yml file, then run this command to update your conda environment:

(cb) $ conda env update --prefix ./env --file environment.yml --prune

You can read more about creating reproducible environments in this Carpentries tutorial. You can also use the environment.yml file for this book’s repo as an inspiration.

Create a project skeleton

In many different programming frameworks - Ruby on Rails, React, etc. - people use a highly consistent directory structure from project to project, which makes it seamless to jump back into an old project. In Python, things are much less standardized. I went into a deep rabbit hole looking at different directory structures suggested by different projects. Here’s a consensus structure you can use as inspiration:

|-- data
|-- docs
|-- results
|-- scripts
|-- src
|-- tests
 -- .gitignore
 -- environment.yml
 -- README.md

Let’s look at each of these components in turn.

Folders

  • data: Where you put raw data for your project. You usually won’t sync this to source control, unless you use very small, text-based datasets (< 10 MBs).

  • docs: Where you put documentation, including Markdown and reStructuredText (reST). Calling it docs makes it easy to publish documentation online through Github pages.

  • results: Where you put results, including checkpoints, hdf5 files, pickle files, as well as figures and tables. If these files are heavy, you won’t put these under source control.

  • scripts: Where you put scripts - Python and bash alike - as well as .ipynb notebooks.

  • src: Where you put reusable Python modules for your project. This is the kind of python code that you import.

  • tests: Where you put tests for your code. We’ll cover testing in a later lesson.

You can create this project structure manually using mkdir on the command line:

$ mkdir {data,docs,results,scripts,src,tests}

Files

  • .gitignore contains a list of files that git should ignore.

  • README.md contains a description of your project, including installation instructions. This file is what people see by default when they navigate to your project on GitHub.

  • environment.yml contains the description of your conda environment.

.gitignore can be initialized to the following:

*.egg-info
data

A README.md should have already been created during the initial sync to Github. You can either create an environment.yml file manually or export an exhaustive list of the packages you are currently using:

$ conda env export > environment.yml

Install a project package

Warning

Creating a project package is slightly annoying, but the payoff is quite substantial: your project structure will be clean, you won’t need to change Python’s path, and your project will be pip installable.

You might notice a flaw in the preceding project structure. Let’s say you create a reusable lib.py under the src folder, with a function my_very_good_function. How would you reference that function in scripts/use_lib.py? This doesn’t work:

>>> from ..src.lib import my_very_good_function
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: attempted relative import with no known parent package

You need to tell Python where to look for your library code. You have two options, change your Python path, or create an installable package. I recommend the installable package route, but cover the Python path route first because you’re likely to encounter it in other projects.

Use the true-neutral cookiecutter

If doing all this for every new project sounds like a lot of work, you can save yourself some time using the true neutral cookiecutter, which creates the project skeleton outlined above automatically. cookiecutter is a Python tool which generates project folders from templates. You can install it in the base conda environment with:

(base) ~/Documents $ pip install cookiecutter

To create the codebook folder with all its subfolders and setup.py, use the following:

(base) ~/Documents $ cookiecutter gh:patrickmineault/true-neutral-cookiecutter

This will create an instance of the true-neutral-cookiecutter project skeleton, which is hosted on my personal github. Follow the prompts and it will create the folder structure above, including the setup file. Next, pip install the package you’ve created for yourself, and sync to your own remote repository, following the github instructions.

Discussion

Using structured projects linked to git will help your long-term memory. You will be able to instantly understand how files are laid out months after you’ve last worked on that project. Using a virtual environment will allow you to recreate that environment in the far future. And git will give you a time machine to work with.

Writing for your future self has an added bonus: it can make it easier for other people to use your project. Consider this: everything at Google is in one giant repository with billions of lines of code. As a new software engineer, you’re invited to commit to that repository during your first week. Because everything is organized according to strict conventions, so it’s not as terrifying as it sounds to jump in. Structure is what enables sustainable growth.

5-minute exercise

Create an empty project with the true-neutral cookiecutter.