Workflows and best practices for collaborative coding

Alexa Fredston

5/26/2020

Goals for today

Discuss how to write really reproducible code that enables others to:

Identify decision points in your code
Understand your methods
Regenerate your results

Do better science in less time! Lowndes et al. 2017

Housekeeping and disclaimers

Assuming you are familiar with R, RStudio, GitHub

More lecturing and less demo-ing today

Using my own code as an example where possible

Please speak up or comment in the chat window

I have no CS background and learned everything I know from you and from Twitter

Outline

Tools, tips, and tricks for:

Organizing your data
Coding workflows and best practices
Repositories / projects / environments
Project management
Collaborating across skill levels and programming languages

1. Organizing your data

Anticipate that your code and data will be published with every completed project/paper

OK: package data with its own metadata in a universal file format
Better: deposit data in an archival repository with metadata
Best: include code to extract data from an online source
Divine: do all three!

unofficial, Alexa-determined ranking

From a real software engineer: “Much of the analysis I’ve done had a script to pull down the original data, transformed and stored it in a canonical format alongside the code (if small) and/or stored it somewhere long term to mirror. That way if the original went down the code wouldn’t break.”

1.1 Data: Web scraping / APIs

Allows others to reproduce your results directly from scripts without requiring any additional data files

R packages that allow you to communicate with servers and access data from R, often through ROpenSci (recent roundup)

Can also write your own API in R (have you done this?)

Check regularly if these are available for major datasets you use

Wrap in ifelse(file.exists()) statements

1.2 Data: Depositing in archives

Forces you to organize your data and document it in a way that facilitates reproducible science

Includes using Ecological Metadata Language

Likely to last longer than project-specific data links, repositories, or R packages

Knowledge Network for Biocomplexity, KNB
FigShare

Lots of resources at NCEAS including the Reproducible Research course

1.3 Data: Packaging it yourself

Use flexible file formats that don’t require certain software (like .xlsx*) or programming languages (.Rdata, .rds)

.csv for tabular data, .nc for spatial data

Write metadata conforming to field standards and package with the data

Consider wrapping in an R package (requires more maintenance, less accessible to non-R users)

Although apparently .xlsx files are really just made up of .xml files that preserve data types!

Try to host data at a stable link

2.1 Coding: script organization

My philosophy:

1 script, ~1 task (no more than one section of Methods)
List decision points / user choices at the top so it’s obvious what they are

A bad script

A better script

2.2 Coding: don’t repeat yourself

Try to spot patterns (operations you perform over and over) and pull out into functions

Data import (can iterate over list of file names)
Data cleaning/tidying
Applying many models (can vary data, model, or both)
Generating and saving plots

Simple functions can be defined within a script, and complex functions / functions used over many scripts can live in a “functions” folder

Apply your functions using apply, purrr, or for loops

2.3 Coding: RMarkdown

Makes code much easier to annotate and visualize

Pros:

Great for describing a workflow or reporting out your findings
I most frequently use this to generate PDFs and share with collaborators
Can streamline analysis-to-paper pipeline, if you write papers with bookdown (anyone?)

Cons:

Clunky to do large analyses in .Rmd scripts

2.4 Coding: workflow

Try to pause coding when a task is complete
Leave a to-do list for others (and yourself!) of what, if anything, still needs to happen for that script to be functional
pull + commit + pull + push frequently
Restart R frequently (ensures you’ve loaded the right packages)

3.1 Use version control

To use git and prevent problems: https://happygitwithr.com/

To deal with problems like merge conflicts: https://ohshitgit.com/

3.2 Use projects, seriously

Every project begins with a new github repository and a new R project filed under ~/github/repo_name

Projects allow you to:

Work safely on different scripts in different windows
Develop custom environments (deal with versioning issues), e.g. with renv

3.3 Project/repository organization

Organize repositories consistently

Describe what each script does, somewhere (I use readme.md)

Use here() to manage file paths

4.1 Project management: GitHub issues

Github issues are designed for project management example

Useful whether or not everyone codes

4.2 Project management: “how we work”

Agree on shared practices for data storage, script development, communication, etc. at the start of a project, and document them in a Google Doc, GitHub Wiki, etc. example

5.1 Collaboration: editing other people’s code

Same as editing writing: find out what kind of feedback they want, be constructive, and be nice

Check if they have a contributing.md or “how we work” document

Fix bugs if you see any

Phrase code review comments / pull requests as questions: did you consider ____? Do you think it would be more readable if we moved ____? Does changing ____ to make the script faster make the extra 5 lines of code worth it?

5.2 Collaboration: different open-source programming languages

Extra important to have good communication and division of tasks (see “how we work”)

Can still use git for version control of scripts in ~/github/repo_name (possibly from the command line)

Write out and read in data in flexible formats that translate across the languages you’re using

reticulate lets you run Python code from RStudio (still need to understand Python a little bit)
databricks

Other ideas?

5.3: Collaboration: proprietary software

Many scientists feel strongly about using ArcGIS, MATLAB, etc.

Best practices about version control and data management all apply

Half a reproducible project is better than none!

5.4 Collaboration: non R / GitHub people

For collaborators who code, but aren’t following best practices:

workflowr
Offer to help set up and maintain reproducible workflow for them

For collaborators who don’t really code:

Use GitHub for communication (Issues) and sharing data and figures
RMarkdown/Shiny to show code and results
RMarkdown interface to Microsoft Word (has anyone tried this?)
Google Docs > Microsoft Word
People are worth more to projects than tools

Other thoughts?

Questions I couldn’t answer

How to get collaborators who operate on their own machine/server to share code and data

How to get collaborators to stop emailing code back and forth

Any other questions?