Workflows and best practices for collaborative coding

Alexa Fredston

5/26/2020

Who’s Alexa?

Goals for today

Discuss how to write really reproducible code that enables others to:

Do better science in less time! Lowndes et al. 2017

Housekeeping and disclaimers

Assuming you are familiar with R, RStudio, GitHub

More lecturing and less demo-ing today

Using my own code as an example where possible

Please speak up or comment in the chat window

I have no CS background and learned everything I know from you and from Twitter

Outline

Tools, tips, and tricks for:

  1. Organizing your data
  2. Coding workflows and best practices
  3. Repositories / projects / environments
  4. Project management
  5. Collaborating across skill levels and programming languages

1. Organizing your data

Anticipate that your code and data will be published with every completed project/paper

unofficial, Alexa-determined ranking

From a real software engineer: “Much of the analysis I’ve done had a script to pull down the original data, transformed and stored it in a canonical format alongside the code (if small) and/or stored it somewhere long term to mirror. That way if the original went down the code wouldn’t break.”

1.1 Data: Web scraping / APIs

Allows others to reproduce your results directly from scripts without requiring any additional data files

R packages that allow you to communicate with servers and access data from R, often through ROpenSci (recent roundup)

Can also write your own API in R (have you done this?)

Check regularly if these are available for major datasets you use

Wrap in ifelse(file.exists()) statements

1.2 Data: Depositing in archives

Forces you to organize your data and document it in a way that facilitates reproducible science

Likely to last longer than project-specific data links, repositories, or R packages

Lots of resources at NCEAS including the Reproducible Research course

1.3 Data: Packaging it yourself

Use flexible file formats that don’t require certain software (like .xlsx*) or programming languages (.Rdata, .rds)

Write metadata conforming to field standards and package with the data

Consider wrapping in an R package (requires more maintenance, less accessible to non-R users)

Try to host data at a stable link

2.1 Coding: script organization

My philosophy:

A bad script

A better script

2.2 Coding: don’t repeat yourself

Try to spot patterns (operations you perform over and over) and pull out into functions

Simple functions can be defined within a script, and complex functions / functions used over many scripts can live in a “functions” folder

Apply your functions using apply, purrr, or for loops

2.3 Coding: RMarkdown

Makes code much easier to annotate and visualize

Pros:

Cons:

2.4 Coding: workflow

3.1 Use version control

To use git and prevent problems: https://happygitwithr.com/

To deal with problems like merge conflicts: https://ohshitgit.com/

3.2 Use projects, seriously

Every project begins with a new github repository and a new R project filed under ~/github/repo_name

Projects allow you to:

3.3 Project/repository organization

Organize repositories consistently

Describe what each script does, somewhere (I use readme.md)

Use here() to manage file paths

4.1 Project management: GitHub issues

Github issues are designed for project management example

Useful whether or not everyone codes

4.2 Project management: “how we work”

Agree on shared practices for data storage, script development, communication, etc. at the start of a project, and document them in a Google Doc, GitHub Wiki, etc. example

5.1 Collaboration: editing other people’s code

Same as editing writing: find out what kind of feedback they want, be constructive, and be nice

Check if they have a contributing.md or “how we work” document

Fix bugs if you see any

Phrase code review comments / pull requests as questions: did you consider ____? Do you think it would be more readable if we moved ____? Does changing ____ to make the script faster make the extra 5 lines of code worth it?

5.2 Collaboration: different open-source programming languages

Extra important to have good communication and division of tasks (see “how we work”)

Can still use git for version control of scripts in ~/github/repo_name (possibly from the command line)

Write out and read in data in flexible formats that translate across the languages you’re using

Other ideas?

5.3: Collaboration: proprietary software

Many scientists feel strongly about using ArcGIS, MATLAB, etc.

Best practices about version control and data management all apply

Half a reproducible project is better than none!

5.4 Collaboration: non R / GitHub people

For collaborators who code, but aren’t following best practices:

For collaborators who don’t really code:

Other thoughts?

Questions I couldn’t answer

How to get collaborators who operate on their own machine/server to share code and data

How to get collaborators to stop emailing code back and forth

Any other questions?

Thanks!

Other useful links:

Special thanks for all the great suggestions via Twitter