Alexa Fredston
5/26/2020
Discuss how to write really reproducible code that enables others to:
Do better science in less time! Lowndes et al. 2017
Assuming you are familiar with R, RStudio, GitHub
More lecturing and less demo-ing today
Using my own code as an example where possible
Please speak up or comment in the chat window
I have no CS background and learned everything I know from you and from Twitter
Tools, tips, and tricks for:
Anticipate that your code and data will be published with every completed project/paper
unofficial, Alexa-determined ranking
From a real software engineer: “Much of the analysis I’ve done had a script to pull down the original data, transformed and stored it in a canonical format alongside the code (if small) and/or stored it somewhere long term to mirror. That way if the original went down the code wouldn’t break.”
Allows others to reproduce your results directly from scripts without requiring any additional data files
R packages that allow you to communicate with servers and access data from R, often through ROpenSci (recent roundup)
Can also write your own API in R (have you done this?)
Check regularly if these are available for major datasets you use
Wrap in ifelse(file.exists()) statements
Forces you to organize your data and document it in a way that facilitates reproducible science
Likely to last longer than project-specific data links, repositories, or R packages
Lots of resources at NCEAS including the Reproducible Research course
Use flexible file formats that don’t require certain software (like .xlsx*) or programming languages (.Rdata, .rds)
Write metadata conforming to field standards and package with the data
Consider wrapping in an R package (requires more maintenance, less accessible to non-R users)
Try to host data at a stable link
My philosophy:
Try to spot patterns (operations you perform over and over) and pull out into functions
Simple functions can be defined within a script, and complex functions / functions used over many scripts can live in a “functions” folder
Makes code much easier to annotate and visualize
Pros:
Cons:
To use git and prevent problems: https://happygitwithr.com/
To deal with problems like merge conflicts: https://ohshitgit.com/
Every project begins with a new github repository and a new R project filed under ~/github/repo_name
Projects allow you to:
Organize repositories consistently
Describe what each script does, somewhere (I use readme.md)
Use here() to manage file paths
Github issues are designed for project management example
Useful whether or not everyone codes
Agree on shared practices for data storage, script development, communication, etc. at the start of a project, and document them in a Google Doc, GitHub Wiki, etc. example
Same as editing writing: find out what kind of feedback they want, be constructive, and be nice
Check if they have a contributing.md or “how we work” document
Fix bugs if you see any
Phrase code review comments / pull requests as questions: did you consider ____? Do you think it would be more readable if we moved ____? Does changing ____ to make the script faster make the extra 5 lines of code worth it?
Extra important to have good communication and division of tasks (see “how we work”)
Can still use git for version control of scripts in ~/github/repo_name (possibly from the command line)
Write out and read in data in flexible formats that translate across the languages you’re using
Other ideas?
Many scientists feel strongly about using ArcGIS, MATLAB, etc.
Best practices about version control and data management all apply
Half a reproducible project is better than none!
For collaborators who code, but aren’t following best practices:
For collaborators who don’t really code:
Other thoughts?
How to get collaborators who operate on their own machine/server to share code and data
How to get collaborators to stop emailing code back and forth
Any other questions?
Other useful links:
Special thanks for all the great suggestions via Twitter