Harder, Better, Faster:

Case Studies in Reproducible Workflows

Kathryn Huff

NYU Reproducibility Symposium

May 03, 2016

Physics, University of Chicago Nuclear Engineering and Engineering Physics, University of Wisconsin - Madison Nuclear Engineering, University of California, Berkeley Nuclear, Plasma, and Radiological Engineering, University of Illinois, Urbana-Champaign
Scipy Software Carpentry THW book pyne cyclus pyrk

  • Case Study Book Concept
  • Case Study Contributions
  • Lessons Learned
  • Next Steps!
repro mission
justin kitzes flow
justin kitzes flow
khuff flow
basic flow

Reproducibility and Open Science Conference

May 21&22, 2015

  • Three days
  • Invitation Only
  • Case Studies, Education, Self-assessment
  • https://github.com/BIDS/repro-conf
jgukelberger flow
akrause flow
- Preface (Stark) - Introduction (Kitzes) - Assessing the Reproducibility of a Research Project (Rokem, Marwick, Staneva) - The Basic Reproducible Workflow Template (Kitzes, Turek) - Introducing the Case Studies (Imamoglu, Turek) - PART 1: High-Level Case Studies - PART 2: Low-Level Case Studies - Lessons Learned (Huff et al.) - Supporting Reproducible Science (Ram, Marwick) - Glossary of Terms and Techniques (Rokem, Chirigati)


justin kitzes fatma imamoglu daniel turek

Justin Kitzes, Fatma Imamoglu, Daniel Turek

Supplementary Chapter Authors

BIDS uw escience nyu center for data science
  • Philip Stark
  • Justin Kitzes
  • Daniel Turek
  • Fatma Imamoglu
  • Kathryn Huff
  • Karthik Ram
  • Ariel Rokem
  • Ben Marwick
  • Valentina Staneva

Fernando Chirigati

Case Study Chapter Contributors!

  • Mary K. Askren
  • Anthony Arendt
  • Lorena A. Barba
  • Pablo Barberá
  • Kyle Barbary
  • Carl Boettiger
  • You-Wei Cheah
  • Garret Christensen
  • Devarshi Ghoshal
  • Chris Gorgolewski
  • Jan Gukelberger
  • Chris Holdgraf
  • Konrad Hinsen
  • David Holland
  • Chris Hartgerink
  • Kathryn Huff
  • Fatma Imamoglu
  • Justin Kitzes
  • Natalie Koh
  • Andy Krause
  • Randy LeVeque
  • Tara Madhyastha
  • José Manuel Magallanes
  • Ben Marwick
  • Olivier Mesnard
  • K. Jarrod Millman
  • K. A. S. Mislan
  • Kellie Ottoboni
  • Gilberto Pastorello
  • Russell Poldrack
  • Karthik Ram
  • Ariel Rokem
  • Rachel Slaybaugh
  • Valentina Staneva
  • Philip Stark
  • Daniel Turek
  • Daniela Ushizima
  • Zhao Zhang
## Lessons Learned - Pain Points - Recommmendations from the Authors - A Little Data - Needs

Pain Points

  • People and Skills
  • Dependencies, Build Systems, and Packaging
  • Hardware Access
  • Testing
  • Publishing
  • Data Versioning
  • Time and Incentives
  • Data restrictions
  • ## Incentives - verifiability - collaboration - efficiency - extensibility - "focus on science" - "forced planning" - "safety for evolution"
    ## Recommendations - version control your code - open your data - automate everywhere possible - document your processes - test everything - use free and open tools
    ## Recommendations: Continued - avoid excessive dependencies - when dependencies can't be avoided, package their installation - host code on a collaborative platform (e.g. GitHub) - get a Digital Object Identifier for your data and code - avoid spreadsheets, plain text data is preferred ("timeless," even) - explicitly set pseudorandom number generator seeds - workflow and provenance frameworks may be too clunky for most scientists
    ## Recommendations: Outliers > ... in our estimation, if someone > was to try to reproduce our research it would probably be more > natural for them to write their own scripts as this has the > additional benefit that they might not fall into any error > we may have accidentally introduced in our scripts.
    ## Recommendations: Outliers > Scientific funding and the number of scientists available to do the work is finite. Therefore not every scientific result can, or should be reproduced.

    Emergent Needs

    • Better education of scientists in more reproducibility-robust tools.
    • Widely used tools should be more reproducible so that the common denominator tool does not undermine reproducibility.
    • Improved configuration and build systems for portably packaging software, data, and analysis workflows.
    • Reproducibility at scale for high performance computing.
    • Standardized hardware configurations and experimental procedures for limited-availability experimental apparatuses.
    • Better understanding of why researchers don't respond to the delayed incentives of unit testing as a practice.
    • Greater adoption of unit testing irrespective of programming language.
    • Broader community adoption around publication formats that allow parallel editing (i.e. any plain text markup language that can be version
    • Greater scientific adoption of new industry-led tools and platforms for data storage, versioning, and management.
    • Increased community recognition of the benefits of reproducibility.
    • Incentive systems where reproducibility is not self-incentivizing.
    • Standards around scrubbed and representational data so that analysis can be investigated separate from restricted data sets.
    • Community adoption for file format standards within some domains.
    • Domain standards which translate well outside of their own scientific communities.

    Social Science Volume

    Collecting Case Studies Spring/Summer 2016

    • Same format: 1,500-2,000 words plus one diagram
    • Bad Hessian blog : http://www.badhessian.org
    • GitHub Repo : http://github.com/BIDS/ss-repro-case-public
    • Email Garret Christensen (garret@berkeley.edu) or Cyrus Dioun (dioun@berkeley.edu)


    • Justin Kitzes
    • Fatma Imamoglu
    • Daniel Turek
    • Chapter Authors
    • Case Study Authors
    • Reproducibility Working Group

    BIDS Logo


    Katy Huff

    Creative Commons License
    Harder, Better, Faster: Case Studies in Reproducible Workflows by Kathryn Huff is licensed under a Creative Commons Attribution 4.0 International License.
    Based on a work at http://katyhuff.github.io/2016-05-03-nyu.
    > 'connectome workbench', 'stata', 'zotero', ' ', 'travisci', 'vistrails', > 'osf', 'testtools', 'nipy', 'coverage/coveralls', 'ferret', 'cmake', > 'flickr api', 'amazon s3', 'nose', 'readthedocs', 'pypi', 'jira', > 'jenkins', 'ec2 s3', 'sweave', 'shell', ' jupyter', 'sql', 'dataverse', > 'rnw', 'spark', ' paraview', 'data science toolkit', 'overleaf', > 'virtualenv', 'crossref', 'spyder', 'markdown', 'dropbox', > 'scikit-image', 'awk', 'netcdf', 'petsc', 'figshare', 'sharelatex', > 'pandoc', 'ibamr', 'dcvs', 'twitter api', 'mendeley', 'word', 'd3', > 'beautiful soup', 'sed', 'devtools', 'activepapers', 'private git repo', > 'cython', 'outreg2', 'rsync', ' zenodo', 'vagrant', 'c' >