The Data Analysis Tools Series

Archival Data Repositories

2018-11-19T00:00:00+00:00

Welcome!

Speakers

Joshua Quan - Data Librarian @ UC Berkeley Library, D-Lab

Content

This DATS session will introduce archival data repositories researchers might be interested in using to discover datasets or depositing their own data and code for long-term archiving for others to discover. We will cover Dataverse, Dash/Dryad, and Zenodo.

Objectives:

Why sharing datasets is easier with a Repository designed for archiving and discovery
Learn a little about: Dataverse, Zenodo, Dash/Dryad, OSF,
Searching for Data in Repositories
APIs + Tools to work with repositories

Data Repository Defined

From Registry of Research Data Repositories: “subtype of a sustainable information infrastructure which provides long-term storage and access to research data that is the basis for a scholarly publication. Research data means information objects generated by scholarly projects for example through experiments, measurements, surveys or interviews.”

…So it’s a place to put your data and analysis scripts that will be accessible beyond the life of a research project, grant, or individual career.

A minimum rationale for depositing/sharing…

Sharing your data gives you credit for your work that everyone can see
Your hard work will persist and be discoverable

… fulfills the most basic components of F.A.I.R principles for scientific data

Things to Consider when choosing a Repository

Reputation

Is the repository endorsed by a funding agency, scholarly journal, professional society, library, etc?
Is it listed in the Registry of Research Data Repositories?

Sustainability

Is there evidence that the repository will be around in Five years? Ten years?

Is the owner/manager of the content reliable?

Visibility

One of the primary reasons to deposit your data in a repository is to obtain a unique identifier that others can use to cite your data. This service will increase the visibility of your data within the scholarly literature and allows researchers to find it later on.
Ensure your data repository offers a DOI (digital object identifier), handle, or another unique identifer.

Usability

The usability of a data repository is also important in ensuring that others will be able to access your data. If your peers are unable to find and download your data it will limit the effectiveness of sharing your data.
A usable data repository should allow for users to easily upload, download, and cite data sets.

Features

Some data repositories have really great features like integrations with Open Science Framework, GitHub, or other commercial storage solutions. While these feature may not be the keystones to providing long-term access to your data, they can help you share your data more frequently and effectively
Comparative Overview of Features
You’ll want to review the upload and storage limits. Some repositories offer limited free storage before a fee is charged. Be sure to look over each data repository’s features and compare them with comparable services.

Formats

Be sure to take a look at the repositories documentation to ensure they can store the data you’ve generated
Does the repository provide a way to preview data/scripts? i.e., rendering .ipynb in Github

Rights

Take time out to read the terms of use and to understand what permissions you’re giving the data repository.
For instance, does your data repository use common licensing agreements (Creative Commons) that will help others understand what they can and cannot do with your data?

General vs. Subject Specific Repositories

A “general” data repository is subject independent and will have data from many fields. General data repositories are often well-known solutions with large user communities.
General repositories are great places to store all your data because they tend to have robust features (like simple GitHub integration), strong institutional backing, and are indexed by search engines.
The downside of general repositories is that because there is a lot of everything, users might have more difficulty finding your work.

General repositories

Harvard Dataverse: Harvard’s Dataverse is both a platform for institutions and a data repository. Backed and developed by Harvard’s IQSS, Libraries, and Information Technology, Dataverse has 22 installations with over 48,000 datasets, and 2 million downloads.

-some cool ideas floating around

UC Dash is an open-source, self-service toolkit for managing, openly publishing, and effectively describing data for access and reuse. Dash features geolocation metadata, ORCID, DOI, and FundRef identifiers, and generates a citation for all of your datasets. Additionally, Dash allows you to set a timed-release of data while undergoing peer-review.
Zenodo: Funded by CERN, OpenAIRE, and Horizon 2020
- Zenodo accepts 50GB per dataset and integrates nicely with GitHub. While Zenodo doens’t seem to detail its download numbers like other services, it is partnered with CERN, which stores more than 100PB (petabytes) of data.
- Starting to archive some of the lessons/modules created in the Division of Data Sciences
Open Science Framework integrates with major storage workflows like Github, Google Drive, Box, etc.

Subject repositories

Many subject-specific data repositories exist today. Unlike a general data repository, discipline-based repositories can be very specific and well-known within a particular field. This can be both a good thing and a bad thing.
Pro: If your field has a specific repository you’re data will likely be seen by the right people - increasing its chance for reuse and further influence
Con: Researchers outside of that discipline might not know where to look for your data
Re3data.org: The Registry of Resarch Data Repositories is a service provided by DataCite (a global non-profit that provides DOIs - Digial Object Identifiers). With over 1,500 data repositories listed, re3data.org is likely to have a repository in your discipline.
OpenDOAR: OpenDOAR (Directory of Open Access Repositories) is an curated and authorative list of academic open access repositories. Not only do staff of OpenDOAR visit each repository listed but they also review each repository for quality (a pretty big task considering they have 2,600 listings). Included in OpenDOAR are datasets, articles, books, and software.
Simmons College hosts the Open Access Directory’s list of Data Repositories. The Open Access Directory is maintained by the Open Access community and an editorial board. It includes repositories ranging from archaeology to physics.

APIs + Wrappers

Zenodo(R)
PyZenodo(Python)

Dataverse(R)
Dataverse-client(Python)

Github Search API

Dataverse Walk-through

Searching and using the website/GUI
- Demo Dataverse for fooling around with.
A play example of using the dataverse package in R to search for data and download it.
- Check out the vignettes for more

On your own

Using the Comparative Overview of Features document as a template, think about your own research and the kind of repository (general vs. specific) that makes the most sense for your archival needs.

Contacts

https://researchdata.berkeley.edu/

http://dlab.berkeley.edu/

https://www.cdlib.org/services/uc3/dash.html

Intro to Machine Learning with scikit-learn -- Robert Martin-Short

2018-11-05T00:00:00+00:00

Welcome!

Speakers

Robert Martin-Short

PhD Candidate, Geophysics

Website: rmartinshort.jimdo.com

Content

Installation

This workshop will be using the following languages and software:

Python 3.6
Jupyter
scikit-learn

All of these requirements can be satisfied with Anaconda.

Materials

The Jupyter notebooks containing the workshop material can be found in the following repo: basics of machine learning with scikit-learn

DATS Round-table

2018-10-29T00:00:00+00:00

Welcome!

Please sign in at this google sheet!

DATS Meet up

2018-10-22T00:00:00+00:00

Welcome!

Please sign in at this google sheet!

Matplotlib Two Ways -- Caroline Cypranowska

2018-10-01T00:00:00+00:00

Welcome!

Speakers

Caroline Cypranowska

PhD Candidate, Department of Molecular and Cell Biology

Website: cypranowska.github.io

Content

Installation

This workshop will be using the following languages and software:

Python 2.7/3.6
Jupyter
Matplotlib
Numpy

All of these requirements can be satisfied with Anaconda.

Materials

The Jupyter notebooks containing the workshop material can be found in the following repo: code_examples/matplotlib

Data tidying in R & Python -- Caroline Cypranowska and Sara Stoudt

2018-09-24T00:00:00+00:00

Welcome!

Speakers

Caroline Cypranowska

PhD Candidate, Department of Molecular and Cell Biology

Website: cypranowska.github.io

Sara Stoudt

Graduate Student, Department of Statistics, Moore/Sloan Fellow @ BIDS

Website: sastoudt.github.io

Content

For this workshop we’ll be using materials created by Diya Das, David DeTomaso, and Andrey Indukaev. See the README.md file in Diya’s tutorial repo to get started.

Charles Frye -- Use You A Jupyter Notebook For Great Good!

2018-09-17T00:00:00+00:00

Welcome!

Agenda

Speakers

Charles Frye

Graduate Student, Helen Wills Neuroscience Institute

Website: charlesfrye.github.io

Bio: here.

Content

The content for this talk is available at this link. Head to the GitHub repo for DATS for instructions on access.

Mark Mikofski -- Git Version Control with GitHub

2018-09-10T00:00:00+00:00

Agenda

Requirements
Objectives
What is Git VCS?
GitHub
GitHub Pages
SSH or HTTPS
Git Primer
Winning Workflow

Requirements

To prepare for this tutorial make sure you have the following:

We’re going to use Git, so make sure you have Git installed on a laptop, and of course, don’t forget to bring your laptop to the tutorial.
- MacOS: you already have git, open a terminal and type git
- Windows: install Git-for-Windows, no admin
- Linux: use your app manager, eg Ubuntu: sudo apt install git
For more info, see the Git SCM Book on installing Git
We’re going to make a personal webpage on GitHub, so make sure your computer has working internet access. AFAIK anyone can use CalVisitor or AirBears WiFi connection for free.
If you are not already registered for GitHub, please create an account. I strongly recommend that you enable two factor authentication using an app like Google Authenticator.
You’ll probably want a basic editor like Notepad on Windows, TextEdit on Mac, and gedit in Linux, or you can also just edit your files directly on GitHub. Anything will do, but not a word-processor, no, and a fullblown IDE is also probably overkill. Something like Sublime Text or Notepad++ is just right IMHO.
A willingness to participate, try new things, make mistakes, learn and have fun!

Objectives

At the end of this tutorial you will be able to do the following:

explain to a colleague what version control is, why it’s important, what it’s important for, and when to use it
use Git to version control your documents between iterations
teach a coworker to use basic git commands, and to create a pull request on GitHub
collaborate with others on GitHub using a feature-branch workflow
make a personal webpage using GitHub Pages

Git VCS

What is Git? And why is it important?

In case of fire, git commit, git push and leave the building

Git on Git

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. [1]

XKCD on Git

Version Control Software (VCS) aka Source Code Management (SCM)

But what is Version Control?

… version control, aka source control, is the management of changes to documents, computer programs, large web sites, and other collections of information. [2]

Whether you’re writing a dissertation, developing an analysis, or writing code, you will revise, revise, and revise. Each iteration is important. Using Git VCS gives you the ability:

to reverse your work
take a new direction without losing your current position
recover from a hard drive crash
continue your work from a different laptop
collaborate with others,

References

GitHub

Repeat the following 3 times out loud:

Git is not GitHub, and GitHub is not Git.

GitHub is an online hosted Git service that acts as a centralized repository for its users. You can create and clone Git repositories on GitHub, and you can pull from and push to Git repositories on GitHub, just as if they were on your own laptop, another networked laptop, or another online Git hosting service like Bitbucket or GitLab.

If you have not already created a GitHub account, you need to create one now to participate in this tutorial. Also, I encourage you to enable two-factor authentication (TFA on your GitHub account, and store your backup codes in a safe location, that you will remember. TFA makes it more difficult to hack your account.

GitHub Pages

GitHub allows users to host static content on GitHub Pages. Content written in markdown is automatically rendered as html using Jekyll, a Ruby static content generator. GitHub offers themes to beautify your site look and layout. It’s a great place to host your personal website.

To create your personal GitHub Page, you need to create a new repository called <your-github-username>.github.io, for example mikofski.github.io.
After the new repository is created, open the repository settings, and select theme chooser.
After Choosing a theme, an online editor opens with index.md. You can make edits to this file like change the title to your name.
Scroll to the bottom, find where it says commit directly to master, in the first field enter, “initial commit”, and then press the commit button.

Congratulations! You’ve just made your first Git commit on GitHub, and created your personal website. But, it’s far from done. It could use a little mroe work. Let’s take it offline, and iterate on it, till it’s just the way you want.

SSH or HTTPS

In order to pull the repository to your laptop, you’ll have to prove to GitHub, that you are who you say you are, and that you have permission to edit the site. There are two ways to authenticate to GitHub:

SSH: you create a pair of keys, keep one private, and upload the public key to GitHub. (Recommended)
1. if your laptop has a folder called .ssh in your user profile and it contains two files called id_rsa and id_rsa.pub then skip to step 4.
2. if your laptop does not have a .ssh folder, then open a shell type ssh-keygen
3. when prompted to for a passphrase, enter something that is easy to remember
4. on your laptop in a shell, type
```
$ eval `ssh-agent`
$ ssh-add
```
5. if prompted for you passphrase and you know it, enter it, but if you don’t know it, then kill the shell, delete the .ssh folder, and restart from step 2
6. on you laptop, open the id_rsa.pub file in .ssh/ and copy the contents
7. online in your personal GitHub profile, in settings under SSH keys, click New SSH key, paste the contents of your public key and click Add SSH key to save
HTTPS: You use your GitHub username and password, but if you enabled TFA, this becomes more complicated. You have two more options:
- Windows: do nothing, Microsoft has already installed a credential manager that works with GitHub to prompt you for your TFA code.
- Mac/Linux Option A: create a personal access token with repo access
  1. in your personal GitHub profile under developer settings click generate new personal access token, and check the repo full access box
  2. on your laptop enable git credential store by typing git config credential.store
  3. then when prompted by Git, use your GitHub username, and the personal access token as your password.
- Mac/Linux Option B: download and install the Microsoft Git Crendential manager - this does everything in option 1 for you (Recommended)

Git Primer

The most important Git command is git. If you type it in a terminal you get a list of the other most important Git commands such as init, clone, status, log, diff, add, commit, checkout, remote add, pull, and push.

The first thing you should do, after setting up your .ssh keys is to tell Git your full name and email address to use. Then we can get your new website and start hacking on it. The following commands are entered in a shell in a folder you use for projects for.

Add your name and email using git config:

$ git config --global user.name "Your Name Comes Here"
$ git config --global user.email you@yourdomain.example.com

Clone your GitHub repository to your laptop using git clone:

# if you're using SSH
$ git clone git@github.com:<github-username>/<github-username>.github.io.git

# if you're using HTTPS
$ git clone https://github.com/<github-username>/<github-username>.github.io.git

Enter the newly cloned repo, display the remotes and the log
```
$ git log
$ git remote
$ git remote show origin
```
Now open your editor and make some changes to your index.md file.
Before you make too many changes, go back to the shell and view the status, a diff from the previous version, and commit your changes
```
$ git status
$ git diff
$ git commit -am "put any message here, usually under 50 characters"
```

XKCD on Git Commit

Winning Workflow

The secret power of using Git with GitHub is how easy it makes collaborating with others. AFAIK the feature-branch workflow is the most frequent method of collaboration on GitHub. I outlined it’s steps in a THW-Berkeley talk last year on using GitHub in OSS.

Additional Info

GitHub help pages are a wealth of info.
Oh Shit Git! is a funny.
Git SCM Documentation is the official source.

First meeting of Fall 2018 Semester -- Organization

2018-08-27T00:00:00+00:00

Direct link: here

Agenda

4:10 - Intro to BIDS and our group // we’re on Berkeley time!
4:20 - Introductions (you!)
4:30 - What do we want to learn and what do we want to teach?
4:45 - Our GitHub repo and website

Speakers

Caroline Cypranowska

PhD Candidate, Department of Molecular & Cell Biology and Chief Organizer, Data Analysis Tools Series

Website: cypranowska.github.io

Caroline Cypranowska is a PhD candidate in the Department of Molecular & Cell Biology at UC Berkeley and a National Science Foundation Graduate Fellow. She’s currently studying the genetic mechanisms of synaptic plasticity as a member of the Isacoff lab. Caroline has technical expertise in single-cell RNA-sequencing, TIRF microscopy, and single-molecule pull-down.

Outside of lab, Caroline volunteers as a Math instructor with the Prison University Project at San Quentin and as an organizer for The Data Analysis Tools Series, formerly known as The Hacker Within. She also enjoys backpacking, snowboarding, bouldering, and any other activity that can be done in the great outdoors.

Diya Das

Postdoctoral Researcher, Department of Molecular & Cell Biology and Moore-Sloan Data Science Fellow, Berkeley Institute for Data Science

Website: diyadas.github.io

Diya is a postdoctoral researcher in the lab of John Ngai, where she studies regeneration in the olfactory epithelium, the tissue responsible for our sense of smell. She analyzes how olfactory stem cells contribute to both steady-state differentiation and injury-induced regeneration using single-cell RNA sequencing (scRNA-seq), assay for transposase-accessible chromatin sequencing (ATAC-seq) and other genomics techniques.

Diya also facilitates opportunities for fellow researchers to develop their data science skills. At BIDS, she coordinates Software/Data Carpentry workshops (she is a Software Carpentry instructor and lesson maintainer). Diya formerly organized The Hacker Within, which is now The Data Analysis Tools Series. She is also Fellow Lead of the Career Paths & Alternative Metrics Working Group (chaired by Henry Brady), which addresses the career paths available to data scientists within academia.

Tim Howes -- File syncing tools - syncthing, dat, git-annex

2018-05-02T00:00:00+00:00

File syncing tools

I will discuss open source tools that you can use to sync files directly between computers, rather than relying on paid cloud services such as dropbox. These can be especially useful when dealing with large scientific datasets, which may be impractical to sync to the cloud, and for which you may want more control over versioning information. If you want something similar to a cloud service, but with more control, you can set up these tools in your own virtual private server.

syncthing

syncthing is a cross-platform tool that can be used to keep folders in sync between your own devices or to share with collaborators. The settings can be customized to ignore certain files or sub-directories on specific machines, and there are different options available for keeping copies of old versions of files.

dat

Dat is a protocol for peer-to-peer sharing of collections of files. This has similar advantages to sharing files using bittorrent, but it also includes the ability to update the files in an archive and track the version history.

git-annex

git-annex is a tool that allows you to track large files within your git repositories, and it gives you a high level of control over which clones of the repository actually get the full file contents and which get only small placeholder files. This means that you can view and organize the full directory tree on your local machine without having to actually download all the files, and you can download the contents of individual files when needed using “git-annex get”. A special git-annex branch tracks the locations of the file contents and ensures that the correct number of copies exist on other machines before “dropping” the local file.

Usage notes

syncthing

Syncthing keeps folders in sync between machines by making a secure, direct connection between the machines (or optionally by using relay servers if a direct connection is not possible). It is a simple tool that can be started at the command line, run in the background, and viewed/controlled via a web browser.

Installation

https://docs.syncthing.net/intro/getting-started.html https://docs.syncthing.net/users/autostart.html

Install and enable on Ubuntu:

sudo apt install syncthing

# Enable as automatic background service
# replace 'myuser' with your username
sudo systemctl enable syncthing@myuser.service
sudo systemctl start syncthing@myuser.service

# or run `syncthing` manually on the command line

Check status on Ubuntu:

#Check service status
sudo systemctl status syncthing@myuser.service

#Check logs
sudo journalctl -e -u syncthing@myuser.service

Install and enable on macOS:
(First install homebrew: https://brew.sh/)

brew install syncthing

#Enable as automatic background service
cp /usr/local/Cellar/syncthing/latest/homebrew.mxcl.syncthing.plist ~/Library/LaunchAgents/syncthing.plist
launchctl load ~/Library/LaunchAgents/syncthing.plist

# run `syncthing` manually on the command line

You may need to adjust firewall settings to allow incoming connections. On Mac, you will usually be prompted to allow this the first time you start syncthing.

https://docs.syncthing.net/users/firewall.html

Connect to a new machine

Vist http://localhost:8384 to view the GUI for your running syncthing.

Click “Add remote device” and enter the device’s long unique ID. If you’re on the same local network as the other device, it will show up as a suggestion so you don’t have to type it.

Give the device whatever nickname you like. Specify the IP address (if it is stable) or leave as ‘dynamic’ to find the device automatically based on the ID. Choose which folders to share with the device. Choose ‘introducer’ if you would like to receive other folders automatically from the device.

https://docs.syncthing.net/intro/getting-started.html#configuring

Set up a new folder

Ignore files

https://docs.syncthing.net/users/ignoring.html

Keep old versions

https://docs.syncthing.net/users/versioning.html

other tips

Set up a virtual private server on a cloud provider if you want to have an always-on machine that can act as the central hub.
If syncing files between Mac and Linux, you might need to watch out for case sensitivity (Linux filesystems are case-sensitive, Mac by default is not). You can create a new APFS volume on your Mac hard drive with case sensitivity enabled, and put your sync folders there to avoid issues.
If running on a server where you don’t have root access, download and run syncthing manually or enable as a user service.

https://docs.syncthing.net/users/autostart.html#using-systemd

See also the syncthing forum: https://forum.syncthing.net/

dat

https://docs.datproject.org/tutorial

Resources for data sharing with dat: https://datbase.org/ https://blog.datproject.org/tag/science/

Beaker, a web browser based on dat that enables peer-to-peer, editable websites: https://beakerbrowser.com/ https://beakerbrowser.com/2017/06/14/forking-websites-on-the-p2p-web.html

git-annex

http://git-annex.branchable.com/walkthrough

Example setup

Initialize a repository:

mkdir project
cd project
git init
git annex init --version=6 "My desktop"

Add files:

cp ~/Downloads/ubuntu.iso .
git annex add ubuntu.iso
git commit -a -m "Added a file"

Clone on another folder on the same computer (could be a removable drive):

cd /media/usb
git clone ~/project
cd project
git annex init --version=6 "Portable drive"

Sync between clones (takes care of commiting, pushing, and pulling):

cd /media/usb/annex
git annex sync

# To get the content of large files in this step, use --content
git annex sync --content

Get and drop files

Special remotes

git-annex assistant

Automated sync tool with a GUI

https://git-annex.branchable.com/assistant/

Accessing public data on .gov websites -- Caroline Cypranowska

2018-04-25T00:00:00+00:00

Accessing public data on .gov websites (or how to deal with bureaucrats)

Prerequisites

Today’s exercises will require Bash. If you have a Mac or Linux machine, you’re mostly good to go.

Windows

Most Windows users in need of a Bash terminal use Cygwin, a collection of Linux software tools compiled for Windows. Other options include Git and creating a Linux subsystem (for Windows 10). The instructions below provide detailed instructions for installing Cygwin and a few other tools required for this tutorial.

Download Cygwin and run setup.exe. Select ‘Install from Internet’ when prompted by the installation wizard. Choose your root directory and mirror for installation.
The installer will also download a list of available packages. Include the default packages, but make sure to search for and include curl and wget.
Add the Cygwin path to the Windows Environment Path Variable, which can be found in the ‘Advanced system settings’ menu. Append ;C:\cygwin\bin to the end of the variable value option (assuming this is where you installed Cygwin).

MacOS

The terminal in MacOS has the majority of the tools needed to make requests to government databases, as cURL comes with Macs out of the box. The main advantage of wget over curl is that it can download recursively. While you can choose to do the exercises without wget, it can be easily installed with Homebrew.

foo@bar:~$ brew install wget

A brief explanation of networking protocols

In networking, a protocol is a set of rules for communication. Peer-to-peer networks are composed of interconnected computers, but no computer has a privileged position. Client-server networks, on the other hand, are composed of servers that perform functions on behalf of other machines (clients). Both of these systems rely on protocols to send and receive data.

The set of protocols used on the Internet is called TCP/IP (Transmission Control Protocol/Internet Protocol). The TCP/IP model has a layered structure, and protocols like HTTP, FTP, and SSH run on the highest layer (the application layer).

HTTP (or hypertext transfer protocol) defines how computers exchange HTML documents, and FTP (or file transfer protocol) defines how computers move files between local and remote file systems. These are the primary tools we will use today to get our data.

HTTP and FTP each have methods for a client to make requests of the server, and for the server to return a response. HTTP requests and responses usually have a header, which contains meta data of the request.

APIs

Application programming interfaces (or APIs) are a set of rules for building application software. In this case it usually refers to accessing and posting data to a specific group of servers. Many government agency APIs for accessing data are catered towards people building web application software.

API documentation usually includes:

how to format query strings
what types/formats of data that can be retrieved or posted with a request
authentication procedures

What is Data.gov?

Data.gov is mostly a catalog of data sets collected by the agencies of the US Federal Government. It includes information about the agency that collected the data, meta data, landing pages for the project, and links to the web address where data can be retrieved, the format of the data, etc. etc.

What Data.gov is not

Data.gov doesn’t host the data directly, and doesn’t have a unified API for accessing data from all government agencies. While Data.gov does have an API, the types of information accessed with the API are data on the types of data in the catalog. So you get meta meta data.

Exercises

Getting NOAA precipitation data from an FTP server

The U.S. Hourly Precipitation data set is hosted on an FTP server and is well documented. Here you’ll find that there is a page for downloading data from specific date ranges and location, but if you want to store them on a server then you’ll (obviously) need to use FTP.

The .pdf describes the naming scheme and the readme.txt instructs how to open a connection to the server and where to find files.

Exercise: Get precipitation records from CA from 2000-2009

According to the docs (don’t run this before we discuss)

Log into the FTP server

foo@bar:~$ ftp ftp.ncdc.noaa.gov

Navigate to the correct directory

ftp> cd pub/data/hourly_precip-3240/04

Use get to download one file, or mget to get multiple files

ftp> mget 3240_04_200*.tar.Z

Just a note, when logging into an FTP server your username and password aren’t encrypted. There are ways of doing FTP over SSH or with a secure-socket layer (SSL).

The safer way

curl has an option of using FTP with a SSL. We should choose this instead, because it will protect the traffic.

Navigate to your preferred directory
Use the --ftp-ssl flag, the --user flag, and the -o option

foo@bar:~$ curl --ftp-ssl --user anonymous:youremail@email.com ftp://ftp.ncdc.nooa.gov/04/3240_04_2000-2000.tar.Z -o ca_2000.tar.Z

The safer (recursive) way

curl doesn’t have a built-in method for easily getting multiple files. Write a shell script that will get all the CA precipitation data from 2000-2009.

wget has a -m option for mirroring sites, that will allow you to download the entire contents of a directory.

foo@bar:~$ wget -mc -nH --ftps-implicit --no-ftps-resume-ssl --user=anonymous --password=youremail@email.com ftp://ftp.ncdc.noaa.gov/pub/data/hourly_precip-3240/04/

Bonus

Write a script for downloading the files you want from the NOAA FTP server with curl.
FTP isn’t super great for transferring large files. How can you tell if the files downloaded by curl are identical to the ones you mirrored with wget from the command line?

Getting USGS earthquake data using an API

Skim the docs. Place a query to return GeoJSON records of earthquakes occuring 1) on your birthday, 2) in your favorite region of the world, 3) with a magnitude > 2.5

foo@bar:~$ curl -O https://earthquake.usgs.gov/fdsnws/event/1/query.geojson?starttime=1991-09-21&endtime=1991-09-21&maxlatitude=43.373&minlatitude=25.542&maxlongitude=-101.25&minlongitude=-120.234&minmagnitude=2.5&orderby=time

The Python urllib and request libraries are great for formatting query strings and headers for more sophisticated endeavors than the exercise above. (But you can also do fancy things in Bash.)

Mini-challenge!

(To be posted during the session)

Resources

Project Open Data

Project Open Data was an initiative created by the Obama Administration to promote accessibility and visibility of data sets collected and curated by the Federal government. The Project Open Data policy page is mostly geared towards government officials wanting to publish agency data, but also includes some resources for harvesting metadata, converting file types, etc.

There’s also a dashboard to check out how well each government agency is complying with the Project Open Data policies.

NASA

Fonts aside, NASA has their crap together.

Nima Hejazi & Jeremy Coyle -- Machine Learning Pipelines for R with sl3

2018-04-18T00:00:00+00:00

About Nima and Jeremy

Nima is a PhD student in the Group in Biostatistics, where he is jointly supervised by Mark van der Laan and Alan Hubbard. Nima is also affiliated with the UC Berkeley NIH Biomedical Big Data training program and the Center for Computational Biology. Currently, his research centers around nonparametric statistical and causal inference, machine learning, and statistical computing – focusing on the development of robust techniques for inference and estimation in an eclectic collection of problem settings, with applications often arising in precision medicine, vaccine efficacy trials, computational biology, and public policy.

Jeremy is a recent PhD graduate in Biostatistics who continues working with the department to translate statistical theory to software. During his PhD studies, Jeremy worked with Alan Hubbard and Mark van der Laan on a series of projects broadly related to computational statistics, including more efficient cross-validation routines for ensemble machine learning and a software framework for cross-validation (origami). His current research interests include causal inference, model selection, re-sampling techniques, statistical software development, and statistical methods for assessing time series data from sensor systems.

Machine Learning Pipelines for R with `sl3`

We present sl3, a recently developed software package for the R language and environment for statistical computing, designed to provide utilities for engaging in a host of common machine learning tasks. Topics to be addressed include efficient data organization and accession, the construction of pipelines for data munging and analysis (based on the idea popularized by Python’s scikit-learn), and methods for performing ensemble machine learning (e.g., optimal stacked regressions). sl3 is a core part of the tlverse, a new ecosystem of software packages currently being developed by a team in the Group in Biostatistics here at Berkeley.

Selected materials for this presentation are available on GitHub here.

Software Setup

R and RStudio Installation

You can download R here and the RStudio IDE here.

Jupyter R Kernel Installation

Please follow the instructions here to install an R kernel for Jupyter.

`sl3` Installation

library("devtools")
devtools::install_github("tlverse/sl3@devel")

`devtools` installation (if needed)

install.packages("devtools")

Joint meetup with the Graduate Data Science Organization

2018-04-11T00:00:00+00:00

Instead of our typical THW lesson format, this week we will be having a joint event with the Graduate Data Science Organization. As always, everyone is welcome to join, even if you’re not a graduate student or affiliated with UC-Berkeley.

The GDSO is a student led organization with the purpose of providing graduate students and postdoc fellows with resources to explore career opportunities in data science. In order to continue building connections between members on campus, we’ll be hosting monthly meetups for all of you interested in data science. These meetups will be informal and feature a few short talks about a variety of topics relevant to our organization. They will happen on the second Wednesday of every month at BIDS. Sign up for the GDSO mailing list or contact the organizers directly at officers@gdso.berkeley.edu.

TBD -- please volunteer to lead!

2018-04-04T00:00:00+00:00

Spring Break -- no THW

2018-03-28T00:00:00+00:00

Flask -- Mark Mikofski

2018-03-21T00:00:00+00:00

Agenda

Mini lesson on Flask apps with Bokeh plots
Mini sprint contest to develop a web app from NREL developer API
Miscellaneous odds and ends

Bokeh Plots

Intro

In my opinion, an interactive web application is fun way to share an analysis. I believe users create deeper, more meaningful connections when they explore data interactively. The goal of this tutorial will be to teach you how to quickly make a simple web application that you can use to share your data analyses online.

Requirements

You will need a laptop with Python installed for this tutorial. If you need to install Python, please download Anaconda 3.6-64bits before you attend this tutorial. During the tutorial we will use the following packags, so please install them in a new conda or virtual environment:

This is easiest with Anaconda:

(root) ~/Projects/myapp $ conda create -n myvenv python==3.6.3 flask bokeh jinja2 requests
(root) ~/Projects/myapp $ activate myvenv
(myvenv) ~/Projects/myapp $

Mini lesson

This mini lesson has 4 parts:

Flask
Bokeh
Jinja2
Bootstrap

Most of the snippets and examples from this mini-lesson are in the The Hacker Within - Berkeley GitHub repository code examples folder here.

Flask

Flask is a micro framework for developing web applications. A web app runs in a browser. The web server can be run locally on your laptop, or it can be on a remote server. Making a Flask app is surprisingly easy! Copy the following into a new file and save it as hello.py.

from flask import Flask

app = Flask(__name__)

@app.route('/')
def hello():
    return 'Hello World!'

if __name__ == '__main__':
    app.run()

This creates a new app, that will listen and respond to the “root” url, or / with the callback function hello(). This decorated function can be called a “route”. Now open a terminal window, browse to your app, and run it!

(myvenv) ~/Projects/myapp/ $ python hello.py

You should see the following in your terminal:

* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Open a browser and enter the url given. This url: http://127.0.0.1 is also called localhost and the second number, 5000, is the port. You should see

“Hello World!”

in your browser. Congratulations! You’ve just written your first web app! Now hit ctrl-c in your terminal to kill the app.

Bokeh

Bokeh is a Python library for making interactive “D3” style plots using a imperative style like Matplotlib (versus a declarative style like Altair). Bokeh is ideally suited for embedding plots in a web app like Flask. Let’s see how you can add a Bokeh plot to your hello.py Flask app.

Make a new folder called myapp-0/
Copy your old hello.py into the new folder

Change your new myapp-0/hello.py file as follows:

"""
My App 0: Hello with Bokeh plot.
"""
from bokeh.plotting import figure
from bokeh.resources import CDN
from bokeh.embed import file_html
from flask import Flask, Markup

app = Flask(__name__)

@app.route('/')
def hello():
    plot = figure()
    xdata = range(1, 6)
    ydata = [x*x for x in xdata]
    plot.line(xdata, ydata)

    return Markup(file_html(plot, CDN, "my plot"))

if __name__ == '__main__':
    app.run(debug=True)

Open a terminal and navigate to myapp-0/
```
$ cd ~/path/to/myapp-0/
```

Activate your conda environment

myapp-0/ $ source ~/miniconda/bin/activate myenv

Start the web app:
```
(myenv) myapp-0/ $ python hello.py
```
Open a browser to http://localhost:5000/ or http://127.0.0.1:5000/ and you should see a line plot that looks similar to this:

Bokeh gives you several interactive features for free!

link to the Bokeh documentation
pan, zoom, and reset
save

Finally hit ctrl-c in your terminal to kill the app.

There are at least two ways to “embed” a Bokeh plot in an HTML document:

HTML files: create a stand alone HTML file
components: return the individual components used to embed the plot in any HTML file.

This example used the first method. In the next example we’ll use the “components” method to embed the plot in our own custom HTML file.

Jinja2 Templates

Jinja2 is a Python library for making HTML files with dynamic content that is rendered using a subset of the Python language. The Jinja2 markup is enclosed in curly-cue braces and can refer to variables and commands:

<!-- http://jinja.pocoo.org/docs/2.10/templates/#escaping -->
<ul>
{% for user in users %}
  <li><a href="{{ user.url }}">{{ user.username }}</a></li>
{% endfor %}
</ul>

attribution: snippet from the Jinja2 documentation

The HTML files with Jinja2 markup are called “templates”. Flask can use Jinja to render content placed in a folder called templates next to your app. Use render_template to specify the name of the template file and the desird data as keyword arguments.

Let’s modify our “Hello” app to use a custom template and Bokeh components.

Create a new folder called myapp-1/ and copy the myapp-0/hello.py to it.

Create a templates folder inside myapp-1/ and save the following file as myapp-1/templates/hello.html:

<!-- http://jinja.pocoo.org/docs/2.10/templates/#escaping -->
<!DOCTYPE html>
<html lang="en">

    <head>
        <meta charset="utf-8">
        <title>{{ title }}</title>

<link
    href="https://cdn.pydata.org/bokeh/release/bokeh-0.12.14.min.css"
    rel="stylesheet" type="text/css">
<link
    href="https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.14.min.css"
    rel="stylesheet" type="text/css">
<link
    href="https://cdn.pydata.org/bokeh/release/bokeh-tables-0.12.14.min.css"
    rel="stylesheet" type="text/css">

<script src="https://cdn.pydata.org/bokeh/release/bokeh-0.12.14.min.js"></script>
<script src="https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.14.min.js"></script>
<script src="https://cdn.pydata.org/bokeh/release/bokeh-tables-0.12.14.min.js"></script>

    </head>

    <body>
        <h1>Hello!</h1>
        {{ plot_div|safe }}
        {{ plot_script|safe }}
    </body>

</html>

Now change myapp-1/hello.py to get the Bokeh components and render the hello.html template:

"""
My App 1: Hello with Bokeh plot and Jinja2 template.
"""
from bokeh.plotting import figure
from bokeh.embed import components
from flask import Flask, request, render_template, abort, Response

app = Flask(__name__)

@app.route('/')
def hello():
    plot = figure()
    plot.circle([1, 2], [3, 4])

    plot_script, plot_div = components(plot)
    kwargs = {'plot_script': plot_script, 'plot_div': plot_div}
    kwargs['title'] = 'hello'
    if request.method == 'GET':
        return render_template('hello.html', **kwargs)
    abort(404)
    abort(Response('Hello'))

if __name__ == '__main__':
    app.run(debug=True)

Open a terminal, navigate to your app, activate your conda environment, and start your app:

$ cd ~/path/to/myapp-1/
myapp-0/ $ source ~/miniconda/bin/activate myenv
(myenv) myapp-1/ $ python hello.py

Open your browser to localhost:5000 and you should see your app with the title

Hello!

above the plot.

In this example, we get the Bokeh components, plot_script and plot_div and pass them to the template, hello.html using render_template.

The template must have several links and scripts to run Bokeh. These are placed in the <head> section of the HTML file. PyData generously provides a content delivery network (CDN) to provide these files, but you can also download them and host them locally. There are 3 cascading style sheets (CSS) with custom HTML and 3 JavaScript files with scripts that Bokeh uses to make interactive plots.

Finally the template must have a <div> element where you want the Bokeh plot to appear, and a <script> which has your data and the Bokeh JavaScript code to make your interactive plot.

Bootstrap

Bootstrap is an HTML framework and component library of CSS and JavaScript files that takes the pain out of creative attractive content for folks who are not web designers. To use it all you have to do is put the CSS and Javascript links in your HTML. Follow the directions in their getting started introduction to see where to put thes links.

Mini Sprint Contest!

We’re going to have a mini sprint so that you can practice what you’ve learned. The goal will be to create an interactive Bokeh plot in a Flask app using data from the NREL Developer Network.

Go to NREL Developer Network and register for an API key.
Start your engines.
Go!

See Flask code examples for some ideas. (wip)

Miscellaneous odds and ends

There are lot’s of other rabbit holes to jump down. Here are a few.

HTML, CSS, and JS

Understanding HTML 101 will make building your web app or generating static content much easier. But understanding HTML is just the tip. You may quickly find yourself dabbing in CSS and JS too. Embrace it. But beware of misinformation - avoid W3Schools and go straight to the horses mouth. Mozilla invented the internet, not Al Gore, (j/k) so when if you need to find out anything about HTML, CSS, or JS, always check the Mozilla Developer Network (MDN) first!

Other HTML/CSS/JS frameworks

Writing your own CSS and JS is tiring. Making it look good, unless you’re a pro, is nearly impossible. These frameworks make it easy to look like a pro.

Static content

If you only need to generate your report once, or only occasionally, then a static site is fine. Your plots can still be interactive, static just means that the content on the site doesn’t change. A static site generator creates HTML, CSS, and JS content from some other markup like Markdown or ReST. Some hosts also offer static site generation and content management. And there are some tools that can generate static content in the form of HTML even though that’s not their primary function.

GitHub Pages and Jekyll
Pelican
Sphinx
Jupyter
Markdown
HTML/CSS/JS
Bokeh, d3, Vega, and Vega-Lite
Blogger, WordPress, and WordPress.com for hosting

Web frameworks for dynamic content

If your site depends on user input or if you want your site to update automatically with input from another source like an API or database, then you will need to use a web framework and a web server. A web framework combines the most common features from most web applications into a boilerplate design. Additional features can usually be added with extensions and plugins. Some frameworks are simpler than others, and some come with everything included.

AJAX

There is this crazy middle ground between static and dynamic content where you get content and modify the DOM using AJAX directly from the browser. This is way beyond the scope of this tutorial.

Web API

This is a web app that has a published interface or schema that users can use to programmatically interact with the application without a browser. There are serveral frameworks extensions that can be used to create a web API.

Embedded ploting libraries

Bokeh
Plot.ly
d3
Vega and Vega-Lite
Chaco
Matplotlib - static only AFAIK
mpld3

Database object relational mapper (ORM)

If your web interacts with a database, then you should use an object relational mapper. This tool converts native objects into database records and generates database operations like SQL queries from native methods in the background, making it simpler to create, read, update, and destroy data.

Hosting

If you want to share your site, or have it visible outside of your network then you’ll need a host. Beware, once your data is public it’s on you to keep it secure. Web frameworks will handle the most obvious threats, but you still need to use common sense. Robots continuously crawl the internet and automatically attack anything new that they find, regardless of how insignificant it is.

Warning: If your application will require authentication, then you must use HTTPS!

Heroku
AWS
Google App Engine
Azure
local network or intranet

Web servers

Hopefully, you probably won’t have to deal with setting up a web server, since this is usually handled by your hosting service, but it’s useful to know about web servers at a high level. Typically you will see a WSGI server, WSGI is a protocol for passing content to and from Python, and a web server that offers the content to web browsers requesting it and accepts content from browsers that send it. Most WSGI servers combine both of these but a dedicated web server can offer more features and better performance. It’s not uncommon for a single web app to be simultaenously running on several web servers and several WSGI servers behind a single load balancer that also offers a CA-certificate and port forwarding from HTTP (port 80) to HTTPS (port 443) to secure your site.

Apache + mod-wsgi
gunicorn or uwsgi
nginx + gunicorn uwsgi
Werkzeug
Tornado

REST

In order for your application to run on several servers simultaneously, it needs to be RESTful. REST stands for representational state transfer and basically means that your app is stateless. In other words all of the information that the servers need to run your app is contained one of three (or maybe four) places:

request header
query string
URL
(maybe a cookie or other client side cache that is used for client side operations only, eg: with JavaScript)

Glossary

DOM = document object model
HTML = hypertext markup language
CSS = cascading style sheets
JS = JavaScript - it has nothing to do with Java, used to manipulate the DOM from the browser.
REST = representational state transfer
URL/URI = universal resource locator/identifier
HTTP = hypertext transfer protocol
HTTPS = with SSL or TLS
SSL = secure socket layer was replaced by TLS
TLS = transport layer security
WSGI = web server gateway interface
ORM = object relational mapping
MVC/MVW = model-view-controller or model-view-whatever
SQL = structured query language
API = application programming interface
AJAX = asynchronous JavaScript and XML - used for client side requests
CORS = cross-origin resource sharing
CRSF = cross-site request forgery
JSON = JavaScript object notation
XML = extensible markup language

Joint meetup with the Graduate Data Science Organization

2018-03-14T00:00:00+00:00

No THW -- SF Open Drinks Meetup

2018-03-07T00:00:00+00:00

** RSVP REQUIRED **

There will be no meeting of The Hacker Within at Berkeley on March 7th. But if you’d like to hang out with some likeminded people interested in open source, open data, open knowledge, and open everything, Wednesday evening, then head to San Francisco for the SF Open Drinks meetup. March 7th from 5:30-7:30pm at the Wikimedia Foundation’s headquarters at Montgomery St BART (120 Kearny Street, Suite 1600). Due to security of the building, you must RSVP on Eventbrite beforehand and bring an ID. See more info and details there.

Intro to D3.js -- Caroline Cypranowska

2018-02-28T00:00:00+00:00

d3_simplemap

D3 tutorial for making a map with data on US campgrounds from recreation.gov.

Intro to D3

How to prepare for this tutorial

Download an install Brackets *(This is my preferred tool for building visualizations with D3, but isn’t strictly necessary. It has a nice live preview function that serves the page to your browser. Other options include using node.js)
Fork or download the repository with the data (link coming soon) *It has a template in the main directory that we’ll use to write our code, our raw data in a .csv file in the /data directory

So what is D3?

Data-driven Documents, better known as D3, is a JavaScript library for creating interactive data visualizations for the web. Mike Bostock, the primary developer of D3, first published D3 in 2011, and it’s been a favorite data visualization tool.

However, D3 has a reputation for being a challenging library to master. This is because it requires knowledge of how SVG works, a bit about HTML/CSS, and a large dose of JavaScript. The goal of this workshop is to help you get a good enough sense of how D3 works so that you can try things on your own!

Going in (SVG) circles

D3 visualizations usually begin with creating SVG objects. So let’s create 3 circles using SVG.

<svg width="720" height="120">
    <circle cx ="40" cy="60" r="10"></circle>
    <circle cx ="80" cy="60" r="10"></circle>
    <circle cx ="120" cy="60" r="10"></circle>
</svg>

D3 allows you to select elements and then manipulate them. Let’s change the color of the circles to steelblue and the radius to 30.

var circle = d3.selectAll("circle");
circle.style("fill", "steelblue");
circle.attr("r",30);

Now if you inspect the circles in your browser, the SVG markup should look like this:

<svg width="720" height="120">
    <circle cx ="40" cy="60" r="30" style="fill:steelblue;"></circle>
    <circle cx ="80" cy="60" r="30" style="fill:steelblue;"></circle>
    <circle cx ="120" cy="60" r="30" style="fill:steelblue;"></circle>
</svg>

Instead of passing a string or an integer to a .style or .attr call, you can also pass a function. Try adding this line to your javascript code. What do you think the result would be?

circle.attr("cx",function () { return Math.random()*720 });

Inspect the circles again in your browser. Now the cx parameter should be changing with each page refresh.

Binding data to HTML or SVG elements is the foundation of D3

How do I change the attributes of my SVG elements based on my data? The first step is to bind the data to the SVG elements. In the javascript portion of our document, delete the circle.attr("r",30) line and add the following:

circle.data([32,57,112]);
circle.attr("r", function(d) { return Math.sqrt(d); });

Here d refers to the data we bound to the circles. Open the web inspector and run console.log(d3.selectAll("circle")). Each element should have a __data__ parameter, and that value should correspond to the data value.

We can also pass the index of elements that are selected. After removing the circle.attr("cx", ...) line, add the following:

circle.attr("cx", function(d, i) {  return i * 100 + 30; });

Now the x location of each circle is a function of its index!

But what if I had 1000000000000000 rows of data!!!

With D3 you don’t need to explicitly write out every SVG element you want for your final data visualization. What you can do is make a virtual selection with D3, bind your data to it, and then create the elements that you want on the page. THIS is the magic of D3.

Go ahead, and delete the <circle> elements from the SVG portion of your document. The javascript portion should look like this:

/* create an svg canvas, 300 by 100 px */
var svgCanvas = d3.select("body").append("svg")
                  .attr("width", 300)
                  .attr("height", 100);

/* the data */
var dat = [32,57,112,293];

/* select circles virtually, bind the data, add attributes */
svgCanvas.selectAll("circle")
    .data(dat)
    .enter()
    .append("circle")
    .attr("cy", 60)
    .attr("cx", function (d, i) { return i * 100 + 30;})
    .attr("r", function (d) { return Math.sqrt(d); });

Appending to the virtual selection allows us to create circles for each data point, even if we don’t have those circles already drawn on the canvas.

You don’t need to reinvent the wheel

There are tons of resources for learning D3, and lots of code blocks to peruse through.

Online learning resources

D3 documentation
Aligned Left
Dashing D3 – not all content on this site is free

Example galleries

Official D3 Gallery
https://bl.ocks.org/
http://christopheviau.com/d3list/gallery.html

Fancy examples

http://www.facesoffracking.org/data-visualization/
http://www.koalastothemax.com/

And if all else fails… there’s always Google.

Now let’s make a map!

Mark Mikofski -- SQL and relational databases

2018-02-21T00:00:00+00:00

Agenda

Requirements
Objectives
SQL Examples
Relational Databases
Summary

XKCD 327: Exploits of Mom

Requirements

To prepare for this tutorial make sure you have the following:

We’re going to use some Python, so make sure you have it installed on a laptop, and of course, don’t forget to bring your laptop to the tutorial.
We’re going to use an example database and a Jupyter notebook with some code examples, so make sure your computer has working internet access. AFAIK anyone can use the Cal AirBears WiFi connection for free.
A willingness to participate, try new things, make mistakes, learn and have fun!

Objectives

At the end of this tutorial you will be able to do the following:

define what a database is
describe the difference between a relational database and no-SQL databases
write SQL code to
- create a database, add a table to a database, and add a row to a table
- query a database by selecting fields that satisfy a condition
- join two or more tables along a common field
- calculate COUNT, MAX, and other aggregate functions
name some common relational databases
explain some common usage patterns for databases

SQL Examples

We’re going to use the examples from code_examples/SQL, so point your browser to this link or clone The Hacker Within - Berkeley and navigate to this folder.

Relational Databases

Wikipedia defines a database as …

An organized collection of data. A relational database, more restrictively, is a collection of schemas, tables, queries, reports, views, and other elements. … the most popular database systems since the 1980s have all supported the relational model - generally associated with the SQL language.

The main difference between a database and a object model like JSON or an simple spreadsheet is the size and complexity, necessitating database management software to quickly create, query, and retrieve data.

The relational database differs from other databases due to its strictly tabular structure consisting of rows of records and columns of fields. E.G.:

primary key	text field	integer field	date field	real field	boolean field
1	foo	234	2018-02-21T1700Z	5.67E-8	TRUE
2	bar	123	2018-02-21T1830Z	1.6E-19	FALSE

Other databases, called noSQL, have a more flexible structure, allowing nested relations between keys, values, and arrays. Some NoSQL databases are more scalable than relational databases and can handle more data, making them useful for data science. Some examples of NoSQL databases are: CouchDB, MongoDB, Cassandra, AWS DynamoDB, etc.

Schema

The database schema formerly describes the structure of a database. For example the database in the table above could be described as a table with six fields:

a unique non-null field called the primary key.
a text field
an integer field
etc.

SQL - A Structured Query Language

The language used to define the database schema, insert data, and make queries is called SQL or Structured Query Language.

Database Management Software

Database management typically consists of a server and a client. There are several popular relational databases:

Clients and APIs

There are many ways to interface with a SQL database. Most databases come with a command line client, e.g.: psql or a GUI, e.g.: pgAdmin. Most databases also provide an API for programmatically interaction, e.g.: libpq.

Python Bindings

There are Python bindings to most database APIs:

Object Relational Mapping

It also possible to bind the database records directly to objects using object relation mapping (ORM) with software such as Django or SQLAlchemy. The advantage of using an ORM is that instead of using SQL commands, you create objects native to the languange, and the ORM takes care of creating the corresponding schema in the database.

Extra SQL commands

When setting up a SQL database server, eg PostgreSQL, you will also need to create a user, set a password, and create a database. I’ll leave these to the reader to investigate on their own.

Summary

SQL is not glamorous, and it’s been around for a long time, but it’s not that difficult to teach yourself. There are ton of links here and in the code_examples/SQL so I hope this will serve as a good starting point, but there is still so much more to learn. If you have any suggestions, feel free to comment here or please send a PR to The Hacker Within, Berkeley

Thanks!

Joint meetup with the Graduate Data Science Organization

2018-02-14T00:00:00+00:00

Stuart Geiger -- Intro to Jupyter Notebooks

2018-02-07T00:00:00+00:00

This session will be an introduction to using Jupyter notebooks. No specific programming language expertise is required, although I’ll show how to use Jupyter to write code in python, R, and bash. We’ll walk through some of the basics together, so you can install Jupyter on your computer with Anaconda or you can launch a temporary virtual server with our mybinder container.

Some links and resources

Official Jupyter Documentation
Gallery of interesting Jupyter notebooks
IPython magic commands
IPython minibook tutorial
Jupyter Cheat Sheet
MyBinder.org – turn any GitHub repo with notebooks into a live temporary server

Jupyter (and Python) is a REPL: Read-Evaluate-Print Loop

You might be familiar with a REPL – the BASH command line is one too!

Mapping out different uses

Note these are simplifications that aren’t 100% accurate – all models are wrong, but some are useful.

What you may be familiar with

What Jupyter notebook does (on your computer)

Basic structure

Writing output to a file

Reading a file with bash

Writing output to a file

Using many notebooks and kernels (on your computer)

Jupyter on a remote server

Diya Das and David DeTomaso -- Intro to BASH and the command-line shell

2018-01-31T00:00:00+00:00

In this session, we will attempt to teach Bash (i.e. the Unix shell, the thing that you have on your Mac that opens when you click Terminal) from an introductory to an intermediate/advanced level. Windows users: some versions of Windows 10 have a Linux subsystem, but you can also install Cygwin to follow along.

We’ll start from basics of the shell and attempt to get all the way to some advanced stuff (I’m being vague on purpose; we’ll stop and answer questions so that might determine our endpoint).

Invite your friends, especially those who are scared of command line interfaces but still want to know things! We’ll be in BIDS / 190 Doe Library starting at 5pm on Wednesday. As always, we’re a come when you can / leave when you need to sort of group, but we hope to see you there.

First Meeting -- What's on campus and what do we want to do this semester?

2018-01-24T00:00:00+00:00

Agenda

5:10 - Introductions // we’re on Berkeley time!
5:20 - Presentations
5:45 - Introduction to our GitHub repo
- How to edit the website
- Raising issues to request tutorials
6:00 - What do we want to learn and what do we want to teach?

Intro to Machine Learning with Scikit-Learn -- Qingkai Kong

2017-12-06T00:00:00+00:00

Goals of the workshop

In this session, I will give a quick overview of the basic machine learning and an introduction of sklearn. The goals are:

Understand the basics of Machine Learning, we will cover the classification and regression in this session.
Get familiar with the syntax of scikit-learn

After the workshop, you should be able to use popular models in your problems.

Tutorial material

Material for this session - Here

This tutorial is developed by Qingkai Kong

References

Mining scientific articles with Public Library of Science (PLoS) -- Elizabeth Seiver

2017-11-29T00:00:00+00:00

Link to presentation slides

CSV datasets

Have you ever wanted to learn how to mine the text and data from scientific articles? Come join us at The Hacker Within for a tutorial and mini-hackathon!

First will be a brief tutorial on the basic structure of XML documents, the JATS XML structure used by PLOS and other scientific publishers, as well as the XML parsing tools in allofplos, a Python library that downloads and parses PLOS articles. Then we’ll have some time to mine the corpus, contribute to the allofplos codebase, or whatever else you want to do with hundreds of thousands of research articles at your fingertips!

Spots are limited, so please sign up here: https://www.eventbrite.com/e/plos-the-hacker-within-mining-scientific-articles-tutorial-hackathon-tickets-39877458552.

The tutorial portion will be broadcast live and recorded on YouTube. While a working knowledge of Python is helpful, we will also have .csv documents of allofplos’s metadata that can be parsed in R.

Pizza will be provided.

About the presenter

Elizabeth Seiver is a Researcher at the Public Library of Science, a non-profit Open Access publisher. She wrote the codebase for allofplos.

No THW on 11/22-- next meeting 11/29

2017-11-22T00:00:00+00:00

TBD -- TBD

2017-11-15T00:00:00+00:00

No THW this week -- next meeting 11/15

2017-11-08T00:00:00+00:00

Visualizations in R with ggplot2 -- Rebecca Barter

2017-11-01T00:00:00+00:00

In this session you will learn how to build impressive ggplot2 figures. To help along the way, I will teach you the grammar of graphics, the basic plot types available in ggplot2, and a plethora of ways to customize your figures (including, time permitting, making your own ggplot theme).

The jupyter notebook containing the materials for this session can be found here: https://github.com/rlbarter/ggplot2-thw.

No THW this week -- next session on 11/1

2017-10-25T00:00:00+00:00

Utility Functions in R -- Diya Das

2017-10-18T00:00:00+00:00

This session will cover some useful R functions, with a focus on installing packages from various sources and managing environments. I’ll also present some customizations to RStudio that I’ve found helpful in my work.

A minimal background in R is recommended: be familiar with basic arithmetic and have previously installed a package in R.

Please make sure you have installed R and RStudio.

Using Jupyter Notebooks -- Stuart Geiger

2017-10-11T00:00:00+00:00

Some links and resources

Official Jupyter Documentation
Gallery of interesting Jupyter notebooks
IPython magic commands
IPython minibook tutorial
Jupyter Cheat Sheet
MyBinder.org – turn any GitHub repo with notebooks into a live temporary server

Jupyter (and Python) is a REPL: Read-Evaluate-Print Loop

You might be familiar with a REPL – the BASH command line is one too!

Mapping out different uses

Note these are simplifications that aren’t 100% accurate – all models are wrong, but some are useful.

What you may be familiar with

What Jupyter notebook does (on your computer)

Basic structure

Writing output to a file

Reading a file with bash

Writing output to a file

Using many notebooks and kernels (on your computer)

Jupyter on a remote server

Using GitHub in Open Source Software Projects -- Mark Mikofski

2017-10-04T00:00:00+00:00

Agenda

What is FOSS?
Why contribute to FOSS?
Different ways to participate
Nuts and bolts

Free and Open Source Software, FOSS or OSS

There are many definitions of “open source software” and even different names like “free and open source software”.

A Wikipedia post on open source software says the following:

Open-source software (OSS) is computer software with its source code made available with a license in which the copyright holder provides the rights to study, change, and distribute the software to anyone and for any purpose. Open-source software may be developed in a collaborative public manner. According to scientists who studied it, open-source software is a prominent example of open collaboration.[2] The term is often written without a hyphen as “open source software”.

GitHub’s Open Source Guide by Nadia Eghbal answers the question: What does “open source” mean? in a section called: Starting an Open Source Project.

When a project is open source, that means anybody can view, use, modify, and distribute your project for any purpose. These permissions are enforced through an open source license.

For more rigor check out the the Open Source Initiative (OSI) definition, but the bottom line is that open source code is free, as in free beer.

The importance of contributing to open source

Why do people create open source software? GitHub’s open source guide says, “There are many reasons”:

Collaboration: “Open source projects can accept changes from anybody in the world.”

Adoption: “Open source projects can be used by anyone for nearly any purpose. People can even use it to build other things.”

Transparency: “Anyone can inspect an open source project for errors or inconsistencies.”

Wikipedia discusses the “open source development model: advangtages and disadvantages” (emphasis mine):

“Open source software is usually easier to obtain than proprietary software, often resulting in increased use.”

“Open source development offers the potential for a more flexible technology and quicker innovation.”

The OSI lists their reasons too (emphasis mine):

Developers: “Open source projects provide tremendous opportunities for developers to share and learn through collaboration.”

Business: “… enterprises have realized the promise of open source: higher quality, greater reliability, more flexibility, lower cost …”

Non-Profit: “… open source ethos of contribution & community helps make life for NPO & NGO staffers easier”

Google has also recently published their open souce guidelines.

Ways to contribute

There are many ways to find and contribute to open source. Here are a few …

Using GitHub for Open Source projects

GitHub is an ideal tool for open source projects for many reasons. It’s free for open source projects. The issue, pull request and review tools make contributing to open source much easier. And other tools like a wiki, issue or pull request templates, and automatic detection of licenses, contribution guidelines, and codes of conduct are also very useful.

The license

Whether you are using, creating or contributing to open source, it’s useful to have a basic understanding of licenses. According to OSI there at least 9 common licenses. GitHub created choose a license to help users choose and create a license. There are even licenses for works of art and prose by Creative Commons for use in blogs and other online creations that aren’t necessarily computer code.

Code of Conduct and Contribution Guidelines

You want to read these and follow them.

Issues

One of the easiest ways to contribute to open source is to create an issue. Issues can be technical, code-related or an improvement to the documentation. There is no issue too big or too small, and never any dumb questions, only dumb answers. However try to empathize with the other users and maintainers when reporting issues. They may be overwhelmed by a deluge of issues, and they are typically volunteering their precious free time. So a little preparation or ground work before submitting an issue will go a long way to getting the issue resolved.

Try to solve the issue yourself. Spend a reasonable amount of time on this to show that you’ve done your research.
- Check if the open source project has a Google group or a Slack or IRC channel and search for common questions or issues you have. Ask for help from the forum.
- Ditto for StackOverflow.
If there are submission guidelines or an issue template, read and follow it very carefully, complete all sections as thoroughly as possible.
- Include in your issue something that approaches a minimum complete verifiable example of your issue.
- It should go without saying, but be polite, respectful and constructive. Assume Good Faith
Scratch your own itch. Follow your issue with a pull request.

Pull Requests

Pull requests (PR’s) are one of the most useful keys to contributing to open source. With a few exceptions, PR’s are how most open source projects receive contributions. A PR is not a Git feature; a PR is a feature of GitHub and other online hosted repositories. A PR is defined by GitHub as follows:

Pull requests let you tell others about changes you’ve pushed to a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before the changes are merged into the repository.

I wrote a blog post called winning workflow about how we use PR’s in my team to collaborate.

Step 1: Fork the repository

The first step in contributing to an open source project should be to fork the repository. Forking a repository allows you to create pull requests for your contributions. From the main GitHub page for the project find the fork button and select your personal GitHub profile as the location for your fork.

Step 1-1/2: The shortcut

You can work, commit and submit a PR directly from GitHub by editting and creating new files directly in GitHub online. Make sure to select that you want GitHub to create a new branch and submit your PR when you commit your work.

Create a new branch for this commit and start a pull request.

Then for future commits you would commit directly to the “patch-N” branch created by GitHub for your pull request.

Commit directly to the patch-1 branch.

In fact this is exactly the shortcut I’m using to edit this file. However there are some limitations to this approach. You may not be able to upload images this way, but you can start with this shortcut and then continue with the remaining steps anytime.

Step 2: Clone your fork

The second step is to use Git to clone the fork you just created from GitHub.

me@mycomputer ~/projects
$ git clone git@github.com:me/oss-proj-fork.git  # your url might be https://github.com/me/oss-proj-fork.git

This copies the repository to your computer where you can work on it.

Step 3: Add “Upstream” Git Remote

The third step is to add a Git remote to the original open source project which you forked on GitHub. For convenience sake we’ll call this the “upstream” repository, but call it whatever you want.

me@mycomputer ~/projects/oss-proj-fork (master)
$ git remote add upstream git@github.com:oss-people/oss-proj.git  # your url might be https://github.com/oss-people/oss-proj.git

Step 4: Make a feature branch

The fourth step is to checkout a new feature branch. This is a short lived branch with a descriptive name and it is the easiest path to submitting a PR beacuse it has several advantages.

me@mycomputer ~/projects/oss-proj-fork (master)
$ git checkout -b my-feature-gh99  # put the issue number if there is one

Your branch name serves as a quick description of the feature or issue.
It’s easier to sync your feature branch with master if your feature takes awhile to finish, and other features get merged upstream before you’re done.
If the upstream project chooses to rebase and squash your work into a single commit, your feature branch can serve as a history of the changes you made, although typically after a PR is merged, the feature branch can be deleted.

Step 5: Make a test

The fifth step is to make a test using the unittest framework that the upstream repository uses. This test serves as the minimum acceptance criteria for the new feature. Testing in code development is very important. Tests ensure that a project is working as intended, and when issues arise, tells the maintainers and contributors, exactly where the problem is. Often repositories are integrated with online build and test servers called “continuous integration” or CI, that test every PR commit. These are helpful for communicating to collaborators the state of the PR.

There are several established unittest frameworks, including Python’s own builtin Unittest module however most project use either nose or pytest. If you can’t figure what the maintainers use, then use pytest and simple assertions. Don’t be surprised if they ask you to adopt their own specific paradigm. Be flexible. This is an opportunity for you to collaborate with the maintainer and learn something new.

from oss_proj.core.new_feature import new_calc
import numpy as np
import pandas as pd
import os

BASEDIR = os.path.dirname(__file__)
NEW_FEAT_TEST_DATA = os.path.join(
    BASE_DIR, "new_feature_test_data.csv"
)
A, B, C = 1, 2, 3
KNOWN_GOOD_VALUES = pd.read_csv(NEW_FEAT_TEST_DATA)

def test_new_feature_calculation():
    calculated_values = new_calc(A, B, C)
    assert np.allclose(calculated_values, KNOWN_GOOD_VALUES)

If you run the testrunner, pytest, from the command line now, your test will fail. Don’t cry, this is OK. Failure is not bad. We will work on this until it passes, but not yet. First we have to commit our changes and push them up to our fork.

me@mycomputer ~/projects/oss-proj-fork (my-feature-gh99)
$ git add oss-proj-fork/oss_proj/core/new_feature.py  # we may need to add files first
$ git commit -m "add test for new feature to fix #99"  # if there's an issue you can refer to it
$ git push -u origin my-feature-gh99

Step 6: Create a Pull Request on GitHub

Now is when you submit the PR! Not after you’ve done a bunch of work and find out that someone else already solved the problem, or that the maintainers don’t like your approach b/c it doesn’t fit into their long term plan. But NOW! as soon as you start working, so that everyone else has a chance to collaborate with you and your cool new feature.

To create a PR for your feature, go online to GitHub. When you view either your fork or the upstream repo, you should see a message from GitHub that asks you if you want to create a PR for your new feature. Click it, and add some descriptive information about your plans for the feature, what you intend to do, if it relates to any issues, how long it might take, what help you need, etc. Then click submit.

If GitHub doesn’t automatically ask you, go to either your fork or the upstream repo and click New Pull Request. Then choose the upstream repo as the “base fork” and set the “head fork” to your feature branch.

Step 7: Hack, Communicate, Repeat

Now comes the fun. Hack! Communicate via the pull request with the other contributors. Collaborate and hack some more. Finally let them know when all of your changes are complete, your tests are all passing and you’re ready for the maintainers to review and merge your new feature. This may take many iterations. Be patient! Assume Good Faith

Conclusion

That’s it. There may be some nuances and differences between projects. There are some projects that want you to email patches, but that’s a subject for another discussion. Also you may be asked to add or update documentation. Often there is an AUTHORS file for contributors, feel free to add yourself or ask if you should. Also there may be a changelog that you should contribute to. Or you may choose to contribute to the wiki instead of the codebase. Communication is the key to finding out about all of these loose ends. Keep the channels open, stay positive and enjoy.

Cool Video

This is a cool video of Brett Cannon

No meeting this week -- THW postponed until Oct 4

2017-09-27T00:00:00+00:00

Version Control with git -- Mitch Negus and Yu Feng

2017-09-20T00:00:00+00:00

If you don’t have git installed:

Download for Windows (includes bash & git)
Download for Mac OS X
Linux: sudo apt-get install git

Git this

What happens when you don’t use version control?

A general life example:

You most likely started out doing something like this. Maybe you’ve become more sophisticated (or not) and now you

date files
append _vXXX

This is good, but you can still do better.

Why version control is amazing

Code that works will be saved permanently
If you break the code that works, reverting is easy
You still only need to keep one version around (the VC program does the rest)
Collaboration is kept smooth and coordinated
Productivity is not stifled by too many people working on one document

Git

Git is the method of version control we’re going to be working with. Other methods exist (SVN and Mercurial are the big names, and Wikipedia’s page history is a more commonly known example).

Git tracks your files by essentially taking a snapshot of a directory or subdirectory structure and saving it over time. The process of capturing one of these processes is called a commit.

To be a little more specific, you can think of Git as working in three different areas or bins. There is (1) the workspace, (2) the repository, and (3) the index.

Workspace:

The set of directories and files that you actually operate on.

Repository (repo)

The set of linked commits corresponding to snapshots of the workspace at specified points in time.

Index

The staging area where you “set up your snapshot”; files in the index that have changed since your last commit will be updated on your next commit.

Hashes

Each commit is given a unique label (called a hash) and is tied to the previous commit. This is created using a function which converts a set of information (i.e. your docs) into a string of letters and numbers. Hashes include a layer of security, since each relies on its parent’s hash.

Diagramatic Representation

Below is a diagram of the git process for a single file; I’ve named it file_1. You start by creating the file in your workspace.

Once you are satisfied with it’s progress, you decide that you want to commit your work. You add the file to the index.

Now you’re ready. You can commit the file to your repository. (Notice that in the repository file_1 is shown with the hash of the commit. In reality the full hash for a git commit is much longer.)

Now, if you make some changes to file_1. Your workspace changes.

You add the changed file_1 to the index.

And again, you finally you commit. The new commit replaces the most recent commit, which moves deeper into the repository’s history.

And the process repeats, on and on, continually building up your repository’s history.

Your turn to Git started

First we want to make a directory to track as the repository for this tutorial. Go somewhere in your file system and create this directory. Then navigate inside the newly created directory. (In the following code snippets, a $ indicates a command line prompt.)

$ mkdir git_tutorial
$ cd git_tutorial

Once you’re inside, its time to initialize the repository. Intuitively, this is done with the command

$ git init

Now, since this is likely the first time you’re creating a Git repo, you may want to set up some Git configurations. Feel free to skip this step (it is optional) but if you don’t do it now, Git will likely ask you for this information repeatedly in the future.

$ git config --global user.name "OskiBear"
$ git config --global user.email "oski@berkeley.edu"
$ git config --global core.editor /usr/bin/nano

Don’t worry, you can change this information later. It is stored in the hidden .git/config file that appears in whichever directory you ran git init. You can also use git config --list to print all of the configuration options for easy viewing.

If you already have a GitHub account, it is best to use the same username and email for both Git and that account. Additionally, if you have a preffered editor (vim, emacs, sublime, etc.), feel free to use that instead of nano as your default.

First off, we’ll make a very simple program. Open a new file called hello.sh and add the following inside:

echo "Hello, World!" > hello.txt

Save the script, and exit the text editor.

We can now see the options that Git provides by using just the git command, with no options or arguments. It should show the git usage statement, providing descriptions of the most commonly used git commands.

At this point we will follow the steps outlined above. First, we will add our new file to the index, as we prepare to save a snapshot of it to our repository. This is straightforward enough, the command is

$ git add hello.sh

If all goes well, nothing should be displayed on the console. If you want to check that your change was added to the index, type

$ git status

Git should then let you know that a new file, hello.sh has been staged, and is ready to be committed. When you are ready, commit the changes with the git commit command. Adding the -m option allows you to give a short message explaining the changes.

$ git commit -m "created a new repository containing a simple script"

Let’s keep practicing. Now make another slightly more interesting program. This is a python script to use Monte Carlo rejection sampling to determine the value of $\pi$. Open a file called picalc.py and include the following code inside.

import numpy as np

N = 1000
X = np.random.random(N)
Y = np.random.random(N)

scores = []
noscores = []
for n in range(N):
	x,y = X[n],Y[n]
	if x**2 + y**2 < 1:
		scores.append([x,y])
	else:
		noscores.append([x,y])
	if n%10 == 0:
		print(4*len(scores)/(len(scores)+len(noscores)))

Again, save and exit the editor, and now follow the same steps as before. First, add the new file; then, commit the changes and include a brief message.

Let’s Git a bit more complicated

At this point, our repository is starting to get interesting. We can see how things are evolving in our repo with the command

$ git log

You will see a history of each commit, from the most recent commit at the top to the oldest commit at the bottom.

Open picalc.py in an editor once again and change the script by adding the following lines:

Below import numpy as np add
```
  from matplotlib import pyplot as plt
```

At the bottom of the program, add (unindented)

  scores = np.asarray(scores)
  noscores = np.asarray(noscores)
  plt.plot(scores[:,0], scores[:,1], 'bo', noscores[:,0], noscores[:,1], 'ro')
  plt.show()
	

Save the script and exit the editor. Now, the version of picalc.py in our working directory is different than the version in the index (and by extension, in the repo). We can see the changes quickly by typing

$ git diff

on the command line.

Now, we’re going to take things even further. The really powerful parts of Git are used when two people are collaborating on a project (or when you are trying to multitask–working on two aspects of a single project independently from one another).

To see this, we are going to have you “break” one of your programs. To solve the issue, you are going to create a new branch where you go back and correct the problem. Now, the branch with the fix will be different from the original. When you try to bring the branches back together, to the original “master” branch, the changes will conflict, and you will be able to merge the changes. (If the changes didn’t conflict, for example if you just added an extra line of code, Git is smart enough to notice this fact and merge the documents automatically.

Open the hello.sh script and change the echo statement to be more like the original “hello, world” scripts

echo `hello, world` > hello.txt

Since we haven’t created a new file here, we can use a handy shortcut to avoid having to type both git add and git commit. Type

$ git commit -am "message of your choice"

The new option -a adds all modified documents to the index automatically before the commit is enacted.

Now, if you run the program again, the file hello.txt now contains the statement in all lowercase.

Since this is improper English, we decide to change the shell script back to the way it was before. This time though, we’re going to work on a new branch. Creating a new branch allows us to make changes to the code, including as many commits as we’d like, without actually modifying the original “master branch” of the code. Let’s try it.

$ git checkout -b englishfix

The git checkout statement allows you to extract files and branches from the repository. Since our branch had not already been created, we needed to use the -b option. If the englishfix branch already existed, we could omit the -b.

Now we are on the englishfix branch. Let’s go ahead and fix the script, changing hello, world to Hello, World. Commit the changes. Now, if you use git checkout master (master is automatically named when you use git init) and look at the script (type cat hello.txt) you will see that your changes on the englishfix branch were not transferred. This is super useful if you have a working version of a piece of code and want to add a new feature without taking the risk of breaking your code in the meantime.

What if you had been working on the script in the meantime? Still on the master branch, open the script, and add the exclamation point back in. Change “hello, world” to “hello, world!”, save and exit the editor.

Now, since we did want to incorporate the changes on the englishfix branch into the master branch, we should merge the englishfix branch into the master branch. Type

git merge englishfix

You should get an error (which includes a culprit file). Even though git is usually smart enough to perform automatic merges, when a line of code is edited two different ways on two different branches, it doesn’t know what to make of the situation. This best solution, give it back to a human and let them make some sense out of it.

To resolve the conflict, use git diff to have the differences between the conflicting documents printed to the console, and then use an editor to fix the discrepancy. Finally, add and commit the changes, and then the merge is complete!

Find image of Git repo as tree to emphasize branched nature

Bonus!

Configurations

As we mentioned earlier, you can manually edit the configuration file to update your Git settings. To do this, move to your home directory and open the file .gitconfig.

.gitignore

You can also tell git to ignore specific files by adding them to a .gitignore file. Find (or create) this file as .gitignore in the top level of the repo where you want it to apply.

Note you can use wildcards in these filenames! (i.e. *.log will ignore all files ending in .log)

Other things:

git checkout
git stash 
git reset
git rebase
gitk

Git Cheat Sheet

You can often think of the operations that Git performs on the three “areas”–workspace, index, and repo–as mathematical equations. Here are some examples (for each Git command, perform the steps in order):

git add

$ \text{staged index} = \text{workspace} - \text{current branch} $

git commit

(with -a option: $ \text{staged index} = \text{workspace} - \text{current branch} $) $ \text{new commit} = \text{staged index} $ $ \text{current branch} = \text{new commit} $ $ \text{staged index} = 0 $

git checkout

$ \text{new workspace} = \left(\text{workspace} - \text{old branch}\right) + \text{new branch} $

git stash

$ \text{stash} = \text{workspace} - \text{current branch} $ $ \text{workspace} = \text{current branch} $

git reset –hard

$ \text{workspace} = \text{current branch} $

Install Party -- Aaron Culich and Stuart Geiger

2017-09-13T00:00:00+00:00

The Hacker Within Install Party!

Installing all the things is always a pain, so why don’t we try and get as much as we can out of the way all at once? So for next week’s The Hacker Within (Sept 13th), we will be having an install party, where we will all try and help each other get various kinds of programming languages, libraries, development tools, and package environments installed. Come if you need things installed or can help others install things.

The plan

Your session leaders (Aaron and Stuart) are still working out how we ought to organize the install party, but we think this might work best as a series of lightning talks that could split into different groups and one-on-ones as appropriate, rather than a single linear session where everyone does the same thing. I imagine that we’ll be spending some time trying to debug each other’s environments when the official instructions don’t work. :) So we’re gathering info about what people want to install, where they want to install it, and who can help with what.

First Meeting -- What do we want to learn and teach?

2017-09-06T00:00:00+00:00

We usually have a “what to learn and teach” session the first week of the semester. This is a nice time for us to get together, do a round of introductions, see what we want to learn and teach, and then try to set as much of the schedule for the semester as we can. I’ll also be sharing results from the topics survey, which are here in this Jupyter notebook.

Google doc for taking notes here

No THW -- BIDS Data Science Faire

2017-05-02T00:00:00+00:00

BIDS Data Science Faire: https://bids.berkeley.edu/events/bids-spring-2017-data-science-faire

Mapping and geospatial data -- Brian Hamlin

2017-04-25T00:00:00+00:00

Visualization in Python -- David DeTomaso

2017-04-18T00:00:00+00:00

Link to the repo containing the presentation notebook:

Repo Link

Clone the repo to follow along and open up the “Plotting in Python.ipynb” notebook.

Visualization in R -- Diya Das

2017-04-11T00:00:00+00:00

Please clone the repo at https://github.com/diyadas/tutorials

Containers with Docker -- Tony Kelman

2017-04-04T00:00:00+00:00

Tony is going to be using dply.co to walk us through containers, which lets you set up a free cloud server for 2 hours. If you want to walk it on your own laptop, you need to have a github account, an SSH key, and link the SSH key to your github account (see this help page for instructions on that).

If you can create a server on dply.co and connect to it with the SSH key, then you’ll be good to go. If not, come a few minutes early and we can help you get set up.

Talk slides are available here.

Spring Break -- no meeting

2017-03-28T00:00:00+00:00

Neural Networks using Transfer Learning with Caffe -- Maryana Alegro

2017-03-21T00:00:00+00:00

Overview

Repository for a tutorial at THW, Berkeley on Caffe.

Running the tutorial

You can run the tutorial Jupyter notebooks:

locally on your computer: The easiest way is running Caffe Docker image. After installing Docker type
```
docker run -ti -p 8888:8888 bvlc/caffe:cpu
```

Inside the container, install jupyter

pip install jupyter

Clone this repository and start Jupyter typing

git clone https://github.com/mary-alegro/caffe_tutorial_thw
cd caffe_tutorial_thw
jupyter notebook --ip 0.0.0.0

Copy and paste the URL Jupyter outputs in your browser. You should now be able to access the notebook running inside the container.

Data tidying in R & Python -- Diya Das and David Detomaso

2017-03-14T00:00:00+00:00

For this tutorial, clone the github repo at https://github.com/diyadas/tutorials

Documentation and Continuous Integration in Python with Sphinx and Travis CI -- Nelle Varoquaux, Chris Holdgraf, Matthias Bussonnier

2017-03-07T00:00:00+00:00

Documentation and Travis

Welcome to this special session the The Hacker Within Berkeley which will take place at the usual BIDS location but during the Docathon event that span the week of March 6 to 10.

During the Talks on Monday 6th, you had a quick overview of Sphinx, RMarkdown, and how Travis-Ci can be used to deploy documentation.

Today we’ll get our hands dirty and try to deploy this ourself using GitHub, Travis, and GhPages on our own, as well as describe what to do (and not to do) when doing so.

Requirements

The requirements are minimal and the time of the Hacker Within session should be enough to get them, though, getting these in advance will help to follow along.

get a GitHub account
Login on Travis-CI with your GitHub

If possible:

install the travis ruby gem on your machine ($ gem install travis should be enough)
have doctr installed on your local machine.

High level overview

Understanding how to deploy documentation from Travis requires a minimal understanding on how Travis works.

In particular we will discuss the safe ways to store credentials in the .travis.yml file, what do to, not to do, when these credential get decrypted and when they are not.

We’ll setup a repository that deploy itself on GitHub pages when pushed on master.

Visualization with D3.js -- Caroline Cypranowska and Luc Guillemot

2017-02-28T00:00:00+00:00

d3_fretgraph

D3 tutorial for building an animated line graph (with real FRET data) for The Hacker Within at UC Berkeley on February 28, 2017.

Intro to D3

Luc’s slides on the fundamentals of D3 (with code examples) are posted here.

How to prepare for this tutorial

Download and install Brackets
- (This is Caroline’s preferred tool for building visualizations with D3, but isn’t strictly necessary. It has a nice live preview feature that is handy if you’re building these visualizations to go on a webpage.)
Fork (or download) Caroline’s d3_fretgraph repository
- It has a template in the main directory that we’ll use to write our code, our raw data in a .csv file in the /data directory, a minified version of D3 in the /d3 directory, and a finished version of the visualization in the /finished_version directory
For Luc’s code example–navigate to this webpage and open developer tools. Click on the ‘sources’ tab to grab the contents of the ‘d3-hackerwithin’ directory.

What is D3?

D3 stands for data-driven documents, and is a JavaScript library for building interactive data visualizations to display on the web. It was developed primarily by Mike Bostock, his PhD adviser, Jeffrey Heer, and Vadim Ogievetsky (Bostock, Ogievetsky & Heer, IEEE Trans. Visualization & Comp. Grapics, 2011).

D3 is notoriously challenging because it requires knowing a bit about JavaScript, a bit about HTML/CSS, and a bit about SVG. The goal with this workshop is to help you get a good enough sense of how D3 works to explore on your own.

D3 visualizations are built around binding data to HTML or SVG elements

What the heck does binding even mean? The idea here is that if you have a bunch of data and you want to use those data to manipulate elements on your webpage, then you need a way to select those elements and associate (or ‘bind’) your data to them.

Here’s an example of how to do this:

var  sample = [1,2,3,4];

d3.select('body').selectAll('p')  // this selects all paragraph elements within the body of your HTML file, if you don't have                                      any <p> elements on your page then this is a virtual selection
  .data(sample)                   // this binds your data variable to your selection
  .enter()                        // THIS is the magic of D3! This method allows you to create NEW elements on the webpage                                          based on your data
  .append('p')                    // for each datum in your variable, D3 will append a new <p> element to your page
  .text("I'm a paragraph!");      // the text in each newly created <p> element
  

If you were to put this code between <script> tags on an HTML document and then view on a browser, you would see a page with 4 <p> elements with ‘I’m a paragraph!’ in them. But if you were to open your web inspector and run console.log(d3.selectAll("p")) you will see that each element has a __data__ parameter, and that value will correspond to the value in sample.

The way you then manipulate elements on your HTML document is by writing functions that take those data as arguments and change some kind of attribute of the selected element.

Showing things to scale

One of the other important D3 concepts is scale. For example, if you wanted to draw a circle on your document representing the US GDP ($18.56 trillion), you wouldn’t want a circle that has a diameter of 18.56 trillion pixels. D3’s .scale method helps you scale your data to the size of the graphic that you want to create. We’ll discuss this more when we build our example.

You don’t need to reinvent the wheel

There are tons of resources for learning D3 and perusing through code blocks created by other people.

Online learning resources

D3 documentation
Aligned Left
Dashing D3 – not all content on this site is free

Example galleries

Official D3 Gallery
https://bl.ocks.org/
http://christopheviau.com/d3list/gallery.html

Fancy examples

http://www.facesoffracking.org/data-visualization/
http://www.koalastothemax.com/

Git and GitHub -- Ciera Martinez and Matthias Bussonnier

2017-02-21T00:00:00+00:00

Git and Github

Introduction to Git and GitHub

Wether you are lost in the woods trying to save a bear cub stuck in a tree, or defending earth against alien invasion, git is a tool of choice to collaborate and save your progress to come back in time and save the day again if needed.

Though using git (and GitHub) can be quite intimidating or look like dark magic. We will gently introduce you to simple git concept, from Just memorize these shell commands, to some dark voodoo allowing you to do a 66 way Cthulhu merge.

We will learn wether or not Linus Torvald (Git Creator) actually said the following statemnts or not, and wether the following statement have a bit of truth in them:

“all meaningful operations can be expressed in terms of the rebase command”

[git is] so hard to use, but that turns out to be its big appeal

It is true that actual manual page, can be hard to distinguish from markov-chain text, but you probably don’t need to dive into it now.

What we’ll do

The basics

We’ll start pretty soft. Make sure you have git installed, and that it works.

We’ll make sure you know the basics to already use git on your own, and to be ready to collaborate.

Clone a repository
Fork a GitHub repository
Create a repository from scratch
Make a commit
Make a branch
Create a Pull request on GitHub
Update your local repository

What is the difference between github and git?

Git

A lightweight version control system to track changes made to a project through time. There are many ways to use Git on your computer.

The main ways are:

Command Line - typing command into terminal (mac)
GitHub desktop - GUI
In RStudio - GUI
SourceTree - GUI

Suggestion: Command Line

Command line is the most popular way to use git, therefore you can get help easily. If you know how to run the command line version, you can probably also figure out how to run the GUI version, while the opposite is not necessarily true. Dont let your inexperience with command line stop you, you only need to learn the very basics of unix to use git.

Github - Remote Hosting

While Git stands alone as a system, Github is a website that hosts your project and Git history. You can use for collaboration, back-up, sharing, and learning. Github is just one of many places to host repositories.

The main ways are:

Suggestion: Github

The benefit of Github is that it is the most popular and has many tools to make it easy and fun to use. The main downside is that it does not allow free private repositories.

Why use Git?

Allows you to store versions (properly)
Makes you fearless
Restoring Previous Versions
Collaboration - Git allows groups of people to work on the same documents (often code) at the same time, and without stepping on each other’s toes (from tryGit).
Backup
Build easy to maintain websites

Learning Git

Learning git well is hard, but I would say only 5% of people who use git know exactly what they are doing.

Why is learning git hard?

Vocabulary is not intuitive and is different depending on the system to use it. Here is a cheatsheet for common vocabulary
Git is a complex with many ways to approach using it.
Git becomes more complex when working on a team, because there must be rules for how to collaborate and these rules differ depending on the team. You can learn how a team collaborates usually from a file in the project directory called CONTRIBUTING.md. Example contributing file: CONTRIBUTING.md file for ggplot2

Demo (Beginner)

Requirements

Git

Try to have git installed on your laptop before coming to the hacker within. If you are on windows we recommend git-bash, which should be bundled with GitHub for Desktop.

Git should be bundled on recent Macs, you can also install it with GitHub for Desktop, or Homebrew.

User of linux probably already have git installed as well , or know how to install it with your favorite package manager.

Activity

Basically we are all going to make a small edit to a file in a repository using basic git commands. Here is an overview with many of the command we will use:

Go here: https://github.com/iamciera/THW_attendence
Press the Fork button (you’ll need a Github account)
In your terminal, execute git clone https://github.com/YOURUSERNAME/THW_attendence. Make sure you replace “YOURUSERNAME” with your Github name. For example mine is iamciera.
Enter the new directory with cd THW_attendence
Add the original remote repo with git remote add upstream https://github.com/iamciera/THW_attendence
Fetch information about the remote with git fetch upstream
Now, you need to check what branch you’re in git branch. Make sure you are on the master branch.
Now we are ready to edit the file. Open the README.md file and add your name to the list. Add under the header of the letter your first name starts with. This is so we avoid merge conflicts.
Commit them. git commit -am "I added files for the tutorial on my topic.." NOTE: -am means you are telling git to “stage all changes in the directory” and that you want to include a commit message
Git push to your origin (your repo on Github) with git push origin master
Navigate in your browser to: https://github.com/YOURUSERNAME/THW_attendence and press the pull request button.

Demo (Advanced)

Advanced tactics !

Narrow down a bug ? Let’s bisect. Want to hide your mistakes ? rebase/amend. Have erased a mistake from history that was not a mistake ? reflog to the rescue.

Blips and Chitz !

Git is no fun without all the configuration option and tricks that make your life easier.

Checkout a PR by it’s number ? oowee! Diff words instead of lines ? Can doooo ! Local and global gitignore ? Sure !

DON’T PANIC

Even if it looks insanely complicated to operate and and partly to keep intergalactic travelers from panicking we’ll discuss what to do when things go south.

Long story short, keep calm and commit -A (and push) if you are really scared. Nothing is ever lost.

What happen in case of broken whatever ? If you are in “Detached head state”, “merge conflict”, or anything else ? We got you covered !

Resources

Examples of how I use Github

SOM Tutorial: To host tutorials
My Website: To host website
Example Manuscript Repo: Host code for my papers
http://ropensci.github.io/reproducibility-guide/: Build things with strangers
Eisen Lab Github: Collaborate with lab members.

Learning Git

Software Carpentry Version Control lesson
You can train in your browser !
Spoon-Knife : https://github.com/octocat/Spoon-Knife

Adventure time prompt

Inspired from stackoverflow

function we_are_in_git_work_tree {
    git rev-parse --is-inside-work-tree &> /dev/null
}

function parse_git_branch {
    if we_are_in_git_work_tree
    then
    local BR=$(git rev-parse --symbolic-full-name --abbrev-ref HEAD 2> /dev/null)
    if [ "$BR" == HEAD ]
    then
        local NM=$(git name-rev --name-only HEAD 2> /dev/null)
        if [ "$NM" != undefined ]
        then echo -n "@$NM"
        else git rev-parse --short HEAD 2> /dev/null
        fi
    else
        echo -n $BR
    fi
    fi
}

function parse_git_status {
    if we_are_in_git_work_tree
    then 
    local ST=$(git status --short 2> /dev/null)
    if [ -n "$ST" ]
    then echo -n "| (• ︵•)| (❍ᴥ❍ʋ) "
    else echo -n "| (• ‿ •)| (❍ᴥ❍ʋ)"
    fi
    fi
}

function pwd_depth_limit_2 {
    if [ "$PWD" = "$HOME" ]
    then echo -n "~"
    else pwd | sed -e "s|.*/\(.*/.*\)|\1|"
    fi
}

export PS1="\[\033[32m\]\w\[\033[33m\]\$(parse_git_status)\[\033[00m\] $ "

Machine Learning with Neural Networks using Keras -- Remi Lehe

2017-02-14T00:00:00+00:00

Keras is a machine learning library that runs on top of the popular TensorFlow neural network library.

Overview

Repository for a tutorial at THW, Berkeley on Keras.

Running the tutorial

The tutorial is in the form of Jupyter notebooks. You can run these notebooks:

remotely on mybinder.org: to do so, click the above badge (although binder is temporarily down right now)
locally on your computer. To do so, install Anaconda and install the requirements by typing

conda install -c conda-forge jupyter keras pandas matplotlib

Then, clone this repository, and run the jupyter notebook:

git clone https://github.com/RemiLehe/thw_keras_introduction.git
cd thw_keras_introduction
jupyter notebook index.ipynb

Intro to Python -- Yu Feng and Stuart Geiger

2017-02-07T00:00:00+00:00

Intro to Python (and anaconda/Jupyter)

This session will be an intro to python. We will also be using and helping set up Jupyter notebooks, which is a programming environment we frequently use for THW sessions, as well as anaconda, which is a package manager that will install python, Jupyter, and many other libraries and dependencies for you.

If you don’t have these libraries installed, follow the instructions here – you know you have it set up right if you can type “jupyter notebook” into a terminal / command prompt and the browser-based Jupyter interface pops up.

Note: this will make the anaconda version of python (which is python 3.6) your default python. If you already have a non-anaconda version of python installed and you are using this for important work, it may be best to create a new user account and install anaconda under that (selecting the options to only install it for that user, not system-wide).

If you have some experience with python, Jupyter, and/or anaconda, feel free to come and help others around you get up and running. Also, if you want to give a lightning talk on something in this area, please feel free to prepare a 3-5 minute demo on something you think might be interesting to THW.

Jupyter notebooks

View on the web

Or clone with git and run yourself:

git clone https://github.com/thehackerwithin/berkeley
jupyter notebook

Then navigate in the web interface to berkeley/code_examples/intropy_sp17

Navigating bash and UNIX environments -- Akos, Mitch, and Matthias

2017-01-31T00:00:00+00:00

Topics

UNIX intro (some history, UNIX in society)
UNIX design principles, or at least some of them, briefly
Shells and command-line interface
Shell scripting basics
Cool tricks

System requirements

Do you have a Mac? Open the Terminal app. You’re done.

Do you run Linux? Open your computer. You’re done.

Do you run Windows? See next section.

How to get bash or a Unix-like environment on Windows

Install Git Bash. (Instructions copied from here.)

Download the Git for Windows installer. Run the installer and follow the steps bellow:

Click on “Next”. (5 times)
Select “Use Git from the Windows Command Prompt” and click on “Next”. If you forgot to do this programs that you need for the workshop will not work properly. If this happens rerun the installer and select the appropriate option.
Click on “Next”. Keep “Checkout Windows-style, commit Unix-style line endings” selected.
Select “Use Windows’ default console window” and click on “Next”.
Click on “Next”.
Click on “Finish”.

This will provide you with both Git and Bash in the Git Bash program.

Run Linux on a virtual machine, e.g., VirtualBox, or in a container, e.g., Docker.
Run Linux from an external USB storage device, e.g., live USB instructions for Ubuntu.
If you don’t want to do any of that
- Open a bash terminal at try.jupyter.org
- If you have a GitHub account and can use ssh, https://dply.co provides 2 hours free server time. Set that up by yourself though.

Learning resources

For those desiring something more structured, thoughtful, and professional…

List of Unix Commands
Software Carpentry Unix Shell Lessons
The Command Line Murders, a game to teach yourself the Unix CLI.
Advanced Bash-Scripting Guide from The Linux Documentation Project
O’Reilly books on Unix & shell topics
How to find files hidden inside a computer

What to Learn and Teach for Spring 2017

2017-01-24T00:00:00+00:00

Google doc for notes here

The first meeting of THW for Spring 2017 will be at 4:00pm in the Berkeley Institute for Data Science, Doe Library room 190. We will talk about what we want to learn and then try and fill up as much of the schedule as possible.

Ensemble (Machine) Learning with Super Learner and H2O in R -- Nima Hejazi and Evan Muzzall

2016-12-06T00:00:00+00:00

Nima Hejazi & Evan Muzzall

Nima is a graduate student in the Division of Biostatistics. His research combines aspects of causal inference, statistical machine learning, and nonparametric statistics, with a focus on the development of robust methods for addressing inference problems arising in precision medicine, computational biology, and clinical trials.

Evan earned his Ph.D. in Biological Anthropology from Southern Illinois University Carbondale where he focused on spatial patterns of skeletal and dental variation in two large necropoles of Iron Age Central Italy (1st millennium BC). He is currently R Lead Instructor, co-founder of the Machine Learning Working Group, and Research Associate in the D-Lab.

Ensemble (Machine) Learning with Super Learner and H2O in R

This presentation covers methods for performing ensemble machine learning with the Super Learner R package and H2O software platform, using the R language for statistical computing.

Materials for this presentation are available on GitHub here.

R & RStudio Installation

You can download R and RStudio here.

Jupyter R Kernel Installation

Please follow the instructions here to install an R kernel for Jupyter notebooks.

SuperLearner Installation

require("devtools")
devtools::install_github("ecpolley/SuperLearner")

H2O Installation

These installations are required to make H2O work in RStudio. Click the links to visit the download pages.

Download RStudio
Download Java Runtime Environment
Download H2O for R and dependencies (click the “Use H2O directly from R” tab and follow the copy/paste instructions)
Install the devtools and h2oEnsemble R packages.

# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg, repos = "http://cran.rstudio.com/") }
}

# Now we download, install and call the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turing/10/R")))

# Install the "devtools" R package.
install.packages(c("devtools"))

# Install the "h2oEnsemble" R package.
install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")

# Load packages
library(h2o)
library(devtools)
library(h2oEnsemble)

Lightning Talks

<+ person +> : <+ topic +>

RStudio -- Diya Das and Wolf Ketter

2016-11-29T00:00:00+00:00

Diya Das and Wolf Ketter

<+ speaker bio +>

R and RStudio

You can download R and RStudio at https://www.rstudio.com/products/rstudio/download/

Lightning Talks

<+ person +> : <+ topic +>

Thanksgiving -- The Turkey Within (no meeting)

2016-11-22T00:00:00+00:00

There is no meeting this week

Machine learning with scikit-learn - Rochelle Terman and Christopher Hench

2016-11-15T00:00:00+00:00

Rochelle Terman and Christopher Hench

<+ speaker bio +>

Machine learning with scikit-learn

Clone this Github repository.

Lightning Talks

<+ person +> : <+ topic +>

Matplotlib - Yu Feng

2016-11-08T00:00:00+00:00

Yu Feng

<+ speaker bio +>

matplotlib

A Jupyter notebook is here

Lightning Talks

<+ person +> : <+ topic +>

Physical Computing - Brandon Curtis

2016-11-01T00:00:00+00:00

Brandon Curtis

<+ speaker bio +>

<+topic +>

<+ notes +>

Code examples

Lightning Talks

<+ person +> : <+ topic +>

The Python Olympics - John Bohannon

2016-10-25T00:00:00+00:00

John Bohannon

<+ speaker bio +>

The Python Olympics

The fastest way to learn a programming language is to use it. So why not turn that into a game?

All levels of experience welcome. We have Python puzzles for advanced coders and beginners alike.

This will also be the world debut of a new kind of interactive IPython Notebook designed for group coding games.

See you at the games!

Lightning Talks

<+ person +> : <+ topic +>

Parallelization in Python - Remi Lehe

2016-10-18T00:00:00+00:00

Remi Lehe

<+ speaker bio +>

Parallelization in Python

A Jupyter notebook is here, click the “launch binder” icon.

Lightning Talks

<+ person +> : <+ topic +>

Natural Language Processing for Python with NLTK -- Christopher Hench

2016-10-11T00:00:00+00:00

Christopher Hench

<+ speaker bio +>

NLTK

Text data requires a separate preprocessing stage often referred to as the ‘NLP pipeline’. One popular library for its implementation is Python’s NLTK (Natural Language Toolkit). This talk will cover how to clean text data, tag parts of speech (POS), identify named entities (NER), and quantify sentiment beyond dictionary look-up. While not explored in this talk, these preprocessing steps are often critical to developing more advanced, high-level models for document classifiers, topic modeling, and network models by providing targeted feature sets.

Installation

We are using this Jupyter notebook in the thehackerwithin/berkeley repo, master branch, nltk folder.

For installation of Python and NLTK follow these instructions

If you installed anaconda:

conda install nltk

Otherwise:

pip install nltk

Lastly, the NER wrapper requires the Java Stanford NER here: Note: do not download the extension, just Download Stanford Named Entity Recognizer version 3.6.0

Lightning Talks

<+ person +> : <+ topic +>

Git and Github -- Tony Kelman and Garret Christensen

2016-10-04T00:00:00+00:00

Presenters

Tony Kelman

Garret Christensen

<+ speaker bios +>

Topics

Git

Github

<+ notes +>

Code examples

Lightning Talks

<+ person +> : <+ topic +>

Github Pages and Jekyll - Stuart Geiger

2016-09-27T00:00:00+00:00

Stuart Geiger

I’m a postdoc at the Berkeley Institute for Data Science and I recently completed my Ph.D last December at the UC-Berkeley School of Information next door. I’m an ethnographer of science and technology, and I study how people produce knowledge. My Ph.D research was about Wikipedia’s volunteer editing community, and I’m now studying the emergence of this thing we like to call data science. In my work, I use many different kinds of methods – sometimes I look more like an anthropologist, a historian, or a philosopher, while other times I run surveys, experiments, and large-scale data analyses.

Github Pages and Jekyll

Github Pages is a free web hosting service by Github, which uses Jekyll to generate HTML files from files (themes, layouts, and data) in a special Github repository. Whenever you make a commit to a Github Pages repository, Github’s servers run the Jekyll parser on the files in that repository, which generates a set of static HTML and CSS files on a special subdomain. The result can look nearly identical to traditional content management systems (like Wordpress or Drupal) that dynamically process requests from browsers using languages like PHP and querying live databases like MySQL.

Advantages over the dynamic/CMS approach:

Fewer moving parts to configure and maintain
No need to be a systems administrator
More secure from hackers (the bad kind)
Uses existing Github infrastructure for logins and collaboration
Free hosting! (recommended max: 100,000 requests/month)

What you need

For most of this session, just a Github account and a web browser
For a few minutes at the end, I’ll walk people through running Jekyll locally. Install instructions are here for OS X and Linux (Windows is not officially supported).

Repositories to fork

Tips and tricks

Settings are in the settings tab of your repository, in the “GitHub Pages” section.
- You can see details about errors here, although they can be misleading / hard to decode
Jekyll’s markdown parser/renderer can be stricter than Github’s, and will just print raw markdown if it hits something it won’t parse
Go to the commit list (on your repo) to find the last version Github built with Jekyll.
- Green check: successful build
- Orange circle: building
- Red X: error
- No icon: not built
YAML is important and easy to mess up (YAML Ain’t a Markup Language)
- The YAML format
- Invalid YAML declarations will cause builds to fail in ways that generate misleading errors
- Valid YAML declarations will be rendered by Github as a nice, formatted table.
- YAML uses C-style quote escape sequences

Examples of good/easy/interesting Github Pages sites

Themes

Real world examples

ACM Conference on Cloud Computing – Github repo
- Very detailed and polished (and complicated)
- Uses YAML to generate schedule
AstroHackWeek – Github repo
- Single page scrolling layout, based on Solid State by HTML5 UP
Switch2OSM – Github repo
- Uses Neo HPSTR theme

Lightning talks

Matthias Bussonnier

Cross language Jupyter

<+ person +> : <+ topic +>

Machine Learning for Kaggle Competitions with R -- Jerry Chen

2016-09-20T00:00:00+00:00

Jerry Chen

<+ speaker bio +>

Description

Kaggle is a data science platform where data scientists from all over the world work together and compete in real-world machine learning challenges. These public data sets cover a wide array of interesting problems from diagnosing eye problems based on images of the retina to recommending coupons to users who visit a site. On Tuesday, we will explore the machine learning process in the context of competitions and how Kaggle is becoming a really good starting point for machine learning enthusiasts to collaborate and learn new things.

Code:

Lightning Talks

<+ person +> : <+ topic +>

The Bash Olympics -- Aaron Culich and John Bohannon

2016-09-13T00:00:00+00:00

Aaron Culich and John Bohannon

<+ speaker bio +>

<+topic +>

<+ notes +>

Lightning Talks

<+ person +> : <+ topic +>

What To Learn and Teach - Everyone

2016-09-06T00:00:00+00:00

Attending

Anyone is welcome. I hope you’ll join us!

If you can’t join us, but would like to request to learn or teach a topic related to scientific computing, please fill out this google form.

Discussion: What Do You Want To Learn and What Can You Teach

Our first meeting of the semester will be focused on introductions and building this semester’s schedule of topics. To mold the upcoming schedule of topics to your needs and desires, please attend. We will engage in a fun democratic exercise in which we each offer and request knowledge. In this way, we’ll keep THW relevant by weighing in on what topics are important to us as a community.

To request particular sessions, volunteer some useful knowledge, or just hang out, please join us at 4:00pm in Room 190 of Doe Library.

First Time Attendees

We are very hopeful that many new faces will join us this semester. We would especially love your input at this meeting. Your voice will help us to make The Hacker Within as useful and peer-driven as possible.

More information on the how, when, where, and why of this meeting can be found at:

Results

I will list the results here when the meeting is over.

Lightning Talks

<++>

<+ speaker +>

<+ speaker bio +>

<+topic +>

<+ notes +>

D3.js - Kai Chang

2016-04-27T00:00:00+00:00

Attending

<++>

Kai Chang

Kai Chang is an experienced used of D3.js, design technologist at Stamen Design and co-organizer of the Bay Area D3.js User Group.

D3.js for building Exploratory Visualization Tools

Who is here? Why are you interested in D3.js?

Data journalists?
Data scientists?
Real scientists?

Speaker Links

D3.js Resources

Parallel Coordinates

EcoEngine

Metagenomics

Parallel Coordinates

Radial Tree

Treemap

Partition Layout

Specific D3.js Techniques

d3-hierarchy
General Update Pattern - one of the big D3.js learning hurdles
Perceptual Color Spaces
Bivariate Hexbin
Dynamic Projections

Code examples can be found here.

Lightning Talks

<+ person +> : <+ topic +>

Tableau - Harrison Dekker

2016-04-20T00:00:00+00:00

Attending

lots!

Harrison Dekker

Tableau

Build Systems - Tony Kelman

2016-04-13T00:00:00+00:00

Attending

lots!

Tony Kelman

Tony is a lecturer in Mechanical Engineering, and a core contributor to the Julia language. He likes building things, including scientific software.

Build systems

Yay open source! So there’s some cool library you want to use, and its author was kind enough to share the source code with the world. But maybe that’s all that they provided? Or you want to change something, fix a bug, add a feature, etc. For libraries written in compiled languages like C, C++, Fortran, etc, compilation and dependencies can be hard. There are a variety of build systems commonly used by open-source projects to assist in building libraries and managing dependencies across various platforms. I’ll talk about the GNU autotools (and Make), CMake, and briefly mention gyp. I’ll work through an example using a small but nontrivial C library.

Code examples can be found here.

Cython - Kyle Barbary

2016-04-06T00:00:00+00:00

Attending

Lots!

Kyle Barbary

Kyle is a Data Science Fellow at the Berkeley Institute for Data Science.

Cython

Code examples can be found here.

Lightning Talks

Qingkai Kong on line_profiler

kernprof! Google it.

Katy Huff on python’s cprofiler and snakeviz

Snakeviz! Google it.

Seán Ó Nualláin on SONAS

Sonas: https://www.youtube.com/watch?v=gDZ_GOt13eg

Python For Plotting Timeseries & 3D Data - Qingkai Kong, Andy Haefner

2016-03-30T00:00:00+00:00

Attending

<++>

Qingkai Kong

I am PhD student at Berkeley Seismological Lab of Earth and Planetary Science Department. My research area is Earthquake Early Warning System, I am working on using your smartphones to detect earthquakes. I am also really interested in data science, now working on how to apply data science skills back to Seismology. You can chechout my Github here.

Code examples for my presentation can be found here.

Andy Haefner

<+ speaker bio +>

Code examples can be found here.

5:00pm Machine Learning Club

At 5:00pm, the Machine Learning Club will jump in and have a complementary talk on reproducible vizualizations using Lightning.

Abstract

Creating reproducible scientific research has been a goal of the academic community for as long as I have been a part of it and has seen great successes (such as the interactive Nature article and LIGO Gravitation wave analysis), in part due to the efforts of the Python (and Jupyter) community. But I like to believe that these efforts stem from a more human root cause to understand the world around us and as such should be relevant to anyone (not just the scientific Python community) trying to communicate the results of research.

In this talk give a brief history on why (and how) we need to make all of our analyses reproducible and how (web based) interactive visualizations are essential to making research much more accessible to the world at large. By creating a reusable (and extensible) chart using the Lightning visualization library I will highlight the role visualization plays in making analyses accessible to others and how web based technologies such as Javascript and D3 can liberate our results from the static prison of PDFs. And along the way I will (hopefully) show you the potential of interaction to change the hearts and minds of (colleagues) and the world.

No Meeting - Spring Break

2016-03-23T00:00:00+00:00

Attending

Don’t show up. Go on vacation. It’s spring break, fool.

matplotlib - Tenzing Joshi & Nick Swanson-Hysell

2016-03-16T00:00:00+00:00

Tenzing Joshi bio

I am a post-doc in the Applied Nuclear Physics Program at LBL.

Nick Swanson-Hysell bio

I am an Assistant Professor of Earth and Planetary Science here at UC Berkeley. My research is focused on reconstructing conditions on the ancient Earth with a particular focus on using magnetic data from rocks to determine the past positions of continents. You can learn more at my website. I seek to us tools that facilitate open and reproducible data analysis. You can find me and my research group on Github.

matplotlib presentation through notebook demos

Code to install (if you use Anaconda, use conda install instead of pip install):

pip install matplotlib

pip install Basemap

pip install mpld3

pip install folium

pip install bokeh

Introduction to matplotlib: Jupyter Notebook with example code

Using Basemap to plot geospatial data and other tricks/tools using matplotlib (“what used to bug me about using matplotlib, but doesn’t anymore”): Jupyter Notebook with example code

Lightning Talks

Will occur as the spirit moves THW attendees.

Handling and Visualizing Geospatial Data - Kevin Koy

2016-03-09T00:00:00+00:00

Attending

Approximately 35 people

Kevin Koy

Kevin is the Executive Director of the Berkeley Institute for Data Science (BIDS). He was previously executive director at the Geospatial Innovation Facility (GIF).

Geospatial Data

Resources

To get help beyond this talk, visit the Berkeley Geospatial Innovation Facility. They have resource guides, office hours, workshops, and more.

Data:

Tools:

Quantum GIS: open source, multiplatform geospatial software

Notes

There are two general types of GIS data. These are:

Vector data (in shapefiles, for example)
and Raster data (in pixels, numbered cells).

Where to find Geospatial Data?

gif.berkeley.edu/resources/data.html

For the demo, there are a few places where the data will be downloaded from:

http://gadm.org http://prism.oregonstate.edu http://www.openstreetmap.org

Kevin demonstrated using QGIS (an open source alternative to ARCGIS). http://www.qgis.org/en/site/

He also showed us how to publish our map data on the webservice Cartodb at https://cartodb.com.

Lightning Talks

Aji : TerraView

Aji shared TerraView, a node.js app which can show information about air quality. It has many layers, a straighforward map, using open street map. But, on top of it, using node.js, there’s an open source package called leaflet which allows lots of extra layers to the map. In real time, new values are updated.

Python Metaprogramming & Conversion to Python 3 - Ryan Pavlovsky & Matthias Bussonnier

2016-03-02T00:00:00+00:00

Attending

<++>

Ryan Pavlovsky

<+ speaker bio +>

Matthias Bussonnier

Matthias is a PostDoc at BIDS, Jupyter and IPython core developer, as well as a pesky Python 3 evangelist.

Python Metaprogramming

An IPython Notebook on python metaprogramming can be found here.

Conversion to Python 3

Not everybody may be aware, but Legacy Python 2 is reaching end of life in 2020, and it’s well beyond time to move to Python 3, which is a much better language.

I’ll show some of the reason why you do not want to stay on Legacy Python, and what are the paths you can take to migrate your codebase (including notebooks !) to Python 3.

I’ll also show off some Python 3 fancy packages !

Lightning Talks

<+ person +> : <+ topic +>

Julia - Tony Kelman and Kyle Barbary

2016-02-24T00:00:00+00:00

Attending

Many folks showed up.

Tony Kelman

Tony Kelman (@tkelman) is a Julia contributor, software engineer at Julia Computing Inc, and lecturer in Mechanical Engineering. He recently completed his PhD doing research on optimization-based control.

Kyle Barbary

Kyle is a BIDS fellow.

Julia

Demos that you can use to follow along, as well as a powerpoint presenation, can be found here: https://github.com/thehackerwithin/berkeley/tree/master/julia.

Scraping Wikipedia Data - Stuart Geiger

2016-02-17T00:00:00+00:00

Attending

About 30 folks!

Stuart Geiger

I’m a postdoc at the Berkeley Institute for Data Science and I recently completed my Ph.D last December at the UC-Berkeley School of Information next door. I’m an ethnographer of science and technology, and I study how people produce knowledge. A big focus of my work is about how new technologies change what it means to produce knowledge. In my work, I use many different kinds of methods – sometimes I look more like an anthropologist, a historian, or a philosopher, while other times I run surveys, experiments, and large-scale data analyses. My Ph.D research was about Wikipedia’s volunteer editing community, and I’m now studying the emergence of this thing we like to call data science.

Scraping Wikipedia data

We’ll be using two different resources to query Wikipedia. First, the Wikipedia API, which directly queries the text in Wikipedia articles, and second Wikidata, a new project that is trying to store all of the information in Wikipedia articles in a standardized, structured database.

Things you will need

A clone of this directory, which has Jupyter notebooks
Jupyter notebook instance with the python kernel (I’m using python 3)
Python libraries (can be installed with ‘pip install …’): wikipedia, pywikibot, requests, nltk, pandas
A Wikipedia account (not required but highly recommended. Register here!)

Lightning Talks

Matthias : Hacker Within mybinder

Go checkout mybinder.org. You can run the THW notebooks from your browser.

Brian : Where is a mountain, anyway

Inspired by the geocoordinates in Stuarts talk, Brian pointed out that putting coordinates on a mountain is tricky. Where is a mountain, anyway?

Pandas - Tenzing Joshi

2016-02-10T00:00:00+00:00

Attending

Many Folks.

Tenzing Joshi

I am a postdoc in the Applied Nuclear Physics Program at LBL. I received my PhD from the Nuclear Engineering department at Berkeley. My current research is focused on using modern data analysis techniques to improve the sensitivity of mobile radiation detection platforms and using insights from this work to develop future radiation detection systems.

There is more than one way to skin a Panda.

In this edition of The Hacker Within I’ll introduce the pandas library. We’ll talk about Series and DataFrames. This includes a variety of ways to create them, index into them, manipulate them, and get data out of them.

Follow along with this Jupyter Notebook. In this notebook we’ll use some data.

Resources

Pandas site
- http://pandas.pydata.org/pandas-docs/stable/overview.html
- http://pandas.pydata.org/pandas-docs/stable/tutorials.html
- There are loads of useful examples and tutorials on this site
- If you’re curious then take some time to look around
Wes’s Book
- Wes McKinney started Pandas
- Wes wrote a book titled Python for Data Analysis
- http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793
- This was my starting point and there is great stuff in this book
Other Pandas tutorials that are worth a read
- https://bitbucket.org/hrojas/learn-pandas
- http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
- http://synesthesiam.com/posts/an-introduction-to-pandas.html
- https://plot.ly/ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/
Stack Overflow
- There are a large number of Pandas related answers on here
- http://stackoverflow.com/questions/tagged/pandas
- It seems like this site is monitored for pandas tagged questions, if you’re stumped then this is a great place to ask a question.

Lightning Talks

Mike Pacer

Kunal Marwaha

LaTeX - Rachel Slaybaugh, Mike Pacer, and Katy Huff

2016-02-03T00:00:00+00:00

Attending

About 20.

Leaders

Rachel Slaybaugh

Rachel Slaybaugh is an Assistant Professor in the Department of Nuclear Engineering at UC Berkeley. She was one of the founding members of The Hacker Within at the University of Wisconsin.

Mike Pacer

Michael Pacer is a cognitive scientist at UC Berkeley.

Katy Huff

Katy Huff is a BIDS fellow and postdoctoral fellow in the Nuclear Science and Security Consortium.

$\LaTeX$

First, we’ll address an introduction to the basic concepts in $\LaTeX$. Then, we’ll share a few tips and tricks.

What is Markup?

HTML

HTML is just hypertext markup language. It provides a plain text way to describe objects and data that are encountered in the world wide web. Things like urls, text rendering in webpages, etc. are all easy to describe in HTML.

XML

XML is the extensible markup language. It generalizes where others specify. In the way that all reductionist things fail to get the specifics right, XML is great for general tasks in programming (input files, etc.), but not great for writing documents, where the needs are very specific.

MarkDown? RestructuredText? Where does it end?

There are a lot of markup languages. They all do different things. Restructured text is the standard in the world of python documentation. Markdown is the standard on github. Pick your poison.

How Do I install $\LaTeX$?

Linux

Everything in linux is simple.

sudo apt-get install texlive

OSX

You should use MacTeX. You can do this with macports or homebrew by downloading the whole shabang from the website.

Windows

I honestly have no idea. It looks like the TeX stack exchange may be able to help, though.

How do I write $\LaTeX$?

The not-so-short introduction to LateX is pretty great. http://tobi.oetiker.ch/lshort/lshort.pdf .

LyX

Max showed us LyX last time, which is a WYSIWYG editor for $\LaTeX$. That’s awesome. I recommend you give it a shot.

TeXShop

TeXShop is something that many folks use to write and render latex side by side. It’s cool. I don’t use it, but I can see where it would be great.

Text Editors

Some folks will find the text editor option the most extensible and glorious. I am one of those folks. I have a vim plugin for latex called, you guessed it, vim-latex and it does most of the typing for me. With syntax highlighting, it tells me where there’s a mistake, and by virtue of dealing directly with the content, I can ignore how it looks until the very end.

How do I pronounce $\LaTeX$?

Check it out, the last letter is the Greek letter $\chi$. So, it definitely has to end in a K sound. But, is it Lay or Lah? The developers say it’s up to you.

What are the Parts of a Document?

$\LaTeX$ documents have numerous parts.

The Preamble

In the preamble, there is a basic set of information that must be included in order to define the document. The real minimum set is just the “documentclass” parameter. Options include “article,” “book,” and “letter.” Options concerning the paper format and the font can be specified in the square brackets while the documentclass type should be listed in the

\documentclass[11pt]{article}

inclusion of any packages that you rely on. Standard packages include “amsmath,” “amsfonts,” “amssymb,” and graphicx.

\usepackage{amsmath}
\usepackage{amssymb}

If you are expecting a title to appear, parameters such as author and title should be filled in.

begin and end

You must begin and end the document.

\documentclass[11pt]{article}

\begin{document}

<stuff>

\end{document}

Now, that’s it. To create a beautiful pdf, you can place this text in a file called doc.tex, type “latex doc.tex” to create a dvi file, then type dvi2pdf to create a pdf file.

The Title Elements

There are elements that help to define the title elements.

\documentclass[11pt]{article}
\author{The Hacker Within}
\title{Our New Document}

\begin{document}

<stuff>

\end{document}

Those variables are used by the maketitle command, which must be executed within the document boundaries.

\documentclass[11pt]{article}
\author{The Hacker Within}
\title{Our New Document}

\begin{document}
\maketitle

\end{document}

Books, Chapters, Sections, Subsections, Subsubsections, and Paragraphs

These are enviroments that define the hierarchy of your document.

Include and input

Rather than keep everything in one big file, you can include and input other latex files into a master. That acknowledgements section that you use in every paper? Keep it in its own file.

Examples

As we go along, you may consider cloning :

Lightning Talks

<+ person +> : <+ topic +>

What To Learn and Teach - All

2016-01-27T00:00:00+00:00

Attending

Anyone is welcome. I hope you’ll join us!

If you can’t join us, but would like to request to learn or teach a topic related to scientific computing, please fill out this google form.

Discussion: What Do You Want To Learn and What Can You Teach

This semester, we’re going to try to have a visualization theme. Everyone visualizes results of some kind. So, bring us your tools, your examples, your demos, and your problems. Bring us your plots, timeseries, volumetric images, videos, or interactive charts and graphs. We’re ready to see it all.

First Time Attendees

More information on the how, when, where, and why of this meeting can be found at:

Results

I will list the results here when the meeting is over.

Lightning Talks

<++>

<+ speaker +>

<+ speaker bio +>

<+topic +>

<+ notes +>

High Performance Python - Chick Markley

2015-12-02T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Chick Markley

Chick Markley does work with the Aspire lab at UC Berkeley.

Straw Man High Performance Python Example

First, some aphorisms:

Programmer hours are more important than cpu hours - cook
Premature optimization is the root of all evil - Knuth
etc.

Next, an example of a laplacian.

Chick put his arrays into various data structures (lists, numpy arrays, etc.)

Interestingly, lists performed better than naive numpy arrays, but then once you vectorize the numpy arrays, that helps a lot and is much much faster. It’s of course much much better if you use the built in scipy laplacian (faster because it’s written in c). You can do well with cython too, but ultimately, you get a lot better performance by loading a c library.

We can also parallelize. Parallel operations vary from embarassingly parallel to inscrutably parallel. One can do so on many devices (many noces, MIC, GPU…), many frameworks (pyspark, openmp, opencl, cuda…). But, once must inform the compiler which loops to parallelize, etc.

One can also “roofline” one’s system with “shocdriver” or a similar tool to benchmark the system. In particular, it shows what kind of performance constraints are characteristic of your system.

Another option is SEJITS, a framework that Chick works on. It selectively embeds just in time “specialization” (or, rather, optimization).

Tuning is another option. There’s something called OPENTUNER. It will run your program numerous times to find the minimum amount of time to run the program.

Wait - there’s more hardware. One can build new hardware to solve your problem. Hardware isn’t so hard anymore (maybe it should be called easyware.)

There’s an interesting “hardware construction language” that folks at Aspire came up with. It’s called Chisel.

No Meeting - Thanksgiving

2015-11-25T00:00:00+00:00

Attending

Please don’t attend. The library is closed for Thanksgiving.

scikit-learn - Ross Barnowski and Shannon McCurdy

2015-11-18T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Ross Barnowski

Ross is a Nuclear Engineering PhD student in Kai Vetter’s group.

Shannon McCurdy

Shannon is a postdoc in computational biology.

Discussion: scikit-learn

Ross walked us through a demo notebook which can be found here. You can clone it from github.com/thehackerwithin/berkeley.

Shannon walked us through some useful resources. The documentation for sklearn seems to parallel a book called The Elements of Statistical Learning, and Shannon recommends this as a resource.

Linear Regression

If y is nx1 and x i nxp, we have an unknown coefficient matrix W, which is px1. The error term is then nx1. The assumption is that x and y are linearly related. The fit, W, minimizes the vertical error. The least squares cost function, which comes up in regression in this way, is a model for the error.

Note that in this example, when p>n, we enter a danger zone for validity of this model. Shannon wanted us to note, in this context, scikit-learn doesn’t necessarily warn you when this happens. So, don’t trust that scikit-learn will always warn you if you aren’t using the models in the appropriate regime.

Shrinkage Models

A bunch of different shrinkage models are included in scikit-learn. One that Shannon uses in her work is Lasso.

The idea, functionally, is that we add a penalty to the least squares cost function. The penalty is related to the magnitude of each coefficient. That is, if you are going to add some nonzero element in the matrix, it must contribute well to the fit with y. This is a parsimony metric which enforces sparsity in the solution vector. This helps with interpretability because it emphasizes the most important coefficients.

In the Wild

Shannon has encountered least squares and lasso in two different problems in her work.

Example: In her research she looks into event times, where only a subset (half) of the events are recorded. Using an exponential probability and an indicator (whether or not an event was recorded), she can describe the probability of an event happening. Given this, she can separate the probability into a maximum likelihood problem which can be minimized (using exponential regression) to determine the least squares soluation and she can reframe the Newton-Raphson step into an ordinary least squares lasso situation. If you didn’t follow this completely, check out Tibshirani’s website on the general topic of lasso models.

Lightning Talks

Finally, there will be a time for a couple of Lightning Talks, which are 5-10 minute blasts of information about a particular topic or question of interest to the group. This topic can be anything useful, new, or interesting to scientists who compute. It may be some new skill you have recently picked up in your research, a productivity tool you have recently learned to love, a quick demo of a useful library, or anything you feel we would enjoy learning.
Note that the lightning talk time is a good way to bring a question to the group. If you have a bug you need help with, here’s the place to ask many ears about it at once.

Name : Topic

Notes and links

Name : Topic

Notes and links

No Meeting - Veterans' Day

2015-11-11T00:00:00+00:00

Attending

Please don’t attend. The library will be closed on November 11th.

scikit-image - Stefan van der Walt

2015-11-04T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Stefan van der Walt

Bio

Discussion: Topic Description

Please insert your topic description here. Bold text, italic text, hyperlinks, and other markup follow markdown syntax.

Please place any tutorial materials in the master branch of this repository and link to them from this post like so. For help and questions, please file an issue or email Katy.

Lightning Talks

Name : Topic

Notes and links

Name : Topic

Notes and links

Advanced Python - Sven Chilton, Matthias Bussonnier

2015-10-28T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Sven Chilton

Bio

Matthias Bussonnier

Post Doc in BIDS, mostly wotking on Jupyter and IPython.

Discussion: Topic Description

WE’ll discuss a bit on advance Python, context manager, dunder methods, and a lot of things that might not be good idea to do in production but are fun to play with.

If tiem permit a little bit of AST.

here is the notebook I used for the various example.

Lightning Talks

Name : Topic

Notes and links

Name : Topic

Notes and links

GPUs and Parallelization - Biye Jiang, Aaron Culich

2015-10-21T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Biye Jiang

Biye Jiang is a PhD student at UC Berkeley in the CS department working with John Canny.

Aaron Culich

Aaron is a research computing architect at Berkeley.

Discussion: GPUs and Parallelization

Today’s topic is about GPUs and parallelism.

Survey of Needs and Resources – Aaron Culich

Aaron referenced a presentation on this topic. It can be found here.

Aaron started this presentation with a survey of what the attendees are actually using.

GPUs? 3 folks.
Other Parallelization? Lots of folks.

Python Parallelism

It was mentioned that, for some folks, python is the language of choice. The Python Multiprocessing module was mentioned. This was the topic of a THW session last year. The THW resources on this topic can be found here. That session was not on GPUs, however, the python threading module can be used in conjuction with PyCUDA, a python module for GPUs.

Research IT – Krishna Muriki

Research IT is available as a resource for individuals who would like to test their code on GPU resources. Krishna Muriki expresses that there is an institutional shared linux cluster (Savio). Within that cluster, there are 6 compute nodes with 4 kepler GPUs each. Those nodes are in testing and BRC is interested and open to new users.

Java runtime engine – Oliver

Oliver at ESPM has a javascript modeling project for agent based population models. They are working to make their software scalable from the desktop to the level of higher performance computing. The NOVA stack and XSEDE resources are core to their efforts.

Scala Demo – Biye Jiang

Biye demonstrated the speed of GPUs by conducting a matrix multiplication using GPUs versus conducting the same multiplication using CPUs.

GPU Discussion

Biye shared some of the diagrams from this presentation.

He noted

GPUs give excellent speed,
but GPU memory latency is also an issue.
So the throughput is high, but so is the memory latency.
If you want your GPU code to run quickly, optimize for throughput.
Always remember, GPU memory access is slower than computation.
Moving data between the GPU and the main memory should be avoided.

GPU BIDMat demo

Biye presented an ipython notebook to demostrat how BIDMat works.

The ipython notebook demos are here.

Lightning Talks

Name : Topic

Notes and links

Name : Topic

Notes and links

Webscraping - John Bohannon, Sven Chilton

2015-10-14T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

John Bohannon

Bio

Sven Chilton

Bio

Discussion: Topic Description

Please insert your topic description here. Bold text, italic text, hyperlinks, and other markup follow markdown syntax.

Please place any tutorial materials in the master branch of this repository and link to them from this post like so. For help and questions, please file an issue or email Katy.

Lightning Talks

Name : Topic

Notes and links

Name : Topic

Notes and links

Free-Form Hacking

2015-10-07T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Many folks will be absent, due to a BIDS-related event elsewhere. However, you are welcome to gather, sit together, and get some work done in a collaborative environment.

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: Today is unstructured. Simply gather, sit together, and get some work done.

Pandas - Sean Wahl & Sven Chilton

2015-09-30T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Sean Wahl

Bio

Sven Chilton

Bio

Discussion: Topic Description

Please insert your topic description here. Bold text, italic text, hyperlinks, and other markup follow markdown syntax.

Please place any tutorial materials in the master branch of this repository and link to them from this post like so. For help and questions, please file an issue or email Katy.

Lightning Talks

Name : Topic

Notes and links

Name : Topic

Notes and links

Spark and Hadoop - Zhao Zhang

2015-09-23T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Zhao Zhang

Bio

Discussion: Topic Description

Please insert your topic description here. Bold text, italic text, hyperlinks, and other markup follow markdown syntax.

Please place any tutorial materials in the master branch of this repository and link to them from this post like so. For help and questions, please file an issue or email Katy.

Lightning Talks

Name : Topic

Notes and links

Name : Topic

Notes and links

Visualization - John Naulty, Ross Barnowski, Biye Jiang, Jennifer Jones

2015-09-16T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

John Naulty

Bio

Ross Barnowski

… likes computers

pyqtgraph

Install: pip install pyqtgraph

Demo: python -m pyqtgraph.examples

Description of pyqtgraph

My Take: pyqtgraph is less user-friendly than matplotlib (esp. the documentation; the gallery contains far fewer examples and doesn’t do a good job of covering all of the possible features and uses of pyqtgraph), but is very feature-rich and more performance-oriented, despite still being pure python. There are several scenarios in which pyqtgraph is definitely worth looking into:

The need for speed: pyqtgraph is in many cases much faster than matplotlib (see demo). Also has built-in support for remote plot updating.
Volumetric rendering: If you need to visualize in 3D, pyqtgraph has a lot to offer. The other de-facto python 3D-visualization library is mayavi — I would say pyqtgraph has a slightly steeper learning curve and is a little less pretty, but again is much faster than mayavi. I don’t have enough experience with yt to say how it compares.
Building Qt Applications: If you’re using python-ized Qt (either PySide or PyQt) to build a GUI, pyqtgraph integrates very nicely. It is built with the same tools!
Beyond Visualization: The author(s) of pyqtgraph had the goal of making it a general science/engineering tool. There are a lot of built-in features designed to aid in analyzing data visually and interactively. See the Data Slicing and Image Analysis examples to get a feel for this.

Jennifer Jones

This is my Bio

Biye Jiang

I am a third year CS PhD at Cal, working with Prof. John Canny, on topics like making machine learing more easier to use. Checkout our BIDMach project.

Here is the ipython notebook I will use in the talk. This will be similar to our data science class.

Discussion: Topic Description

Please insert your topic description here. Bold text, italic text, hyperlinks, and other markup follow markdown syntax.

Please place any tutorial materials in the master branch of this repository and link to them from this post like so. For help and questions, please file an issue or email Katy.

Lightning Talks

Name : Topic

Notes and links

Name : Topic

Notes and links

Advanced Git and GitHub - Ross Barnowski, Kyle Barbary, Katy Huff

2015-09-09T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Ross Barnowski

Ross is a graduate student in Kai Vetter’s group in Nuclear Engineering. He has long hair.

Kyle Barbary

Kyle is a cosmologist and BIDS data science fellow. Kyle likes bicycles.

Katy Huff

Katy is a nuclear engineer and BIDS data science fellow.

Discussion: Advanced Git

We’ll be talking about a bunch of cool git stuff. This will range from powerful hacks everyone can use to awkward workarounds only a couple of people will ever use.

Undoing Stuff

git reset hard vs soft
revert, why to revert, how to revert
git stash and git stash pop
getting a specific file from checkout
```
git checkout <branch> -- <file>
```

Useful Configurations and Stuff

show your current branch in the terminal prompt
aliasing (very quick example with git config –global alias.unstage ‘reset HEAD –’)
the hub project to make interacting with github a little nicer (follows aliases nicely)

Creating a template for git commit messages with

git config (git config --global commit.template ~/.gitmessage.txt)

the mailmap, for normalizing the many possible commit names of your various contributors

Dealing with Branches, Remotes, and Collaboration

remotes
setting up SSH keys
the DAG
git flow for collaborating
git tagging

Rebasing

rebasing

Specialized Knowledge

cherry-picking a commit from one branch to another
detaching a single subdirectory and its history from a big repo to make it its own repo
the github api: futz with github from the command line

Lightning Talks

Additionally, there will be a time for a couple of Lightning Talks, which are 5-10 minute blasts of information about a particular topic or question of interest to the group. This topic can be anything useful, new, or interesting to scientists who compute. It may be some new skill you have recently picked up in your research, a productivity tool you have recently learned to love, a quick demo of a useful library, or anything you feel we would enjoy learning.
Note that the lightning talk time is a good way to bring a question to the group. If you have a bug you need help with, here’s the place to ask many ears about it at once.

Name : Topic

Notes and links

Name : Topic

Notes and links

Hacky Hour

Inspired by the hackers of Australia, we’re taking this opportunity to try out a Hacky Hour. After the meeting is over, folks can stick around to review one another’s code. This part of the meeting is meant to be very casual, so feel free to pop open a beverage if you need to take the edge off of the code reviews (byo).

Introductory Git and GitHub - Harrison Dekker and John Naulty Jr.

2015-09-02T00:00:00+00:00

Attending

Anyone is welcome. We hope you’ll join us!

Meeting Info

When: 4:00pm - 5:30pm
Where: BIDS, Room 190 of Doe Library.
Who: Anyone interested in software development best practices is welcome to come to our meetings.
How: A predetermined main topic (45 minutes) will be followed by impromptu lightning talks (5 minutes each)

Harrison Dekker

Harrison Dekker is the director of the Data Lab, an essential student resource on campus for data-related inquiry.

John Naulty

A Berkeley alum with experience in neuroscience and devices.

Introduction to Git and GitHub

Welcome to Git!

We will be using these resources:

Try Git is a live demo we will be going through first.
Git Cheatsheet is a useful reference.
Download Git Training wheels are off, lets get started!
Git Workflow. This is a model for a typical workflow using Git.

Challenge

Other sources not covered today:

More Git Workflow Because workflow is important.
Katy’s Great Tutorial This is Part I of II. I highly recommend it.

Lightning Talks

Aaron Culich : Two Factor Auth

Two factor auth is a way to robustify password use by combining it with hardware (like your phone).

What Do You Want To Learn and What Can You Teach - Everyone

2015-08-26T00:00:00+00:00

Attending

Anyone is welcome. I hope you’ll join us!

If you can’t join us, but would like to request to learn or teach a topic related to scientific computing, please fill out this google form.

Discussion: What Do You Want To Learn and What Can You Teach

Our first meeting of the year will be focused on introductions and building this semester’s schedule of topics. To mold the upcoming schedule of topics to your needs and desires, please attend. We will engage in a fun democratic exercise in which we each offer and request knowledge. In this way, we’ll keep THW relevant by weighing in on what topics are important to us as a community. To request particular sessions, volunteer some useful knowledge, or just hang out, please join us at 4:00pm in Room 190 of Doe Library.

First Time Attendees

More information on the how, when, where, and why of this meeting can be found at:

Results

Many of you suggested many cool things to learn and teach.
Based on the popularity of those sessions, the tentative schedule for the semester is here.

Lightning Talks

Chris Paciorek - A set of resources online

Chris pointed out an existing set of resources at link

Aaron Culich - Resources on Berkeley Campus

Aaron shared some insider knowledge about resources on campus presentation.

Thomas Kluyver - New Cool Thing In Jupyter

Thomas showed us a cool new thing in Jupyter. You had to be there to see it. Get excited, mergers of cells.

Biye - A New Course

Biye talked about a new course on campust: Introduction to Data Science on campus link.

Technology For Teaching - Matthew Brett

2015-05-13T00:00:00+00:00

Attending

More than 30 people attended!

Matthew Brett

I (Matthew) am an aged sort-of post-doc working at the UCB Brain Imaging Center.

How to use (how not to use) the IPython notebook for teaching

I am teaching a course called practical neuroimaging at UCB.

The course is half-flipped, in that the students do 30 minutes of reading before class, and spend about half of the 2 hour class time doing exercises.

Of course we make heavy use of the IPython notebook for the exercises, and this has worked very well.

But - using IPython for tutorials and reading for the class has been much more difficult because it does not yet fit well with static website builders like Sphinx.

It is still hard to write a lot of complicated text or explanation in the notebook because the web interface and cell structure make the environment cumbersome compared to a good text editor.

Others seem to have had the same experience working with the IPython notebook as an interactive code editor - see the very new rodeo project.

Maybe, by sharing our experiences, we can help to work out some solution that uses the IPython machinery, that is yet closer to perfection.

Lightning Talks

Jess Hamrick : nbgrader

Jess shared a cool tool written for creating

St $\'{e}$ fan van der Walt : Elegant Scipy, Markdown for Books, etc.

Stefan and Juan Nunez-Iglesias are writing a book called “Elegant Scipy” to collect and discuss elegant uses of and implementation within scientific python. He shared some details about the book and showed how he is using markdown as the native format to edit in, exportable to ipython notebooks and html.

Sean ONullian : What Computers can’t do (even now) and why

Sean gave some context for an upcoming conference.

Matthias Bussonnier : Jupyter Sidecar

A tool for viewing/rendering rich Jupyter kernel output in HTML.

https://github.com/rgbkrk/jupyter-sidecar

Also, thebe:

https://github.com/oreillymedia/thebe.

Shiny - Karthik Ram

2015-05-06T00:00:00+00:00

Attending

About 20 folks!!

Karthik Ram

Karthik is a BIDS data science fellow, programmer extraordinaire, and leader of ROpenSci.

Shiny

Shiny is an R-language package that creates web applications to interact with analysis pipelines and visualizations.

Github repo with some example code: https://github.com/karthik/shiny The most up to date resource on Shiny: http://shiny.rstudio.com/ Also see some amazing cheatsheets here: http://www.rstudio.com/resources/cheatsheets/

server.R holds the behind the scenes info
ui.R holds the interface

Best way to learn: Try building an app. Best resource: cheatsheets rstudio.com/resources/cheatsheets

## ui.R
shinyUI( fluidPage(
    titlePanel("This is a shiny app"),
    sidebarLayout(
        sidebarPanel(
            selectInput("x", "x variable", names(iris))
            selectInput("y", "y variable", names(iris), names(iris)[[2]])
            ),
        mainPanel()
        
        )
    ))

## server.R
## any code that runs once on each server
## put that code *before* the shinyServer() call
library(ggplot2)

shinyServer(function(input, output){
    output$gg <-
    })

Lightning Talks

Ryan Pavlovsky : RadWatch Dosimeter

Ryan showed of a cool small, cheap, touchscreen silicon PIN detectors (“radiation thermostat!”) module and the plotly interface that they have deployed!

Katy : Survey!!

Please fill this out: https://goo.gl/AIymbR

Jeroem Ooms

MongoDB Client for R called “mongolite”. http://cran.r-project.org/web/packages/mongolite/index.html. Showed off some in-database aggregations, mapreducing, binning, and the like.

Make - Chris Paciorek

2015-04-29T00:00:00+00:00

Attending

30 folks!

Chris Paciorek

Chris Paciorek is the statistical computing consultant in the Department of Statistics at Berkeley, as well as being a researcher and lecturer in the department. His research focuses on statistical methods (often Bayesian methods) applied to environmental and public health applications. He teaches the department’s graduate-level statistical computing class, Stat 243.

Make

Make is a ubiquitous command line tool that can help to automate building software and executing analysis pipelines.

For the material for today, please clone this Github repository: https://github.com/berkeley-scf/make-thw-2015

The primary document is this IPython Notebook

Lightning Talks

Kelly Rowland : CMake

CMake, by Kitware is an open source way to automate the configuration and generation of makefiles for building software in a cross platform way.

Jess Hamrick : SCons

SCons is a replacement for make. Interestingly, it was the result of a Software Carpentry code competition a very very long time ago.

C++ and Object Orientation - Sven Chilton

2015-04-22T00:00:00+00:00

Attending

About 20 folks.

Sven Chilton

Dr. Sven Chilton is an alumni of the Nuclear Engineering department.

C++ and Object Orientation

C++ is a low-level programming language that utilizes an object-oriented paradigm.

Code examples can be found here.

Lightning Talks

Brian Hamlin : A Benchmarking Exercise

Brian talked about a benchmarking exercise between C++ and Java within the world of maps. It seems like Java was able to hold its own.

Sean ONuallain: Limits of Current Genetics Work

Sean talked about the limitations in the approaches of two large genetics projects.

Microcontrollers - Anders Priest

2015-04-15T00:00:00+00:00

Attending

About 20 people.

Anders Priest

Anders is a graduate student in nuclear engineering at Berkeley.

Microcontrollers

Circuit boards, arduinos, Raspberry Pi’s, oh my! Microcontrollers and similar digital devices enable you to sense and control the physical world using nothing but your programming skills.

Notes

Microcontrollers are small computers that range in size and scale. Some are more sophisticated than others.

They are found in a variety of devices - cars, microwaves, remote controls, digital clocks, etc. They are also used in industry and medicine.

The Arduino produces its own IDE, which is fairly simple to use. The two necessary functions are setup() and loop(). Programs have to be written on a separate computer, however.

The Raspberry Pi is somewhat more sophisticated and runs a stripped-down version of Linux. You can do things like run Python scripts on the RPi.

Components to use with microcontrollers include:

the usual analog suspects (wires, resistors, etc.)
sensors (accelerometers, thermistors, joysticks, etc.)
“shields” are devices to mount on Arduino microcontrollers (for ethernet, WiFi, etc.)

The nuclear engineering department is working on a dosimeter network using Raspberry Pi devices.

The Internet has a lot of great resources if you’re interested in working with and learning about microcontrollers.

Today’s presentation was brought to us on a Raspberry Pi! Neat.

Lightning Talks

None today.

Julia - Kyle Barbary

2015-04-08T00:00:00+00:00

Attending

About 25 people.

Kyle Barbary

Kyle is a postdoc in the Berkeley Center for Cosmological Physics and a BIDS fellow. Like many people, he has a website.

Julia

Julia is a high-level language (like Python) that emphasizes performance. Slides and Jupyter notebooks from the talk can be found in this Github repository.

If you’re living in future and the link is broken, look in https://github.com/kbarbary/talks/

Notes

Julia solves the two language problem, where high-level languages are easy to program in, but they use some other low level language on the backend. Julia, however, only uses Julia.

Fundamentally, Julia was created under the idea that dynamic languages don’t need to be slow. Julia seeks to be as fast as C, dynamic as ruby, useful as python, etc…

Julia is pretty fast. In some examples, very very fast. One of the cool things it does is to compile a function only on the first time you run it. Later runs are faster than the first.

The syntax seems really similar to python, except:

; takes you to the shell
? takes you to help
backspace to get out of the shell or help
supports unicode, like python 3
you can use ipython notebook --profile julia to start IJulia
typeof(var) gives the type of the variable var
string interpolation is neat. You can use $var in a string and it will be expanded.
Functions and loops use an explicit end
Functions can be written like f(x) = 2x^2 + 3x + 1
arrays can be either homogeneous or heterogeneous. If heterogeneous, the array type is Any.
array type can be explicitly defined
one-based indexing
ranges are inclusive at both ends
the code_native function gives the machine code for any function. Pretty sweet.

Rather than importing packages, the using syntax is used and the macros in that package are called with @func

Lightning Talks

No lightning talks.

R - Rochelle Terman, Daniel Turek

2015-04-01T00:00:00+00:00

Attending

about 25

Rochelle Terman

Rochelle is a Ph.D. Candidate in Political Science at the University of California, Berkeley.

Daniel Turek

Daniel Turek is a statistician and BIDS fellow.

R

R is a high-level programming language for statistical analysis.

<+ notes +>

Rochelle’s demonstration code and notes can be found in this github repo

Lightning Talks

<+ person +> : <+ topic +>

Computer Architectures - Alex Chong

2015-03-18T00:00:00+00:00

Attending

Lots of folks. Wasn’t able to count.

Alex Chong

Alex is a student at Berkeley.

Computer Architectures

His talk about computer architecture can be found here.

Testing - Rachel Slaybaugh

2015-03-11T00:00:00+00:00

Attending

30 or so folks

Rachel Slaybaugh

Rachel Slaybaugh is an Assistant Professor of Nuclear Engineering at the University of California, Berkeley. At Berkeley, Prof. Slaybaugh’s research program is based in computational methods and applied to existing and advanced nuclear reactors, nuclear non-proliferation and security, and shielding applications. She received a BS in Nuclear Engineering from Penn State in 2006 where she served as a licensed nuclear reactor operator. Dr. Slaybaugh went on to the University of Wisconsin – Madison to earn an MS in 2008 and a PhD in 2011 in Nuclear Engineering and Engineering Physics along with a certificate in Energy Analysis and Policy. For her PhD she researched acceleration methods for massively parallel deterministic neutron transport codes. Dr. Slaybaugh then worked with hybrid (deterministic-Monte Carlo) methods for shielding applications at Bettis Laboratory while teaching at the University of Pittsburgh as an adjunct faculty member. Throughout her career Dr. Slaybaugh has been engaged in software carpentry education and training; she also contributes to the open source project PyNE. Prof. Slaybaugh was awarded the 2014 American Nuclear Society Young Member Excellence Award.

Testing

Today’s presentation can be found here.

Lightning Talks

Kelly Rowland : Sometimes the tests are wrong

But, it’s ok. We don’t need to enter an infinite recursive testing of tests. Just keep in mind that sometimes tests need to be updated when the code interface changes behavior.

Katy Huff : TravisCI

Check out this continuous integration service. TravisCI is free.

Brian Hamlin : More TravisCI

Brian gives an example of a travis.yml file.

Matplotlib and Seaborn - Caroline Sofiatti and Sean Wahl

2015-03-04T00:00:00+00:00

Attending

At least 35 people attended!

Caroline Sofiatti

I’m a PhD Candidate in the physics department. I work for the Supernova Cosmology Group and our goal is to unravel the mysteries of Dark Energy, one data point at a time!

Sean Wahl

PhD Candidate in the Earth and planetary science department. I study planetary interiors using first-principles material simulations. I use matplotlib for both routine plotting needs as well as for published journal figures.

Matplotlib

The find the ipython notebook here.

If you wish to follow along with the presentation you should have Python 2 installed with the following packages:

matplotlib, numpy, ipython, basemap(optional)

Seaborn

Seaborn is an awesome library for making beautiful and informative graphics in Python. Its mission is to make visualization a central part of exploring and understanding data. Adding import seaborn to your code will not only make your plots look amazing, it will also make your life easier!!!

Check out the IPython Notebook here.

Code examples can be found here.

Lightning Talks

Sean O’Nuallain : Homoiconicity in Programming Languages

See here for more.

IPython - Omoju Miller

2015-02-25T00:00:00+00:00

Attending

I counted 35 people. These included, at least:

Omoju
Kelly
Katy
John
Chris
Caroline
Min
Matthias
Thomas
Jess
Denia
Sven
Anders
Donny
Dan
Many others!
Add your name above if you aren’t on the list!

Omoju Miller

Omoju Miller is a PhD candidate at the University of California at Berkeley researching artificial intelligence. She is also a software technologist, start-up advisor, and educator.

IPython

IPython is an interactive interpreter for programming with Python (and now many other languages).

Easy, Peasy, Lemon Squeezy

Omoju suggests that, to work, teach, or collaborate, development tools need to be as easy as possible to install and use.

Things that she mentioned in this regard:

IPython is easy to install with “pip.” Just type pip install ipython in the terminal.
Wakari.io
NBViewer

The IPython Notebook

To start up the ipython notebook, crack open your terminal and type:

ipython notebook

That starts up a server which serves ipython notebookes (usually to localhost:8888 or similar). This command, therefore, will automatically open a browser instance with a view of your directory. This will allow you to open up any ipython notebooks in that directory. It also allows you (with a button) to create a new notebook in that directory.

The Oscars

Omoju showed an example from Mining the Social Web about using the twitter api. She demonstrated how she was able to use the ipython notebook to access the twitter firehose and filter out tweets concerning the Academy awards.

LaTeX in Markdown cells

She demonstrated also how to include LaTeX in a markdown cell. First, create a markdown cell, then include math:

Courtesy of MathJax, you can include mathematical expressions both inline:
$e^{i\pi} + 1 = 0$  and displayed:

$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

Using IPython Notebooks with GitHub

A troublesome issue with IPython notebooks is the extra information that is held in the json. That doesn’t version control as beautifully as plain text.

To avoid extra headaches, clear all output cells (Cell - > All Output - > Clear) before you commit your ipython notebooks.

Magics in the Notebooks

Omoju describes these as Development Powertools. Magics are special tools and functions. They are often preceeded by one or more percent signs (%). Some examples:

%timeit : times the execution of a function
%%timeit : times the execution of a whole cell
%%javascript : allows the use of javascript code in the notebook

Plotly Notebook Examples

Plotly is a plotting tool. On their websites, there are some fun examples in the gallery of IPython notebooks .

Use Cases

Omoju suggests using IPython notebooks for lots of stuff, including:

Code Mentoring
Teaching
Data Analysis
Writing Books

Learning more with Books

Omoju recommends The Ipython Cookbook.

Lightning Talks

Martin Magdinier : OpenRefine

OpenRefine is a tool for helping to clean and process data. You can do this with very limited data processing skills, but it is also useful for more skilled analysts.

Refine runs a local java-based server on your local machine and it opens up a browser instance to provide an interface for loading data, parsing it, identifying close duplicate categories, cleaning it up, and exploring it somewhat with filters and views.

Brian Hamlin : Geospatial stuff!

Geospatial data is everywhere. An example is found at www.pism-docs.org

Does anyone know how to export NetCDF -> text
Does anyone who uses Pandas or GeoPandas know how to join-on-attribute and get a GeoPandas object rather than a Pandas object?

OSGeo-Live

Daniel Wooten : Prompt Magic

Dan Wooten uses SSH to get to computers all over the place. He likes to be able to tell what computer he is on simply by the color of the prompt in his terminal.

In the .bashrc file, you can export a modified PS1 variable to change the content and color of your prompt.

BUT, the code necesary for specifying the right thing is pretty hideous. Let someone else figure out the syntax for you with : PROMPT MAGIC!!!

Advanced Git - Dav Clark

2015-02-18T00:00:00+00:00

Attending

Over 40 people! Too many to count! We have arrived.

Dav Clark

Dav Clark is the director of Glass Bead Labs, is employed by the D-Lab, and is supported by BIDS and the NIMH. His mission is to provide inclusive access to Data Science training, with a particular focus on social scientists.

Advanced Git

You may be keeping track of your work with git already. Learn some minimal skills, including the answer to “what’s a pull request?” You’ll get more done, both by managing your own work efficiently, and by effectively soliciting and incorporating work from others.

Dav has created a repository at github.com/tech4measurement/tech4measurement.github.io. This repository uses jekyll to make a website. In order to change a thing or two about the website, folks make pull requests.

We’ll be pretty interactive. Check out this overview of resources, pull requests welcome.

<+Notes+>

<+notes here+>

Lightning Talks

Thomas Kluyver

Thomas has created an exceptional little tool for teaching the shell!

Text Editors - Everyone

2015-02-11T00:00:00+00:00

Attending

Donny
Mathias
Chris
Kyle
Sven
Cameron
Joey
Anders
Katy
Mathias
Caroline
Matthew
Sean
David
Edward
Others (didn’t catch your name! sorry!!)

Everyone

This week will be a session full of lightning talks. All of the members of THW are encouraged to bring a lightning talk introducing some aspect of their favorite (or not their favorite) text editor.

Matthew Brett : Why Invest in a Text Editor?

Use a single editor well. “The Pragmatic Programmer” (Andrew Hunt & DAvid Thomas). Vim/Emacs are productive if you do it well.

What is the cost to a scientist of being a bad programmer?

Maybe the good motivators are: taking it on faith, by watching others, and increasing efficiency of thought.

Matthew wants to do a study!!! It’s going to be cool

Joey Curtis : Atom

Joey shared this text editor smackdown blogpost with us. He’s now going to show off a few things about Atom.

Atom is a lot like sublimetext. Atom is GitHub’s text editor and it’s completely open source underneath, built on node.js.

On GitHub, there are tons of available packages to extend the program. There are tons of papers, even, on how people prefer to look at code (colors, appearence, eyestrain). The things that are successful are somewhat based on Sublimetext, which, in turn, is based on Atom.

Katy Huff : Vim-LaTeX

Vi (vim) has a lot of plugins. Katy’s favorite way to discover plugins is VimAwesome. Her favorite way to then to install most of those plugins is something called vim-pathogen.

Among all of these plugins, the one that has made the most difference in the life of Katy is vim-latex. She owes this knowledge to the great and wonderful RedBeard (@mrterry).

Donny : IPython Notebook

Check it out, you get a beautiful IPython prompt, you have the ability to edit cells with markdown, get python documentation, quickly interact with plots and whatnot.

Literate programming is the name of the game here. It’s a nice way to prototype code.

Chris Paciorek : LyX

LyX is a WYSIWYG-style LaTeX. You can do things like type “frac” and then “space” and it shows up beautifully rendered. It avoids the intermediate step of building the LaTeX file.

Sven Chilton : Emacs

The default emacs in macOSX isn’t the best. You should install the new version. The Ctrl-x is the key feature. You do that to execute various commands. “Ctrl-x 3” gets you a vertical screen. Lots of other things get shown off…. opening a file.

Anders Priest : Vim

Anders uses vim mostly in insert mode, but has recently started beefing up his vimrc. He went to vimdoc.sourceforge.net and learned more about all the options.

Colorschemes go in the colors folder.
You can use the cursor in insert mode if you “set mouse=i”
You can show the numbers or not show the numbers.
You can use mapping functions. Anders did this to make it so he can delete a line even from insert mode.

Cameron Bates : Textmate

Mac only text editor. It supports the all powerful command+ and command- view changers.

It is one of the first of the standalone text editors.
There aren’t many changes anymore, as it’s quite mature and stable.
Cameron uses it mostly for editing large files.
It also can let you search within a folder, rather than just in one file.
Search and replace, therefore, is nice and safe.

Kyle Barbary : Emacs line-wrapping

If you hit alt-q it will reflow the text to make it wrap nicely.

Caroline Sofiatti : Sublimetext

It’s beautiful and a lot like Atom. Sublime has a beautiful rendering of the whole file.

Parallel Programming - Chris Paciorek

2015-02-04T00:00:00+00:00

Attending

Chris Paciorek
Matias (Ipython)
Min RK (IPython)
Josh Howland (NE)
Greg (neuro)
Andrew
Kelly Rowland (NE)
Sven Chilton (NE)
Anders Priest (NE)
David (econ)
Rachel Slaybaugh (NE)
Caroline Sofiatti (astro)
Alex (undergrad)
Katy Huff (NE)
Ryan Pavlovsky (NE)
Zhangpeng Guo (NE)
Denia Djokic (NE)
Tenzing Joshi (NE)
Xin Wang (NE)
Nicholas Adams (DLab)
Vic Gehman (physics)
others

Chris Paciorek

Parallel Programming

For the material for today, please clone this github repository

https://github.com/berkeley-scf/parallel-thw-2015

The primary document is here

Lightning Talks

Rachel Slaybaugh : TotalView

A debugger that works reasonably well for distributed parallel tasks is TotalView. It’s developed by Livermore.

The Shell and The Filesystem Hierarchy Standard - Katy Huff

2015-01-28T00:00:00+00:00

Attending

About 35 folks! No attendance was taken, though.

Katy Huff

Katy Huff is a postdoc with NSSC and BIDS.

The Shell

There was some interested in the shell. In particular, someone was interested in ksh. So, let’s cover shells.

Various Shells

Shell programs are just programming languages. The flavors include:

sh
csh
tsh
zsh
ksh
bash

What are the differences? Mostly syntax. For serious shell programming, they vary mostly in the way they treat arrays, their order of operations, and their way they treat variable scope.

Basics in the Shell

I’m actually going to cheat here and use the first chapter of my new book to cover the shell basics really quickly.

I know that’s not a very open source way to go about things, because you’d have to buy the book to get this material later. Be cool. The same material is covered beautifully by Software Carpentry.

Customizing Your Shell

In any shell, there are files that can be used to customize its behavior. These files hold bash commands that are run at the start of each shell session. For bash these are usually:

.bash_profile
.bashrc
.bash_aliases

The profile file is called first and sources other files (such as bashrc and aliases). Many people keep their bashrc files online. Let’s find some good ones and browse them. I keep mine online so that I can get back to work instantly if my laptop self-immolates. Let’s talk about some of the things you can do to make your life easier with bashrc.

The Filesystem

All of this is very exciting. The shell provides a nice transparent interface to the filesystem. But, what’s the point of having an interface to the filesystem?

Pretty much everything in a UNIX or Linux operating system is a file that you can look at. Since you’re a human with skillz, this means that pretty much everything in the operating system is something you can investigate, manipulate, and control.

The only way to know the potential power of the filesystem is to understand the filesystem hierarchy standard.

The Filesystem Hierarchy Standard

On a linux machine, the placement of directories at the top level of the filesystem is not just systematic, it is standardized. The standard provides a place for each thing that might be needed on your filesystem.

I feel like this kind of skill should be used in only two ways:

to be more efficient
to prank your friends

Thankfully, the filesystem provides plenty of opportunities for both.

/bin

Binary files are utilities like commands and programs. System level binary files are held in bin.

/lib

Libraries are compiled software with APIs that can be used by other source code on your system. System level libraries are held in lib. UNIX machines don’t have lib at the top level, but they do have it at lower levels. We’ll see this when we address opt and usr.

/dev

Even hardware has a filesystem representation. In dev, block and character devices are linked to the operating system through file-like objects. Browse dev… what devices do you see? Can you find your printer? What is zero? What is random?

It used to be the case that you could pipe random numbers into the file that held your speakers (try ‘cat /dev/random > /dev/dsp’). It isn’t true with modern linux, unfortunately. Now all audio moves through a program (on linux it is called aplay) before it hits the device.

On linux, try:

cat /dev/urandom

aplay

On macs, try:

say the hacker within rocks

/proc

The processes on your machine are represented in the filesystem by what appear to be files. This isn’t true on a mac. However, it’s really cool.

/boot

Macs don’t have this. Linux does. What do you think it holds? Why should this be part of the filesystem?

/mnt

This is where things get mounted (CDs, USB drives, etc.). Note that a lot of these will also be accessible via the device number of their port. Unlike the port, though, you can unmount things that are mounted.

/opt

When you want to install a library or a program, you might want to do it in this optional space. This space reflects the top-level system hierarchy.

/usr

An almost exactly equivalent space is here in usr.

Lightning Talks

Ryan Pavlovsy : ssh config files!

What Do You Want To Learn and What Can You Teach - Everyone

2015-01-21T00:00:00+00:00

Attending

Katy Huff
Rachel Slaybaugh
Rochelle Terman
Caroline Sofiatti
Denia Djokic
Britta Fiore
Chris Paciorek
Alex Chong
Greg Telian
Sean Wahl
Min RK
James Kendrick
Sven Chilton
Jose Buraschi
Andrew Greenop
Joey Curtis
Anders Priest
Daniel Turek
Karthik Ram
Tenzing Joshi
Kelly Rowland
Madicken Munk
Thomas Kluyver
Kyle Barbary
Daniel Wooten

Discussion: What Do You Want To Learn and What Can You Teach

First Time Attendees

More information on the how, when, where, and why of this meeting can be found at:

Results

You can see the results in the master branch of this repository here and you can see the logic behind the scheduling here in this ipython notebook.

Nuclear Data and Advanced Cython - Morgan White and Cameron Bates

2014-12-03T00:00:00+00:00

Attending

Alejandra Jolodosky
Denia Djokic
Katy Huff
Cameron Bates
Kyle Barbary
Kelly Rowland
Rachel Slaybaugh
Marissa Zweig
Ryan Pavlovsky
Ross Barnowski
Any Haefner
Aaron Culich
Krishna Muriki
Morgan White
? New beligian student

Nuclear Data at LANL - Morgan White

We have a distinguished visitor for the last meeting of the semester. Morgan will give us some thoughts on Nuclear Data at Los Alamos.

Morgan White

Morgan White joined the nuclear data team at LANL in X-division in 1998 as a summer student and has been part of that team ever since. Recently, Morgan has crossed from simulations to the dark side and begun working with the experimental community to better understand and reduce the systematic errors in the fundamental data necessary for such simulations.

Advanced Cython - Cameron Bates

Code examples can be found here.

Cameron Bates

Cameron is a PhD candidate in Nuclear Engineering who works as a graduate student researcher on nuclear data experiment and simulation at Lawrence Berkeley National Laboratory.

Lightning Talks

<+ person +> : <+ topic +>

ORIGEN and Open Source

2014-11-19T00:00:00+00:00

Attending

Max Fratoni
Katy Huff
Kelly Rowland
Alejandra Jolodosky
Sandra Bogetic
Madicken Munk
Dan Wooten
Tenzing Joshi
Andrey Mironyuk

Discussion: ORIGEN - Max Fratoni

Max gave us an overview of ORIGEN, a depletion code. Max’s presentation can be found here.

Max Fratoni

Max is a professor in the Department of Nuclear Engineering.

ORIGEN

ORIGEN solves the bateman equation
What you need for the zero dimensional depletion equation to be accurate is simple: accurate cross sections.

ORIGEN-S is within the Scale package and is maintained by the Scale maintainers, whereas ORIGEN2 is standalone. ORIGEN-ARP is a graphical interface for ORIGEN-S. It’s possible to use 3 energy groups in ORIGEN-S and the cross sections are kept up-to-date

ORIGEN-S tracks depletion for 1946 isotopes.
HOWEVER, there are only about 300 isotopes in the ENDF database

So, how do we run the code? Max went over the various data we need to input

material
data
depletion data
- power depletion : need power and time
- flux irradiation : need flux and time
- decay : need time

They produce:

activity
radiotoxicity
decay heat
absorption and fission rates
neutron emmission
photon emission

Every material you provide must be one of the three groups

activation product (720)
actinide (130)
fission product (850)

Of course these groups overlap.

You also have to provide information about every nuclide (decay constants, decay heats, etc.) These decay data libraries are plaintext. ORIGEN comes packaged with this information.

You also have to provide the cross section libraries. ORIGEN comes with some of these. The cross section libraries have to be selected carefully.

The input files are TAPE files… because they used to actually be tapes.

Dicussion: Open Source Contribution

We intentionally misspelled everyone’s names and went through the issue-pull-request-review-pull-close workflow seen in many open source projects.

Cython and the Python C/API - Ross Barnowski

2014-11-05T00:00:00+00:00

Attending

Ross Barnowski
Andy Haefner
Tenzing
Paul
Kelly Rowland
Daniel Wooten
Kyle Barbary
Cameron Bates
Aaron Culich
Katy Huff

Discussion: Extending Python with Cython and the C/API

Ross Barnowski

Ross Barnowski is a nuclear engineering PhD student in Kai Vetter’s research group.

Cython and the Python C/API

Code examples can be found here.

Lightning Talks

<+ person +> : <+ topic +>

Jekyll - Katy Huff

2014-10-29T00:00:00+00:00

Attending

Sandra Bogetic
Alejandra Jolodosky
Staffan Qvist
Madicken Munk
Katy Huff
Jason Hou
Daniel Wooten
Ross Barnowski
Andy Haefner
Fatma Imamoglu
Rachel Slaybaugh
others…

Katy Huff

Katy Huff is a postdoc with NSSC and BIDS.

Jekyll

This very site is made with Jekyll. Jekyll is a Ruby-based, blog-aware, static site generator.

Two ways to host your Jekyll site for free on GitHub

Everybody needs a website. Google yourself. What happens? Let’s get you a website.

username.github.com master branch

Every time someone creates a user name on github, a special space on the internet is reserved for them at theirusername.github.com (and .io, it’s a long story).

If the user “lisemeitner” existed, then she could create a repository on github called “lisemeitner.github.com” (or .io, it’s a long story). If that repository has a master branch, then GitHub will try to render it with Jekyll and serve it up to the internet at lisemeitner.github.io. Note that jekyll plug-ins used by GitHub are very minimal. Try not

If Lise doesn’t want to use Jekyll, that’s cool. Sites on GitHub can be plain boring old html (like katyhuff.github.io. To keep GitHub from trying to render it as jekyll, she has to add an empty file (.nojekyll) in her repository. Additionally, an index.html file has to exist at the top level of her repository, or else there will be nothing there.

gh-pages branch

If Lise also has a project called fission, she can have a website for it too. That website can sit on the internet at lisemeitner.github.io/fission. All she has to do is put either jekyll stuff or a static html page in the gh-pages branch. The same rules apply as far as .nojekyll and plug-ins are concerned.

For an example, check out katyhuff.github.io/cyder.

How does the THW site work?

Please look at the readme. We’re gonna make some changes.

What’s this config file?

It’s for configuring the site, silly! Let’s check it out.

What’s all this stuff at the top of the posts?

It’s YAML metadata. Let’s talk about it.

Serving it up locally

So, rather than rely on github to render the jekyll and serve it up on the internet, you can also render it locally and check it out on your localhost. You’ll need to have ruby installed. Then:

gem install jekyll

Then, if you navigate to a directory containing a jekyll site, you can serve it up:

jekyll serve

Now open a browser and navigate to the localhost url http://localhost:4000.

What about themes?

The THW page relies on an open source theme called left. We could swap that out for another theme really easily. There are lots on the internets. Try this page.

Lightning Talks

<+ person +> : <+ topic +>

MocDown and Pyne Install - Phil Gorman and Kelly Rowland

2014-10-22T00:00:00+00:00

Attending

Phil Gorman
Phil
Xianlom Hou
Daniel Wooten
Alejandra Jolodosky
Madicken Munk
James Bevins
Kelly Rowland

Discussion: Pyne Install - Kelly

Today’s THW went really well! Kelly did a “choose your own adventure” livebuild of PyNE on a guest account on her computer.

Discussion: Mocdown 2.0 - Phil

Phil continued his introduction to Mocdown.

Code examples can be found here.

MocDown and Python Threading - George Zhang, Phil Gorman, Ross Barnowski

2014-10-15T00:00:00+00:00

Attending

George Zhang
Phil Gorman
Chick Markley
Max Fratoni
Aaron Culich
Xiao Fan
Kyle Barbary
Madicken Munk
Alejandra Jolodosky
Denia Djokic
Ross Barnowski
Katy Huff
Joey Curtis
Kelly Rowland
Andy Haefner
Caroline

Discussion: MocDown

George Zhang and Phil Gorman

George and Phil are both PhD students in the Berkeley neutronics group.

MocDown

MocDown is a neutron transport, transmutation, thermal fluids, and equilibrium search tool developed here at Berkeley primarily by Jeffrey Seifried.

George and Phil covered :

What does MocDown do?
What is going on in the input files?

Code examples and documentation can be found at the homepage.

Discussion: Threading with Python

Ross Barnowski

Ross Barnowski is a PhD student in Kai Vetter’s research group. His work focuses on nuclear instrumentation, including a 3D gamma ray imaging cart called the Compact Compton Imager II.

Threading in Python

Ross gave a talk that covered the concept of concurrency as well as how to make it happen in Python.

Code examples can be found here.

To see the ipython notebook in the notebook viewer try this link: Concurrency Notebook.

Lightning Talks

Kelly : Test Your Code

Kelly, after having dedicated a ton of time this summer to building tests for the WARP code, now has a test suite for it. When her colleague, the main WARP developer, made an update to the API, her tests caught it (by failing) and she was alerted to the global effects of the change. Moral of the story: test your code!

Aaron Culich : BRC

Aside: One of the places where tests break down is in concurrency, actually! Aaron recommends a paper “The Problem With Threads” by Edward Lee. He also offers us some choice quotes:

“…non-trivial multi-threaded programs are incomprehensible to humans.”

and

“Threads must be relegated to the engine room of computing, to be suffered only by expert technology providers.”

Aaron also passed out a little handout about BRC. He encourages folks to reach out to him (as part of the Consulting and Community initiative). One of the ways for him to help out is here with THW, where he wants to hear our needs and feedback.

They’ve already benefitted from our feedback concerning Savio here. Please feel free to add more information to that file with a pull request.

In response to the need for a simpler Pledge setup documentation, they’ve created better docs here.
In response for the need for example run files, they’ve created a repository here!

Numpy Vectorization and Python Logging - Andy Haefner and Dan Wooten

2014-10-08T00:00:00+00:00

Attending

Dan Wooten
Rachel Slaybaugh
Madicken Munk
Alejandra Jolodosky
Tenzing Joshi
John Ready
Joey Curtis
Staffan Qvist
Dan Wooten
Ross Barnowski
Andy Haefner
Laazar Zupich
Aaron Culich
Katy Huff

Discussion: Vectorization with Numpy

Andy Haefner

Andy Haefner is a graduate student in Kai Vetter’s group.

Vectorization With Numpy

A tutorial and code examples can be found here.

Discussion: The Python Logger Utility

Daniel Wooten

Daniel Wooten is a graduate student working for Max Fratoni.

The Python Logger Utility

Example code can be found here.

HPC Module Installation and Plotting Tools - Everyone!

2014-10-01T00:00:00+00:00

Attending

Ross Barnowski
Sandra Bogetic
Aaron Culich
Denia Djokic
Andy Haeffer
Jason Hou
Katy Huff
Alejandra Jolodosky
Madicken Munk
Kelly Rowland
Rachel Slaybaugh
Daniel Wooten
Andy Haefner
Ryan Pavlovsky
Cameron Bates
Ross Barnowski
Tenzen Joshi
Dav Clark

Discussion: Installing Modules on the BRC Savio Cluster

Katy Huff

Katy Huff is a postdoctoral scholar with the Nuclear Science and Security Constortium and is a fellow with the Berkeley Institute for Data Science.

Module Installation Tips and Tricks

I spent some time last week installing MOOSE on the cluster. The dream was this: MOOSE should be a module that anyone can use on the cluster if they import it. There are a couple of catches to this.

MOOSE’s dependencies can each be compiled with an array of flags, should I compile only debug versions, only non-debug versions, both?
MOOSE has a bunch of associated libraries which do various physics. I would also like to install those, but they have varying permissions.

Logging in

Setting up easy login situaiton is a two step process:

install pledge
create aliases for the ssh commands

Installing Pledge Somewhere

Pledge is for generating time-sensitive one-time-use, two-factor-authentication passwords. That’s awesome. Many of you may have seen or used the passkey generating RSA keys that are used to log into the national laboratory networks. How many use google two-factor authentication for their email or something similar? I do. google2factor.

This is annoying because it takes a long time to get to the final url with which to install Pledge. But, you will eventually succeed. Use the username and password given to you by Krishna at LBL.

Install Pledge. The easiest is likely to do this on your phone using whatever app installation store is appropriate.
Go here (https://identity.lbl.gov/PledgeEnrollment/enroll.jsp), select HPCS from the pulldown window, and enter your user name/password that Krishna from LBNL sent to you. This should provide an 8 digit profile ID.
Open Pledge and click the + button. It should ask for your profile ID (the thing you just generated); enter it, and it should download your “Pledge profile.” If you get an error, contact Phil Goorman for trouble shooting advice.
Make a pin number. The pin is specific for that profile.
When you log into the savio cluster you will use this app to generate a new password everytime.

Installing Dependencies

Typically, installation requires :

getting your environment right
downloading the source code for the dependencies
following the instructions for each of those

MOOSE relies on two main external dependencies:

HYPRE
PETSc

It also relies on one internal dependency, libMesh. LibMesh is independent of MOOSE, but since MOOSE has added non-standard features to libMesh, they keep their own flavor of libMesh in the MOOSE framework source code. Clear as mud?

Thankfully, MOOSE is a well-documented open source project. It walks through the installation of dependencies as well as the framework.

Environment

To deal with the environment, I edited ~/.bashrc so that it now looks like:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
     . /etc/bashrc
fi

# User specific aliases and functions

export CLUSTER_TEMP=`mktemp -d /tmp/cluster_temp.XXXXXX`

umask 0022

export GRP_DIR="/global/home/groups/ac_nuclear"

export PACKAGES_DIR="$GRP_DIR/MOOSE/moose-compilers”

That makes sure that the packages will be downloaded to the right place (CLUSTER_TEMP), installed in the right place (PACKAGES_DIR), and linked to the right place (GRP_DIR).

For this to take effect, the terminal needs to re-initialize itself with :

source ~/.bashrc

Downloading the Dependency Source

This can be done using curl.

curl -L -O --insecure https://computation.llnl.gov/casc/hypre/download/hypre-2.8.0b.tar.gz
curl -L -O http://ftp.mcs.anl.gov/pub/petsc/release-snapshots/petsc-3.4.3.tar.gz

Installing Hypre

First, I went to the place where I want to install it.

  cd $GRP_DIR

Install Hypre according to the instructions. That went well, creating the beginning of a module called moose-dev-gcc. Sso there’s nothing interesting to share. The interesting stuff is when things go wrong.

Installing PETSc

I started to configure PETSc Install PETSc - OOOPS - stop installing petsc and install valgrind

load the moose-dev-gcc module that has now been created load valgrind

configure petsc

xxx=========================================================================xxx
 Configure stage complete. Now build PETSc libraries with (legacy build):
   make PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/stack_src/petsc-3.4.3 PETSC_ARCH=arch-linux2-c-debug all
 or (experimental with python):
   PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/stack_src/petsc-3.4.3 PETSC_ARCH=arch-linux2-c-debug ./config/builder.py
xxx=========================================================================xxx

Now what?

I read the docs, and chose the legacy build because the moose docs say:

During the configure/build process, you will be prompted to enter the correct make commands. Because this can be different from system to system, I leave that task to the reader. However, I have received better results when following the non-experimental commands. make PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/stack_src/petsc-3.4.3 PETSC_ARCH=arch-linux2-c-debug all

It worked !

Completed building libraries
=========================================
making shared libraries in /global/home/groups/ac_nuclear/MOOSE/moose-compilers/stack_src/petsc-3.4.3/arch-linux2-c-debug/lib
building libpetsc.so
=========================================
Now to install the libraries do:
make PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/stack_src/petsc-3.4.3 PETSC_ARCH=arch-linux2-c-debug install
=========================================

So, I did that:

[huff@ln001 petsc-3.4.3]$ make PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/stack_src/petsc-3.4.3 PETSC_ARCH=arch-linux2-c-debug install
*** Using PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/stack_src/petsc-3.4.3 PETSC_ARCH=arch-linux2-c-debug ***
*** Installing PETSc at prefix location: /global/home/groups/ac_nuclear/MOOSE/moose-compilers/petsc/petsc-3.4.3/gcc-opt  ***
====================================
Install complete. It is useable with PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/petsc/petsc-3.4.3/gcc-opt [and no more PETSC_ARCH].
Now to check if the libraries are working do (in current directory):
make PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/petsc/petsc-3.4.3/gcc-opt test
====================================
[huff@ln001 petsc-3.4.3]$

So, I ran the tests:

make PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/petsc/petsc-3.4.3/gcc-opt test

Here’s the output:

[huff@ln001 petsc-3.4.3]$ make PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/petsc/petsc-3.4.3/gcc-opt test
Running test examples to verify correct installation
Using PETSC_DIR=/global/home/groups/ac_nuclear/MOOSE/moose-compilers/petsc/petsc-3.4.3/gcc-opt and PETSC_ARCH=arch-linux2-c-debug
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 1 MPI process
See http://www.mcs.anl.gov/petsc/documentation/faq.html
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: ln001.brc
--------------------------------------------------------------------------
lid velocity = 0.0016, prandtl # = 1, grashof # = 1
Number of SNES iterations = 2
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 2 MPI processes
See http://www.mcs.anl.gov/petsc/documentation/faq.html
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: ln001.brc
--------------------------------------------------------------------------
lid velocity = 0.0016, prandtl # = 1, grashof # = 1
Number of SNES iterations = 2
[ln001.brc:54921] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
[ln001.brc:54921] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
egrep: /global/home/groups/ac_nuclear/MOOSE/moose-compilers/petsc/petsc-3.4.3/gcc-opt/arch-linux2-c-debug/include/petscconf.h: No such file or directory
Possible error running Fortran example src/snes/examples/tutorials/ex5f with 1 MPI process
See http://www.mcs.anl.gov/petsc/documentation/faq.html
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: ln001.brc
--------------------------------------------------------------------------
Number of SNES iterations =     4
Completed test examples

On first glance, maybe it passed, right? WRONG! It failed, yo. The first rule of programming is: Google the error. Googling it, of course, this sends us to a discussion on the petsc-users list host - exactly what we want - right here Interestingly, the question comes from someone at LBL. Perhaps it’s even on the same BRC system? In any case, Barry Smith, who leads the PETSc project, responded…

Well it is running. It is just producing annoying warning messages. You need to talk to your local MPI expert on that system for how to get rid of the problem. Barry

Since I don’t have any idea who in BRC is the MPI guru willing to solve it, I guess we just make a note of it and go on with our lives. So, moving on, I had to clone moose.

I hate using github’s ssh protocol, so I set up my ssh keys for the brc cluster this is how that works.
I have a fork of moose (which currently exactly parallels moose development), so I cloned from that.
I also fetched the upstream idaholab/moose repo so that I can keep up to date.
MOOSE likes a clean history, so every time you pull, you have to rebase (we all make choices…) git pull --rebase upstream master

Lightning Talks

Ross Barnowski : PyQT

So, pyqtgraph is good for volumetric rendering. There are a lot of example scripts, so you can copy those. Additionally it is good for making fast video graphics (better than matplotlib).

Alejandra : MATLAB ternary plots

Alejandra shared a ternary plotting thing.

Andrew Hefner : Mayavi

[Mayavi uses vtk, which is pretty powerful, but it’s a python interface. Additionally, it uses syntax that will be familiar to the matlab users.

There are various interesting features in Mayavi. Quiver, for example, is a really basic function call that generates vector fields. http://docs.enthought.com/mayavi/mayavi/

Ryan Pavlovsky : DyGraph

Dygraphs is a nice, lightweight, and interactive. So, it’s great for websites, because you just drop a single javascript file.

Katy Huff : yt (and what is plotly?)

Katy likes and is impressed with yt. She is curious but nervous about plotly.

Dav Clark : Bokeh

It’s architected to have a javascript frontend and is meant to be hooked into generic data servers.

It has cool zooming capabilities in the gui and has neato feature like linked brushing so that two plots are linked and can be interacted with using a single tool in one of the windows.

PARCS and RadWatch (without the physics) - Sandra Bogetic and Ryan Pavlovsky

2014-09-24T00:00:00+00:00

Attending

Sandra Bogetic
Christian DiSanzo
Alejandra Jolodosky
James
Kelly Rowland
Jasmina Vujic
Rachel Slaybaugh
Massimiliano Fratoni
Katy Huff
Dan Wooten
Aaron Culich
Madicken Munk
Kedar Kolluri
James Bevins

Discussion: PARCS

Sandra Bogetic

Sandra Bogetic is a first year graduate student in the Nuclear Engineering Department.

PARCS

PARCS is a powerful tool, but it seems to have struggles with version control, and would strongly benefit from a more transparent and controlled release procedure.

Since it is an NQA-1 code…. its surprising that it is not under version control.

Input Files

Generation of cross sections can be done in various ways. These include CASMO, HELIOS, and TRITON
The input for thermal hydraulic behavior can be enered into PARCS or coupled using PATH, TRACE, and/or RELAP. For a PWR, you can do it in TH.
Depletion can be done by PARCS, but sometimes you don’t want to do it with PARCS because perhaps you have done depletion in some other code (such as SIMULATE). PARCS allows you to input this external data.
The input file formatting is in blocks.
- CNTL, XSEC, GEOM, PARAM, TH, TRAN, etc.
- ata can be repeated using an asterisk
- Input ends with a .
- etc.

Options

There are many options that can or should be specified. The core type, core power, simulation behavior concerning Xe and Sm (will you input the values, do you want them to be at equilibrium, transient, etc), control rod banking positions, external thermal hydraulics linkages, print options, whether or not to conduct depletion, etc.

Additionally, there is a tree variable for cross section definitions.

The geometry card of course is very important. The core compositions are all defined for the assemblies, reflector, etc. Typical boundary conditions are available.

Running the input

Examples

Examples can be found in the presentation, but will not be shared online.

Discussion: RadWatch (without all the physics)

Ryan Pavlovsky

Ryan is a graduate student in the Nuclear Engineering Department.

Linux and Unix tools within RadWatch

The stack

Sensor input, python, smpt, datetime, python, cron, scp, ssh, ssh-agent, pytables, matplotlib, scp, yes, drupal, jquery

CROn

CROn is for scheduling jobs.

Crontab -e can be used to edit the cron file for your user space. Don’t freak out if it’s empty. Just use a template from your toplevel cron file or find a template on the internet to fill out.

Ryan reminds us of the importance of the man page. If you need more help with the crontab command, try man crontab in your terminal to figure out its secrets. (Man pages are opened in a program called less. So, to get out of the man page, type “q”.)

Note that your system may have a cron.allow file. That file, if it exists, names the people allowed to create cron jobs.

SSH

Ryan points out that there are two versions of ssh (client and server). They have their own configuration files!

Note the config file located in /etc/ssh/ssh_config, but also, note the one in your home directory ~/.ssh/config and the one for server daemon configuration /etc/ssh/sshd_config. NOTE: on MACOSX, there may not be an additional ssh directory layer in etc. So, find those files at /etc/ssh_config and /etc/sshd_config.

Fun Fact DSA has a stronger random number generator than RSA, but RSA is used more widely. This is likely because RSA encription is faster and (more compressed?) than DSA.

Code examples from Ryan’s talk can be found here in the master branch.

When and Where Survey

2014-09-18T00:00:00+00:00

Space and Time

Space and time are complex, coupled problems for THW. We would like a place and a time that works well for everyone. But, schedules differ, geography is an issue, and space is hard to come by on this campus.

Space

There are three possible spaces.

2150 Shattuck, Suite 230

This is the space we’ve been using for the last year. Many of you know it. It has lovely light, a lot of chairs, and a nice big round table. Additionally, we never get kicked out of it for other events, because our events take priority here. The location may be a downside for those of you who drive or don’t like walking from Etcheverry.

190 Doe

This space is brand spanking new, and yet, very old. The Berkeley Institute for Data Science (driven by many of the same concerns that drive The Hacker Within) has an open, computational-science-focused space that is just being finished up. It’s in the historic Doe library, right at the entrance, so it’s very central and convenient for people coming from any corner of campus. Also, it looks like a beautiful startup space, it has a camera/screen portal with the capability to broadcast our meeting to remote viewers, and Katy has the keys. The construction crew is finishing up some of the A/C in the room this week, but all the furniture is in and it’s ready for excellent events like this. Go check it out if you don’t believe me.

4101 Etcheverry

This conference room is lovely and keeps its guests surrounded by nuclear engineering books. It’s a popular choice for many NE meetings and has a lot of charm. It is large enough to seat our current meeting attendees, but probably no more than that. So, consider the ideal size of THW, when you make this decision. Many might prefer that THW stay the same size it is now. It is a very convenient location for those of you in Etcheverry, but fairly inconvenient for those of us who sit in the NSSC space.

Time

In general, we’d like to have the meeting in the afternoon. Three possible start times have been suggested. These are 3:00, 3:30, and 4:00pm. The meeting nominally lasts between 1.5 and 2 hours.

Consider your class schedule. Please be generous. It’s ok if you have to be 10 minutes late. This is Berkeley.

Exercise your rights

We’re all equal here. Please exercise your rights by voting in this online poll that I’ve set up. Between this poll and the availability of key locations, we’ll find a place and time.

Serpent and LaTeX - Alejandra Jolodosky and Katy Huff

2014-09-17T00:00:00+00:00

Attending

Kelly Rowland
Madicken Munk
Sandra Bogetic
Daniel Wooten
Alejandra Jolodosky
Jasmina Vujic
Massimiliano Fratoni
Aaron Culich
Ross Barnowski
Ryan Pavlovsky
Kedar Kolluri
Xin Wang
Katy Huff
James Bevins
Jessica Roche

Discussion: Serpent

Alejandra Jolodosky

Alejandra is a graduate student in the nuclear engineering department. She’ll discuss the use of Serpent and how to find a bug when using it.

Serpent

Download the slides here.

Discussion: LaTeX, markup for science

Katy Huff

Katy Huff is a postdoctoral scholar in NSSC and BIDS.

LaTeX

Notes and code examples can be found here.

Lightning Talks

Kelly Rowland : Today I learned what a call stack is.

Everyone, check out the wikipedia article for Call Stack

Ross Barnowski : Did you know LaTeX + Matplotlib = Awesome

With dollar signs in plain text, matplotlib renders math on your plot, in the title, on the axes, in the labels… wherever!

CRAM and imagemagick - Dan Wooten and Madicken Munk

2014-09-10T00:00:00+00:00

Attending

Daniel Wooten
Kelly Rowland
Cameron Bates
Christian DiSanzo
George Zhang
Phil Gorman
Alejandra Jolodosky
Madicken Munk
Jasmina Vujic

Discussion: CRAM

Daniel Wooten

Dan Wooten is a second year graduate student in the Nuclear Engineering Deparment.

The CRAM method

Dan introduced the CRAM method.

Discussion: Imagemagick

Madicken Munk

Madicken Munk is a fourth year graduate student in the Nuclear Engineering Deparment.

Imagemagick

Madicken demonstrated the generation of gifs on the command line.

Code examples can be found here.

Discussion of Next Week

Proposed future talk(s) from Jasmina: Series of “how to” talks— how to find and install your software, how to set up your environments, etc.

It was decided that next week’s Nuclear Talk should be Serpent Tutorial and How to Approach a Bug: Alejandra Jolodosky

The non-nuclear talk should be Introduction to LaTex (how you should interface with Tex on your respective OS). How to format your paper, image additions, and managing citations. : Katy can do this.

Computational Nuclear Engineering Overview & Bash - Max Fratoni & Katy Huff

2014-09-03T00:00:00+00:00

Attending

Kelly Rowland
Katy Huff
Alejandra Jolodosky
Blake Huff
Jasmina Vujic
Sven
Daniel Wooten
Rachel Slaybaugh
Denia Djokic
Madicken Munk
Sandra Bogetic
Phil Gorman
Naman
Max Fratoni

Computational Tools for Nuclear Engineering, An Overview

Speaker Intro: Massimiliano Fratoni

Max Fratoni is a professor (forever freshman) in the nuclear engineering department who specializes in computational neutronics methods, advanced reactors, and accident tolerant fuels.

Discussion: Computational Tools for Nuclear Engineering

Max would like to help define what to use when. The first quest to ask is “What is your problem like?” As whether it is:

a steady state or time dependent
over a short (reactivity excursion) or long (depletion) time frame

The most generic types of tools are either

stochastic (Monte Carlo)
or deterministic (many).

The question, again, is “What are you trying to model?” If your simulation has a common geometry and common materials, then deterministic tools are certainly likely to be the answer. For deterministic codes, there are many simplifications, so it’s likely to be fast, but perhaps not as flexible.

If your geometry or you have unusual materials, stochastic models are probably going to capture your problem the best. In general, you will choose either MCNP or Serpent. So, when do you use MCNP and when do you use Serpent? While Serpent is very user friendly, the theory part in the Serpent manual, it is very hard to be confident in your results, since there are so many knobs that can be turned, but don’t actually have to be turned.

Serpent, for example, can combine points and make up its own energy grid. When you do this, you can lose accuracy, in particular in the unresolved resonances. This unified energy grid (which is set by default) will definitely bias some of your isotopics.

That’s fine, but MCNP doesn’t do depletion in a reliable way.

There are also a suite of codes that are capable of transient solutions by coupling with a monte carlo or deterministic code. These are often specifically designed for a certain reactor. This includes PARCS, for example.

Besides coupling with a monte carlo or deterministic code, depletion can be handled, by and large, by ORIGEN. ORIGEN2 and ORIGEN-S are your options. The resuls from ORIGEN are going to be just as good as your cross sections.

Future Topics

What is the difference between the exponential matrix method and the kram method? (Daniel)
Mocdown (Phil)
PARCS (Sandra)
Serpent&PARCS (Sandra)
COMSOL (Madicken)
MONTEBURNS (Alejandra)
MOOSE (Katy)

Madicken will show off COMSOL next week, and then Daniel will talk the week after that.

Discussion: Bash and Unix / Linux Environments

Speaker Intro: Katy Huff

Katy Huff is an NSSC Postdoctoral Scholar and a Berkeley Institute for Data Science Fellow.

Discussion: Bash

Code examples can be found here.

Lightning Talks

Kelly Rowland

Fall Kickoff 2014

2014-08-27T00:00:00+00:00

Wednesday at 4pm in 2150 Shattuck, Suite 230.

Attending

Kelly Rowland
Madicken Munk
Ross Barnowski
Daniel Wooten
Russell
Sven
Massimiliano Fratoni
Denia Djokic
Katy Huff

Discussion: Upcoming Topics

The Berkeley chapter of the Hacker Within scientific computing group (formerly known as the Berkeley NE computational methods group) will be kicking off the fall 2014 semester on Wednesday, August 27th, from 4pm-6pm.

The goal of this meeting was to plan the rest of the semester’s meetings. The time, frequency, and content of the upcoming semester’s meetings we all up for discussion. In particular, we were able to brainstorm a possible suite of software tools, resources, and practices to discuss this upcoming semester. If you have an idea, but didn’t make it to the meeting, reply on the [UCB hackerwithin listhost][listhost].

Brainstorming Computational Topics

We thought of a number of cool things we’d like to talk about this semester.

LaTeX * Resumes
PyNE * Live build (Kelly)
Bash
Ubuntu install and dual boot
Fun hacky things * Doxygen * Cmake
Plotting * Matplotlib, yt * gnuplot * 3D options * Rules * how to make a good plot * animations (imagemagick)
Presentation rules * tools * rules * Pre-ANS
Vectorized/Matrix computing * formulating your problem correctly * Andy’s diffusion example
Web tools (scraping, python urllib, wget)
Extending Python * cython, C/API, boost.python
Threading (multiprocessing, ZMQ)

Bootcamp Series Ideas

Professor Fratoni has an excellent idea for embedding a seminar for neutronics specific toolsets into this general computational seminar. The topics will vary and will be the subject of discussion during the September 3rd meeting.

Tools - Serpent - MCNP - ORIGEN - MOCUP - MOOSE
Data/Methods
Tricks ‘n tips
Overviews & Comparison
Increase interactivity, project/tutorial focused
Group attendees by interest/skill-level

General thoughts

Tutorial code should be posted on github prior to presentation
Reminders should be sent the day before the meeting
Also, the listhost reminders should go out to ne-grads for a while
Having the meetings in the Doe Library BIDS space seems feasible.

Meeting Structure ideas

1st hour -> nuclear tools seminar 1:00 - 1:45 -> computing skillz 1:45 - people burn out -> lightning/hanging out

Upcoming Talks

Bash (Katy)
Latex/resumes (Katy-resumes), (Laurence, Rachel for general Latex?)
PyNE (Kelly)

Ordering of talks - Nuclear series

Max overview of what tools to use when
… figure out from Max’s talk

LaTeX - Laurence Lewis

2014-04-29T00:00:00+00:00

Attending

Katy Huff
Ryan Bergmann
Professor Rachel Slaybaugh
Professor Max Fratoni
Kelly Rowland
Daniel Wooten
Joshua Howland
Madicken Munk
and others… I failed at taking attendance this time.

Lesson: LaTeX

You can find a lot of Laurence’s examples in the master branch of our repository.

Lightning Talk: Rachel on drawing, Katy on FloatBarrier and Max on Easy LaTeX

Rachel shared her LaTeX homework assignments, Katy pointed out FloatBarrier, the best command ever, and Max showed off a WYSIWYG latex editor called LyX.

So You Have A Software

2014-04-23T00:00:00+00:00

Attending

Katy Huff
Ryan Bergmann
Professor Rachel Slaybaugh
Joshua Howland
Anthony Scopatz
and many others… I failed at taking attendance this time.

Talk

This week Anthony Scopatz gave a talk on software architecture patterns. Hint: the most important file in your project is the license! Find the slides here.

Packaging and Distribution - Anthony Scopatz

2014-04-22T00:00:00+00:00

Lesson: Anthony Scopatz “So You Have a Software”

Anthony’s talk can be found here

Emailing with Python - Ross Barnowski

2014-04-16T00:00:00+00:00

Attending

Ross Barnowski
Katy Huff
Ryan Bergmann
Professor Rachel Slaybaugh
Kelly Rowland
Daniel Wooten
Joshua Howland
Madicken Munk
…

Lesson: Emailing With Python

Tutorial for sending email using python, smtplib, and the gmail smtp server.

Requires:

python (2.6 or greater)
Python modules: smtplib, email, getpass, psutil (advanced example)

Example scripts (examples):

smtp_simple.py: Simplest example demonstrating the use of smtplib to send a “Hello World” style messge
smtp_mime.py: A more complicated example demonstrating the use of several MIME objects in the email module to construct a message out of formatted text (html) with an image attachment.
simulation_example_ : This folder contains an example python script that calls a simulation program (in this case, a plasma calculation from Prof. Morse’s 281 class). The simulation is launched from the python script, and psutil is used to do some rudimentary performance logging. When the calculation finishes, the results, simulation output, and performance statistics are all attached to an email and sent to the user.

NOTE: The logging in this example is for demonstration only. This simple logging is probably not the way you’d want to do it if you truly wanted to track the performance of a running calculation. May not work on all systems.

Lightning Talk: Rachel on pretty images and Madicken on slow MCNP

Rachel showed an excellent-looking, peacock colored image of the ratio of two neutronics solutions.

Madicken discussed the behavior of MCNP when a single material is replaced by a material which causes more neutron scattering. Result: MCNP slows down a whole heck of a lot for such materials.

Raspberry Pi Hacking - Ryan Pavlovsky

2014-04-09T00:00:00+00:00

Attending

Ryan Pavlovsky
Katy Huff
Ryan Bergmann
Josh Howland
Prof. Rachel Slaybaugh
Kelly Rowland
Ross Barnowski
Tomi Akindele

Lesson: Raspberry Pi

Ryan Pavlovsky, a student in Kai Vetter’s research group, gave an excellent presentation about what he’s done with the raspberry pi.

Stuff that we discussed :

How did you get this?
What are the peripherals that work with it?
- gpu/cpu
- broadcomm video card
- ARM processor, 700 MHz
- 512 MB memory
- JTag header?
- USB/Ethernet
- SD card additional memory
- Raspbian operating system
What example projects are cool?
- smart kegerator (monitors flow rates, temperatures, accounting, facial detection)
- Quake III
- cluster of pis. built mpi on it. rack made of legos!
Demos!
- pong, a ping sensor. Sends a ping, measures time to return.
- ping, a program that acquires pong senses over time.
- simon says, computer tells you what to do, based on ping
- GEANT4
  - 4.10 C++ implementation
  - networked raspberry pi
  - edited ~/.bashrc for data

Code examples for the demo can be found here.

Lightning Talks

We talked, in an ad hoc fashion about the hearbleed OpenSSL bug.

Testing Part II - Katy Huff

2014-04-02T00:00:00+00:00

Lesson: Introduction to Testing

Katy gave a very quick continuation of testing in the context of languages other than python. She mostly did a tour through the Cyclus code and its tests, written using the google test framework and built into an executable with CMake.

Lightning Talks

Ross Barnowski gave a quick overview of how to mount remote drives on a linux or unix platform.

Testing - Katy Huff

2014-03-19T00:00:00+00:00

Lesson: Introduction to Testing

Katy gave a very quick intro to testing using the python nosetests package. There is a simple example here.

IPython - Ross Barnowski

2014-03-12T00:00:00+00:00

Lesson: Introduction to IPython

Ross gave an introduction to one of the best tools in the python development suite: IPython.

His notes for this tutorial can be found on github.

Makefiles - Katy Huff

2014-03-05T00:00:00+00:00

Lesson: Introduction to Makefiles

Katy gave a very quick intro to makefiles. This was based largely on Software Carpentry material, replicated here.

Self Documenting Code - Rachel Slaybaugh

2014-02-26T00:00:00+00:00

Attending

Prof. Rachel Slaybaugh
Ryan Bergmann
Jankai (Jack) Yu
Dan Wooten
Sandra Bogetic
Christian DiSanzo
Josh Howland
Alex Chong
Kelly Rowland
Phil Gorman
Jason Hou

Lesson: Code Documentation

Rachel gave a brief overview of a variety of documentation strategies, including how to write code comments that generate a useful API. Here is [Rachel’s Tutorial][rachelstalk].

Lightning Talks

[rachelstalk]: https://github.com/thehackerwithin/berkeley/tree/master/documentation/documentation.md “Rachel’s Tutorial”

title: Documentation - Rachel Slaybaugh comments: true category: posts tags: meeting documentation —

Lesson: Introduction to Documentation

Professor Rachel Slaybauh gave an introduction to documenting code. This covered:

Code Comments
API Documentation
Auto-Documentation
Self-Documenting Code
Readmes
User Guides
Developer Guides

You can find details about this topic from the meeting notes.

Intro to Git Part II - Katy Huff

2014-02-19T00:00:00+00:00

Lesson: Introduction to Git Part II

Katy gave the second half to version control using git: remotes. Here is Katy’s Tutorial.

Intro to Git - Katy Huff

2014-02-12T00:00:00+00:00

Attending

Katy Huff
Ryan Bergmann
Jankai (Jack) YU
Dan Wooten
Sandra Bogetic
Christian DiSanzo
Madicken Munk
Josh Howland
Prof. Rachel Slaybaugh
Kelly Rowland
Phil Gorman
Alexjandra Jolodosky
Kelly Rowland
Jason Hou

Lesson: Introduction to Git

Katy gave a very quick intro to version control using git. Here is Katy’s Tutorial.

Lightning Talks

Rachel gave an introduction to the IPython Notebook, an excellent tool for prototyping python code.

GPUs and CUDA - Ryan Bergmann

2014-02-05T00:00:00+00:00

Attending

Ryan Bergmann
Katy Huff
Jankai (Jack) YU
Dan Wooten
Prof. Max Fratoni
Sandra Bogetic
Christian DiSanzo
Josh Howland
Prof. Rachel Slaybaugh
Kelly Rowland
Nikola Radnovic

Lesson: GPUs and CUDA

Ryan Bergmann covered various features of GPUs and CUDA. Here is Ryan’s Tutorial.

Things we learned include:

CUDA stands for Compute Unified Device Architecture.
SIMD stands for Single Instruction Multiple Data.
GPUs are good for turning compute-bound problems into memory-bound ones.
CUDA cores aren’t really cores there are multiple cores per CUDA core.
You have to use the SIMD lanes in order to get good performance out of a GPU system.
Coalesced reading and writing means that your cores should be accessing adjacent pieces of memory simultaneously.
The memory latency is higher for GPUs than CPUs, but the GPU hides this better the more threads you’re running.
The host thread launches the GPU kernel
Threads are organized into blocks
Blocks are organzied into grids
The grid is the kernel you have loaded.
We learned how to launch a kernel for

Lightning Talks

Katy gave a quick lightning talk on style guides for code.
Kelly gave a more in-depth lightning talk on Laser Doppler Vibrometry.

Bash Meeting - Katy Huff

2014-01-22T00:00:00+00:00

Attending

Katy Huff
Ryan Bergmann
Professor Rachel Slaybaugh
Kelly Rowland
Daniel Wooten
Christian Disanzo
Jiankai (Jack) Yu
Sandra Bogetic

Lesson: Bash

Katy will review various features of the powerhouse of programming, the *nix terminal. (Note that *nix is jargon intended to indicate both linux and unix operating systems.) You’ll find this lesson within our shared repository. Start with the tutorial.

First Meeting

2014-01-15T00:00:00+00:00

This was a planning meeting. Katy Huff, Ryan Bergmann, and Professor Rachel Slaybaugh attended. Together we discussed a possible suite of useful software practices to discuss this semester.

What is this?

We discussed that part of the purpose of these meetings is to restart a successful group that originated in Wisconsin, “The Hacker Within.” Ideally, this meeting will facilitate sharing skills and best practices for computational nuclear engineering applications. Last semester, we had a couple such meetings. In spring semester I would like to share a number of skills for scientific software development (testing, data management, version control, literate programming etc. ) and to ask the rest of you to share the skills you have as well. The goal will be to incorporate these practices into our workflows. This would be a great venue for introducing new libraries, showing off useful features of a neutronics code you’re using, or bringing up a computational problem you’re having.

What can be expected?

We decided to try meetings with an agenda structured thus:

First, we will go around the room and attendees can introduce themselves.
The meeting will start with one 30-40 minute talk on a topic of import to scientists who use software. Particular emphasis is likely to be paid to topics useful to nuclear engineering researchers. To volunteer to give a talk, mention it at a meeting, or email Katy.
The talk will be followed by a short period for questions.
For up to 40 minutes, attendees will have the opportunity to give lightning talks on short topics. These may share a small skill snippet, demonstrate a computational issue you’re having with you’re research, or anything of interest to the group. Sometimes, lighting talk topics will be requested ahead of time on a theme (i.e., text editors). To give a lightning talk, just show up and speak up whe the time comes. If you like, letting Katy know ahead of time is always welcome.
After the meeting, attendees can hang around in the space and hack together on their research codes, if they like.

What are the topics?

The topics for the first part of the semester will focus on reproducibility:

command line
gpus/cuda
version control
build systems
testing
self documenting code
cloud computing (amazon ec2, etc.)
parallelism
profiling

Lightning Talks

A number of good topics were identified for lightning talks or talk series. If you’re interested in talking about these or something else, just come prepared. If the talk you want to give is in a series, consider banding together a group of folks who would like to give the other parts of the series.

debugging
libraries/linking
scripted plotting
exceptions
text editors
licensing and export control

The Data Analysis Tools Series

Archival Data Repositories

Welcome!

Speakers

Content

Objectives:

Data Repository Defined

A minimum rationale for depositing/sharing…

Things to Consider when choosing a Repository

Reputation

Sustainability

Visibility

Usability

Features

Formats

Rights

General vs. Subject Specific Repositories

General repositories

Subject repositories

APIs + Wrappers

Dataverse Walk-through

On your own

Contacts

Intro to Machine Learning with scikit-learn -- Robert Martin-Short

Welcome!

Speakers

Robert Martin-Short

Content

Installation

Materials

DATS Round-table

Welcome!

Sign-In

DATS Meet up

Welcome!

Sign-In

Matplotlib Two Ways -- Caroline Cypranowska

Welcome!

Speakers

Caroline Cypranowska

Content

Installation

Materials

Data tidying in R & Python -- Caroline Cypranowska and Sara Stoudt

Welcome!

Speakers

Caroline Cypranowska

Sara Stoudt

Content

Charles Frye -- Use You A Jupyter Notebook For Great Good!

Welcome!

Agenda

Speakers

Charles Frye

Content

Mark Mikofski -- Git Version Control with GitHub

Agenda

Requirements

Objectives

Git VCS

In case of fire, git commit, git push and leave the building

Git on Git

XKCD on Git

Version Control Software (VCS) aka Source Code Management (SCM)

References

GitHub

GitHub Pages

SSH or HTTPS

Git Primer

XKCD on Git Commit

Winning Workflow

Additional Info

First meeting of Fall 2018 Semester -- Organization

Welcome! Please sign in at bit.do/dats-082718.

Agenda

Speakers

Caroline Cypranowska

Diya Das

Tim Howes -- File syncing tools - syncthing, dat, git-annex

File syncing tools

Machine Learning Pipelines for R with `sl3`

`sl3` Installation

`devtools` installation (if needed)