Accessing public data on .gov websites -- Caroline Cypranowska
Accessing public data on .gov websites (or how to deal with bureaucrats)
Prerequisites
Today’s exercises will require Bash. If you have a Mac or Linux machine, you’re mostly good to go.
Windows
Most Windows users in need of a Bash terminal use Cygwin, a collection of Linux software tools compiled for Windows. Other options include Git and creating a Linux subsystem (for Windows 10). The instructions below provide detailed instructions for installing Cygwin and a few other tools required for this tutorial.
-
Download Cygwin and run
setup.exe
. Select ‘Install from Internet’ when prompted by the installation wizard. Choose your root directory and mirror for installation. -
The installer will also download a list of available packages. Include the default packages, but make sure to search for and include
curl
andwget
. -
Add the Cygwin path to the Windows Environment Path Variable, which can be found in the ‘Advanced system settings’ menu. Append
;C:\cygwin\bin
to the end of the variable value option (assuming this is where you installed Cygwin).
MacOS
The terminal in MacOS has the majority of the tools needed to make requests to government databases, as cURL comes with Macs out of the box. The main advantage of wget
over curl
is that it can download recursively. While you can choose to do the exercises without wget
, it can be easily installed with Homebrew.
foo@bar:~$ brew install wget
A brief explanation of networking protocols
In networking, a protocol is a set of rules for communication. Peer-to-peer networks are composed of interconnected computers, but no computer has a privileged position. Client-server networks, on the other hand, are composed of servers that perform functions on behalf of other machines (clients). Both of these systems rely on protocols to send and receive data.
The set of protocols used on the Internet is called TCP/IP (Transmission Control Protocol/Internet Protocol). The TCP/IP model has a layered structure, and protocols like HTTP, FTP, and SSH run on the highest layer (the application layer).
HTTP (or hypertext transfer protocol) defines how computers exchange HTML documents, and FTP (or file transfer protocol) defines how computers move files between local and remote file systems. These are the primary tools we will use today to get our data.
HTTP and FTP each have methods for a client to make requests of the server, and for the server to return a response. HTTP requests and responses usually have a header, which contains meta data of the request.
APIs
Application programming interfaces (or APIs) are a set of rules for building application software. In this case it usually refers to accessing and posting data to a specific group of servers. Many government agency APIs for accessing data are catered towards people building web application software.
API documentation usually includes:
- how to format query strings
- what types/formats of data that can be retrieved or posted with a request
- authentication procedures
What is Data.gov?
Data.gov is mostly a catalog of data sets collected by the agencies of the US Federal Government. It includes information about the agency that collected the data, meta data, landing pages for the project, and links to the web address where data can be retrieved, the format of the data, etc. etc.
What Data.gov is not
Data.gov doesn’t host the data directly, and doesn’t have a unified API for accessing data from all government agencies. While Data.gov does have an API, the types of information accessed with the API are data on the types of data in the catalog. So you get meta meta data.
Exercises
Getting NOAA precipitation data from an FTP server
The U.S. Hourly Precipitation data set is hosted on an FTP server and is well documented. Here you’ll find that there is a page for downloading data from specific date ranges and location, but if you want to store them on a server then you’ll (obviously) need to use FTP.
The .pdf describes the naming scheme and the readme.txt instructs how to open a connection to the server and where to find files.
Exercise: Get precipitation records from CA from 2000-2009
According to the docs (don’t run this before we discuss)
- Log into the FTP server
foo@bar:~$ ftp ftp.ncdc.noaa.gov
- Navigate to the correct directory
ftp> cd pub/data/hourly_precip-3240/04
- Use
get
to download one file, ormget
to get multiple files
ftp> mget 3240_04_200*.tar.Z
Just a note, when logging into an FTP server your username and password aren’t encrypted. There are ways of doing FTP over SSH or with a secure-socket layer (SSL).
The safer way
curl
has an option of using FTP with a SSL. We should choose this instead, because it will protect the traffic.
-
Navigate to your preferred directory
-
Use the
--ftp-ssl
flag, the--user
flag, and the-o
option
foo@bar:~$ curl --ftp-ssl --user anonymous:youremail@email.com ftp://ftp.ncdc.nooa.gov/04/3240_04_2000-2000.tar.Z -o ca_2000.tar.Z
The safer (recursive) way
curl
doesn’t have a built-in method for easily getting multiple files. Write a shell script that will get all the CA precipitation data from 2000-2009.
wget
has a -m
option for mirroring sites, that will allow you to download the entire contents of a directory.
foo@bar:~$ wget -mc -nH --ftps-implicit --no-ftps-resume-ssl --user=anonymous --password=youremail@email.com ftp://ftp.ncdc.noaa.gov/pub/data/hourly_precip-3240/04/
Bonus
-
Write a script for downloading the files you want from the NOAA FTP server with
curl
. -
FTP isn’t super great for transferring large files. How can you tell if the files downloaded by
curl
are identical to the ones you mirrored withwget
from the command line?
Getting USGS earthquake data using an API
Skim the docs. Place a query to return GeoJSON records of earthquakes occuring 1) on your birthday, 2) in your favorite region of the world, 3) with a magnitude > 2.5
foo@bar:~$ curl -O https://earthquake.usgs.gov/fdsnws/event/1/query.geojson?starttime=1991-09-21&endtime=1991-09-21&maxlatitude=43.373&minlatitude=25.542&maxlongitude=-101.25&minlongitude=-120.234&minmagnitude=2.5&orderby=time
The Python urllib and request libraries are great for formatting query strings and headers for more sophisticated endeavors than the exercise above. (But you can also do fancy things in Bash.)
Mini-challenge!
(To be posted during the session)
Resources
Project Open Data
Project Open Data was an initiative created by the Obama Administration to promote accessibility and visibility of data sets collected and curated by the Federal government. The Project Open Data policy page is mostly geared towards government officials wanting to publish agency data, but also includes some resources for harvesting metadata, converting file types, etc.
There’s also a dashboard to check out how well each government agency is complying with the Project Open Data policies.
NASA
Fonts aside, NASA has their crap together.