2016-03-07-uconn

Welcome to the Etherpad for the Data Carpentry Workshop at UCONN on March 7th to 8th.

You are here: https://public.etherpad-mozilla.org/p/2016-03-07-uconn
Please see the workshop repository at http://jrherr.github.io/2016-03-07-uconn

=========================================================================================================================================================

Instructors:
Kate Hertweck, Assistant Professor, University of Texas at Tyler, @k8hert
Josh Herr, Assistant Professor, University of Nebraska, @number_three

Attendees:

R.C. Rizzitello Plant Science UConn
Maria Coman University of Connecticut
Rebecca Acabchuk , Physiology and Neurobiology UCONN
Tracy Rittenhouse, Natural Resources & Environment
Jenny Miglus
Emil Coman, HDI-UConn Health,
Meghan Bergin University af Massachusetts Amherst Libraries
Lucy DeGozzaldi University af Massachusetts Amherst Libraries
Dave Bretthauer, UConn Libraries Digital Scholarship
Hillary Kenyon, Northeast Aquatic Research, UCONN grad
Amanda Dick (MCB, Uconn)
Natalia Vorotyntseva, CT State Data Center
Sabina Perkins, Northeast Aquatic Research
Wanli Xu School of Nursing
Steve Batt, CT State Data Center
Benjamin Gluck
Gaurav Joshi, Pharmaceutical Sciences, UConn
Hannah Kyer, Northeast Aquatic Research
Kendra Maas, Uconn Biotech MARS facility
Xiaomei Cong, UConn School of Nursing
Joan Smyth, Pathobiology & Veterinary Science, UCONN
Chuan-Jie Zhang, plant science department, Uconn
Kay Dion, University of Massachusetts Amherst

Options for lunch
==============

The Bookworms Café on the entrance floor of the libraryhttp://nutritionanalysis.dds.uconn.edu/shortmenu.asp?sName=UCONN+Dining+Services&locationNum=26&locationName=UC+Cafes+-+Bookworms&naFlag=1
Lissie’s Curbside food truck parked outside the south entrance to the Library:http://www.lizziescurbside.com/#!our-daily-specials/c1965
Food Court options at the Student Union, a short walk away:http://studentunion.uconn.edu/dining-retail/

=========================================================================================================================================================

Day 1
++++

Spreadsheets
==========
Data for spreadsheet: https://ndownloader.figshare.com/files/2252083
Link to lesson materials: http://www.datacarpentry.org/spreadsheet-ecology-lesson/

Spreadsheet notes:

csv: comma separated values
tsv: tab separated values
Microsoft Excel files are not human readable outside of the program! tsv and cvs files are
our focus today is data wrangling (not analysis or build figures)

Exercise: create a new tab that merges data from 2013 and 2014 tabs into one table

do not modify original data (tabs 2013 and 2014
try to represent ALL DATA available
only one type of data per column!
Hints:

How do you make columns from one table match columns in another table?
Is there anything you can remove from a column?
Are there any columns you need to add to both tables?

Strategies to solve exercise:

match same columns from each table
leave missing data blank, or add another character
columns in final spreadsheet:

month, day, year, plot, sex, weight, species, calibration

can fill in missing data in excel: http://www.accountingweb.com/technology/excel/three-ways-to-fill-blank-cells-within-excel-spreadsheets

Things to remember when setting up spreadsheets:

avoid multiple tables in one spreadsheet
splitting similar data across multiple tabs
not filling in zeros
using inappropriate null values (missing data)
using formatting to convey information
using formatting to make sheet look pretty
adding comments/units in cells
placing more than one piece of information in a cell
special characters in column names
special characters in data
including metadata in data table (in
date formatting (super tricky!)

Dates in excel

extract month, day, year (where A3 is cell of date in question):

=MONTH(A3)
=DAY(A3)
=YEAR(A3)

year-month-day is international code (e.g., 2013-03-07)

The short story of trying to do data analysis in excel: https://pbs.twimg.com/media/BRjnZqeCIAEl84N.png
Quality assurance:

can occur both before data entry starts, as well as after data entry (to clean)
Before hand: in Excel, Data -> Data validation

Export data from Excel as .csv, so other programs can read and interpret data appropriately

you can open and view your .csv data in a text editor (recommended text editors below, all free):

Mac: TextWrangler http://www.barebones.com/products/textwrangler/
Windows: Notepad++ https://notepad-plus-plus.org (this is different than the pre-installed Notepad program on your computer!)
I would also add Sublime Text https://www.sublimetext.com/3 (also great!)
Another free editor is Atom https://atom.io/

OpenRefine
=========

    To render OpenRefine in your browser (once it is installed), open the program on your computer and then open a port in your favorite web browser - http://127.0.0.1:3333/

Data for OpenRefine: https://www.dropbox.com/s/kbb4k00eanm19lg/Portalrodents19772002_scinameUUIDs.csv?dl=0 (click blue button to download)
Link to lesson materials: http://www.datacarpentry.org/OpenRefine-ecology/

Why to use OpenRefine:

manage large data files that may have many errors
avoid introducing more errors
have tracked list of changes made to the file

For anyone interested in working with OpenRefine more, these screencasts are very helpful! https://github.com/OpenRefine/OpenRefine/wiki/Screencasts

Intro to R
=======

make sure you have both R and RStudio installed!
data URL for R: https://ndownloader.figshare.com/files/2292169
http://www.socrative.com, PAW5AYWM
Kate's R script: https://www.dropbox.com/s/h7ofv7maoundaim/data_carpentry.R.txt?dl=0

=========================================================================================================================================================

Day 2
++++

Instructors:
Kate Hertweck, Assistant Professor, University of Texas at Tyler, @k8hert
Josh Herr, Assistant Professor, University of Nebraska, @number_three

Attendees:

Steve Batt
Sabina Perkins
Amanda Dick
Benjamin Gluck
Hannah Kyer
Xiaomei Cong
Emil Coman
Hillary Kenyon
Jenny Miglus
chuan-jie zhang
Dave Bretthauer
Gaurav Joshi
Becky Acabchuk

Data Analysis and Visualization in R
============================

Kate's R script: https://www.dropbox.com/s/a078ra2lv6pf0xu/data_carpentry_Day2.R.txt?dl=0
ddplyr cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
referencing and customizing keyboard shortcuts in RStudio: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
difference between NA and NaN:

NA: not applicable (no data, missing data)
NaN: not a number, so a function on missing data may return this
more information: http://www.r-bloggers.com/difference-between-na-and-nan-in-r/

ggplot2 package manual: http://docs.ggplot2.org/current/

ggplot2 cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Another useful website for ggplot2, which has been expanded into a book: http://www.sthda.com/english/wiki/ggplot2-essentials
When adjusting ggplot2 themes, there's an R package called "cowplot" that is very helpful for nice publication-quality graphics, and also combining plots into one figure

https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html

Cookbook for R: http://www.cookbook-r.com/ -- This website was expanded in a great book - R Graphics Cookbook: http://shop.oreilly.com/product/0636920023135.do
colors in ggplot2: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/

3D plotting (for Amanda)

Using RGL (which can be rotated or saved as video) http://www.sthda.com/english/wiki/a-complete-guide-to-3d-visualization-device-system-in-r-r-software-and-data-visualization
A simpler alternative using Scatterplot3D: http://www.sthda.com/english/wiki/scatterplot3d-3d-graphics-r-software-and-data-visualization

creating fancy R documents with embedded graphics (using knitr package) https://onlinecourses.science.psu.edu/statprogram/markdown
searching for R information: http://rseek.org

Intro to SQL
==========

link to lesson material: http://www.datacarpentry.org/sql-ecology/
link to data: https://figshare.com/articles/Portal_Project_Teaching_Database/1314459 (download all)
SQL Server (for massive implementations): https://blogs.microsoft.com/blog/2016/03/07/announcing-sql-server-on-linux/ (Thanks, Emil!)
relational databases: data stored in tables (columns and rows)
cheatsheets:

http://www.sql-tutorial.net/sql-cheat-sheet.pdf
I like this tutorial. It's written for MySQL, but most of it is applicable: http://www.tizag.com/mysqlTutorial/
This is also an "interactive course" in SQL. Codeschool has a lot of stuff like this, (including R) https://www.codeschool.com/courses/try-sql