Welcome to the Etherpad for the Data Carpentry Workshop at UCONN on March 7th to 8th.
You are here: https://public.etherpad-mozilla.org/p/2016-03-07-uconn
Please see the workshop repository at http://jrherr.github.io/2016-03-07-uconn
=========================================================================================================================================================
Instructors:
Kate Hertweck, Assistant Professor, University of Texas at Tyler, @k8hert
Josh Herr, Assistant Professor, University of Nebraska, @number_three
Attendees:
- R.C. Rizzitello Plant Science UConn
- Maria Coman University of Connecticut
- Rebecca Acabchuk , Physiology and Neurobiology UCONN
- Tracy Rittenhouse, Natural Resources & Environment
- Jenny Miglus
- Emil Coman, HDI-UConn Health,
- Meghan Bergin University af Massachusetts Amherst Libraries
- Lucy DeGozzaldi University af Massachusetts Amherst Libraries
- Dave Bretthauer, UConn Libraries Digital Scholarship
- Hillary Kenyon, Northeast Aquatic Research, UCONN grad
- Amanda Dick (MCB, Uconn)
- Natalia Vorotyntseva, CT State Data Center
- Sabina Perkins, Northeast Aquatic Research
- Wanli Xu School of Nursing
- Steve Batt, CT State Data Center
- Benjamin Gluck
- Gaurav Joshi, Pharmaceutical Sciences, UConn
- Hannah Kyer, Northeast Aquatic Research
- Kendra Maas, Uconn Biotech MARS facility
- Xiaomei Cong, UConn School of Nursing
- Joan Smyth, Pathobiology & Veterinary Science, UCONN
- Chuan-Jie Zhang, plant science department, Uconn
- Kay Dion, University of Massachusetts Amherst
Options for lunch
==============
=========================================================================================================================================================
Day 1
++++
Spreadsheets
==========
Data for spreadsheet: https://ndownloader.figshare.com/files/2252083
Link to lesson materials: http://www.datacarpentry.org/spreadsheet-ecology-lesson/
Spreadsheet notes:
- csv: comma separated values
- tsv: tab separated values
- Microsoft Excel files are not human readable outside of the program! tsv and cvs files are
- our focus today is data wrangling (not analysis or build figures)
Exercise: create a new tab that merges data from 2013 and 2014 tabs into one table
- do not modify original data (tabs 2013 and 2014
- try to represent ALL DATA available
- only one type of data per column!
- Hints:
- How do you make columns from one table match columns in another table?
- Is there anything you can remove from a column?
- Are there any columns you need to add to both tables?
- Strategies to solve exercise:
- Things to remember when setting up spreadsheets:
- avoid multiple tables in one spreadsheet
- splitting similar data across multiple tabs
- not filling in zeros
- using inappropriate null values (missing data)
- using formatting to convey information
- using formatting to make sheet look pretty
- adding comments/units in cells
- placing more than one piece of information in a cell
- special characters in column names
- special characters in data
- including metadata in data table (in
- date formatting (super tricky!)
- Dates in excel
- extract month, day, year (where A3 is cell of date in question):
- =MONTH(A3)
- =DAY(A3)
- =YEAR(A3)
- year-month-day is international code (e.g., 2013-03-07)
- The short story of trying to do data analysis in excel: https://pbs.twimg.com/media/BRjnZqeCIAEl84N.png
- Quality assurance:
- can occur both before data entry starts, as well as after data entry (to clean)
- Before hand: in Excel, Data -> Data validation
- Export data from Excel as .csv, so other programs can read and interpret data appropriately
- you can open and view your .csv data in a text editor (recommended text editors below, all free):
OpenRefine
=========
To render OpenRefine in your browser (once it is installed), open the program on your computer and then open a port in your favorite web browser - http://127.0.0.1:3333/
Data for OpenRefine: https://www.dropbox.com/s/kbb4k00eanm19lg/Portalrodents19772002_scinameUUIDs.csv?dl=0 (click blue button to download)
Link to lesson materials: http://www.datacarpentry.org/OpenRefine-ecology/
Why to use OpenRefine:
- manage large data files that may have many errors
- avoid introducing more errors
- have tracked list of changes made to the file
For anyone interested in working with OpenRefine more, these screencasts are very helpful! https://github.com/OpenRefine/OpenRefine/wiki/Screencasts
Intro to R
=======
=========================================================================================================================================================
Day 2
++++
Instructors:
Kate Hertweck, Assistant Professor, University of Texas at Tyler, @k8hert
Josh Herr, Assistant Professor, University of Nebraska, @number_three
Attendees:
- Steve Batt
- Sabina Perkins
- Amanda Dick
- Benjamin Gluck
- Hannah Kyer
- Xiaomei Cong
- Emil Coman
- Hillary Kenyon
- Jenny Miglus
- chuan-jie zhang
- Dave Bretthauer
- Gaurav Joshi
- Becky Acabchuk
Data Analysis and Visualization in R
============================
Intro to SQL
==========