- R: R is an open-source environment for data analysis.
- The RStudio IDE is highly recommended. The Revolution IDE is also very good, only Linux and Windows.
- R Task Views: The Machine Learning and Optimization Task Views list useful packages in R we may use.
- swirl is an interactive R (and general data analysis) tutorial
- R/Matlab references: A short R guide for Matlab users. A longer one.
- R/Python references: A short R guide for Python users.
We will use Rocker, a project built on top of Docker to manage our software installation. This will provide you with a working installation of R, and the Rstudio IDE, along with a lot of packages providing analysis tools and datasets we will use throughout the semester. You can read more about Rocker in this introduction
We will be working from a docker image that extends the
container. To use this please use the following command:
docker run -d -p 8787:8787 -v <local_path>:/home/rstudio hcorrada/introdatascidocker
<local_path> corresponds to a path in your local machine that
will be mapped to the home directory in the docker container running
You can find more information about this docker container here
I will announce updates to the image on piazza.
The Rocker container you will install will contain access to a large number of data repositories. We will list additional ones here as we go along.
- Kaggle: is a site hosting data competitions. It’s a great source of datasets, questions and tutorials.
- data.gov: The U.S. goverment’s open data portal
- Global Health Observatory: World Health Organization’s data repository.
- UCI Machine Learning Repository: contains many datasets useful for testing and benchmarking learning algorithms.
- StatLib: Statistical software and dataset portal maintained by CMU.
- Another large list of repositories: http://www.inside-r.org/howto/finding-data-internet