r_tutorial_ed/TutorialX_StatisticsRefresher.Rmd at master · Eemaa26/r_tutorial_ed

355 lines (273 loc) · 15.4 KB
% Basic Statistical Concepts
% R Bootcamp HTML Slides
% Jared Knowles
```{r loading, include=FALSE}
library(ggplot2)
library(eeptools)
opts_chunk$set(fig.path='figure/slidesX-', cache.path='cache/slidesX-',fig.width=12,fig.height=9,message=FALSE,error=FALSE,warning=FALSE,echo=TRUE,size='tiny',dev='png',out.width='650px',out.height='370px')
# Introduction
We will review the following statistical concepts here through the lens of R:
- Probability
- Statistical distributions
- Attributes of distributions
- Statistical models
- Univariate Regression
- Hypothesis testing
- Multivariate Regression
- Causality
# What is statistics?
- What do you think?
- From [Wikipedia](http://en.wikipedia.org/wiki/Statistics): 
> "Statistics is the study of the collection, organization, analysis, interpretation, and presentation of data. It deals with all aspects of this, including the planning of data collection "
- Statistics gives us a way to quantify and express our uncertainty about the future or about the relationship of a sample to the population
### Key Concepts to Statistical Thinking
1. Probability
2. How to describe data
3. Drawing inferences from data
4. Causation
# Probability
- Probability rests on the concepts of randomness
- Randomness is an abstract principle that describes the behavior of events in the world
- Random phenomena have outcomes we cannot perfectly predict, but that have a regular distribution over many repetitions
- Probability describes the the proportion of times an event will occur in many repeated trials of observation
- The probability of the Packers winning the coin toss is 0.5
- Let's look at an example of coin tosses!
# Probability's uses
- For truly random independent events like coin flips, this is easy and uncertainty can become certainty
- But, what about events that depend on one another?
- What does it mean then when someone says "Kobe Bryant's 3 has a 40% chance of going in."
- Or "Mason Crosby has a 65% chance of making that field goal?"
- These aren't random events, are they?
- But we can model them as such with large enough sample size
# Interdependence and Independence
- Coins are obviously independent in their flipping (though we can think of conditions where this is not true)
- Other events have probability that varies based on other factors - joint probabilities
- Let's look at a simple example of this
- [Coin Flip Demo](http://glimmer.rstudio.com/wisconsindpi/coin/)
# Describing Data
- At it's most basic level, statistics is about summarizing and understanding data
- Data are themselves abstractions of real world concepts we care about
- There are a number of useful ways to describe a set of data, some of which we have become familiar with so far
# Spread of the Data
- Data spread describes how scattered a data set is
- One type of data, categorical data, describes groups
```{r barplot,echo=FALSE}
qplot(factor(cyl),data=mtcars)+
  labs(x='cylinder',title='Car models by Cylinder Count')+theme_dpi()
- What can we learn here?
# Let's try another
- What can we learn from this chart type about the data?
```{r diamondplot,echo=FALSE}
data(diamonds)
qplot(factor(cut),data=diamonds)+labs(x='Cut',title='Diamonds by Cut Quality')+theme_dpi()
- What else might we want to learn?
# How about?
```{r diamondplot2,echo=FALSE}
qplot(factor(color),data=diamonds)+
  labs(x='Color',title='Diamonds by Color and Clarity')+
  facet_wrap(~clarity,nrow=2)+theme_dpi()
# Still more
- With diamonds we immediately want to look at price
```{r diamondplot3,echo=FALSE}
qplot(carat,price,data=diamonds,color=color)+geom_smooth(aes(group=1))+theme_dpi()
- What do you see?
- Outliers?
- Data modes?
- Clusters?
# Graphical Depictions of Data
- These are ways to show data with graphics
- Graphical displays are driven by the concept of dimensions
- One dimension--a single category
- Two dimensions--two categories
# Levels of Measurement
- Any given dimension may be measured at different [levels of measure](http://en.wikipedia.org/wiki/Level_of_measurement)
- **Nominal:** unordered categories of data
- **Ordinal:** ordered categories of data, relative size and degree of difference between categories is unknown
- **Interval:** ordered categories of data, fixed width, like discrete temperature scales
- **Continuous (ratio):** a measurement scale in a continuous space with a meaningful zero--physical measurements
- This classification was derived by Stanley Smith Stevens in the 1940s and 50s
- Color (of diamonds) is what level of measurement?
- **Nominal**
- Carats are what level of measurement?
- **Continuous**
# Levels of measurement matter
- How you depict the data
- What you can calculate using the data
# Describing Data with Numbers
- What types of measures can we use to describe different levels of measurement?
Level of Meas.   Stats
--------------- -----------------------------------------------
Nominal           mode, Chi-squared
Ordinal           median, percentile
Interval          mean, std. deviation, correlation, ANOVA
Continuous        geometric mean, harmonic mean, logarithms
# Let's talk about these statistics
- **STATISTIC:** a single measure of some attribute of a sample (e.g. its arithmetic mean value). It is calculated by applying a function (statistical algorithm) to the values of the items comprising the sample which are known together as a set of data. (Wikipedia)[http://en.wikipedia.org/wiki/Statistic]
- These statistics can measure a number of features of a dataset, but we tend to think of them as measuring either **central tendency**, **spread**, or **association**
- We'll focus on these today.
# Measures of Central Tendency
- These are the three canonical measures of central tendency:
- How are these different? What properties do they have? Why does this matter?
```{r centraltend,echo=FALSE}
qplot(hwy,data=mpg,geom='density')+geom_vline(xintercept=median(mpg$hwy),color=I("blue"),size=I(1.1))+
  geom_vline(xintercept=mean(mpg$hwy),color=I("gold"),size=I(1.1))+
  geom_vline(xintercept=26,color=I("orange"),size=I(1.1))+
  geom_text(aes(x=median(mpg$hwy)+1.5,y=0.08,label="Median"),size=I(4.5))+
  geom_text(aes(x=mean(mpg$hwy)-1.5,y=0.06,label="Mean"),size=I(4.5))+
  geom_text(aes(x=26+1.5,y=0.05,label="Mode"),size=I(4.5))+theme_dpi()
```{r mpgtable,results='asis'}
library(xtable)
print(xtable(table(mpg$hwy)),type="html")
- Commonly thought of as the average
- Add up all the values, divide by the number of observations
- 2 + 6 + 3 + 9 + 1 = 21 
- Divide by 5 
- The mean is sensitive to the number of observations and the spread of the data
# The Median
- This is a measure of the value the middle observation takes
- Order the data: 1 2 3 6 9 ; get the length (5)
- Count (length / 2) + 1 in this case, the 3rd observation = 3
- What if we add another number?
- Count (length /2) from each side and average the two observations that remain
- 3 + 6 = 9
- 9/2 = 4.5
- The median is sensitive to the spread of the data
- The most common observation 
- This can be problematic since skewed distributions can have a mode that diverges far from the mean or median
- The mode is useful in cases where you are looking for data errors (the most common value) and is also used a lot in fitting statistical models
# Measures of spread
- It is handy to describe the central tendency of the data, but we also need a sense of how much the data resembles the central tendency
- Measures of spread help us achieve this
# Quantiles
- Quantiles are used to divide the data into an even number of observations per "bin"
- An example is percentiles where data is divided evently into 100 groups, or quartiles where the data is divided into 4 groups
- This is handy to look at the way the distribution of the data behaves
# Standard Deviation and Variance
- When we have multiple observations of data we are interested in how spread apart these values can be
- The standard way to measure this is to use the variance and standard deviation (which is the square root of the variance)
- Variance measures how far away from the mean the data can be and can be easily calculated by subtracting each element of data from the mean, squaring that difference, adding them together, and dividing by the number of elements
- Skew describes the symmetry of the distribution
- Is it balanced? Are there clusters?
- Skew is in reference to a norm - most often the normal distribution, but not always
# Distributions of Data
- All of the above are the concepts we use to describe a univariate group of data
- Statisticians use the idea of a distribution to summarize how data generated through some process looks
- Think about the difference between a coin flip and a test score
- Both are generated by a stochastic process
- Both may generate the same amount of numbers
- But, we have very different expectations about what values those numbers take in the real world
- A **distribution** describes our expectation not just about the mean of the data (the average value), but also of how spread out it is likely to be
# Pictures of distributions
- Let's look at some distributions of data and talk about what they might represent
- Different distributions describe different data generating processes, and these processes can be used to represent a large array of data types
- The bell curve
```{r bellcurve}
qplot(rnorm(3000),geom='density',adjust=2)+theme_dpi()
```{r uniform}
qplot(runif(100000,min=-5,max=5),geom='bar')+theme_dpi()
```{r poisson}
qplot(rpois(3000,lambda=3),geom='density',adjust=2)+theme_dpi()
```{r binom}
qplot(rbinom(3000,1,.5),geom='bar')+theme_dpi()
```{r weibull}
qplot(rweibull(3000,shape=18,scale=1),geom='density')+theme_dpi()
# Distribution Demo
- [Drawing Distributions](http://glimmer.rstudio.com/wisconsindpi/distribution/)
- These parameters define distributions (there are others as well), but you can see what a distribution does--it summarizes data
- Much of traditional statistics is based on the idea of samples
- Sampling is the process of picking observations out of a population in a way that makes the smaller set of observations reflective of the population
- How we sample determines what conclusions we can draw in the population from our data
- The size of the sample relative to the population we are making inference about determines our confidence / precision
- Let's look at a demo at some different types of sampling-
- [Sampling Demo](http://glimmer.rstudio.com/wisconsindpi/sampling/)
# Measures of association
- Correlation is not causation
- Correlation is a measure of the dependence between two variables and there are a number of different versions of measuring correlation
- The most common is [Pearson's *r*](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) which measures linear dependence on a scale of -1 to 1
- Rank correlation is an option as well using [Spearman's rank correlation coefficient](http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) or [Kendall's tau rank correlation coefficient](http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient)
# Let's look at Pearson's Coefficient
- [Visualizing Correlations](http://glimmer.rstudio.com/wisconsindpi/correlation/)
# We can identify other types of correlation though:
```{r sophcorr,echo=FALSE}
mydat<-data.frame(x=runif(600),y=rnorm(600),z=sample(c("A","B","C"),600,replace=TRUE))
mydat$y[mydat$z=="C"]<-mydat$y[mydat$z=="C"]+2
mydat$y[mydat$z=="B"]<-mydat$x[mydat$z=="B"] * 
  runif(length(mydat$x[mydat$z=="B"]),1,3)+rnorm(length(mydat$x[mydat$z=="B"])) 
qplot(x,y,data=mydat)+facet_wrap(~z)+theme_dpi()+geom_smooth(se=FALSE,method='lm')
# Regression and Statistical Models
- Statistical models are mathematical representations of real world phenomena
- We use statistical models to summarize and describe the relationships between pieces of data
- Let's look at the simplest version of regression model and the basis of many statistical techniques: Ordinary Least Squares regression (OLS)
# Hypothesis testing
- What does it mean when someone says something is statistically significant?
- "A significance is a formal procedure for comparing observed data with a hypothesis whose truth we want to assess."~ Moore and McCabe p. 436
- Hypothesis testing is based on the idea of sampling and centers around the question of "Do we have enough data to believe the relationship we are seeing in our sample is true in the population?"
- In a hypothesis test we specify two hypotheses--the null hypothesis and the alternative hypothesis
- A test of significance then assesses the evidence against the null hypothesis in terms of probability
# Most Common Applications
- The most common application of these is when you compare a sample to a population or a true fixed value
- A classic example is fluctuation in weight, for an athlete for example
```{r classicweight,echo=TRUE}
wt<-c(190.5,189,195.5,187,191,190.4,186,183,193,188)
t.test(wt,mu=187,alternative="two.sided")
# Questions
```{r weightrepeat,echo=FALSE}
t.test(wt,mu=187,alternative="two.sided")
- the P value reflects our belief about the probability of the test statistic taking a value as extreme or more extreme than what we observe
- This translates into our belief about "extremeness" of the value of interest, the average weight, in relation to the comparison group, in this case **187**
- Traditionally, researchers set a fixed value of *p* for comparison before conducting the test, usually **.05**, sometimes **.10** and sometimes **.01**
- If our *p* is **less than** the pre-specified *p value* we set, then we say the observed value is **statistically significant**
# Let's look at an example
# Some practical advice about statistical signficance
- Statistical significance is very different from substantive significance. Variables can be statistically significant but have no observable difference. For example, with enough precise measurements we could demonstrate a statistically significant difference in weight of 0.2 pounds, but in most cases this is substantively meaningless.
- Statistical significance tells us nothing about the true value in the population except that it is **not** equal to the assumption of the null hypothesis. We are not testing what range of values our measurement could take in the real world, only if it is plausibly different from another set of values.
- Statistical significance is only valid under assumptions about the data. What might be a few ways that the data in our weight example could invalidate the usefulness of a significance test?
# Session Info
It is good to include the session info, e.g. this document is produced with **knitr** version `r packageVersion('knitr')`. Here is my session info:
```{r session-info}
print(sessionInfo(), locale=FALSE)
# Attribution and License
<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license" href="http://creativecommons.org/publicdomain/mark/1.0/">
<img src="http://i.creativecommons.org/p/mark/1.0/88x31.png"
     style="border-style: none;" alt="Public Domain Mark" />
This work (<span property="dct:title">R Tutorial for Education</span>, by <a href="www.jaredknowles.com" rel="dct:creator"><span property="dct:title">Jared E. Knowles</span></a>), in service of the <a href="http://www.dpi.wi.gov" rel="dct:publisher"><span property="dct:title">Wisconsin Department of Public Instruction</span></a>, is free of known copyright restrictions.
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

TutorialX_StatisticsRefresher.Rmd

Latest commit

History

TutorialX_StatisticsRefresher.Rmd

File metadata and controls