class: center, middle, inverse, title-slide # Inference: One and Two-Samples ## Discrimination and Concrete ### Robert W. Walker ### 2020-10-13 --- # How do we learn from data? -- .center[] -- ### It is probably better posed as, what can we learn from data? --- # The Big Idea ## **Inference** In the most basic form, learning something about data that we do not have from data that we do. In the jargon of statistics, we characterize the probability distribution of some feature of the population of interest from a sample drawn randomly from that population. -- ### Quantities of Interest: **Single Proportion** *The probability of some qualitative outcome with appropriate uncertainty.<sup>1</sup>* **Single Mean** *The average of some quantitative outcome with appropriate uncertainty.<sup>1</sup>* .footnote[ <sup>1</sup> Which will itself be a probability measure. ] --- # The Big Idea ## **Inference** In the most basic form, learning something about data that we do not have from data that we do. In the jargon of statistics, we characterize the probability distribution of some feature of the population of interest from a sample drawn randomly from that population. ### Quantities of Interest: **Compare Proportions** *Compare key probabilities across two groups with appropriate uncertainty.<sup>1</sup>* **Compare Means** *Compare averages/means across two groups with appropriate uncertainty.<sup>1</sup>* .footnote[ <sup>1</sup> Which will itself be a probability measure. ] --- # The Big Idea ## **Inference** In the most basic form, learning something about data that we do not have from data that we do. In the jargon of statistics, we characterize the probability distribution of some feature of the population of interest from a sample drawn randomly from that population. ### Study Planning *Plan studies of adequate size to assess key probabilities of interest.* --- # Inference for Data Of two forms: 1. Binary/Qualitative -- 2. Quantitative -- We will first focus first on the former. But before we do, one new concept. --- ## The **standard error** It is the *degree of variability of a statistic* just as the standard deviation is the *variability in data*. - The standard error of a proportion is `\(\sqrt{\frac{\pi(1-\pi)}{n}}\)` while the standard deviation of a binomial sample would be `\(\sqrt{n*\pi(1-\pi)}\)`. - The standard deviation of a sample is `\(s\)` while the standard error of the mean is `\(\frac{s}{\sqrt{n}}\)`. The average has less variability than the data and it shrinks as n increases. --- ## Let's think about an election.... We are in the election season. The key to modeling the winner of a presidential election is the electoral college in the United States. In almost all cases, this involves a series of binomial variables. In particular, almost all states award electors for the statewide winner of the election and DC has electoral votes. We have polls that give us likely vote percentages at the state level. Call it `p.win` for the probability of winning. We can then deploy `rbinom(1000, size=1, p.win)` to derive a hypothetical election. Rinse and repeat to calculate a hypothetical electoral college winner. But how do we calculate `p-win`? ** We infer it from polling data ** --- ## Qualitative: The Binomial Is entirely defined by two parameters, `\(n\)` -- the number of subjects -- and `\(\pi\)` -- the probability of a positive response. The probability of exactly `\(x\)` positive responses is given by: $$ Pr(X = x | \pi, n) = {n \choose x} \pi^x (1-\pi)^{n-x} $$ The binomial is the `canonical` distribution for binary outcomes. *Assuming all `\(n\)` subjects are alike and the outcome occurs with `\(\pi\)` probability*,<sup>1</sup> then we must have a sample from a binomial. *A binomial with `\(n=1\)` is known as a Bernoulli trial* Two key features: 1. Expectation: `\(n \cdot \pi\)` or number of subjects times probability, `\(\pi \textrm{ if } n=1\)`. 2. Variance is `\(n \cdot \pi \cdot (1 - \pi)\)` or `\(\pi(1-\pi) \textrm{ if } n=1\)` and standard deviation is `\(\sqrt{n \cdot \pi \cdot (1 - \pi)}\)` or `\(\sqrt{\pi(1-\pi)} \textrm{ if } n=1\)`. -- Now let's grab some data and frame a question. ```r load(url("https://github.com/robertwwalker/DADMStuff/raw/master/Big-Week-6-RSpace.RData")) ``` .footnote[<sup>1</sup> **If this does not hold, we will seek the homogenous groups in which it does.**] --- ## Berkeley Admissions What is the probability of being admitted to UC Berkeley Graduate School?<sup>2</sup> -- I have three variables: Admitted or not, Gender, and Department applied to. <span style="color: yellow;">Admission is the key.</span> ```r *library(janitor) UCBTab <- data.frame(UCBAdmissions) %>% reshape::untable(., .$Freq) %>% select(-Freq) UCBTab %>% tabyl(Admit) # Could also use UCBTab %$% table(Admit) ``` ``` ## Admit n percent ## Admitted 1755 0.3877596 ## Rejected 2771 0.6122404 ``` The proportion is denoted as `\(\hat{p}\)`. We can also do this with *table*. .footnote[<sup>2</sup> **We probably think this depends on a bunch of applicant specific features but we will put that to the side for now.**] --- ```r UCBTab %>% DT::datatable(.,fillContainer = FALSE, options = list(pageLength = 8), rownames=FALSE) ```
--- ## A Visual ```r UCBTab %>% ggplot(.) + aes(x = Gender, fill = Admit) + geom_bar(color="darkgreen") + theme_xaringan() + scale_xaringan_fill_discrete() ``` <img src="index_files/figure-html/unnamed-chunk-5-1.jpeg" width="504" style="display: block; margin: auto;" /> --- # The Idea Suppose we want to know `\(\pi(Admit)\)` -- the probability of being admitted -- with 90% probability. We want to take the data that we saw and use it to **infer** the likely values of `\(\pi(Admit)\)`. -- ### Three interchangeable methods 1. Resampling/Simulation -- 2. Exact binomial computation -- 3. A normal approximation --- ## Method 1: Resampling **Suppose I have 4526 chips with 1755 <span style="color: green;">green</span> and 2771 <span style="color: red;">red</span>.** -- I toss them all on the floor -- and pick them up, one at a time, -- record the value (<span style="color: green;">green</span>/<span style="color: red;">red</span>), -- put the chip back, *[NB: I put it back to avoid getting exactly the same `sample` every time.]* -- and repeat 4526 times. -- **Each count of <span style="color: green;">green chips</span> as a proportion of 4526 total chips constitutes an estimate of the probability of Admit.** -- I wrote a little program to do just this -- `ResampleProps`. ``` remotes::install_github("robertwwalker/ResampleProps") ``` --- class: inverse # Resampling Result ```r library(ResampleProps) RSMP <- ResampleProp(UCBTab$Admit, k = 10000, tab.col = 1) %>% data.frame(Pi.Admit=., Gender=as.character("All"), frameN = 1) quantile(RSMP$Pi.Admit, probs = c(0.05,0.95)) ``` ``` ## 5% 95% ## 0.3758285 0.3996907 ``` What is our estimate of `\(\pi\)` with 90% confidence? *The probability of admission ranges from 0.376 to 0.4.* --- # A Plot ```r library(ggthemes) RSMP %>% ggplot(., aes(x=Pi.Admit)) + geom_density(outline.type = "upper", color="maroon") +labs(x=expression(pi)) + theme_minimal() + theme_xaringan() ``` <img src="index_files/figure-html/unnamed-chunk-7-1.jpeg" width="504" style="display: block; margin: auto;" /> --- ## Another Way: Exact binomial That last procedure is correct but it is overkill. 1. With probability of 0.05, how small could `\(\pi\)` be to have gotten 1755 of 4526 or more? 2. With probability 0.95, how big could `\(\pi\)` be to have gotten fewer than 1755 of 4526? ```r *UCBTab %$% table(Admit) ``` ``` ## Admit ## Admitted Rejected ## 1755 2771 ``` -- **`binom.test` does exactly this.** --- class: inverse ## binom.test() ```r UCBTab %$% table(Admit) %>% binom.test(conf.level = 0.9) ``` ``` ## ## Exact binomial test ## ## data: . ## number of successes = 1755, number of trials = 4526, p-value < 2.2e-16 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 90 percent confidence interval: ## 0.3757924 0.3998340 ## sample estimates: ## probability of success ## 0.3877596 ``` ```r Plot.Me <- binom.test(1755, 1755+2771, conf.level=0.9)$conf.int %>% data.frame() ``` **With 90% probability, now often referred to as 90% confidence to avoid using the word probability twice, the probability of being admitted ranges between 0.3758 and 0.3998.** --- class: inverse ## Illustrating the Binomial ```r UCBTab %$% table(Admit) %>% binom.test(conf.level = 0.9) ``` ``` ## ## Exact binomial test ## ## data: . ## number of successes = 1755, number of trials = 4526, p-value < 2.2e-16 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 90 percent confidence interval: ## 0.3757924 0.3998340 ## sample estimates: ## probability of success ## 0.3877596 ``` ```r c(pbinom(1755, 1755+2771, 0.3998340), pbinom(1754, 1755+2771, 0.3757924)) ``` ``` ## [1] 0.05000037 0.95000004 ``` --- ```r Binomial.Search <- data.frame(x=seq(0.33,0.43, by=0.001)) %>% mutate(Too.Low = pbinom(1755, 1755+2771, x), Too.High = 1-pbinom(1754, 1755+2771, x)) Binomial.Search %>% pivot_longer(cols=c(Too.Low,Too.High)) %>% ggplot(., aes(x=x, y=value, color=name)) + geom_line() + geom_hline(aes(yintercept=0.05)) + geom_hline(aes(yintercept=0.95)) + geom_vline(data=Plot.Me, aes(xintercept=.), linetype=3) + labs(title="Using the Binomial to Search", color="Greater/Lesser", x=expression(pi), y="Probability") + theme_xaringan() + scale_xaringan_color_discrete() ``` <img src="index_files/figure-html/unnamed-chunk-11-1.jpeg" width="720" style="display: block; margin: auto;" /> --- ## A Third Way: The normal approximation As long as `\(n\)` and `\(\pi\)` are sufficiently large, we can approximate this with a normal distribution. **This will also prove handy for a related reason.** As long as `\(n*\pi\)` > 10, we can write that a standard normal `\(z\)` describes the distribution of `\(\pi\)`, given `\(n\)` -- the sample size and `\(\hat{p}\)` -- the proportion of yes's/true's in the sample. $$ Pr(\pi) = \hat{p} \pm z \cdot \left( \sqrt{\frac{\hat{p}*(1-\hat{p})}{n}} \right) $$ `\(R\)` implements this in `prop.test`. By default, R implements a modified version that corrects for discreteness/continuity. To get the above formula exactly, `prop.test(table, correct=FALSE)`. --- class: inverse ### A first look at hypotheses $$ z = \frac{\hat{p} - \pi}{\sqrt{\frac{\pi(1-\pi)}{n}}} $$ With some proportion calculated from data `\(\hat{p}\)` and some claim about `\(\pi\)` -- hereafter called an hypothesis -- we can use `\(z\)` to calculate/evaluate the claim. These claims are *hypothetical* values of `\(\pi\)`. They can be evaluated against a specific alternative considered in three ways: - two-sided, that is `\(\pi = value\)` against not equal. - `\(\pi \geq value\)` against something smaller. - `\(\pi \leq value\)` against something greater. In the first case, either much bigger or much smaller values could be evidence against equal. The second and third cases are always about smaller and bigger; they are one-sided. Just as `\(z\)` has metric standard deviations now referred to as standard errors of the proportion, the z here will measure where the data fall with respect to the hypothetical `\(\pi\)`. *With z sufficiently large, and in the proper direction, our hypothetical `\(\pi\)` becomes unlikely given the evidence and should be dismissed.* --- class: inverse ## Three hypothesis tests ```r ( Z.ALL <- (0.3877596 - 0.5)/(sqrt(0.5^2 / 4526)) ) # Calculate z ``` ``` ## [1] -15.10207 ``` *The estimate is 15.1 standard errors below the hypothesized value.* .pull-left[ ## Two Sided `\(\pi = 0.5\)` against `\(\pi \neq 0.5\)`. ```r pnorm(-abs(Z.ALL)) + (1-pnorm(abs(Z.ALL))) ``` ``` ## [1] 7.846435e-52 ``` Any `\(|z| > 1.64\)` -- probability 0.05 above and below -- is sufficient to dispense with the hypothesis. ] .pull-right[ ## One-Sided `\(\pi \leq 0.5\)` against `\(\pi \gt 0.5\)`. With 90% confidence, `\(z > 1.28\)` is sufficient. `\(\pi \geq 0.5\)` against `\(\pi \lt 0.5\)`. With 90% confidence, `\(z < -1.28\)` is sufficient. ] --- class: inverse # The normal approximation ```r prop.test(table(UCBTab$Admit), conf.level = 0.9, correct=FALSE) ``` ``` ## ## 1-sample proportions test without continuity correction ## ## data: table(UCBTab$Admit), null probability 0.5 ## X-squared = 228.07, df = 1, p-value < 2.2e-16 ## alternative hypothesis: true p is not equal to 0.5 ## 90 percent confidence interval: ## 0.3759173 0.3997361 ## sample estimates: ## p ## 0.3877596 ``` `\(R\)` reports the result as `\(z^2\)` not `\(z\)` which is `\(\chi^2\)` not normal; we can obtain the `\(z\)` by taking a square root: $$\sqrt{228.08} \approx 15.1 $$ This approximation yields an estimate of `\(\pi\)`, with 90% confidence, that ranges between 0.376 and 0.4. --- # A Cool Consequence of Normal The formal equation defines: $$ z = \frac{\hat{p} - \pi}{\sqrt{\frac{\pi(1-\pi)}{n}}} $$ -- Some language: 1. Margin of error is `\(\hat{p} - \pi\)`. [MOE] 2. Confidence: the probability coverage given z [two-sided]. 3. We need a guess at the true `\(\pi\)` because variance/std. dev. depend on `\(\pi\)`. 0.5 is common because it maximizes the variance; we will have enough no matter what the true value. -- Algebra allows us to solve for `\(n\)`. $$ n = \frac{z^2 \cdot \pi(1-\pi)}{(\hat{p} - \pi)^2} $$ --- class:inverse ### Example Estimate the probability of supporting an initiative to within 0.03 with 95% confidence. Assume that the probability is 0.5 [it maximizes the variance and renders a conservative estimate -- an upper bound on the sample size] ```r Sample.Size <- function(MOE, B.pi=0.5, Conf.lev=0.95) { My.n <- ((qnorm((1-Conf.lev)/2)^2)*(B.pi*(1-B.pi))) / MOE^2 Necessary.sample <- ceiling(My.n) return(Necessary.sample) } Sample.Size(MOE=0.03, B.pi=0.5, Conf.lev = 0.95) ``` ``` ## [1] 1068 ``` **I need 1068 people to estimate support to within plus or minus 0.03 with 95% confidence.** --- class: inverse ### In the Real World <img src="./y1.png" width="900px" height="300px" /> [A real poll](https://navigatorresearch.org/wp-content/uploads/2020/04/Navigator-Daily-Tracker-Topline-F04.13.20.pdf). They do not have quite enough for a 3% margin of error. But 1006 is enough for a 3.1 percent margin of error... ```r Sample.Size(MOE=0.031, B.pi=0.5, Conf.lev = 0.95) ``` ``` ## [1] 1000 ``` --- # One More Oddity What is our estimate of `\(\pi\)` with 90% confidence? **The probability of admission ranges from 0.3758285 to 0.3996907.** If we wish to express this using the normal approximation, a central interval is the observed proportion plus/minus `\(z\)` times the *standard error of the proportion* -- $$ SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$ NB: The sample value replaces our assumed true value because `\(\pi\)` is to be solved for. For 90%, the probability of admissions ranges from ```r 0.3877596 + qnorm(c(0.05,0.95))*(sqrt(0.3877596*(1-0.3877596)/4526)) ``` ``` ## [1] 0.3758468 0.3996724 ``` --- # Back to the Story **Does the probability of admission depend on whether the applicant is Male or Female?** --- # Analysis by Group ```r library(janitor); UCBTab %>% tabyl(Gender,Admit) %>% adorn_totals("col") ``` ``` ## Gender Admitted Rejected Total ## Male 1198 1493 2691 ## Female 557 1278 1835 ``` ```r UCBTab %>% tabyl(Gender,Admit) %>% adorn_percentages("row") ``` ``` ## Gender Admitted Rejected ## Male 0.4451877 0.5548123 ## Female 0.3035422 0.6964578 ``` ```r UCBTF <- UCBTab %>% filter(Gender=="Female") UCBTF.Pi <- ResampleProp(UCBTF$Admit, k = 10000) %>% data.frame(Pi.Admit=., Gender=as.character("Female"), frameN=2) # Estimates for females UCBTM <- UCBTab %>% filter(Gender=="Male") UCBTM.Pi <- ResampleProp(UCBTM$Admit, k = 10000) %>% data.frame(Pi.Admit=., Gender = as.character("Male"), frameN = 2) # Estimates for males ``` --- class:inverse # Binomial Result: Males ```r UCBTM %$% table(Admit) %>% binom.test(., conf.level = 0.9) ``` ``` ## ## Exact binomial test ## ## data: . ## number of successes = 1198, number of trials = 2691, p-value = ## 1.403e-08 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 90 percent confidence interval: ## 0.4292930 0.4611706 ## sample estimates: ## probability of success ## 0.4451877 ``` The probability of being admitted, conditional on being Male, ranges from 0.43 to 0.46 with 90% confidence. --- ## The Thought Experiment: Male ```r UCBTM.Pi %>% ggplot(., aes(x=Pi.Admit)) + geom_density(color="maroon") + labs(x=expression(pi)) + theme_xaringan() ``` <img src="index_files/figure-html/BMPlot-1.jpeg" width="504" style="display: block; margin: auto;" /> --- class: inverse ## Binomial Result: Females ```r UCBTF %$% table(Admit) %>% binom.test(., conf.level = 0.9) ``` ``` ## ## Exact binomial test ## ## data: . ## number of successes = 557, number of trials = 1835, p-value < 2.2e-16 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 90 percent confidence interval: ## 0.2858562 0.3216944 ## sample estimates: ## probability of success ## 0.3035422 ``` The probability of being of Admitted, given a Female, ranges from 0.286 to 0.322 with 90% confidence. --- ## The Thought Experiment: Female ```r UCBTF.Pi %>% ggplot(., aes(x=Pi.Admit)) + geom_density(color="maroon") + labs(x=expression(pi)) + theme_xaringan() ``` <img src="index_files/figure-html/BFPlot-1.jpeg" width="504" style="display: block; margin: auto;" /> --- # Succinctly .left-column[ Female: from 0.286 to 0.322 Male: from 0.43 to 0.46 All: from 0.3758 to 0.3998 ] .right-column[ <img src="index_files/figure-html/patch1-1.jpeg" width="648" /> ] --- class: inverse ## A Key Visual ```r UCB.Pi <- bind_rows(UCBTF.Pi, UCBTM.Pi) UCB.Pi %>% ggplot(., aes(x=Gender, y=Pi.Admit, fill=Gender)) + geom_violin() + geom_label(aes(x=1.5, y=0.375), label="Too small to be male \n Too large to be female?", size=14, fill="black", color="white", inherit.aes = FALSE) + guides(size=FALSE) + theme_xaringan_inverse() + labs(x="") ``` <img src="index_files/figure-html/Yellow-1.jpeg" width="648" style="display: block; margin: auto;" /> --- class: inverse ```r UCB.Pi <- bind_rows(UCBTF.Pi, UCBTM.Pi, RSMP) UCB.Pi %>% ggplot(., aes(x=Pi.Admit, fill=Gender)) + geom_density(alpha=0.5) + theme_minimal() + scale_fill_viridis_d() + transition_states(frameN, state_length = 40, transition_length = 20) + enter_fade() + exit_fade() -> SaveAnim anim_save(SaveAnim, renderer = gifski_renderer(height=500, width=1000), filename=here("static/xaringan/testest","Anim1.gif")) ``` .center[  ] --- # Can We Measure the Difference? .pull-left[ How much more likely are Males to be admitted when compared to Females? **We can take the difference in our estimates for Male and Female.** ```r UCB.Diff <- UCB.Pi %>% filter(frameN != 1) %>% select(Pi.Admit, Gender) %>% mutate(obs = rep(seq(1:10000),2)) %>% pivot_wider(., id_cols = obs, values_from=Pi.Admit, names_from=Gender) %>% mutate(Diff = Male - Female) quantile(UCB.Diff$Diff, probs=c(0.05,0.95)) ``` ``` ## 5% 95% ## 0.1176662 0.1651037 ``` ```r ggplot(UCB.Diff) + aes(x=Diff) + geom_density() + labs(title="Difference in Probability of Admission", subtitle="[Male - Female]") + theme_xaringan() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-20-1.jpeg" width="504" /> ] --- class: inverse ## Or Use the Normal Approximation `$$[\hat{p}_{M} - \hat{p}_{F}] \pm z*\sqrt{ \underbrace{\frac{\hat{p}_{M}(1-\hat{p}_{M})}{n_{M}}}_{Males} + \underbrace{\frac{\hat{p}_{F}(1-\hat{p}_{F})}{n_{F}}}_{Females}}$$` ```r # prop.test(table(UCBTab$Gender,UCBTab$Admit), conf.level=0.9, correct=FALSE) UCBTab %$% table(Gender, Admit) %>% prop.test(conf.level = 0.9, correct=FALSE) ``` ``` ## ## 2-sample test for equality of proportions without continuity ## correction ## ## data: . ## X-squared = 92.205, df = 1, p-value < 2.2e-16 ## alternative hypothesis: two.sided ## 90 percent confidence interval: ## 0.1179805 0.1653103 ## sample estimates: ## prop 1 prop 2 ## 0.4451877 0.3035422 ``` With 90% confidence, a Female is 0.118 to 0.165 [0.1416] less likely to be admitted. --- class: inverse ### The Standard Error of the Difference Following something akin to the FOIL method, we can show that the variance of a difference in two random variables is the sum of the variances minus twice the covariance between them. $$Var(X - Y) = Var(X) + Var(Y) - 2*Cov(X,Y) $$ If we assume [or recognize it is generally unknownable] that the male and female samples are independent, the covariance is zero, and the variance of the difference is simply the sum of the variance for male and female, respectively, and **zero covariance**. `$$SE(\hat{p}_{M} - \hat{p}_{F}) = \sqrt{ \underbrace{\frac{\hat{p}_{M}(1-\hat{p}_{M})}{n_{M}}}_{Males} + \underbrace{\frac{\hat{p}_{F}(1-\hat{p}_{F})}{n_{F}}}_{Females}}$$` --- # The chi-square and the test of independence The sum of `\(k\)` squared standard normal (z or Normal(0,1)) variates has a `\(\chi^2\)` distribution with `\(k\)` degrees of freedom. Two sample tests of proportions can be reported using the chi-square metric as well. `\(R\)`'s `prop.test` does this by default. --- # Inference for Data Of two forms: 1. Binary/Qualitative (Proportions) 2. **Quantitative (Means)** -- We will now turn to the latter. --- ## Inference for Quantitative Data **What is the average level of expenditure by Ethnicity?** Why do we focus on the average? Under general conditions, the average has a known probability distribution even if the distribution of the underlying data is unknown. This is a result of the central limit theorem and a generic property of averaging. We can derive this distribution, it is known as `\(t\)` and has one parameter: degrees of freedom. --- ## The `t` distribution Results from the combination of a normal random variable in the numerator and a `\(\chi^2\)` random variable in the denominator. --- ## Expenditures First, a look at the data on expenditures. ```r load(url("https://github.com/robertwwalker/DADMStuff/raw/master/DiscriminationCADSS.RData")) ## Cutting Discrimination$Age into Discrimination$Age.Cohort Discrimination$Age.Cohort <- cut(Discrimination$Age, include.lowest=FALSE, right=TRUE, breaks=c(0, 5, 12, 17, 21, 50, 95)) WH <- Discrimination %>% filter(Ethnicity %in% c("White not Hispanic","Hispanic")) summary(WH$Expenditures) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 222 2877 6690 18101 37916 68890 ``` --- class: inverse ```r ( BasePlot <- ggplot(WH, aes(x=Expenditures)) + geom_density(color="red") + theme_minimal() + labs(y="") ) ``` <img src="index_files/figure-html/unnamed-chunk-22-1.jpeg" width="504" /> --- ## Method 1: Resampling **Suppose I have 777 chips that each contain an expenditure** -- I toss them all on the floor -- and pick them up, one at a time, -- record the value, -- put the chip back, *[NB: I put it back to avoid getting exactly the same sample every time.]* -- and repeat this to create a sample of n=777. -- Use random sampling to sample the mean/average 10000 times. -- **Each average of 777 chips/expenditures constitutes an estimate of the average expenditure.** ```r WH.Means <- data.frame(Mean.Expenditure=ResampleMean(WH$Expenditures)) ``` --- class: inverse ### A Visual ```r (ggplot(WH, aes(x=Expenditures)) + geom_density() + theme_xaringan_inverse()) / (ggplot(WH.Means, aes(x=Mean.Expenditure)) + geom_density() + theme_xaringan_inverse()) ``` <img src="index_files/figure-html/unnamed-chunk-25-1.jpeg" width="720" style="display: block; margin: auto;" /> --- ## Average Expenditures: A 90% Interval What is the middle 90% of the sampled averages? ```r ( ResQ <- quantile(WH.Means$Mean.Expenditure, probs=c(0.05,0.95)) ) ``` ``` ## 5% 95% ## 16976.61 19225.67 ``` ```r ResQD <- data.frame(ResQ=ResQ, Height2=5e-5) ``` --- ## A Cowplot without Proper Scaling <img src="index_files/figure-html/unnamed-chunk-27-1.jpeg" width="504" /> --- ## A Decent Visualization ```r BasePlot + geom_area(data=ResQD, aes(x=ResQ, y=Height2), fill="blue", color="blue", alpha=0.2, outline.type = "full") ``` <img src="index_files/figure-html/unnamed-chunk-28-1.jpeg" width="504" style="display: block; margin: auto;" /> --- class: inverse ## A Parametric Alternative Under general conditions, the sample and population mean can be described by the `\(t_{df}\)` distribution with a single parameter -- degrees of freedom [df] -- and metric standard errors of the mean: `\(\frac{s}{\sqrt{n}}\)` as follows: $$ t_{df} = \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}} $$ Given any observed sample of data, only two features of the aforementioned equation remain unknown: `\(\mu\)` -- the true mean -- and `\(t_{df}\)`. Solving for `\(\mu\)` yields: $$ \mu = \overline{x} - t_{df} \cdot \left( \frac{s}{\sqrt{n}} \right) $$ --- class: inverse ## Putting It To Work ```r ( Summaries <- WH %>% summarise(Exp.Mean = mean(Expenditures), Exp.SD = sd(Expenditures), Count = n()) ) ``` ``` ## Exp.Mean Exp.SD Count ## 1 18100.86 19579.53 777 ``` ```r *WH %$% t.test(Expenditures, conf.level = 0.9) ``` ``` ## ## One Sample t-test ## ## data: Expenditures ## t = 25.77, df = 776, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 90 percent confidence interval: ## 16944.12 19257.61 ## sample estimates: ## mean of x ## 18100.86 ``` With 90% confidence, average expenditures [for White not Hispanic and Hispanic] range between 1.6944\times 10^{4} and 1.9258\times 10^{4} dollars per DDS client. --- class: inverse ## Disparities in Expenditures? ```r ggplot(WH, aes(x=Ethnicity, y=Expenditures, fill=Ethnicity)) + geom_violin() + theme_xaringan_inverse() + scale_xaringan_fill_discrete() ``` <img src="index_files/figure-html/unnamed-chunk-30-1.jpeg" width="720" style="display: block; margin: auto;" /> --- ## Two Groups? .pull-left[ <small> ```r WH %>% filter(Ethnicity=="White not Hispanic") %$% t.test(Expenditures, conf.level=0.9) ``` ``` ## ## One Sample t-test ## ## data: Expenditures ## t = 24.003, df = 400, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 90 percent confidence interval: ## 23001.17 26393.92 ## sample estimates: ## mean of x ## 24697.55 ``` </small> ] .pull-right[ <small> ```r WH %>% filter(Ethnicity=="Hispanic") %$% t.test(Expenditures, conf.level=0.9) ``` ``` ## ## One Sample t-test ## ## data: Expenditures ## t = 13.728, df = 375, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 90 percent confidence interval: ## 9736.455 12394.683 ## sample estimates: ## mean of x ## 11065.57 ``` </small> ] --- ## Comparing Means There are two [generic] ways to compare means: 1. The resampling approach 2. t-tests --- ## Comparing Means by Resampling The method is similar to the approach for comparing proportions. We will: 1. Take a random sample of the same size as the original for each group *with replacement*. 2. Calculate the average for each group and the difference in the averages. 3. Repeat steps 1 and 2 `\(k\)` times for `\(k\)` samples. --- ## The Resampling Distribution of the Difference ```r WNHExp <- Discrimination %>% filter(Ethnicity %in% c("White not Hispanic")) HExp <- Discrimination %>% filter(Ethnicity %in% c("Hispanic")) Expenditure.Difference <- data.frame(Diff=ResampleProps::ResampleDiffMeans(WNHExp$Expenditures,HExp$Expenditures)) Expenditure.Difference %>% ggplot(., aes(x=Diff)) + geom_density() + theme_minimal() + labs(x="Average Difference in Average White and Hispanic Expenditures") ``` <img src="index_files/figure-html/unnamed-chunk-33-1.jpeg" width="504" /> ## The `t-test` For independent samples, the *Behrens-Fisher* problem dictates that comparisons of means have implications for the variance that must be reconciled. In short, we must assume that the variance and standard deviation of the samples are either the same [t] or different [Welch -- the default]. --- ```r Discrimination %>% filter(Ethnicity %in% c("White not Hispanic","Hispanic")) %>% t.test(Expenditures~Ethnicity, data=., conf.lev=0.9) ``` ``` ## ## Welch Two Sample t-test ## ## data: Expenditures by Ethnicity ## t = -10.429, df = 743.08, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 90 percent confidence interval: ## -15784.59 -11479.37 ## sample estimates: ## mean in group Hispanic mean in group White not Hispanic ## 11065.57 24697.55 ``` --- ```r Discrimination %>% filter(Ethnicity %in% c("White not Hispanic","Hispanic")) %>% t.test(Expenditures~Ethnicity, data=., conf.lev=0.9, var.equal=TRUE) ``` ``` ## ## Two Sample t-test ## ## data: Expenditures by Ethnicity ## t = -10.339, df = 775, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 90 percent confidence interval: ## -15803.25 -11460.71 ## sample estimates: ## mean in group Hispanic mean in group White not Hispanic ## 11065.57 24697.55 ``` --- ## Paired t-test Twinsberg. The paired t-test is a single sample average applied to the measured difference. The key is knowing who is matched with whom and believing that there is at least some variation in both outcomes that depends on the unit in question. Take the example of *Concrete*. --- Twelve distinct batches of concrete are subjected to an additive. Does the additive strengthen concrete? ```r Concrete ``` ``` ## Batch No.Add Additive Diff ## 1 1 4550 4600 -50 ## 2 2 4950 4900 50 ## 3 3 6250 6650 -400 ## 4 4 5700 5950 -250 ## 5 5 5350 5700 -350 ## 6 6 5300 5400 -100 ## 7 7 5150 5400 -250 ## 8 8 5800 5850 -50 ## 9 9 4900 4850 50 ## 10 10 6050 6450 -400 ## 11 11 5550 5850 -300 ## 12 12 5750 5600 150 ``` --- ## The Average Difference For each batch, we can calculate the difference between those with and without an additive. To do so, we need *mutate*. Let's plot it. ```r Concrete <- Concrete %>% mutate(Difference = Additive - No.Add) hist(Concrete$Difference) ``` <img src="index_files/figure-html/unnamed-chunk-37-1.jpeg" width="504" /> --- ## Resampling ```r Concrete.Diff <- data.frame(Sampled.Mean.Difference = ResampleMean(Concrete$Difference)) quantile(Concrete.Diff$Sampled.Mean.Difference, probs=c(0.025, 0.975)) ``` ``` ## 2.5% 97.5% ## 58.33333 254.16667 ``` --- ## Graphic ```r Concrete.Diff %>% ggplot() + aes(x=Sampled.Mean.Difference) + geom_density() ``` <img src="index_files/figure-html/unnamed-chunk-39-1.jpeg" width="504" /> --- ## t-test ```r t.test(Concrete$Difference) ``` ``` ## ## One Sample t-test ## ## data: Concrete$Difference ## t = 2.8793, df = 11, p-value = 0.01499 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 37.29936 279.36730 ## sample estimates: ## mean of x ## 158.3333 ``` With 95% confidence, the additive increases the average break-weight of the concrete by 37.3 to 279.4 pounds. ---