ANOVA Python

HYPOTHESIS TESTING
ANOVA

In the last exercise, we saw that the probability of making a Type I error got dangerously high as we performed more t-tests.

When comparing more than two numerical datasets, one way to preserve a Type I error probability of 0.05 is to use ANOVA. ANOVA (Analysis of Variance) tests the null hypothesis that all of the samples come from populations with the same mean. If we reject the null hypothesis with ANOVA, we’re saying that at least one pair of populations (from which the samples were drawn) have different means; however, we cannot determine exactly which pair(s).

We can use the SciPy function f_oneway to perform ANOVA on multiple datasets. f_oneway takes in each dataset as a different input and returns the F-statistic and the p-value. For example, if we were comparing scores on a videogame between math majors, writing majors, and psychology majors, we could run an ANOVA test with this line:

fstat, pval = f_oneway(scores_mathematicians, scores_writers, scores_psychologists)

The null hypothesis, in this case, is that all three populations have the same mean score on this videogame. If we reject this null hypothesis (if we get a p-value less than 0.05), we can say that we are reasonably confident that at least one pair of populations is significantly different. After using only ANOVA, we can’t make any conclusions on which two populations have a significant difference.

from scipy.stats import ttest_ind
from scipy.stats import f_oneway
import numpy as np

a = np.genfromtxt("store_a.csv",  delimiter=",")
b = np.genfromtxt("store_b.csv",  delimiter=",")
c = np.genfromtxt("store_c.csv",  delimiter=",")
fstat,pval = f_oneway(a,b,c)
print(pval)
print(fstat)
Assumptions of T-Tests and ANOVA

Before we use one or two sample t-tests or ANOVA, we need to be sure that the following things are true:

1. The sample(s) should be normally distributed…ish

Data analysts in the real world often still perform t-tests or ANOVAs on data that are not normally distributed. This is usually not a problem if sample size is large, but it depends on how non-normal the data is. In general, the bigger the sample size, the safer you are!

2. The standard deviations of the samples should be equal

For ANOVA and 2-Sample T-Tests, using datasets with standard deviations that are significantly different from each other will often obscure the differences in group means. That said, there is also a way to run a 2-Sample T-Test without assuming equal standard deviations (for example, by setting the equal_var parameter in the scipy.stats.ttest_ind() function equal to False). Running the test in this way has some disadvantages (it essentially makes it harder to reject the null hypothesis even when there is a true difference between groups), so it’s important to check for equal standard deviations before running a test.

To check this assumption, it is normally sufficient to divide the two standard deviations and see if the ratio is “close enough” to 1. “Close enough” may differ in different contexts but generally staying within 10% should suffice. This equates to a ratio between 0.9 and 1.1.

3. The samples must be independent

When comparing two or more datasets, the values in one distribution should not affect the values in another distribution. In other words, knowing more about one distribution should not give you any information about any other distribution.

Here are some examples where it would seem the samples are not independent:

  • the number of goals scored per soccer player before, during, and after undergoing a rigorous training regimen
  • a group of patients’ blood pressure levels before, during, and after the administration of a drug

It is important to understand your datasets before you begin conducting hypothesis tests on them so that you know you are choosing the right test.


Comments

Popular posts from this blog

Binomial Test in Python

Slicing and Indexing in Python Pandas

Python Syntax and Functions Part2 (Summary Statistics)