Knowing how to do t-test in Python in a simple way is essential for a data scientist to infer conclusions from a given sample dataset about the population.
A t-test was invented by one of the world’s most renowned statisticians, William Gosset, whose pen name was Student. The test is therefore also called as the student’s t-test and is a major hypothesis testing technique. Collecting data about the population is a tough thing to do when you don’t have the time or resources, therefore, using a sample that is representative of the population and making inferences out of it reduces time, effort, and when correctly conducted, is pretty accurate and can help a researcher in their analyses.
What are inferential statistics and hypothesis tests?
It is a part of inferential statistics and hypothesis tests to know how to do t-test in python. Let us look at what inferential statistics and hypothesis tests mean in a brief way.
Inferential statistics is that branch of statistics that takes a variety of analytical tools and makes inferences about the population using a sample. The other type of statistics, i.e., descriptive statistics forms the other branch of statistics. Inferential statistics help to draw conclusions about the population while descriptive statistics summarizes the features of the data set for a researcher to work on.
Hypothesis testing is a technique in statistics (specifically inferential statistics) using which an analyst or data scientist can test an assumption or hypothesis regarding a population parameter using just a sample. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis i.e. there are different types of hypothesis tests that can be conducted. Hypothesis testing is used to assess a hypothesis to be true or false by using a smaller sample size that is more accessible.
(Note: This article only covers How to do t-test in python and not the other types of hypothesis tests. We’ll look into them in other articles to come! Stay tuned)
Read more about Inferential Statistics here: Inferential Statistics
Check out this video for understanding hypothesis tests:
What is a t-test?
A t-test is a special type of hypothesis test that is fairly conducted only in cases where the population parameters are to be estimated with a very small sample which is generally taken to be at n<30. The small sample size works on a t-distribution which is a modified flatter version of the standard normal distribution and then a t-statistic is used to find out whether the hypothesis holds good or not.
Read more about the t-test here: t-tests.
Or watch this youtube video explaining the t-test: (I recommend you to watch this video before heading to the “How to do t-test in python?” section of the article)
How to calculate the t-statistic?
The t-statistic is calculated using the following formula as given below:
t is the required t-statistic or test statistic
x bar is the population mean
mu is the sample mean
S/(sqrt(n)) is the standard error where S is the standard deviation of the population
and n is the sample size *which is less than 30 in a t-test.
It is important to know these denotations before you start to understand, How to do t-test in python?
Read more about the formula to calculate the t-statistic here: t-statistic formula
Conditions to accept or reject the null hypothesis?
Knowing when to accept or reject the null hypothesis is the most important part of knowing how to do t-test in python or any other software.
- In a one-tailed t-test, reject the null hypothesis if the test statistic is greater than the critical value (for a right-tailed test) and is smaller than the critical value (in a left-tailed test)
- In a one-tailed t-test, accept the null hypothesis if the test statistic is smaller than the critical value (for a right-tailed test) and is greater than the critical value (in a left-tailed test)
- In a two-tailed t-test, accept the null hypothesis if the test statistic is smaller than the upper limit and greater than the lower limit and reject when vice versa.
How to do t-test in python? (explained with an example)
There are some steps involved in conducting a t-test and we’ll replicate them using python libraries such as NumPy, Pandas and Scipy.Stats
The steps involved in learning how to do t-test in python are:
- Deciding on a null and alternate hypothesis
- Collecting the relevant data
- Determining confidence interval and level of significance
- Calculating sample mean, standard deviation, and population mean
- Calculating the test statistic or t-statistic
- Determining the critical t value using a t distribution
- Comparing the test statistic with the critical value
Consider if your null hypothesis is that the mean of the data is 10 and the alternate hypothesis is that the mean of the data is greater than 10, we can write this as:
H0 : µ = 10
Ha : µ > 10
This depicts we’re conducting a one-tailed t-test on the right tail to see if the sample mean, is in fact, greater than 10.
Let us collect the data now using the NumPy library of python (N is 20 here which is less than 30 and is, therefore, a small sample size allowing the t-test to be the right choice of a hypothesis test to be conducted)
import numpy as np N = 20 x = np.random.randn(N) + 10 y = np.random.randn(N) + 10
Determining the confidence interval and level of significance, in this case, will be done based on the general use of the level of significance of 0.05 or 5% leaving a confidence interval of 95% because we’re only looking into this scenario as an example to understand the t-test.
Calculating sample mean, standard deviation, and population mean can be done using inbuilt functions in Python or the NumPy functions as follows;
import numpy as np mean_x = x.mean() var_x = x.var(ddof = 19) #ddof is the degrees of freedom, given as sample size minus 1 SD_x = np.sqrt(var_x) mean_y = y.mean()
Calculating the test statistic or t-statistic:
Using the formula mentioned above, we can replicate the formula in NumPy as follows to calculate the test statistic:
import numpy as np t = (mean_y - mean_x)/(SD_x/N)
Here t = -0.94
Note: Since here the number of elements in the sample is 20, the degrees of freedom in consideration will be N – 1 or 20 – 1 = 19.
Calculation of the critical value, using a t-distribution table which is available online or using the scipy library, one can identify the critical t value for a right-tailed test with 19 degrees of freedom and 0.05 level of significance to be 1.7291.
Find a super-easy critical value calculator here: critical value calculator
Upon Comparing the test statistic with the critical value, we can observe that t is negative i.e. -0.94, and t crit is 1.72 and therefore t is not greater than the critical value in a right-tailed test and the null hypothesis cannot be rejected.
The null hypothesis is not rejected in this case.
However, this happened because we used the sample to have a mean of around 8 and the difference was massive. Let us change it to have a mean around 1 and see how it impacts the decision.
In this case, you can observe that the t value is 16.43 which is way bigger than 1.72 and therefore, in this case, we would reject the null hypothesis and conclude a statistically significant difference in our assumption of a mean value of around 10 in the population (but of course, this is because we rigged our sample here)
Conducting a hypothesis test this way is fairly simple because the t-statistic is not a complicated formula to execute. The problem occurs in understanding how the hypothesis test works and it is imperative for the theoretical grasp of the concept to be strong for anyone to execute the hypothesis test in Python.
How to do t-test in Python is important for a researcher, scientist, and data scientist alike. Imagine you have to make inferences or conclusions about employees being late to the company’s offices in their cities by 30 mins every day. As a data scientist, assuming there are 250,000+ employees in the company, it is going to get very difficult for the employees’ data to be recorded accurately and stored perfectly for you to work on. Using a small sample, it is in this case, when your data is really large and when you need to wrangle your data before you conduct the hypothesis test, that you need to know how to do t-test in python.
For a data scientist, knowing only how to conduct descriptive statistics is not enough. Descriptive statistics can be used to diagnose a certain problem or identify a problem. To rectify that problem, it is inferences that need to be made about the situation, which is statistically significant and can help the organization take the relevant steps in order to improve the conditions as they stand.
Try conducting a t-test similar to the tutorial in this article and let us know in the comments below about your inferences from your own datasets! (Note: If the sample is fairly larger, use a Z-test instead of a t-test)
For more such content, check out our website -> Buggy Programmer