An ANOVA test, also known as an Analysis of Variance, is used to analyze the relationship between categorical and continuous variables. It is used to investigate whether either quantitative dependent variable changes at each level, according to one or more categorical independent variables.
ANOVA’s null hypothesis says that there is no difference in the means of the independent variable, whereas the alternative hypothesis says that the means differ.
aov(Dependent_variable~factor(Independent_Variable))
A one-way ANOVA test is performed using the mtcars
dataset between the disp
attribute, a continuous attribute, and the gear
attribute, a categorical attribute.
Note: A one-way ANOVA test comes pre-installed with the
dplyr
package.
The mtcars
data comes from the 1974 MotorTrend magazine. The data includes fuel consumption data and aspects of car design for then-current car models.
library(dplyr)boxplot(mtcars$disp~factor(mtcars$gear),xlab = "gear", ylab = "disp")
The box plot shows the mean values of gear with respect to displacement. Here, the categorical variable is gear
, on which the factor function is used, and the continuous variable is disp
.
mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear))summary(mtcars_aov)
The summary shows that the gear attribute is very significant to displacement (there are stars denoting it). In addition, the p-value is less than 0.05, which proves that gear is significant to displacement, meaning they are related to each other. Therefore, we reject the null hypothesis.
The rest of the values in the output table describe the independent variable and the residuals:
Df
column displays the Sum Sq
column displays the Mean Sq
column is the mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter.F-value
column is the test statistic from the F test. This is the mean square of each independent variable divided by the mean square of the residuals. The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.Pr(>F)
column is the p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.Free Resources