1 Introduction

1.1 Highlight and summary of assignment

In this assignment, I conducted a multivariate data analysis to determine how physical / biological characteristics contribute to achieving physical performance. To do so, I carried out an exploratory analysis (t-test, outlier), and a data analysis in which I plotted linear models to inspect the relationship between variables. Lastly, I resampled my model by finding the polynomial degree of best fit.

1.2 Motivation

Sit-ups are one of the most popular exercise and enable a strengthening of the core strength (Childs, J. D. et al, 2009). This exercise has many added benefits for health given that it engages the abdominal muscle, which helps stabilize the core. Given the benefits of doing sit-ups, we sought a dataset with a myriad of physical attributes. One of motivation for selecting this data is to understand the health benefits when performing such physical activity. Furthermore, we also want to observe the pre-dispositions for physical performance according to various physical traits demographic attributes.

1.3 Data Description

Data was collected from the Korea Sports Promotion Foundation dataset. Raw data has already been filtered given that the extraction was from Kaggle rather than the website of the foundation itself. The initial dataset displays different physical traits over a sample of 13 393 observations with 13 columns (each representing a variable). However, our initial explanatory analysis will comprise of 8 variables: age, gender, body-mass percentage (body-mass according to the body height), diastolic, systolic, height, weight, and sit-up count. The diastolic variable, called the diastolic reading, measures the pressure in the arteries between each heart beat. The systolic blood pressure shows the pressure blood exerts against the artery walls during each heart beat. Out of the two, the systolic blood pressure is always the highest number.

Since we aim to perform a regression analysis, we now seek to attribute independent and dependent variables:

Since our data is health-related, we will select the sit-up count as our dependent variable as we seek to understand how physical and health-related factors might impact physical performance.
For the exploratory analysis, our initial numerical independent variables are the following: age, body-mass percentage, diastolic, systolic, height, and weight.
The chosen categorical variable for our analysis will be gender which is dichotomous and binary since it only has two possible values: male or female. –> Independent variable

2 Data Analysis

2.1 Exploratory Data Analysis

2.1.1 Analysis of the independent Variable

To make the analysis easier to apply, we first and foremost rename our variables:

attach(bodyperformance)

## The following objects are masked _by_ .GlobalEnv:
## 
##     age, gender

## The following objects are masked from bodyperformance (pos = 3):
## 
##     age, body fat_%, broad jump_cm,
##     class, diastolic, diff, gender,
##     gripForce, height_cm, sit and bend
##     forward_cm, sit-ups counts,
##     systolic, weight_kg

## The following objects are masked from bodyperformance (pos = 16):
## 
##     age, body fat_%, broad jump_cm,
##     class, diastolic, diff, gender,
##     gripForce, height_cm, sit and bend
##     forward_cm, sit-ups counts,
##     systolic, weight_kg

gender <- bodyperformance$gender
situps <- bodyperformance$`sit-ups counts`
age <- bodyperformance$age
bod <- bodyperformance$`body fat_%`
dias <- bodyperformance$diastolic
syst <- bodyperformance$systolic
weight <- bodyperformance$weight_kg
height <- bodyperformance$height_cm

Subsequently, our first step is to analyze the dependent variable. As such, we inspect the summary along with the standard deviation, the box plot, and the histogram.

Upon analyzing boxplot 1 for the sit-up counts, we notice that there are no outliers. Furthermore, and given the significant size of the sample, the standard deviation of 14.27 shows that the values are relatively clustered around the mean. The latter is further displayed in histogram 1 with a very slight skewness towards the right but a high frequency of observations around the mean.

library(ggplot2)
summary(situps)

##    Min. 1st Qu.  Median    Mean 3rd Qu. 
##    0.00   30.00   41.00   39.77   50.00 
##    Max. 
##   80.00

sd(situps)

## [1] 14.27273

par(mfrow=c(2,2))
qplot(situps, geom="histogram", 
      binwidth = 5, 
      main = "Histogram 1: Sit-ups",
      xlab = "Sit-ups",
      ylab = "Count",
      fill=I("blue"),
      col=I("red"), 
      alpha=I(.2),)

ggplot(bodyperformance, aes(x = situps,  col = situps)) + geom_boxplot(fill = 'blue') +
  labs(title = "Boxplot 1: Sit-ups")

Similarly to the sit-up count, we inspect the other numerical independent variables using the \(sapply\) function to create the following boxplots (boxplot 2).

Upon inspecting the data for the quantitative independent variables, we look at each individual variable to detect any particular outlier. Correspondingly, we notice that, excluding age, all the other variables have outliers.

We notice that the minimum height found in the dataset is 125cm. The weight of 34.4 kg corresponding to the height, we will not discard this outlier. Inversely, one observation shows a height of 193.8 cm. Nevertheless, this value does not have any particularly conflicting information given that the weight (93.4 kg) and the body fat % (19.4%) is consistent with the height. Weight also has numerous outliers. One of the observations shows a 63 year old female weighing 26kg, a height of 153cm, and a body-fat of 37%. Due to the incoherence of these values, we removed them. With regards to body-fat percentage, and upon inspecting the outliers, no specific incoherence has been discovered within the observations.

Lastly, we examine the variables within the diastolic and systolic column. To ensure there were no incoherent variables, we create a column showcasing the difference between the systolic and diastolic blood pressure. In total, 4 observations returned a negative number. We removed all of them as the systolic pressure is always superior to the diastolic pressure. We also removed an observation which had a systolic and diastolic pressure of 0.

##              age height_cm weight_kg
## Min.    21.00000  125.0000  31.90000
## 1st Qu. 25.00000  162.4000  58.20000
## Median  32.00000  169.2000  67.40000
## Mean    36.77753  168.5611  67.44955
## 3rd Qu. 48.00000  174.8000  75.30000
## Max.    64.00000  193.8000 138.10000
##         body fat_% diastolic systolic
## Min.       3.00000   6.00000  77.0000
## 1st Qu.   18.00000  71.00000 120.0000
## Median    22.80000  79.00000 130.0000
## Mean      23.23508  78.79992 130.2657
## 3rd Qu.   28.00000  86.00000 141.0000
## Max.      54.90000 126.00000 201.0000

##        age  height_cm  weight_kg body fat_% 
##  13.625526   8.426132  11.945704   7.241077 
##  diastolic   systolic 
##  10.696269  14.615821

Now that we have removed the outliers, our final sample is reduced to 13386 observations. To determine correlations, we create a pair figure showcasing the scatter plots (and therefore the linear relationship) between each variable. Upon analyzing the plot we can conclude that there is a moderate negative correlation between sit-up counts and age, and a moderatily negative correlation sit-up counts and body fat percentage. There is also a moderately positive correlation between height and sit-up counts.

By looking at these scatterplots individually (displayed in the scatterplots 1, 2, 3), we can clearly observe the linear negative linear relationships age and body fat percentage have with sit-up count as well as the positive linear relationship height has with sit-up count.The data shows that all the variables are significant except the diastolic variable. Indeed, the p-value for the diastolic variable is higher than 0.05 which indicates there is a very low significance in relation to the sit-up count. Similarly, we notice issues of collinearity between height and weight given that they are highly correlated (0.73 simultaneously). The latter can be observed in the figure displaying the pairs given that the scatter plot for height and weight displays a positive linear relationship.

Consequently, the diastolic and weight independent variables will not be used in the modelling process. Given that height and weight are colinear, weight is withdrawn because height as a stronger correlation with the sit-up count.

library(Hmisc)
library(GGally)

ggpairs(bodyperformance[,c(10,1,3,4,5,6,7)], title = 'Pair plot 1: Relationship amongst numerical variables, and display of their distribution', columns = 1:7, aes(alpha = 0.5),
        lower = list(continuous = "smooth"))

## plot: [1,1] [---------------] 2% est: 0s
## plot: [1,2] [>--------------] 4% est: 2s
## plot: [1,3] [>--------------] 6% est: 2s
## plot: [1,4] [>--------------] 8% est: 3s
## plot: [1,5] [=>-------------] 10% est: 3s
## plot: [1,6] [=>-------------] 12% est: 3s
## plot: [1,7] [=>-------------] 14% est: 3s
## plot: [2,1] [=>-------------] 16% est: 3s
## plot: [2,2] [==>------------] 18% est: 3s
## plot: [2,3] [==>------------] 20% est: 3s
## plot: [2,4] [==>------------] 22% est: 3s
## plot: [2,5] [===>-----------] 24% est: 2s
## plot: [2,6] [===>-----------] 27% est: 2s
## plot: [2,7] [===>-----------] 29% est: 2s
## plot: [3,1] [====>----------] 31% est: 2s
## plot: [3,2] [====>----------] 33% est: 2s
## plot: [3,3] [====>----------] 35% est: 2s
## plot: [3,4] [=====>---------] 37% est: 2s
## plot: [3,5] [=====>---------] 39% est: 2s
## plot: [3,6] [=====>---------] 41% est: 2s
## plot: [3,7] [=====>---------] 43% est: 2s
## plot: [4,1] [======>--------] 45% est: 2s
## plot: [4,2] [======>--------] 47% est: 2s
## plot: [4,3] [======>--------] 49% est: 2s
## plot: [4,4] [=======>-------] 51% est: 2s
## plot: [4,5] [=======>-------] 53% est: 2s
## plot: [4,6] [=======>-------] 55% est: 2s
## plot: [4,7] [========>------] 57% est: 2s
## plot: [5,1] [========>------] 59% est: 1s
## plot: [5,2] [========>------] 61% est: 1s
## plot: [5,3] [========>------] 63% est: 1s
## plot: [5,4] [=========>-----] 65% est: 1s
## plot: [5,5] [=========>-----] 67% est: 1s
## plot: [5,6] [=========>-----] 69% est: 1s
## plot: [5,7] [==========>----] 71% est: 1s
## plot: [6,1] [==========>----] 73% est: 1s
## plot: [6,2] [==========>----] 76% est: 1s
## plot: [6,3] [===========>---] 78% est: 1s
## plot: [6,4] [===========>---] 80% est: 1s
## plot: [6,5] [===========>---] 82% est: 1s
## plot: [6,6] [============>--] 84% est: 1s
## plot: [6,7] [============>--] 86% est: 1s
## plot: [7,1] [============>--] 88% est: 1s
## plot: [7,2] [============>--] 90% est: 0s
## plot: [7,3] [=============>-] 92% est: 0s
## plot: [7,4] [=============>-] 94% est: 0s
## plot: [7,5] [=============>-] 96% est: 0s
## plot: [7,6] [==============>] 98% est: 0s
## plot: [7,7] [===============]100% est: 0s

rcorr(as.matrix(bodyperformance[,c(10,1,3,4,5,6,7)]))

##                sit-ups counts   age
## sit-ups counts           1.00 -0.54
## age                     -0.54  1.00
## height_cm                0.50 -0.29
## weight_kg                0.29 -0.10
## body fat_%              -0.61  0.24
## diastolic                0.02  0.16
## systolic                 0.06  0.21
##                height_cm weight_kg
## sit-ups counts      0.50      0.29
## age                -0.29     -0.10
## height_cm           1.00      0.73
## weight_kg           0.73      1.00
## body fat_%         -0.52     -0.08
## diastolic           0.15      0.26
## systolic            0.21      0.34
##                body fat_% diastolic systolic
## sit-ups counts      -0.61      0.02     0.06
## age                  0.24      0.16     0.21
## height_cm           -0.52      0.15     0.21
## weight_kg           -0.08      0.26     0.34
## body fat_%           1.00      0.05    -0.03
## diastolic            0.05      1.00     0.68
## systolic            -0.03      0.68     1.00
## 
## n= 13386 
## 
## 
## P
##                sit-ups counts age   
## sit-ups counts                0.0000
## age            0.0000               
## height_cm      0.0000         0.0000
## weight_kg      0.0000         0.0000
## body fat_%     0.0000         0.0000
## diastolic      0.0623         0.0000
## systolic       0.0000         0.0000
##                height_cm weight_kg
## sit-ups counts 0.0000    0.0000   
## age            0.0000    0.0000   
## height_cm                0.0000   
## weight_kg      0.0000             
## body fat_%     0.0000    0.0000   
## diastolic      0.0000    0.0000   
## systolic       0.0000    0.0000   
##                body fat_% diastolic systolic
## sit-ups counts 0.0000     0.0623    0.0000  
## age            0.0000     0.0000    0.0000  
## height_cm      0.0000     0.0000    0.0000  
## weight_kg      0.0000     0.0000    0.0000  
## body fat_%                0.0000    0.0005  
## diastolic      0.0000               0.0000  
## systolic       0.0005     0.0000

par(mfrow = c(2, 2))
ggplot(bodyperformance, aes(x=age, y=situps)) + 
    geom_point() + ggtitle('Scatterplot 1: relationship between age and sit-ups') +  geom_smooth(method='lm')

## `geom_smooth()` using formula 'y ~ x'

ggplot(bodyperformance, aes(x=height, y=situps)) + 
    geom_point() + ggtitle('Scatterplot 2: relationship between height and sit-ups') +  geom_smooth(method='lm')

## `geom_smooth()` using formula 'y ~ x'

ggplot(bodyperformance, aes(x=`body fat_%`, y=situps)) + 
    geom_point()+ ggtitle('Scatterplot 3: relationship between body-fat percentage and sit-ups') +  geom_smooth(method='lm')

## `geom_smooth()` using formula 'y ~ x'

2.1.2 Exploratory analysis of the (independent) categorical data

Our first analysis regarding the gender composition is to breakdown the number of Males and Females in our dataset. Based on the bar plot, we notice that the male population is substantially higher than the female population within our dataset.

table(gender)

## gender
##    F    M 
## 4923 8463

par(mfrow=c(1,1))
ggplot(data = bodyperformance, aes(x = gender, fill = gender)) +
    geom_bar() + ggtitle("Bar plot 1: gender count")

<br>

In parallel, we are now interested in seeing how either of these outcomes by calculating the average sit-up by gender. To analyze the difference between means, we carry out a t-test for both samples. As we can observe, the average number of sit-ups is significantly higher for males than females. On average, males do approximately 44 sit-ups whereas females do approximately 30 sit-ups. In relation to this, the difference is significant since the p-value is inferior to 0.001. In accordance with these results, the male gender will be more relevant to use for the modelling process.

t.test(`sit-ups counts` ~ gender)

## 
##  Welch Two Sample t-test
## 
## data:  sit-ups counts by gender
## t = -59.642, df = 8958.8, p-value <
## 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -14.50176 -13.57885
## sample estimates:
## mean in group F mean in group M 
##        30.89376        44.93407

ggplot(data = bodyperformance, aes(x=gender, y=situps, fill=gender)) + geom_boxplot() + ggtitle("Boxplot 3: Sit-up by gender")

3 Model Selection

3.1 Modelling and residual analysis

In the previous section, we have removed all the irrelevant variables as well as the unusable outliers. In this current section, we will implement our modelling process along with our residual analysis. To do so, we test different models. We find that all the variables are significant, as their p-value is smaller than 0.05. The R-squared value, which shows how much the independent variables affect the sit-up count, is 0.56. This indicates a very slight correlation.

model1 <- lm(situps ~ age + height + syst + bod, 
              data = bodyperformance)
model1

## 
## Call:
## lm(formula = situps ~ age + height + syst + bod, data = bodyperformance)
## 
## Coefficients:
## (Intercept)          age       height  
##     26.5833      -0.4431       0.2137  
##        syst          bod  
##      0.1042      -0.8661

summary(model1)

## 
## Call:
## lm(formula = situps ~ age + height + syst + bod, data = bodyperformance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.919  -5.855   0.452   6.238  39.386 
## 
## Coefficients:
##              Estimate Std. Error t value
## (Intercept) 26.583311   2.155156   12.34
## age         -0.443082   0.006538  -67.77
## height       0.213742   0.011988   17.83
## syst         0.104240   0.005950   17.52
## bod         -0.866140   0.013195  -65.64
##             Pr(>|t|)    
## (Intercept)   <2e-16 ***
## age           <2e-16 ***
## height        <2e-16 ***
## syst          <2e-16 ***
## bod           <2e-16 ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1  ' ' 1
## 
## Residual standard error: 9.383 on 13381 degrees of freedom
## Multiple R-squared:  0.5679, Adjusted R-squared:  0.5678 
## F-statistic:  4397 on 4 and 13381 DF,  p-value: < 2.2e-16

In order to find the best fit, we replicate the same model using the gender variable. In the following model, the adjusted R-squared is 0.58, which is higher than the R-squared value of the first model. Additionally, the height variable has a p-value which is greater than the significance level of 0.05. This implicates a low significance and discards the use of the variable.

The higher R-squared value indicates that the male gender explains the variance in sit-up counts more strongly than the female gender given that its slope is much higher. Correspondingly, the lower significance of height in relation to the implementation of the male gender variable means that there is possible collinearity between both.

model2 <- lm(situps ~ age + height + syst + bod 
             + gender)
summary(model2)

## 
## Call:
## lm(formula = situps ~ age + height + syst + bod + gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.492  -5.676   0.431   6.060  39.812 
## 
## Coefficients:
##              Estimate Std. Error t value
## (Intercept) 62.233816   2.604356  23.896
## age         -0.476647   0.006567 -72.581
## height      -0.004415   0.014999  -0.294
## syst         0.070668   0.006006  11.767
## bod         -0.749711   0.013857 -54.102
## genderM      6.365752   0.271992  23.404
##             Pr(>|t|)    
## (Intercept)   <2e-16 ***
## age           <2e-16 ***
## height         0.768    
## syst          <2e-16 ***
## bod           <2e-16 ***
## genderM       <2e-16 ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1  ' ' 1
## 
## Residual standard error: 9.197 on 13380 degrees of freedom
## Multiple R-squared:  0.5849, Adjusted R-squared:  0.5848 
## F-statistic:  3771 on 5 and 13380 DF,  p-value: < 2.2e-16

We now proceed to remove the height variable and create a third model to find the new R-squared value. The results from the model 3 summary below validates our previous assumption. All the variables are significant and the R-squared value is the same as the second model, even after the height variable was removed. Consequently, this model is a better fit. Similarly to the 3 previous scatterplots analyzing the correlation between sit-up count and the independent numerical variables, there is little correlation between the systolic variable and sit-up count.

model3 <- lm(situps ~ age + syst + bod 
             + gender)
summary(model3)

## 
## Call:
## lm(formula = situps ~ age + syst + bod + gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.536  -5.683   0.435   6.061  39.819 
## 
## Coefficients:
##              Estimate Std. Error t value
## (Intercept) 61.501068   0.765349   80.36
## age         -0.475989   0.006175  -77.09
## syst         0.070544   0.005990   11.78
## bod         -0.749195   0.013746  -54.50
## genderM      6.315997   0.213082   29.64
##             Pr(>|t|)    
## (Intercept)   <2e-16 ***
## age           <2e-16 ***
## syst          <2e-16 ***
## bod           <2e-16 ***
## genderM       <2e-16 ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1  ' ' 1
## 
## Residual standard error: 9.197 on 13381 degrees of freedom
## Multiple R-squared:  0.5849, Adjusted R-squared:  0.5848 
## F-statistic:  4714 on 4 and 13381 DF,  p-value: < 2.2e-16

ggplot(bodyperformance, aes(x=syst, y=situps)) + 
    geom_point() + ggtitle('Scatterplot 4: relationship between systolic blood pressure and sit-ups') +  geom_smooth(method='lm')

## `geom_smooth()` using formula 'y ~ x'

To further inspect significance and R-squared value, we create a fourth model by solely having age, gender, and body-fat percentage as independent variables. Since very little change occurs in the R-squared value when removing the systolic variable, we conclude that this is the best fitting model.

model4 <- lm(situps ~ age + bod + gender)
summary(model4)

## 
## Call:
## lm(formula = situps ~ age + bod + gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.560  -5.756   0.431   6.062  40.640 
## 
## Coefficients:
##              Estimate Std. Error t value
## (Intercept) 69.047811   0.420554  164.18
## age         -0.460925   0.006072  -75.92
## bod         -0.727015   0.013686  -53.12
## genderM      7.222971   0.199695   36.17
##             Pr(>|t|)    
## (Intercept)   <2e-16 ***
## age           <2e-16 ***
## bod           <2e-16 ***
## genderM       <2e-16 ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1  ' ' 1
## 
## Residual standard error: 9.244 on 13382 degrees of freedom
## Multiple R-squared:  0.5806, Adjusted R-squared:  0.5805 
## F-statistic:  6176 on 3 and 13382 DF,  p-value: < 2.2e-16

Subsequently, the last step in our modelling process is the residual analysis. To do so, we plot the residuals for model 4 as well as a histogram to analyze the distribution. The histogram is close to a normal distribution with most of the residuals of the model being clustered around 0 (mean). The QQ plot shows a linear trend despite having a few deviations from normality but we believe they don’t impact the regression model. Overall, the residual analysis further validates our conclusion that this is the best fitting model.

library(gridExtra)
library(grid)
library(ggfortify)

par(mfrow=c(2,2))
fit <- model4
p <- autoplot(fit, label.size = 3)
gridExtra::grid.arrange(grobs = p@plots, top = "Residual plots 1: residual analysis of model 4")

An initial collinearity analysis has been carried out in the exploratory and inferential analysis as we removed the independent variables with strong collinearity (i.e. weight and height). Nevertheless, we will analyze the current model to ensure that collinearity is acceptable. As observed, values are close to 1 which is an acceptable level of collinearity. As such, the collinearity analysis for model 4 is valid.

library(car)
vif(model4)

##      age      bod   gender 
## 1.072061 1.538411 1.452538

4 Resampling and Validation: The k-fold validation method

To further assess the robustness of our model, we use the k-fold validation method. This method is used for estimating bias, confidence intervals, prediction errors and other measures for sample estimates. It is also useful to assess the robustness and validate models.

Using the 10-fold cross validation, we assess the robustness of our model in accordance with the best polynomial fit, and using the independent variables stored in model 4. We seek to visualize how polynomial elements better fit data.

Based on Quantile plot 1, a polynomial of degree 2 is the best fit for body fat percentage. While the error rate decreases between the 2nd and 3rd degree polynomial fit, we notice a very slight improvement which on itself is not justified considering the little difference in significance. Moreover, as ascertained by Quantile plot 2, the best polynomial fit for age is of the 3rd degree. As shown on the 3rd graph, the difference in error between a polynomial of the 2nd and 3rd degree is much more significant.

library(boot)

par(mfrow=c(2,2))

glm.fit <- glm(`sit-ups counts` ~ age + `body fat_%` + gender, data=bodyperformance)
cv.error.10 <- cv.glm(bodyperformance, glm.fit, K = 10)$delta[1]

for (i in 1:10) {
  glm.fit <- glm(`sit-ups counts` ~ poly(`body fat_%`, i), data = bodyperformance)
  cv.error.10[i] <- cv.glm(bodyperformance, glm.fit, K = 10)$delta[1]
}
cv.error.10

##  [1] 127.7367 126.9616 126.7471 126.7982
##  [5] 126.5718 126.5499 126.5759 126.6880
##  [9] 126.6217 126.5635

cv.df <- data.frame(degree = 1:10,
                    cv.error = cv.error.10)
qplot(data = cv.df, x = degree, y = cv.error, geom = "line", ylab = "LOOCV error estimate", main = "Quantile plot 1: relationship between polynomial degrees and LOOCV error") + geom_point()

for (i in 1:10) {
  glm.fit <- glm(`sit-ups counts` ~ poly(age, i), data = bodyperformance)
  df <- cv.error.10[i] <- cv.glm(bodyperformance, glm.fit, K = 10)$delta[1]
}
cv.error.10

##  [1] 143.3647 141.7271 140.2568 140.2172
##  [5] 140.2141 140.1552 140.0155 140.0358
##  [9] 140.0091 140.0739

cv.df2 <- data.frame(degree = 1:10,
                    cv.error = cv.error.10)
qplot(data = cv.df, x = degree, y = cv.error, geom = "line", ylab = "LOOCV error estimate", main = "Quantile plot 2: relationship between polynomial degrees and LOOCV error") + geom_point()

Consequently, we create a 5th model using polynomials of best fit for our independent variables.

model5 <- lm(`sit-ups counts`~poly(`body fat_%`,2)+poly(age,3) + gender, data=bodyperformance)
summary(model5)

## 
## Call:
## lm(formula = `sit-ups counts` ~ poly(`body fat_%`, 2) + poly(age, 
##     3) + gender, data = bodyperformance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.765  -5.649   0.435   5.990  43.377 
## 
## Coefficients:
##                         Estimate Std. Error
## (Intercept)              35.3602     0.1485
## poly(`body fat_%`, 2)1 -608.0931    11.3505
## poly(`body fat_%`, 2)2 -107.1607     9.1700
## poly(age, 3)1          -729.7208     9.4765
## poly(age, 3)2           -83.7032     9.1942
## poly(age, 3)3            61.2470     9.2013
## genderM                   6.9756     0.1987
##                        t value Pr(>|t|)    
## (Intercept)            238.194  < 2e-16 ***
## poly(`body fat_%`, 2)1 -53.574  < 2e-16 ***
## poly(`body fat_%`, 2)2 -11.686  < 2e-16 ***
## poly(age, 3)1          -77.003  < 2e-16 ***
## poly(age, 3)2           -9.104  < 2e-16 ***
## poly(age, 3)3            6.656 2.92e-11 ***
## genderM                 35.105  < 2e-16 ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1  ' ' 1
## 
## Residual standard error: 9.151 on 13379 degrees of freedom
## Multiple R-squared:  0.5891, Adjusted R-squared:  0.589 
## F-statistic:  3198 on 6 and 13379 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
fit <- model5
p <- autoplot(fit, label.size = 3)
gridExtra::grid.arrange(grobs = p@plots, top = "Residual plots 2: Residual analysis of model 5")

vif(model5)

##                           GVIF Df
## poly(`body fat_%`, 2) 1.545171  2
## poly(age, 3)          1.094645  3
## gender                1.467718  1
##                       GVIF^(1/(2*Df))
## poly(`body fat_%`, 2)        1.114921
## poly(age, 3)                 1.015186
## gender                       1.211494

5 Conclusion

5.1 Discussion and limitations of the models

Compared to the model 4 summary, the model 5 summary shows a higher R-squared, which proves that the model including polynomials of best fit better explains variation. To further interpret the robustness of model 5, we carry out a residual and multicollinearity analysis.

The residual analysis shows that the model is close to normality. Similarly to model 4, there are a few deviations, but overall the residual graphs show a linear trend. The residual histogram also demonstrates a cluster around 0 (mean). Regarding collinearity, the values are valid given that they are close to 1, which is an acceptable level of collinearity. This means that they each independently predict the number of sit-up counts. Consequently, both residual and collinearity analysis is valid for this model.

Overall, our resampling method helped find the best fitting model. Nevertheless, other aspects are to be considered when selecting the model. For instance, for model 5 coefficients, polynomials terms are difficult to interpret:

\[sit.ups=-107.16* body.fat^2 - 608.09*body.fat + 61.24*age^3 -83.70*age^2-729.72*age + 6.98*gender+35.36\]

Whereas model 4 coefficients are much easier to interpret:

\[sit.ups= -0.46*age - 0.73*body.fat + 7.22*gender + 69.05\]

We observe that between the independent variable, age is the most important one, if we don’t consider gender. If we increase age by one unit, while controlling for the other variables, sit-up count will decrease by 0.46 units. Otherwise, if we increase body fat by one unit, while controlling for the other variables, sit-up count will decrease by 0.73 units. Nevertheless, gender has a strong impact on our dependent variable as when we shift it from 0 to 1 , while controlling for the other variables, sit up counts will increase by 7.22 units.

5.2 Summary

In this statistical report, we conducted a series of analysis to answer our research aim, which sought to see whether individuals were pre-disposed to physical performance on the account of physical and demographic data. To answer this question, and by using a logistic regression model, we filtered our independent numerical variables by removing unjustifiable outliers and removed independent variables which had invalid collinearity (i.e. diastolic and weight). We also analyzed our dichotomous categorical variable and concluded that the male gender better explained variability than the female gender.

Subsequently, we formulated our model with the remaining variables by testing significance and testing collinearity to find the best fit. As a result, the height variable was removed because it had a very low significance. The systolic variable was also removed due to the value of its slope being so low, which in turn shows a very low explanation in the variation of the R-squared. Robustness and justness of the model was further assessed using a residual and multicollinearity analysis which were both valid.

Lastly, a resampling method of 10-fold on the model tested the best fit using different polynomial degrees. In our discussion, despite finding a very slight improvement using polynomial degrees, we determined that the linear model found previously was ideal since it was much easier to interpret.

To answer our research question, we conclude that bodyfat, gender and age affect the number of sit-up counts. Nevertheless, as mentioned in the discussion, future research is needed to explain the remaining variability of R-squared.

6 References

Childs, J. D., Teyhen, D. S., Benedict, T. M., Morris, J. B., Fortenberry, A. D., McQueen, R. M., Preston, J. B., Wright, A. C., Dugan, J. L., & George, S. Z. (2009). Effects of sit-up training versus core stabilization exercises on sit-up performance. Medicine and science in sports and exercise, 41(11), 2072–2083. https://doi.org/10.1249/MSS.0b013e3181a84db2

Statistics Assignment: Gregory Crooks