Sunday, October 29, 2023

Module 10 Assignment

9.1 

For this assignment, we are expected to use the data set from the Introductory Statistics with R package titled "cystfibr", which contains patient information about people who have cystic fibrosis (ages 7-23) and their lung capacity. The goalof this assignment is to understand the relationship between the variables (age, weight, sex, height, bmp, fev1, rv, frc, tlc, and pemax).

The first thing I wanted to do was explore the data. To do this, I used the "str" and "summary" functions in R to better understand what the data looked like and how to approach it. The resulting products are as shown:

From this we can see basic trends and qualities of the data.
The first test is to determine the coefficients of the data. Using the relationship between pemax, age, and height, the coefficients are 2.7178(age) and 0.3397(height); the intercept for these variable are at 17.8600. 


The primary test I am interested in running is the relationship between the pemax, height and the age in the data set. As shown, the Sum of the Squares are listed as 231.1695, with the residuals being 84.19767. The degrees of freedom are 1. 
9.2

Same as the first, I chose to use the str and summary functions to better understand the data and its structure. The ISwR::Secher data set is describing ultrasonographic measurements of babies prior to and following their births. This has 107 rows and 4 columns of data included and the summary/structure looks as such:



From here I am a bit confused, as the the model  is not something I am familiar with in R. When inputting the model into R as  a vector using the same variables, the next step becomes unclear. The regression lines looks something like this, however the result of using the provided formula simply initializes it as a vector. Additionally, it throws an error, however this issue is likely from improper syntax. Without using logarithmic attributes, this is the resulting output:
It's apparent that this is an improper correlation, as the negative intercept is an impossibility. More practice and information required to assess the linear regression in statistical and graphical cirumstances. 






Monday, October 16, 2023

Module #8 Assignment

 For this assignment, we are expected to run an ANOVA hypothesis test. 

1. Firstly, it is necessary to combine the individual response data into three separate vectors to identify the ratings of the high, moderate, and low stress groups.



 Once this is done, binding these vectors into a data frame using the as.data.frame and cbind function is the next step to structure these data into a single command. The stack function allows the data to be illustrated in a more readable and accessible manner before running the ANOVA function. 



Using the Oneway.test function provides information on the F value, the numerator df, denominator df, and the p-value of the data under the assumption that the variances are equal.  




2. The second question asks us to use the ISwR :: zelazo package. The data matrix is as follows:
I wasn't sure how to approach this question, so I opted to use the t-test to glean some information so I could determine some useful characteristics about the set. Since I am writing this after the due-date, and for my own benefit, I employ the answer key to help guide me to the correct next-step. Using the same process as in Question 1, I created a data frame from the zelazo package.
Next, I stacked the data, as before. 
Then, I conducted a one-way T-test.

Although my numbers are different from those posted, I chalk this up to a misinput of the data or a mistake on my end. Again, this is being done for the purpose of practicing this process and becoming familiar with conducting these tests. 
When the ANOVA test is run, the results show that the null hypothesis, that there is not evidence of significant differences between babies that are trained and those that are not, cannot be rejected, as the p-value is greater than the significance level (0.05 < 0.2239).




Sunday, October 8, 2023

Module #7 Assignment

 1. 

The data set for this question is as follows:

x <- c(16, 17, 13, 18, 12, 14, 19, 11, 11, 10)

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

1.1 The input (x) is assumed to be fixed, while the output (y) is a response to the x, thereby being inherently random in contrast to the input. Under these circumstances the relationship is linear. 

1.2

 Intercepts: 19.26 and 3.269

2.

2.1. In this scenario, the relationship between these variables is the time between eruptions of a geyser.

2.2. eruption.lm = lm(eruptions ~ waiting, data=faithful)

coefficients: 1.874016    0.075628

2.3 4.172

3. The coefficients for the 'mtcars' data frame produces this in R; using the 'head' function and limited the variables to just the first 5. 



4.

Using the lm function, as well as the plot and abline function, this is the data represented graphically. 
The inputs for this in R looks like:





Sunday, October 1, 2023

Module #6 Assignment

 A.

a. The mean is 11.8

b. Randomly selected 14 and 10

c. The mean for the sample is 7. The standard deviation of the sample is 2.8284271247462 (or 2.83 to 3sf). 

d. In comparison, the mean is significantly different from the populations, in contrast to the standard deviation. The standard deviation for the population is  2.8565713714171 (or 2.86 to 3sf). The difference in the Sd is significantly less apparent between the sample and population than that of the means. 

B.

n=100

p=0.95

1.Yes, the population has a normal distribution due to its proximity to 1, which suggests a higher confidence in the probability the result is true.

2. As the value becomes further from 1, the chances of the correlation being statistically significant drastically decline. Anything below 0.80 should likely be retested or judged as untrustworthy in being verifiable evidence. Around 0.76 is where the cut-off for statisticians and mathematicians should attempt to use as proof in my opinion. 

B.ii.

A. 5

B. 100

C. pop= (xBar-µ)/(σ/(sqrt(n))

6.165939194 or 6.16 to 3sf.This suggests that the sample is very far off, and does not represent the entire population properly. This is likely due to the limited sample size and variance as a result. Despite the standard deviation being nearly unaffected, the other statistical qualities are negatively impacted by the small sample. 

I do not know which exercise is for the last question, I checked all three textbooks I bought/have access to and am confused on what this is referring to. 

Final Project

  The compiled file is attached to the submission for this assignment itself. Blogger does not allow word documents to be attached as far as...