18. End of Semester Project

For your end of semester project you will do the statistical analysis for a short scientific paper (using real data) and a statistical research report (using fake data).

The project (a PDF)
that you submit
should look like the sample end-of-semester project
shown below

End of Semester Project
Click here to download the above PDF.

Click here to download the above as a Microsoft Word Document you can edit.

A 9 minute video
to help you get started on your
end-of-semester project

I will email you the data you are to use in the end-of-semester project.

Part I. Scientific Paper

Most scientific papers follow the IMRAD structure (Introduction Methods Results And Discussion).

The introduction tells the reader the background, what your research goals are, and why what you are doing is worthwhile.

The methods section tells the reader precisely what you did with enough detail that they could reproduce your research.

The results section tells the reader the results of the experiments, or the outcomes of what you did in the methods section. In this section you avoid explaining or interpreting the results.

The discussion section tells the reader your interpretation of the results, it tells them why you think the results are the way they are and what they imply (especially regarding the goals of your research).

See https://en.wikipedia.org/wiki/IMRAD for more details about the IMRAD structure.

Step-by-step instructions for the Scientific Paper

Note. The scientific paper you are creating for your end-of-semester project contains real data and content. It is written and structured like an actual scientific article. If you expanded it, and did some more research, you could publish it.

In your scientific paper you will compare tick densities in two different counties in New York State.  The raw data on tick densities in NY is collected by the New York State Department of Health (NYS-DOH).  For your project, I will provide you with the NYS-DOH tick data in a form that is easy to work with.

1. Download the “End of Semester Project” word file:  End of Semester Project Word Doc

To create your end-of-semester project you will modify the  “End of Semester Project Word Doc” file. It is easy.

2. Change the author to your name and delete “Department of Mathematics”.

3. You don’t have to change anything in the abstract or the introduction. But, you should read the material in these sections so that you will understand what the paper is about and what goes in the different sections.

4. Skip to the results section, to the part where it says:

3.1  Westchester vs Dutchess Counties
(change that to the counties you are comparing for YOUR project)

5. Make a time series plot of the counties you are comparing.  Test to see if the tick density data from the two counties provide statistically significant evidence that the tick densities in the two counties are correlated.

Note.  A time series plot is when along the x axis you have time (like years).

The following R script contains the data (for Westchester and Dutchess counties) along with the code to make the time series plot and to do the correlation test.  For your project you need to replace the Westchester and Dutchess County tick data with the data for the counties I sent to you. In the R code make sure you change the county names to the names of the counties I sent you.

To download the following R script formatted for online R compilers click Click_Here_R1_Time_Series_R_Script

 
# Tick Data for Dutchess and Westchester and Counties
# From NYS-DOH
#
WestchesterTickDensity = c(30.00, 35.70, 38.30, 168.40, 
 17.70, 21.30, 19.10, 47.60, 
 43.10, 21.10, 11.00, 32.63);
DutchessTickDensity = c(40.0, 36.4, 59.8, 97.4, 
 68.5, 40.3, 49.4, 62.3, 
 63.1, 44.2, 71.8, 52.0);
Years = c(2008, 2009, 2010, 2011, 
 2012, 2013, 2014, 2015,
 2016, 2017, 2018, 2019);
#---------------------------------------------------
# What you need to change
#---------------------------------------------------
county_x_density = DutchessTickDensity
county_y_density = WestchesterTickDensity
#
county_x_name = "Dutchess" # don't forget quotes!
county_y_name = "Westchester" 
your_name = "Prof McCarthy"
#---------------------------------------------------
# No changes needed below this line
#---------------------------------------------------
# Time series plot
minTickDensity = min(county_x_density, county_y_density );
maxTickDensity = max(county_x_density, county_y_density );
#---------- county_x
plot(Years,county_x_density, 
 ylim = c(minTickDensity, maxTickDensity),
 pch = 16,
 cex = 1.5, 
 col = "red", 
 main = "Tick Density vs Year",
 xlab = "Year", 
 ylab = "Tick density ticks per 1000 m^2");
lines(Years,county_x_density, col = "red", lwd =3);
#---------- county_y
points(Years,county_y_density, cex = 1.5, pch = 16, col = "blue");
lines(Years,county_y_density, col = "blue", lwd =3);
#---------- Add legend and caption
legend("topright", 
 legend=c(county_y_name, county_x_name),
 col=c("blue", "red"), 
 pch = c(16, 16),
 lwd = c(3,3), 
 lty=c(1,1), 
 cex=1.5,
 box.lty=0);
title(sub = paste("Analysis by", your_name),cex.sub = 0.75, font.sub = 3, col.sub = "black", adj = 1)
#------------- correlation test
cor.test(county_x_density, county_y_density)
# End of R Script 

The following figure shows the output of the R script and how to report the results of the correlation test in the scientific paper.

Note. How to work with pictures in R and Word. If you are running R on your computer, if you click on the graph that R made, you can copy the graph or save it. Then, you can put it into Microsoft Word.  If you are running R online, if you click on the image output you can usually copy or save it it your computer and then put it in Word. Once the picture is in Word you can crop and adjust the picture if it is too big or has too much white space around it. Just click on the picture and then go to the menu item “Picture Format” and then crop it (or resize it). There are other ways to crop images as well. Use whatever method works best for you.

6. Test to see if the data provides statistically significant evidence that  mean tick density over the years 2008 to 2019 was less than 2011 “spike” in tick densities.

This R script conducts the test for Dutchess county.  You can easily modify it for the two counties you are doing for your project.

For your project you need to replace the Westchester and Dutchess County tick data with the data for the counties I sent to you. In the R code make sure you change the county names to the names of the counties I sent you.

You have to run this analysis twice, once for each county you are doing.

To download the following R script formatted for online R compilers click
Click_Here_R2_t_test_R_Script

# Tick Data for Dutchess and Westchester Counties
# From NYS-DOH
#
WestchesterTickDensity = c(30.00, 35.70, 38.30, 168.40, 
 17.70, 21.30, 19.10, 47.60, 
 43.10, 21.10, 11.00, 32.63);
DutchessTickDensity = c(40.0, 36.4, 59.8, 97.4, 
 68.5, 40.3, 49.4, 62.3, 
 63.1, 44.2, 71.8, 52.0);
Years = c(2008, 2009, 2010, 2011, 
 2012, 2013, 2014, 2015,
 2016, 2017, 2018, 2019);
#---------------------------------------------------
# What you need to change
#---------------------------------------------------
x = DutchessTickDensity
#---------------------------------------------------
# No changes needed below this line
#---------------------------------------------------
"Hypothesis test for claim: mean tick density 2008 - 2019 < 2011 spike in density";
mux = max(x); "spike density"; mux;
xbar = mean(x); "mean density M"; xbar;
Sx = sd(x); "standard deviation SD"; Sx;
n = length(x); 
t_statistic = (xbar - mux)/(Sx/sqrt(n));
"t-statistic"; round(t_statistic, 4);
"degrees of freedom for one sample t-test"; n-1; 
pvalue = pt(t_statistic, n - 1);
"p-value for claim: mean density < spike density"; pvalue;
# End of R Script 

The following figure shows you to report the results of the one sample t-test (hypothesis test1) for the claim:

mean tick density 2008 – 2019 < 2011 tick density spike

with respect to Dutchess county.  A similar  analysis was done for Westchester county.

7. Make a scatter plot of the two counties tick density data and perform a regression analysis.

In the scatter plot, each data point should correspond to the tick densities for the two counties that you are doing for each particular year.

In the following R script the data points are (x, y) = (Dutchess, Westchester) tick densities .

The following R script contains the data and the code to make the scatter plot and do the regression analysis. Remember to modify it for the two counties you are doing for your project.

To download the following R script formatted for online R compilers click
Click_Here_R3_Regression_ScatterPlot_R_Script

 
# Tick Data for Dutchess and Westchester Counties
# From NYS-DOH
#
WestchesterTickDensity = c(30.00, 35.70, 38.30, 168.40, 
 17.70, 21.30, 19.10, 47.60, 
 43.10, 21.10, 11.00, 32.63);
DutchessTickDensity = c(40.0, 36.4, 59.8, 97.4, 
 68.5, 40.3, 49.4, 62.3, 
 63.1, 44.2, 71.8, 52.0);
Years = c(2008, 2009, 2010, 2011, 
 2012, 2013, 2014, 2015,
 2016, 2017, 2018, 2019);
#---------------------------------------------------
# What you need to change
#---------------------------------------------------
county_x_density = DutchessTickDensity
county_y_density = WestchesterTickDensity
#
county_x_name = "Dutchess" # don't forget quotes!
county_y_name = "Westchester" 
your_name = "Prof McCarthy"
#---------------------------------------------------
# No changes needed below this line
#---------------------------------------------------
#-------- Scatter Plot
plot(county_x_density,county_y_density, 
 xlab = paste(county_x_name,"tick density in ticks per 1000 m^2"), 
 ylab = paste(county_y_name,"tick density in ticks per 1000 m^2"),
 main = paste(county_y_name,"vs",county_x_name,"\nData: Yearly tick densities 2008 - 2019"), 
 pch = 16);
#
title(sub = paste("Analysis by", your_name),cex.sub = 0.75, font.sub = 3, col.sub = "black", adj = 1)
#------------ do the regression analysis y ~ x
regression = lm(county_y_density ~ county_x_density); 
summary(regression)
#------------ plot the regression line from the regression analysis
abline(regression, lwd = 3) # plots regression line
#------------ 2011 data point
"2011 county_x tick density"; county_x_density[which(Years == 2011)];
"2011 county_y tick density"; county_y_density[which(Years == 2011)];
# End of R Script 

The following figure shows you the output of the above R script and how to report the results of the regression analysis in your paper.

8. Discussion. Make sure to change the counties in the “Discussion” section to the two counties you have been assigned to compare.

9. References.  The references don’t need to be changed.

Finishing the scientific paper part

Makes sure your finished scientific paper is formatted exactly like the sample. The scientific paper should fit on precisely two pages. The size of the plots should be the same (more or less) as in the sample.

Make sure to read each line in the paper to make sure you have changed the county names to the one’s you have been assigned to compare.

Part II.  Statistical Report

A statistical report is just a catch-phrase to cover reports and publications in which there are a lot of statistics. There is no special format.  However, for your end-of-semester project I want you to format it EXACTLY in the same way as the sample.

The sample statistical report is page 3 of the word document you downloaded for the scientific paper.

How to do the Statistical Report
Note the “statistical report” part of your end-of-semester project uses made-up, fictional data.

1.  The survey of Travel and Tourism students.

A survey is given to Travel and Tourism students. The purpose of the of the survey is to determine if Travel and Tourism students do more touristic things in NYC than other students. From extensive prior research we know quite accurately the percent of students, all students, not just Travel and Tourism majors, that will answer “yes” to the two survey questions. Those percentages are called the “expected” percentages.

The symbol $\rho$ is used to represent the true (but unknown) proportion of Travel and Tourism students who would answer yes. So, actually there should be two $\rho$’s $\rho_1$ and $\rho_2$, since there are two questions, but we won’t be so pedantic.

For each of the two questions we test the hypothesis:

$$(\text{claim}) \ \ H_A: \rho > \text{expected percent}$$

using the exact binomial test. See Unit 10 for a detailed explanation of the exact binomial test.

For the survey question about Pier 25 (Q1) we test:

$$(\text{claim}) \ \ H_A: \rho > 60\%$$

using the data 17 out of the 20 Travel and Tourism said yes to Q1.

Here is an R script to calculate the p-value for Q1 (about Pier 25).

To download the following R script formatted for online R compilers click
Click_Here_R4_Exact_Binomial_p_value_Test_R_Script

# exact binomial test for Ha: p < expected_p
n = 20
x = 17
p = .60;  #expected_p
"sample fraction"; x/n;
"pvalue"; sum(dbinom(x:n, n, p));
# End of R Script 

Here is how you are to report the results.

Make sure to change the text and ALL THE ENTRIES in the table in the statistical report to reflect the data you are given.

Important!! If the p-value < .05 put a “yes” in the “significant column. This is how you signal to your readers that the result was statistically significant. If you get this part wrong, it means you didn’t understand about p-values and you will get a low grade for the end-of-semester project.

Also, don’t forget to change the following:

2.  The histogram of the commute times.

In the sentence above the histogram change the sample size $n$ of the “commute times” survey to match the sample size $n$ of the data for your end-of-semester project:

and do the same in the caption below the histogram:

Here is an R script to create the frequency histogram of commute times. It also calculates the sample mean, sample standard deviation, and 95% CI for the population’s mean commuting time.

Remember to use the end-of-semester project data in the R script.

To download the following R script formatted for online R compilers click
Click_Here_R5_Histogram_R_Script

#----------------- data
commuteTimes = c(31, 23, 33, 50, 43, 44, 47, 
 33, 45, 36, 65, 70, 41, 50, 
 52, 45, 13, 10, 63, 48, 26, 
 50, 43, 34, 55, 52, 38, 59, 
 41, 46, 38, 79, 51, 35, 52, 
 64, 36, 50, 21, 68, 42, 39, 
 38, 52, 36, 49, 37, 10, 13)
#--------------------------------- histogram
hist(commuteTimes, 
 col = "cyan", 
 main = "Commute Times to BMCC", 
 xlab = "minutes");
 title(sub = "analysis by Prof McCarthy",cex.sub = 0.75, font.sub = 3, col.sub = "black", adj = 1)
#------------ mean and sd
"commute times mean M"; mean(commuteTimes);
"commute times standard deviation SD"; sd(commuteTimes);
# ---------- 95% CI
t.test(commuteTimes)
# End of R Script 

Here is how you are to report the results.

Make sure you change the R script so that the histogram, on the bottom, says that the analysis was done by your-name, not by Prof McCarthy.

Finishing the end-of-semester project

Make sure you check every line of the end-of-semester project to make sure you changed everything that needs to be changed. Be especially careful with the captions and the data tables.

When done, convert into a single PDF (which should be 3 pages long).
Upload to blackboard.

 

 

Footnotes

 

  1. If the p-value is < .05 then the sample provides statistically significant evidence that the claim is true (at the $\alpha = .05$ level of significance).