Analysis of Division 1 College Basketball Success Related to Spending
In this project I wanted to study NCAA Division I Mens college basketball and different factors for their success. For this project I took a data set that involved each Division I college basketball team including their team awards and progress in the NCAA tournament. I then found another data set that contained tax information on every sports program in every college of the US. I sorted the tax information to match my other dataset of between 2013-2019. I sorted and cleaned that Data in order to get just D1 mens basketball programs and then I merged this data with my first data set. I then did different analysis to see how different factors of a team from how play to how much a school spends on their program to try and determine how a team is successful. The question I hope to answer is “What improves Division 1 Mens College Basketball success?”
Here is a glimpse of the data:
Rows: 2,116
Columns: 28
$ TEAM <chr> "North Carolina", "Wisconsin", "Michigan", "Texas …
$ CONF <chr> "ACC", "B10", "B10", "B12", "WCC", "SEC", "B10", "…
$ G <dbl> 40, 40, 40, 38, 39, 40, 38, 39, 38, 39, 40, 40, 40…
$ W <dbl> 33, 36, 33, 31, 37, 29, 30, 35, 35, 33, 35, 36, 32…
$ ADJOE <dbl> 123.3, 129.1, 114.4, 115.2, 117.8, 117.2, 121.5, 1…
$ ADJDE <dbl> 94.9, 93.6, 90.4, 85.2, 86.3, 96.2, 93.7, 90.6, 89…
$ BARTHAG <dbl> 0.9531, 0.9758, 0.9375, 0.9696, 0.9728, 0.9062, 0.…
$ EFG_O <dbl> 52.6, 54.8, 53.9, 53.5, 56.6, 49.9, 54.6, 56.6, 55…
$ EFG_D <dbl> 48.1, 47.7, 47.7, 43.0, 41.1, 46.0, 48.0, 46.5, 44…
$ TOR <dbl> 15.4, 12.4, 14.0, 17.7, 16.2, 18.1, 14.6, 16.3, 14…
$ TORD <dbl> 18.2, 15.8, 19.5, 22.8, 17.1, 16.1, 18.7, 18.6, 17…
$ ORB <dbl> 40.7, 32.1, 25.5, 27.4, 30.0, 42.0, 32.5, 35.8, 30…
$ DRB <dbl> 30.0, 23.7, 24.9, 28.7, 26.2, 29.7, 29.4, 30.2, 25…
$ FTR <dbl> 32.3, 36.2, 30.7, 32.9, 39.0, 51.8, 28.4, 39.8, 29…
$ FTRD <dbl> 30.4, 22.4, 30.0, 36.6, 26.9, 36.8, 22.7, 23.9, 26…
$ `2P_O` <dbl> 53.9, 54.8, 54.7, 52.8, 56.3, 50.0, 53.4, 55.9, 52…
$ `2P_D` <dbl> 44.6, 44.7, 46.8, 41.9, 40.0, 44.9, 47.6, 46.3, 45…
$ `3P_O` <dbl> 32.7, 36.5, 35.2, 36.5, 38.2, 33.2, 37.9, 38.7, 39…
$ `3P_D` <dbl> 36.2, 37.5, 33.2, 29.7, 29.0, 32.2, 32.6, 31.4, 28…
$ ADJ_T <dbl> 71.7, 59.3, 65.9, 67.5, 71.5, 65.9, 64.8, 66.4, 60…
$ WAB <dbl> 8.6, 11.3, 6.9, 7.0, 7.7, 3.9, 6.2, 10.7, 11.1, 8.…
$ POSTSEASON <fct> 2ND, 2ND, 2ND, 2ND, 2ND, 2ND, 2ND, Champions, Cham…
$ SEED <dbl> 1, 1, 3, 3, 1, 8, 4, 1, 1, 1, 2, 1, 7, 1, 4, 3, 6,…
$ YEAR <dbl> 2016, 2015, 2018, 2019, 2017, 2014, 2013, 2015, 20…
$ state_cd <chr> "NC", "WI", "MI", "TX", "WA", "KY", "MI", "NC", "V…
$ classification_name <chr> "NCAA Division I-FBS", "NCAA Division I-A", "NCAA …
$ REVENUE_MENALL <dbl> 21342328, 21309114, 20027574, 12382899, 12131756, …
$ EXPENSE_MENALL <dbl> 8667111, 7473012, 9982947, 12338645, 8874752, 1619…
TEAM: The Division I college basketball school
CONF: The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)
G: Number of games played
W: Number of games won
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
BARTHAG (strength of schedule): Power Rating (Chance of beating an average Division I team)
EFG_O: Effective Field Goal Percentage Shot
EFG_D: Effective Field Goal Percentage Allowed
TOR: Turnover Percentage Allowed (Turnover Rate)
TORD: Turnover Percentage Committed (Steal Rate)
ORB: Offensive Rebound Rate
DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws)
FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage
2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage
3P_D: Three-Point Shooting Percentage Allowed
ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)
WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)
POSTSEASON: Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)
SEED: Seed in the NCAA March Madness Tournament
YEAR: Season
Average Revenue(REVENUE_MENALL)-average revenue per school on their men’s basketball program
Average Expense(Expense_MENALL)-average expenses per school in their men’s basketball program
From comparing these different summaries of different team performances. A upset was decided on teams that were above an 8 seed and made it to at least the Sweet 16 or a team above a five seed who made it to the Final four. Teams that shoot better and have better efficiency ratings typically go further in the tournament. Teams that play a slower tempo typically do better. Upset teams stand out by having typically lower tempo then they should at their current seeding. A strong variable to look at here is how much a college typically spends for each level of the tournament. Expenses and revenue seem to have some correlation with how far a team makes it in the tournament.
Based on the first graph there seems to be a correlation with how much money spent leads to a stronger strength of schedule. The second graph Sweet 16 or higher shows that majority of teams to have a chance at making the Sweet 16 need to spend at least 5 million dollars. Also, it seems to show that there is a correlation between strength of schedule gives a team a higher chance to make it further in the playoffs.
In the correlation figure, one can tell the difference between the data for all years and grouped by team shows different amounts of correlation. When it is not grouped by team there seems is not be a strong correlation between winning and expenses for a single year. However, if you group each team by all their years there is a strong correlation to say that higher spending can lead to higher wins over a longer period of time. Strength of schedule seems to have an important correlation with wins in both different correlation charts. This could go with the idea that teams that play better teams typically become better and win more games.
There seems to be positive correlation to wins and expenses in each conference. Certain conferences that spend more typically have way way better win averages. This could be for a number of reasons. First, spending would improve the coaching staff. Also, better facilities, training programs, and equipment.
Also, in certain conferences there are extreme outliers. In SEC, Kentucky both outspends and outperforms the rest of the conference greatly. Another college to look at his Indiana. Indiana in the BIG 10 conference spends the most in the conference with on average 12 million per year. However many teams out perform them in wins. And Iowa who spends around 5 million less has had a similar amount of success. Looking at these specific teams that spend large amounts of money, but it does not pertain to not as much success means these teams are overspending in some area. This likely means for the amount of money put into this program there is some type of mismanagement or misguidence in allocation of resources that is not making this high spending lead to more wins.
If colleges are grouped together by state one an see how different states spend a different amount of money on there college basketball programs and how it can relate to their wins. The amount of money spent on some large states seems to be somewhat correlated with a lack of NBA team in this state. This can be seen in Kentucky and Washington. As these states do not have NBA teams therefore consumers in those states are more likely to spend higher amounts of revenue on there College basketball teams. This could lead to more expenses into the program leading to more successful teams over the years.
In conclusion of the study, the success of college basketball teams tend to be linked to some attributes such as high offense efficiency, low defense efficiency, low tempo, or better shooting percentages will increase a teams chance at winning. However looking further at how does a team get better shooting percentages, or they are more efficient it seems that colleges that spend more money on their programs can get better strength of schedule in their games and other improved outside factors (coaching, facilities, recruiting etc.). The limitation is this study was done before NIL deals. The reason for this study is this was conducted before NIL deals were activated for NCAA. If money has this much of an impact before college players were allowed to be payed then now What will be the effect of College spending after this change? Will this grow the divide between schools who can afford to spend millions and other schools?
I used data from two different sources then merged them. The first set was from kaggle. From kaggle the data set was done by Andrew Sunberg. The other data sets I took his I used each yearly report from EADA (Equity in Athletics Data Analysis) which has data for each year College spending on sports. I took data sets from 2013-2019.
---
title: "NCAA D1 Basketball DashBoard"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: materia
primary: #7DF9FF
orientation: columns
vertical_layout: fill
source_code: embed
---
<style type="text/css">
.chart-title{ /* chart_title */
font-size: 20px;
}
body{
/* Normal*/
font-size: 16px;
}
</style>
```{r setup, include=FALSE}
library(flexdashboard)
```
Basic Information
===
Column {data-width=600}
---
***Analysis of Division 1 College Basketball Success Related to Spending
***
### Introduction
In this project I wanted to study NCAA Division I Mens college basketball and different factors for their success. For this project I took a data set that involved each Division I college basketball team including their team awards and progress in the NCAA tournament. I then found another data set that contained tax information on every sports program in every college of the US. I sorted the tax information to match my other dataset of between 2013-2019. I sorted and cleaned that Data in order to get just D1 mens basketball programs and then I merged this data with my first data set. I then did different analysis to see how different factors of a team from how play to how much a school spends on their program to try and determine how a team is successful. The question I hope to answer is "What improves Division 1 Mens College Basketball success?"
Here is a glimpse of the data:
```{r load_data}
setwd("C:/Math209")
library(pacman)
p_load(tidyverse, maps, viridis,usmap,plotly,DT,dplyr,corrgram)
allfiles <- list.files("C:/Math209/Math209Final", pattern=".csv", full.names = T)
final_data<- data.frame()
for (i in 1:length(allfiles)){
temp_data <- read_csv(allfiles[i])
temp_data$Year <- rep(2012+i, nrow(temp_data))
temp_data <- temp_data %>% filter(Sports == "Basketball")
temp_data<-temp_data %>% select(institution_name,state_cd,classification_name,REVENUE_MENALL,EXPENSE_MENALL,Year)
final_data <- rbind.data.frame(final_data, temp_data)
}
write_csv(final_data, "School_finance1.csv")
CB<-read_csv("C:/Math209/cbb.csv")
School_finance<-read_csv("C:/Math209/School_finance1.csv")
D1_finance<-filter(School_finance,classification_name=="NCAA Division I without football"|classification_name=="NCAA Division I-A"|classification_name=="NCAA Division I-AA"|classification_name=="NCAA Division I-AAA"|classification_name=="NCAA Division I-FBS"|classification_name=="NCAA Division I-FCS")
CB2015<-filter(CB,YEAR==2015)
D1_finance2015<-filter(D1_finance,Year==2015)
remove <- c("University ", " University" ,"of ", "the ", "The ", " the ", "-Main Campus", " at")
D1_finance$Team <- str_remove_all(D1_finance$institution_name, paste(remove, collapse = "|"))
D1_finance$Team <- str_replace_all(D1_finance$Team,"State", "St." )
D1_finance$Team <- str_replace_all(D1_finance$Team,"Pennsylvania", "Penn" )
D1_finance$Team <- str_replace_all(D1_finance$Team,"California", "Cal" )
D1_finance$Team <- str_replace_all(D1_finance$Team,"-", " " )
CB$POSTSEASON<-as.factor(CB$POSTSEASON)
D1_finance$Team<-D1_finance$Team %>% recode(
'Michigan Ann Arbor'= "Michigan",
'North Carolina Chapel Hill'="North Carolina",
'Wisconsin Madison'="Wisconsin",
'Cal Los Angeles'="UCLA",
'at Buffalo'="Buffalo",
'Wisconsin Green Bay'="Green Bay",
'Wisconsin Milwaukee'="Milwaukee",
'Miami Oxford'="Miami OH",
'Bowling Green St.'="Bowling Green",
'Miami'="Miami FL",
'Cal Polytechnic St. San Luis Obispo'="Cal Poly",
'Alabama A & M'="Alabama A&M" ,
'California St.-Sacramento'="Sacramento St." ,
'Cal St. Bakersfield' ="Cal St. Bakersfield",
'Cal St. Northridge' ="Cal St. Northridge",
'Davidson College'="Davidson",
'Texas A & M College Station'="Texas A&M",
'College Charleston'="College of Charleston",
'Colorado Boulder'= "Colorado",
'Colorado St. Fort Collins'= "Colorado St.",
'Detroit Mercy'="Detroit",
'St Bonaventure'="St. Bonaventure",
'Pittsburgh Pittsburgh Campus'="Pittsburgh",
'Central Florida'="UCF",
'Southern Cal'="USC",
'Illinois Urbana Champaign'="Illinois",
'Middle Tennessee St.'="Middle Tennessee",
'Lafayette College'="Lafayette",
'Pittsburgh ' = "Pittsburgh",
'Louisiana St. and Agricultural & Mechanical College'="LSU",
'North Carolina Wilmington'="UNC Wilmington",
'North Carolina Greensboro'="UNC Greensboro",
'Prairie View A & M'="Prairie View A&M",
"Saint Mary's College Cal" = "Saint Mary's",
'Cal Irvine' ="UC Irvine" ,
'Cal Davis' ="UC Davis" ,
'Cal Berkeley' = "California",
'Akron Main Campus'="Akron",
"Rutgers New Brunswick"="Rutgers",
'Texas Christian'="TCU",
'Nebraska Lincoln'="Nebraska",
'Tennessee Chattanooga'="Chattanooga",
'Minnesota Twin Cities'="Minnesota",
'College Holy Cross'="Holy Cross",
'Indiana Bloomington'="Indiana",
'Missouri Columbia'="Missouri",
'Hawaii Manoa'="Hawaii",
'Kent St. Kent'="Kent St.",
'Providence College'="Providence",
'Virginia Polytechnic Institute and St.'="Virginia Tech",
'Fairleigh Dickinson Metropolitan Campus'="Fairleigh Dickinson",
"St John's New York"="St. John's",
"Manhattan College"="Manhattan",
"Virginia Commonwealth"="VCU",
"Nevada Las Vegas"="UNLV",
"Alabama Birmingham"="UAB",
"Cal St. Fresno" ="Fresno St.",
"Maryland College Park"="Maryland",
"Brigham Young Provo"="BYU",
"Southern and A & M College"="Southern",
"Nevada Reno"="Nevada",
"Stephen F Austin St." ="Stephen F. Austin",
"Austin Peay St."= "Austin Peay",
"Maryland Baltimore County"="UMBC",
"Wofford College"="Wofford",
"Iona College"="Iona",
"Massachusetts Amherst" ="Massachusetts",
"Texas Austin"="Texas",
"Southern Methodist"="SMU",
"Northwestern St. Louisiana"="Northwestern St.",
"SUNY Albany" ="Albany",
"Oklahoma Norman Campus"="Oklahoma",
"North Carolina Asheville"="UNC Asheville",
"North Carolina A & T St."="North Carolina A&T",
"North Carolina St. Raleigh"="North Carolina St.",
"South Carolina Columbia"="South Carolina",
"Washington Seattle Campus"="Washington"
)
CBfinal<-CB %>%
left_join(D1_finance, by = c("TEAM" = "Team", "YEAR"="Year"))
CBfinal$POSTSEASON<-factor(CBfinal$POSTSEASON, levels=c("Champions","2ND","F4","E8","S16", "R32","R64","R68"))
CBfinal<-filter(CBfinal, REVENUE_MENALL>=0)
CBfinal3<-select(CBfinal, -25)
glimpse(CBfinal3)
options(scipen = 999)
```
Column ( data-width= 400)
---
### Variable Introduction
TEAM: The Division I college basketball school
CONF: The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)
G: Number of games played
W: Number of games won
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
BARTHAG (strength of schedule): Power Rating (Chance of beating an average Division I team)
EFG_O: Effective Field Goal Percentage Shot
EFG_D: Effective Field Goal Percentage Allowed
TOR: Turnover Percentage Allowed (Turnover Rate)
TORD: Turnover Percentage Committed (Steal Rate)
ORB: Offensive Rebound Rate
DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws)
FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage
2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage
3P_D: Three-Point Shooting Percentage Allowed
ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)
WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)
POSTSEASON: Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)
SEED: Seed in the NCAA March Madness Tournament
YEAR: Season
Average Revenue(REVENUE_MENALL)-average revenue per school on their men's basketball program
Average Expense(Expense_MENALL)-average expenses per school in their men's basketball program
Team Statistical Analysis
===
Column {.tabset data-width=800}
---
### Team Summary
```{r teamsummary}
round_df <- function(df, digits) {
nums <- vapply(df, is.numeric, FUN.VALUE = logical(1))
df[,nums] <- round(df[,nums], digits = digits)
(df)
}
df_Team<-group_by(CBfinal,TEAM)
Team_summary<-summarize(df_Team,
Reb= mean(ORB),
ADJOE= mean(ADJOE),
ADJDE= mean(ADJDE),
tempo=mean(ADJ_T),
`3%`=mean(`3P_O`),
`2%`=mean(`2P_O`),
str_schd=mean(BARTHAG),
Turnovers=mean(TOR),
Steal=mean(TORD),
Revenue=mean(REVENUE_MENALL),
Expense=mean(EXPENSE_MENALL),
Wins=mean(W)
)
Team_summary<-round_df(Team_summary, digits=2)
DT::datatable(Team_summary, colnames = c("College","Average rebounds", "Average Offensive Efficency", "Average Defensive Efficency", "Average Tempo", "Average Shooting % 2 pointers","Average Shooting % 3 pointers", "Strength of Schedule","Turnover Rate","Steal rate", "Average Revenue","Average Expense","Average Wins"))
```
### Postseason Summary
```{r postsummary}
CBfinal$POSTSEASON<-as.factor(CBfinal$POSTSEASON)
df_postseason<-group_by(CBfinal,POSTSEASON)
Postseason_summary<-summarize(df_postseason,
ave_rebounds = mean(ORB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_turnovers=mean(TOR),
ave_steal=mean(TORD),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_wins=mean(W)
)
Postseason_summary<-round_df(Postseason_summary, digits=2)
DT::datatable(Postseason_summary, colnames = c("Postseason Finish","Average rebounds", "Average Offensive Efficency", "Average Defensive Efficency", "Average Tempo", "Average Shooting % 2 pointers","Average Shooting % 3 pointers", "Strength of Schedule","Turnover Rate","Steal Rate","Average Revenue","Average Expense","Average Wins"),options = list(dom = 't'))
```
### Sweet 16 or Better Summary
```{r summary 2}
Sweet16andup<-filter(CBfinal,!POSTSEASON=="R32",!POSTSEASON=="R64",!POSTSEASON=="R68")
Sweet16_summary<- Sweet16andup %>% summarize(
averebounds= mean(ORB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_turnovers=mean(TOR),
ave_steal=mean(TORD),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_wins=mean(W)
)
Sweet16_summary<-round(Sweet16_summary,2)
DT::datatable(Sweet16_summary, colnames = c("Average rebounds", "Average Offensive Efficency", "Average Defensive Efficency", "Average Tempo", "Average Shooting % 2 pointers","Average Shooting % 3 pointers", "Strength of Schedule","Turnover Rate","Steal rate", "Average Revenue","Average Expense","Average Wins"),options = list(dom = 't'))
```
### Upset Summary
```{r summary 3}
upset<-filter(CBfinal,(SEED>8 & !POSTSEASON=="R32"&!POSTSEASON=="R64"&!POSTSEASON=="R68")|(SEED>5 & POSTSEASON=="S16" & !POSTSEASON=="R32"&!POSTSEASON=="R64"&!POSTSEASON=="R68"))
upset_summary<- upset %>% summarize(
averebounds= mean(ORB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_turnovers=mean(TOR),
ave_steal=mean(TORD),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_wins=mean(W)
)
upset_summary<-round(upset_summary,2)
DT::datatable(upset_summary, colnames = c("Average rebounds", "Average Offensive Efficency", "Average Defensive Efficency", "Average Tempo", "Average Shooting % 2 pointers","Average Shooting % 3 pointers", "Strength of Schedule","Turnover Rate","Steal Rate", "Average Revenue","Average Expense","Average Wins"),options = list(dom = 't'))
```
Column(data-width=200)
---
### Explanation
From comparing these different summaries of different team performances. A upset was decided on teams that were above an 8 seed and made it to at least the Sweet 16 or a team above a five seed who made it to the Final four. Teams that shoot better and have better efficiency ratings typically go further in the tournament. Teams that play a slower tempo typically do better. Upset teams stand out by having typically lower tempo then they should at their current seeding. A strong variable to look at here is how much a college typically spends for each level of the tournament. Expenses and revenue seem to have some correlation with how far a team makes it in the tournament.
Explantory Data analysis
===
Column{.tabset data-width=750}
---
### Spending on all Teams in Playoffs
```{r all teams}
CBPost<-filter(CBfinal, !POSTSEASON=="R68")
ggplot(CBPost,aes(x=BARTHAG,y=EXPENSE_MENALL))+geom_point()+labs(title = "Spending and Strength of Schedule effect on Postseason", x="Strength of Schedule",y="Expense ($)")
```
### Spending on Teams in Sweet 16 or higher
```{r sweet16}
ggplot(Sweet16andup,aes(x=BARTHAG,y=EXPENSE_MENALL))+geom_point()+labs(title = "Spending and Strength of Schedule effect on Postseason",subtitle = "For teams finishes at least in the Sweet 16", x="Strength of Schedule",y="Expense ($)")
```
### Correlation
```{r correlation graph}
CBfinal1<-select(CBfinal, c(4,5,6,7,10,11,12,16,18,20,28,29))
CBfinal1<-CBfinal1 %>% rename(
Str_Schd=BARTHAG,
Steal=TORD,
Revenue=REVENUE_MENALL,
Expense=EXPENSE_MENALL,
`2%`=`2P_O`,
`3%`=`3P_O`,
Tempo=ADJ_T,
Wins=W)
corrgram(CBfinal1,
order = TRUE, # If TRUE, PCA-based re-ordering
upper.panel = panel.pie, # Panel function above diagonal
lower.panel = panel.shade, # Panel function below diagonal
text.panel = panel.txt, # Panel function of the diagonal
main = "Correlation between College Basketball Statistics")
```
### Correlation Team
```{r correlation_team}
Team_summary1<-select(Team_summary,-1)
corrgram(Team_summary1,
order = TRUE, # If TRUE, PCA-based re-ordering
upper.panel = panel.pie, # Panel function above diagonal
lower.panel = panel.shade, # Panel function below diagonal
text.panel = panel.txt, # Panel function of the diagonal
main = "Correlation between different team statistics")
```
Column(data-width=250)
---
### Explanation
Based on the first graph there seems to be a correlation with how much money spent leads to a stronger strength of schedule. The second graph Sweet 16 or higher shows that majority of teams to have a chance at making the Sweet 16 need to spend at least 5 million dollars. Also, it seems to show that there is a correlation between strength of schedule gives a team a higher chance to make it further in the playoffs.
In the correlation figure, one can tell the difference between the data for all years and grouped by team shows different amounts of correlation. When it is not grouped by team there seems is not be a strong correlation between winning and expenses for a single year. However, if you group each team by all their years there is a strong correlation to say that higher spending can lead to higher wins over a longer period of time. Strength of schedule seems to have an important correlation with wins in both different correlation charts. This could go with the idea that teams that play better teams typically become better and win more games.
Conference Effect
===
Column{.tabset data-width=800}
---
### Conferences Over 1000 total Wins
```{r conference}
CBfinal<-mutate(CBfinal,
netrevenue=REVENUE_MENALL-EXPENSE_MENALL
)
Conference<-group_by(CBfinal,CONF)
Conference<- Conference %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
Conference<-filter(Conference, ave_wins>10)
ggplot(Conference, aes(x=ave_wins,y=ave_expense, label=CONF))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "Conference Impact on Spending and Wins") + xlim(10,25)
```
### ACC
```{r ACC}
ACC<-filter(CBfinal,CONF=="ACC")
Team<-group_by(ACC,TEAM)
Team<- Team %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
ggplot(Team, aes(x=ave_wins,y=ave_expense, label=TEAM))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "ACC Impact on Spending and Wins")+ xlim(10,30)
```
### B10
```{r Big10}
B10<-filter(CBfinal,CONF=="B10")
Team<-group_by(B10,TEAM)
Team<- Team %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
ggplot(Team, aes(x=ave_wins,y=ave_expense, label=TEAM))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "B10 Impact on Spending and Wins")+ xlim(10,30)
```
### B12
```{r Big12}
B12<-filter(CBfinal,CONF=="B12")
Team<-group_by(B12,TEAM)
Team<- Team %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
ggplot(Team, aes(x=ave_wins,y=ave_expense, label=TEAM))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "B12 Impact on Spending and Wins")+ xlim(10,30)
```
### SEC
```{r SEC}
SEC<-filter(CBfinal,CONF=="SEC")
Team<-group_by(SEC,TEAM)
Team<- Team %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
ggplot(Team, aes(x=ave_wins,y=ave_expense, label=TEAM))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "SEC Impact on Spending and Wins")+ xlim(10,30)
```
### P12
```{r Pac12}
P12<-filter(CBfinal,CONF=="P12")
Team<-group_by(P12,TEAM)
Team<- Team %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
ggplot(Team, aes(x=ave_wins,y=ave_expense, label=TEAM))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "P12 Impact on Spending and Wins")+ xlim(10,30)
```
### Amer
```{r Amer}
Amer<-filter(CBfinal,CONF=="Amer")
Team<-group_by(Amer,TEAM)
Team<- Team %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
ggplot(Team, aes(x=ave_wins,y=ave_expense, label=TEAM))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "American Conference Impact on Spending and Wins")+ xlim(10,30)
```
### MWC
```{r MWC}
MWC<-filter(CBfinal,CONF=="MWC")
Team<-group_by(MWC,TEAM)
Team<- Team %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
ggplot(Team, aes(x=ave_wins,y=ave_expense, label=TEAM))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "Mountain West Conference Impact on Spending and Wins")+ xlim(10,30)
```
### A10
```{r A10}
A10<-filter(CBfinal,CONF=="A10")
Team<-group_by(A10,TEAM)
Team<- Team %>% summarize(
ave_wins=mean(W),
averebounds= mean(ORB)+mean(DRB),
averageADJOE= mean(ADJOE),
averageADJDE= mean(ADJDE),
averagetempo=mean(ADJ_T),
ave_shooting_3Percent=mean(`3P_O`),
ave_shooting_2Percent=mean(`2P_O`),
ave_strength_schedule=mean(BARTHAG),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=mean(netrevenue)
)
ggplot(Team, aes(x=ave_wins,y=ave_expense, label=TEAM))+geom_text(check_overlap = TRUE)+labs(y="Average Expenses ($)", x="Average Wins by team", title = "A10 Impact on Spending and Wins")+ xlim(10,30)
```
Column(data-width=200)
---
### Summary
There seems to be positive correlation to wins and expenses in each conference. Certain conferences that spend more typically have way way better win averages. This could be for a number of reasons. First, spending would improve the coaching staff. Also, better facilities, training programs, and equipment.
Also, in certain conferences there are extreme outliers. In SEC, Kentucky both outspends and outperforms the rest of the conference greatly. Another college to look at his Indiana. Indiana in the BIG 10 conference spends the most in the conference with on average 12 million per year. However many teams out perform them in wins. And Iowa who spends around 5 million less has had a similar amount of success. Looking at these specific teams that spend large amounts of money, but it does not pertain to not as much success means these teams are overspending in some area. This likely means for the amount of money put into this program there is some type of mismanagement or misguidence in allocation of resources that is not making this high spending lead to more wins.
State Effect
===
Column{.tabset data-width=700}
---
### Expenses
```{r Expenses map}
Mapdata<-CBfinal %>%
mutate(state_name=state.name[match(state_cd, state.abb)])
Mapdata<-na.omit(Mapdata)
State_data<-group_by(Mapdata,state_name)
State_data<-State_data %>% summarize(
ave_Wins=mean(W),
ave_revenue=mean(REVENUE_MENALL),
ave_expense=mean(EXPENSE_MENALL),
ave_net_revenue=ave_revenue-ave_expense,
)
State_data$ave_expense<-round(State_data$ave_expense,2)
State_data$ave_revenue<-round(State_data$ave_revenue,2)
US_map <- us_map("state") %>%
filter(full != "District of Columbia")
CB1_map <- State_data %>% left_join(US_map, by = c("state_name"="full"))
CB1_map$group <- as.numeric(CB1_map$group)
region_label <- CB1_map %>%
group_by(abbr) %>%
summarise(x = mean(x), y = mean(y))
g1 <- ggplot(CB1_map, aes(x = x, y = y)) +
geom_polygon(aes(group=group, fill = ave_expense,
text = paste0(state_name, ":\n $",
round(ave_expense,2) ," average expense per college")), colour = "white") +
geom_text(aes(label = abbr),
data = region_label, fontface = "bold") + labs(fill="Average Expenses",title="Average Expenses for D1 College Basketball")+
scale_fill_viridis_c(option = "B") +
theme_void()
ggplotly(g1, tooltip = "text")
```
### Revenue
```{r revenuemap}
g3 <- ggplot(CB1_map, aes(x = x, y = y)) +
geom_polygon(aes(group=group, fill = ave_revenue,
text = paste0(state_name, ":\n",
round(ave_revenue,2) ," average wins per college")), colour = "white") +
geom_text(aes(label = abbr),
data = region_label, fontface = "bold") + labs(fill="Average Revenue", title="Average Revenue for D1 College Basketball")+
scale_fill_viridis_c(option = "B") +
theme_void()
ggplotly(g3, tooltip = "text")
```
### Wins
```{r winsmap}
g2 <- ggplot(CB1_map, aes(x = x, y = y)) +
geom_polygon(aes(group=group, fill = ave_Wins,
text = paste0(state_name, ":\n",
round(ave_Wins,2) ," average wins per college")), colour = "white") +
geom_text(aes(label = abbr),
data = region_label, fontface = "bold") + labs(fill="Average Wins", title="Average Wins for D1 College Basketball")+
scale_fill_viridis_c(option = "B") +
theme_void()
ggplotly(g2, tooltip = "text")
```
Column(data-width=250)
---
### Summary
If colleges are grouped together by state one an see how different states spend a different amount of money on there college basketball programs and how it can relate to their wins. The amount of money spent on some large states seems to be somewhat correlated with a lack of NBA team in this state. This can be seen in Kentucky and Washington. As these states do not have NBA teams therefore consumers in those states are more likely to spend higher amounts of revenue on there College basketball teams. This could lead to more expenses into the program leading to more successful teams over the years.
Conclusion
===
Column(Data-width=600)
---
### Conclusion
In conclusion of the study, the success of college basketball teams tend to be linked to some attributes such as high offense efficiency, low defense efficiency, low tempo, or better shooting percentages will increase a teams chance at winning. However looking further at how does a team get better shooting percentages, or they are more efficient it seems that colleges that spend more money on their programs can get better strength of schedule in their games and other improved outside factors (coaching, facilities, recruiting etc.). The limitation is this study was done before NIL deals. The reason for this study is this was conducted before NIL deals were activated for NCAA. If money has this much of an impact before college players were allowed to be payed then now What will be the effect of College spending after this change? Will this grow the divide between schools who can afford to spend millions and other schools?
### References
I used data from two different sources then merged them. The first set was from kaggle. From [kaggle](https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset) the data set was done by Andrew Sunberg. The other data sets I took his I used each yearly report from [EADA](https://ope.ed.gov/athletics/#/datafile/list) (Equity in Athletics Data Analysis) which has data for each year College spending on sports. I took data sets from 2013-2019.
Column(Data-width=400)
---
### About the Author
My name is Aidan Bramer.
I am a junior pursuing a B.S. in Applied Mathematics in Economics at the University of Dayton.
I also have minors in Data analytics and Data Science and AI.
Connect with me on [LinkedIn](https://www.linkedin.com/in/aidan-bramer-0652b8246?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3Bch9GIwx8SJOQzkvK8KXBSA%3D%3D).