Olympic Gymnastics Prediction
Using statistical methods to get medal probabilities
Aidan Bramer
Department of Mathematics
University of Dayton
Background
This study will look at possible female Olympic gymnasts
Goal: to predict medal probabilities for each athlete in there event
Find the athletes who will participate in each event
test different ways to predict athletes scores
simulate athletes scores
observe each athletes probability
Data Structure
6772 |
Makovits |
Mirtill |
w |
HUN |
11-14 Aug 2022 |
2022 Senior European Championships |
qual |
Munich, Germany |
BB |
65 |
4.5 |
6.900 |
NA |
11.400 |
3665 |
Gobadze |
Ani |
w |
GEO |
11-16 Apr 2023 |
2023 10th Senior European Championships |
qual |
Antalya, Turkey |
UB |
110 |
3.2 |
6.266 |
NA |
9.466 |
5403 |
Kinsella |
Alice |
w |
ENG |
29 Jul-2 Aug, 2023 |
BIRMINGHAM 2022 Commonwealth Games |
final |
Birmingham, England |
FX |
1 |
5.6 |
7.766 |
NA |
13.366 |
8796 |
Petrova |
Marija |
w |
LAT |
11-16 Apr 2023 |
2023 10th Senior European Championships |
qual |
Antalya, Turkey |
VT1 |
100 |
3.4 |
8.500 |
NA |
11.900 |
General structure is 5008 rows 16 columns.
Balance Beam
For time constraints and due to repetition we are going to look at one event.
- Find the top 8 athlete who qualified for the Olympics
- Predict Possible Scores for these Athletes
396 |
Andrade |
Rebeca |
12.733 |
1189 |
Black |
Elsabeth |
13.566 |
1201 |
Blakely |
Skye |
13.300 |
1265 |
Boyer |
Marine |
13.300 |
5692 |
Kovacs |
Zsofia |
12.733 |
7487 |
Miyata |
Shoko |
13.533 |
8445 |
Ou |
Yushan |
13.000 |
12036 |
Watanabe |
Hazuki |
13.600 |
History for these 8 Athletes
Looking at the history of all the athletes scores we can see the number of data points available.
To qualify for the Olympics every athlete will have at least 2 scores based on the two Olympic qualifying rounds.
Elsabeth |
Black |
4 |
Hazuki |
Watanabe |
3 |
Marine |
Boyer |
9 |
Rebeca |
Andrade |
4 |
Shoko |
Miyata |
7 |
Skye |
Blakely |
2 |
Yushan |
Ou |
7 |
Zsofia |
Kovacs |
7 |
Model Building
- Based on each athlete having a different number of scores it caused a problem in a regression approach
- First there would be a need to use summary statistics for each athlete
- This approach caused a problem because if an athlete had less data points then they only had the Olympic qualifying scores-biased
- Hard to justify using all athletes scores to predict only the best athletes
- the assumption: if these are 8 best athletes then they would have a similar athletic performance
Basics of Kernel Density Estimator
For KDE, the formula is:
\[
f_h(x) = \frac{1}{nh} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)
\]
Where:
\(x_i\) are the data points.
\(K\) is the kernel (e.g., Gaussian).
\(h\) is the bandwidth, controlling the smoothness.
Weighted KDE
Weighted KDE allows different weights for different data points.
The formula is:
\[
f_h(x) = \frac{1}{nh} \sum_{i=1}^n w_i K\left(\frac{x - x_i}{h}\right)
\]
Where \(w_i\) are the weights for each data points
- Have to select weights, kernel and bandwidth
Weighted Kernel Density Estimator Approach
Gaussian Kernel: The Default
The Gaussian (or Normal) kernel has the form:
\[
K(u) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} u^2}
\] ( u ) represents the standardized distance: \[ u = \frac{x - x_i}{h} \] Characteristics: Infinitely long tails Smooth, bell-shaped curve. Influences the KDE even at considerable distances
Epanechnikov Kernel: An Alternative
The Epanechnikov kernel has the form:
\[
K(u) = \frac{3}{4}(1 - u^2) \quad \text{for } |u| \leq 1
\]
Characteristics:
- Compact support (zero outside the range [-1, 1])
- Parabolic shape
- Considered “optimal” in a mean integrated squared error sense
Why Epanechnikov?
Compact Support: The Epanechnikov kernel ensures only nearby points contribute to the density, which is beneficial for small datasets. Distant points don’t influence the KDE.
Locality with Weights: Weights can be effectively accounted for within the local neighborhood of each data point without being overshadowed by distant points.
Less Smoothing: The Gaussian kernel might overly smooth small datasets, potentially hiding important data features. Epanechnikov captures local structures better.
Visual Comparison
Visualizing Sample Athletes Weighted KDE
Simulation Study
13.78529 |
13.29861 |
13.51149 |
13.11144 |
13.82491 |
13.36353 |
13.28248 |
13.59079 |
13.41941 |
13.54073 |
13.65676 |
13.05226 |
13.37813 |
14.12054 |
12.20335 |
13.42937 |
13.82295 |
13.44389 |
12.72593 |
12.69181 |
13.64189 |
13.42259 |
13.49186 |
13.73606 |
13.48398 |
13.31476 |
13.78052 |
13.83771 |
13.85183 |
13.55681 |
13.67440 |
13.43475 |
12.53700 |
13.44389 |
13.97960 |
13.67093 |
13.90566 |
13.99168 |
12.45568 |
13.24105 |
13.51088 |
13.74519 |
13.66214 |
12.35288 |
13.52885 |
13.34205 |
13.48649 |
12.89670 |
13.34946 |
13.58378 |
12.44076 |
12.82093 |
12.15081 |
13.60513 |
13.15363 |
13.11730 |
13.77991 |
13.48693 |
12.22016 |
13.68169 |
13.49117 |
13.50849 |
13.23416 |
13.62845 |
13.02663 |
13.92275 |
13.47383 |
12.89624 |
12.78600 |
13.68029 |
13.00867 |
12.93974 |
13.34408 |
13.14258 |
13.95269 |
13.21903 |
13.70110 |
13.44943 |
13.77104 |
13.57464 |
Medal Counts
Elsabeth Black |
0.120 |
0.122 |
0.134 |
Hazuki Watanabe |
0.155 |
0.156 |
0.136 |
Marine Boyer |
0.137 |
0.135 |
0.121 |
Rebeca Andrade |
0.060 |
0.081 |
0.100 |
Shoko Miyata |
0.167 |
0.118 |
0.152 |
Skye Blakely |
0.142 |
0.146 |
0.110 |
Yushan Ou |
0.131 |
0.132 |
0.135 |
Zsofia Kovacs |
0.088 |
0.110 |
0.112 |
Other Apparatus Results
Uneven Bars
Elisabeth Seitz |
0.123 |
0.126 |
0.149 |
Naomi Visser |
0.050 |
0.094 |
0.104 |
Nina Derwael |
0.166 |
0.171 |
0.132 |
Rebeca Andrade |
0.051 |
0.074 |
0.089 |
Rui Luo |
0.114 |
0.130 |
0.131 |
Sanna Veerman |
0.105 |
0.110 |
0.128 |
Shilese Jones |
0.183 |
0.127 |
0.126 |
Xiaoyuan Wei |
0.208 |
0.168 |
0.141 |
Floor Routine
Jade Carey |
0.154 |
0.133 |
0.143 |
Jennifer Gadirova |
0.061 |
0.099 |
0.102 |
Jessica Gadirova |
0.216 |
0.161 |
0.131 |
Jordan Chiles |
0.151 |
0.160 |
0.114 |
Martina Maggio |
0.090 |
0.112 |
0.115 |
Naomi Visser |
0.106 |
0.117 |
0.133 |
Rebeca Andrade |
0.158 |
0.140 |
0.147 |
Shoko Miyata |
0.064 |
0.078 |
0.115 |
Challenges and Limitations
- Lack of Data for Certain Athletes
- The goal was to get some way of sampling future events for different medals
- The proportions generated by my assumption that these 8 athletes would have a similar history
Sources
Chu, Chi-Yang, Daniel J. Henderson, and Christopher F. Parmeter. “On discrete Epanechnikov kernel functions.” Computational statistics & data analysis 116 (2017): 79-105.
Soh, Youngsung, et al. “Performance evaluation of various functions for kernel density estimation.” Open J Appl Sci 3.1 (2013): 58-64.
NBC Olympics
Thank you
Special Thank you to my advisor Dr. Tessa Chen.
Questions?