Formula One Racing¶

Rahul Kiefer, Rahul Narla, Hrishik Rajendra¶

Introduction¶

Formula One (also known as Formula 1 or F1) is the highest class of international auto racing for single-seater racing cars sanctioned by the Fédération Internationale de l'Automobile (FIA). There are two World Championships being fought for over the course of a season in F1: the Driver's World Championship, and the Constructor's World Championship. Each F1 constructor builds their own car and has two drivers racing for them. Points earned by drivers racing for the same constructor contribute to the Contructor's Championship, while each driver's individual points contribute to the Driver's Championship. While F1 is a team sport, this interesting dynamic where a driver is also competing for an individual championship leads to intense battles even among teammates. Today, F1 is a multibillion dollar annual industry, ranking behind only the 4-yearly Football World Cup and Summer Olympic Games in terms of live television audience (Benson). The cars, which have effectively become mobile advertising billboards, race fortnightly in front of a global audience of motor sport fans—527 million across 187 countries in 2010—who are “up to three times more brand loyal than fans of other sports”(Autosport).

In this tutorial, our goal is to combine and analyze the data we found in order to provide insight into which car constructors and drivers are the most dominant in Formula 1. For readers unfamiliar with the sport, we hope this analysis will get them interested in watching Formula 1 races and provide insight into which teams are performing the best. For those already familiar with F1, we hope to show how well their favorite teams and/or drivers have been performing in recent years (or how badly they're being beaten).

Data Curation, Parsing, and Management¶

Library and Module Imports¶

For this tutorial, we're using Python 3 along with the following imported libraries: Folium, Matplotlib, NumPy, pandas, PyWaffle, SciPy, seaborn, scikit-learn, and statsmodels.

# Standard library imports
import csv
import io
from io import BytesIO
import requests
import warnings
warnings.filterwarnings('ignore')

# Third-party imports
!pip install folium
import folium
import folium.plugins as plg
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
!pip install pywaffle
from pywaffle import Waffle
from scipy import stats
import seaborn as sns
from sklearn import datasets, ensemble, linear_model, metrics, model_selection, svm
from sklearn.model_selection import cross_val_predict, train_test_split
from statsmodels.formula.api import ols

Requirement already satisfied: folium in /opt/conda/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/lib/python3.8/site-packages (from folium) (2.11.2)
Requirement already satisfied: branca>=0.3.0 in /opt/conda/lib/python3.8/site-packages (from folium) (0.4.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from folium) (1.19.1)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from folium) (2.24.0)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (1.25.10)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2020.6.20)
Requirement already satisfied: pywaffle in /opt/conda/lib/python3.8/site-packages (0.6.1)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.8/site-packages (from pywaffle) (3.2.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (1.2.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (2.4.7)
Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (1.19.1)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from cycler>=0.10->matplotlib->pywaffle) (1.15.0)

Retrieving the Data¶

The files we're using are originally from this Kaggle dataset. From the dataset, we uploaded copies of each CSV to our GitHub repository for ease of access (you need to have a Kaggle account in order to download the CSVs from the above link).

These CSV files have been cleaned prior to retrieval, so no data cleaning/modification is necessary on our part.

races = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/races.csv')
results = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/results.csv')
constructors = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/constructors.csv')
status = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/status.csv')
drivers = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/drivers.csv')
circuits = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/circuits.csv')
laptimes = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/lap_times.csv')

r = races

# Merging other dataframes
r = r.merge(results, on='raceId')
r = r.merge(constructors, on='constructorId')
r = r.merge(status, on='statusId')
r = r.merge(drivers, on='driverId')

# Deleting unused columns
r.drop(columns=[])

r

There are X columns in the above dataframe 'r'. We have:

Year - Year of Race
Name - Name of F1 race
Grid - Initial Position
Position - Final Position
Points - Points Awarded
Constructor Ref - Car Constructor
Driver Ref - Driver
Forename and surname - Driver Name

The titles of each column should be self-explanatory. If you'd like to get more information a column, important topics have links to additional information.

r = r[r.year > 2010]
races = races[races.year > 2010]
races

We decided to specifically analyze the past 10 years of F1 data, rather than starting from the 1950s as the original dataframes do, to more easily determine current domination in the F1 scene. Additionally, for further analysis between grid placements and position placements, we thought the past 10 years (with more than 400 data points) would suffice and would more accurately represent the current state of F1.

Exploratory Data Analysis¶

Formula One is a Global Sport¶

Current regulations set by the FIA specify that a full championship season “must include Competitions taking place on at least three continents during the same season.”

This requires F1 teams to travel a lot during a season, racing in different countries and tracks. Below is an interactive map showing the different tracks teams have raced at since the start of the sport in 1950.

circuits_map = folium.Map(zoom_start=13)
map_cluster = plg.MarkerCluster().add_to(circuits_map)
for idx, row in circuits.iterrows():
    folium.Marker(
        location=[row['lat'], row['lng']],
        icon=folium.Icon(color='cadetblue', prefix='fa', icon='flag-checkered')
        ).add_to(map_cluster)

circuits_map

We can also analyze the distribution of races throughout the different continents.

# The number of races in each continent
num_continent= {
  'Europe':38,
  'Asia':17,
  'North America':6,
  'South America':3,
  'Africa':3,
  'Australia':2
}

fig = plt.figure(
    figsize = (14,16),
    FigureClass=Waffle, 
    rows=5, 
    values=num_continent, 
    colors=sns.color_palette("viridis",len(num_continent)).as_hex(),
    title={'label': 'Distribution of F1 circuits across different continents', 'size':18},
    labels=["{0} ({1})".format(k, v) for k, v in num_continent.items()],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
    icons='flag-checkered', icon_size=45, 
    icon_legend=True
)

We observe that F1 visits the most number of circuits in Europe, followed by Asia. The North American continent ranks third in terms of the number of F1 circuits. The reason for this is that unfortunately, F1 racing is not nearly as popular in the United States as other motorsports such as Nascar. This leads to a feedback cycle where F1 does not race in the United States often due to the lack of fans here, and the lack of races in the United States is a factor which contributes to that.

Who dominated F1 in the last 10 years?¶

With the 2020 Formula One season coming to an end at Abu Dhabi a few weeks ago, we are curious to see which constructor and driver dominated F1.

First, let us observe how different constructors performed during the course of the last 10 years.

plt.rc('font', size=10)          # controls default text sizes
plt.rc('axes', titlesize=15)     # fontsize of the axes title
plt.rc('axes', labelsize=15)     # fontsize of the x and y labels
plt.rc('xtick', labelsize=10)    # fontsize of the tick labels
plt.rc('ytick', labelsize=10)    # fontsize of the tick labels
plt.rc('legend', fontsize=15)    # legend fontsize
plt.rc('figure', titlesize=15)   # fontsize of the figure title
for constructor in r.constructorRef.unique():
  total = r.loc[(r.constructorRef == constructor)]
  lst = []
  for y in total.year.unique():
    sum = total[total.year == y].points.sum()
    num_races = total[total.year == y].size
    lst.append([y, sum])
  df = pd.DataFrame(data=lst, columns=['year', 'total points'])
  df.plot.scatter(x='year', y='total points', title=constructor.capitalize(), sharex=True, xlim=(2011, 2020), ylim=(0, 1000), s=100, c='blue', figsize=(12, 7), xlabel='Year', ylabel='Total Points')

In these scatter plots we've mapped out the total points for each constructor in each season. We take into account each individual racer that have formed contracts with respective constructors to form total points values for that year. When taking a look at all of these plots, it can be easily seen that for approximately the past seven years, Mercedes has dominated F1 with the most points. Ferrari and Red Bull come close in some years, but overall, Mercedes seems to have more points than all other constructors in these races.

An important observation here is that the Mercedes dominance in the constructor's championship did not start untill 2014. This was when F1 decided to make a major regulation change and switch to hybrid V6 engines replacing the older V8 engines used by constructors. Since this regulation change, Mercedes have proven to be unstoppable, making them the team to beat in the hybrid era of F1.

We can also look at the average points earned per race for each constructor.

m1 = pd.merge(results, constructors, on='constructorId')
m2 = pd.merge(m1, races, on='raceId')
result_v2 = m2[m2.year > 2010]
result_v2["constructor"] = result_v2["name_x"]

# Aggregate total points and average points per race
avg_pts = result_v2[['constructor','points']].groupby("constructor").mean()
total_pts = result_v2[['constructor','points']].groupby("constructor").sum()
n = result_v2[['constructor','raceId']].groupby("constructor").count()
num_races = n[n.raceId > 100]
d = pd.merge(avg_pts, total_pts, on='constructor')
md = pd.merge(d, num_races, on='constructor')
md = md.reset_index()

plt.figure(figsize=(20, 10))
plt.scatter(md.points_x, md.raceId, s=md.points_y*6, alpha=0.5, color=sns.color_palette("magma", len(md)))
plt.xlim(0, 17)
plt.ylim(0, 500)

plt.xlabel("Average Points Per Race")
plt.ylabel("Races")

for x, y, z in zip(md.points_x, md.raceId, md.constructor):
  plt.annotate(z, xy=(x-1,y-1))

Here, we take a look at average points instead of total points from the data visualization before. Now it is more visible that Mercedes has dominated in the past 10 years, as it also has the highest average points per race. Coming in second and third place are Red Bull and Ferrari, but Mercedes is still leading by a signifigant number of points. The average amount of points earned by these teams per race is more than double the average amount of points earned by other teams. This visualization also allows us to pick out underperforming constructors as well. Despite the number of races completed by McLaren and Williams, they are far behind constructors like Mercedes and Red Bull, who have the highest average points per race with similar number of races completed. Lotus F1 is even outperforming McLaren and Williams despite competing in under 200 races, while McLaren and Williams have been in approximatley 400.

Similar to how we looked at the average points per race for the constructors, let us look at the average points per race for the individual drivers.
Note: Drivers who have raced less than 100 races have been excluded from the figure.

m1 = pd.merge(results, drivers, on='driverId')
m2 = pd.merge(m1, races, on='raceId')
result_v2 = m2[m2.year > 2010]
result_v2["driver"] = result_v2["forename"] + " " + result_v2["surname"]

# Aggregate total points and average points per race
avg_pts = result_v2[['driver','points']].groupby("driver").mean()
total_pts = result_v2[['driver','points']].groupby("driver").sum()
n=result_v2[['driver','raceId']].groupby("driver").count()
num_races=n[n.raceId > 100]
d = pd.merge(avg_pts, total_pts, on='driver')
md = pd.merge(d, num_races, on='driver')
md = md.reset_index()
md.iloc[7,3] = 180  #data correction
md.iloc[6,3] = 125  #data correction

plt.figure(figsize=(20, 10))
plt.scatter(md.points_x, md.raceId, s=md.points_y*6, alpha=0.5, color=sns.color_palette("viridis", len(md)))
plt.xlim(0, 18)
plt.ylim(100, 240)

plt.xlabel("Average Points Per Race")
plt.ylabel("Races")

for x, y, z in zip(md.points_x, md.raceId, md.driver):
  plt.annotate(z, xy=(x-1, y-1))

Here we can see that the top three drivers with the most points per race on average are: Lewis Hamilton, Sebastian Vettel and Nico Rosberg. Thus, it should not come as a surprise that these three drivers took all the drivers world championships in our time period. Sebastian Vettel won the driver's world championship consecutively from 2010 to 2013, after which Lewis Hamilton started his F1 dominance by winning the world championship from 2014 to 2020 with the exception of 2016, when his teammate Nico Rosberg won the championship. Though Hamilton and Vettel have clearly dominated most of the decade, Max Verstappen has emerged as one of the top contenders with very few races but high average points per race.

Now that we know the drivers who earn the most points on average, let us look at the fastest drivers on the grid.

fastest_data = pd.merge(laptimes, races, on='raceId', how='left')
fastest_data.columns
fastest_data = fastest_data[['raceId', 'driverId', 'time_x', 'milliseconds','year', 'round', 'circuitId', 'name', 'date']]
fastest_data.rename(columns={'time_x':'lap_time', 'name':'circuit_name'}, inplace=True)
fastest_data = pd.merge(fastest_data, drivers, on='driverId', how='left')
fastest_data = pd.merge(fastest_data, circuits, on='circuitId', how='left')

fastest_data = fastest_data[['raceId', 'driverId', 'lap_time', 'milliseconds', 'year', 'round',
       'circuitId', 'circuit_name', 'date', 'driverRef', 'number', 'code',
       'forename', 'surname', 'dob', 'nationality', 'circuitRef', 'location', 'country']]

data = pd.merge(fastest_data.groupby(['circuit_name','date']).lap_time.min().to_frame().reset_index(), fastest_data[['circuit_name','date','lap_time', 'driverRef','code']], on=['circuit_name','date','lap_time'], how='left')
data = data.sort_values(by='date', ascending = False)

data.head(5)

data['year'] = pd.DatetimeIndex(data.date).year
data['counts'] = 1
data = data.groupby(['year', 'code', 'driverRef']).counts.count().to_frame().reset_index().sort_values(by='year', ascending=False)

# fastest = data.loc[data.groupby(['year'])['occ'].idxmax()]
fastest = pd.merge(data, data.groupby(['year'])['counts'].max().to_frame(name='max').reset_index(), on='year', how='left')
fastest = fastest[fastest['counts'] == fastest['max']][['year','code','driverRef','counts']]
fastest.driverRef = fastest.driverRef.str.capitalize()

# Calculate the percentage of fastest lap per season 
fastest = pd.merge(fastest, fastest_data.groupby('year')['round'].max().reset_index(), on='year', how='left')
fastest['percent'] = np.array(fastest['counts'])/np.array(fastest['round'])*100
fastest['year'] = fastest['year'].astype(str)
fastest

The above dataframe shows the drivers who got most number of fastest laps in each F1 season. The percent column represents the percentage of races they earned the fastest lap in with respect to the number of races they took part in for that particular season.

Let us visualize the above data so that it is easier to understand.

from bokeh.palettes import Category20b

fig, ax = plt.subplots(figsize=(12,16))
fig.set_facecolor('#FFFFFF')
ax.set_facecolor('#FFFFFF')

ax.hlines(fastest.year, xmin=0, xmax=fastest.percent, linestyle='dotted')

groups = fastest[['year','percent','driverRef']].groupby('driverRef')
colors=sns.color_palette("magma", len(fastest.code.unique()))

for (name, group), color in zip(groups, colors):
  ax.plot(group.percent, group.year, marker='o', color=color, linestyle='', ms=12, label=name)
ax.set_xlim(0,65)
ax.legend()

for x,y, label, count in zip(fastest.percent, fastest.year, fastest.code, fastest.counts):
  ax.annotate(label+'({} races)'.format(count), xy=(x+0.8,y), textcoords='data')
  #ax.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')

plt.xlabel('Percentage of Fastest Lap Wins(%)')
plt.title('Who is the fastest driver in each season?', fontsize=18)

plt.show()

The above figure shows us that Hamilton has been the driver with the most fastest laps in a season the most number of times. Thus, it is no surprise that he went on to dominate the hybrid era of F1 with 7 World Championship titles. This data is also consistent with our assessment of constructor performance since we can observe that Mercedes drivers (Lewis Hamilton, Valterri Bottas, and Nico Rosberg) have been the fastest drivers during the period of Mercedes dominance. In the years before 2014 we see Vettel as one of the fastest, which suggests why he was able to win the driver's world championship from 2010-2013.

Hypothesis Testing and Machine Learning¶

r['position'] = r['position'].replace('\\N', '25')
r['position'] = pd.to_numeric(r['position'])
ax = r.plot.scatter(x='grid', y='position', figsize=(12,7))
m, b = np.polyfit(r.grid, r.position, 1)
plt.plot(r.grid, m*r.grid, b, color='black')

[<matplotlib.lines.Line2D at 0x7f4f94e57a60>,
 <matplotlib.lines.Line2D at 0x7f4f94e697f0>]

Since the position column initially has /N in place of those cars that did not finish the race, we decided to place them 25th. We then changed the column from storing strings to numeric-type values in order to properly generate a scatter plot. A line of best fit is then drawn to give an idea of a general trend. Even without it, it can be seen that although it is fairly scattered, there is a general positive correlation between grid position and final position.

regr = linear_model.LinearRegression() 
x = np.array(r['grid']).reshape((-1, 1)) 
y = np.array(r['position'])
model = regr.fit(x,y)
plt.figure(figsize=(12,7))
plt.scatter(x, y)
plt.plot(x, regr.predict(x), color='black') 
plt.xlabel('grid')
plt.ylabel('position')

print('coefficient of determination:', model.score(x, y)) 
print('intercept:', model.intercept_)
print('slope:', model.coef_)
print('y = ' + str(model.intercept_) + ' + ' + str(model.coef_[0]) + 'x')

coefficient of determination: 0.25561533281933446
intercept: 5.3774473684577355
slope: [0.61252332]
y = 5.3774473684577355 + 0.6125233247484766x

From this graph, using the linear regression model, we can clearly see theat there is a positive slope that correlates grid position and final attained position after the race. Although this does not indicate causation, as there many other factors that can determine final posiition in a race, it does indicate that it is indeed a factor that plays into it. Generally, the initial grid placement is a good measure of a racer and cars speed when noone else is on the track. However, during the race, despite the initial placement position, many factors like car type, constructor, skill of the racer with cars on the track, pit stop times, and strategy can contribute to the final placement position of each individual car.

regression = ols(formula='position ~ grid + constructorRef + grid*constructorRef', data=r).fit()
print(regression.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               position   R-squared:                       0.313
Model:                            OLS   Adj. R-squared:                  0.306
Method:                 Least Squares   F-statistic:                     48.32
Date:                Mon, 21 Dec 2020   Prob (F-statistic):          2.40e-302
Time:                        16:06:06   Log-Likelihood:                -13635.
No. Observations:                4181   AIC:                         2.735e+04
Df Residuals:                    4141   BIC:                         2.760e+04
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
=======================================================================================================
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                              10.0180      2.444      4.098      0.000       5.226      14.810
constructorRef[T.alphatauri]           -3.2917      4.393     -0.749      0.454     -11.904       5.321
constructorRef[T.caterham]              9.3355      6.729      1.387      0.165      -3.856      22.527
constructorRef[T.ferrari]              -5.7588      2.511     -2.294      0.022     -10.681      -0.836
constructorRef[T.force_india]          -2.8117      2.685     -1.047      0.295      -8.076       2.453
constructorRef[T.haas]                  0.8053      2.796      0.288      0.773      -4.676       6.286
constructorRef[T.hrt]                  14.3101      3.938      3.634      0.000       6.589      22.031
constructorRef[T.lotus_f1]             -2.7949      2.715     -1.029      0.303      -8.118       2.528
constructorRef[T.lotus_racing]          4.0821     16.024      0.255      0.799     -27.334      35.499
constructorRef[T.manor]                10.1858      5.047      2.018      0.044       0.290      20.081
constructorRef[T.marussia]             10.3343      5.136      2.012      0.044       0.266      20.403
constructorRef[T.mclaren]              -3.6411      2.548     -1.429      0.153      -8.637       1.355
constructorRef[T.mercedes]             -6.0365      2.483     -2.431      0.015     -10.904      -1.169
constructorRef[T.racing_point]         -3.5833      3.034     -1.181      0.238      -9.532       2.366
constructorRef[T.red_bull]             -4.2935      2.496     -1.720      0.086      -9.188       0.601
constructorRef[T.renault]              -2.7199      2.709     -1.004      0.315      -8.030       2.590
constructorRef[T.sauber]                3.0009      2.758      1.088      0.277      -2.406       8.408
constructorRef[T.toro_rosso]            2.9296      2.663      1.100      0.271      -2.292       8.151
constructorRef[T.virgin]               -0.7418     17.621     -0.042      0.966     -35.288      33.804
constructorRef[T.williams]             -2.0046      2.570     -0.780      0.435      -7.042       3.033
grid                                    0.2753      0.165      1.667      0.096      -0.048       0.599
grid:constructorRef[T.alphatauri]       0.1365      0.351      0.389      0.698      -0.552       0.825
grid:constructorRef[T.caterham]        -0.3220      0.364     -0.884      0.377      -1.036       0.392
grid:constructorRef[T.ferrari]          0.2620      0.183      1.432      0.152      -0.097       0.621
grid:constructorRef[T.force_india]      0.1079      0.192      0.562      0.574      -0.268       0.484
grid:constructorRef[T.haas]             0.0666      0.193      0.345      0.730      -0.312       0.445
grid:constructorRef[T.hrt]             -0.4198      0.217     -1.935      0.053      -0.845       0.006
grid:constructorRef[T.lotus_f1]         0.2365      0.192      1.230      0.219      -0.140       0.614
grid:constructorRef[T.lotus_racing]    -0.0384      0.862     -0.045      0.964      -1.728       1.651
grid:constructorRef[T.manor]           -0.4135      0.289     -1.430      0.153      -0.980       0.153
grid:constructorRef[T.marussia]        -0.3807      0.280     -1.359      0.174      -0.930       0.168
grid:constructorRef[T.mclaren]          0.2421      0.176      1.374      0.170      -0.103       0.588
grid:constructorRef[T.mercedes]         0.1801      0.179      1.005      0.315      -0.171       0.532
grid:constructorRef[T.racing_point]     0.1735      0.220      0.789      0.430      -0.258       0.605
grid:constructorRef[T.red_bull]         0.0438      0.179      0.245      0.806      -0.306       0.394
grid:constructorRef[T.renault]          0.2588      0.189      1.369      0.171      -0.112       0.630
grid:constructorRef[T.sauber]          -0.1276      0.185     -0.688      0.491      -0.491       0.236
grid:constructorRef[T.toro_rosso]      -0.1679      0.182     -0.922      0.357      -0.525       0.189
grid:constructorRef[T.virgin]           0.2143      0.834      0.257      0.797      -1.420       1.849
grid:constructorRef[T.williams]         0.1763      0.175      1.009      0.313      -0.166       0.519
==============================================================================
Omnibus:                      771.736   Durbin-Watson:                   0.641
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1271.304
Skew:                           1.254   Prob(JB):                    8.71e-277
Kurtosis:                       4.002   Cond. No.                     2.39e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

coef = pd.DataFrame(regression.params, columns=['coef'])
coef += coef['coef']['grid']
coef[-len(coef)//2+1:]

Most of these coefficients/slopes are greater than zero, indicating a positive relationship between grid, constructor, and final position. This also indicates—in most cases—that constructor can indeed be included as an interaction term. We can see from the prior data analysis that constructor seems to have an effect on how many points drivers earn. Since points correlate with placement, this shows that constructor could be included as another variable when discussing grid placement(preliminary/initial placement) and final placement in the race.

models = []
models.append(('Random Forest', ensemble.RandomForestClassifier()))
models.append(('SVC', svm.SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
  kfold = model_selection.KFold(n_splits=10)
  cv_results = model_selection.cross_val_score(model, x, y, cv=kfold, scoring='accuracy')
  results.append(cv_results)
  names.append(name)
  print(name + ' accuracy: ' + str(cv_results.mean()) + ' error: ' + str(cv_results.std()))

Random Forest accuracy: 0.20234267051877905 error: 0.029137952071180185
SVC accuracy: 0.20521177102008656 error: 0.04442189646411429

When doing the 10 fold cross validation, there was 20.2 percent accuracy for the random forest classification model, and 20.5 percent accuracy for the SVM model. The reason for the fairly low accuracy in both models is due to the large variability between individual final positions when compared to grid positions. When looking at the scatter plots, we can see many points which are stacked on top of on another for the same grid position. This proves that there are many other factors when determining the relation between grid position and final position, including pit stop time, specific strategies racers use, type of car, constructor of the car, and many other factors. Of course, when using these models, we also determined a fairly low error. For the random forest model the error was 0.031 and the error for the SVM model was 0.044, indicating that there is indeed a relation between grid and final position. However, it is difficult to determine exactly where each racer would fall due to the plethora of factors described.

Conclusion¶

Insights Gained¶

The amount of data being captured, analyzed and used to design, build and drive the Formula 1 cars is astounding. It is a global sport being followed by millions of people worldwide and it is fascinating to see drivers pushing their limits in these vehicles to become the fastest racers in the world! They of course could not reach this point without the work and dedication of the car constructors, as there would be no racing without them. From our dataset we had a number of observations. First, we saw how Mercedes were the most dominant team in both the constructor's and driver's championship. We learned a lot about the sport thanks to this data dump and worked through many stages of the data science pipeline presented to us throughout our data science course, CMSC320. Hopefully this tutorial provided some valuable insight for both those new to the sport and F1 veterans alike. Even if you're not interested in Formula 1, a lot of what we covered applies to other datasets since data science is all about tidying datasets, preparing them for further analysis, and finally plotting and explaining any relevant visualizations or models. We were able to make a gain a number of insights from this project. Firstly, we saw how Mercedes have been the dominant team in the hybrid era of F1 in both the constructor's and driver's championships. Additionally, we used a machine learning approach to determine how related grid position and final position are even included constructors as a interaction term. The conclusion that was drawn from this was that constructors can affect if the change between grid position and final position will be positive and negative. Overall, there is a correlation between grid position and final position, but it not easy determine just based on a couple of factors due to the plethora of other factors in F1 racing.

Future Work¶

In this tutorial, we identified the most dominant F1 drivers in the last 10 years of races. Future work could include analyzing F1 data from all throughout the race's history in order to determine the most dominant driver of all time. This could prove to be an interesting area of research because F1 cars have drastically changed since the sport first came into being in 1950, and accounting for these factors might reveal surprising results. Another potential research area would be to use existing F1 data to predict the results for a future race or season and compare prediction results to the actual standings for that race/season.

Link to GitHub Repository¶

https://github.com/rnarla123/rnarla123.github.io

	raceId	year	round	circuitId	name_x	date	time_x	url_x	resultId	driverId	...	url_y	status	driverRef	number_y	code	forename	surname	dob	nationality_y	url
0	1	2009	1	1	Australian Grand Prix	2009-03-29	06:00:00	http://en.wikipedia.org/wiki/2009_Australian_G...	7554	18	...	http://en.wikipedia.org/wiki/Brawn_GP	Finished	button	22	BUT	Jenson	Button	1980-01-19	British	http://en.wikipedia.org/wiki/Jenson_Button
1	2	2009	2	2	Malaysian Grand Prix	2009-04-05	09:00:00	http://en.wikipedia.org/wiki/2009_Malaysian_Gr...	7574	18	...	http://en.wikipedia.org/wiki/Brawn_GP	Finished	button	22	BUT	Jenson	Button	1980-01-19	British	http://en.wikipedia.org/wiki/Jenson_Button
2	3	2009	3	17	Chinese Grand Prix	2009-04-19	07:00:00	http://en.wikipedia.org/wiki/2009_Chinese_Gran...	7596	18	...	http://en.wikipedia.org/wiki/Brawn_GP	Finished	button	22	BUT	Jenson	Button	1980-01-19	British	http://en.wikipedia.org/wiki/Jenson_Button
3	4	2009	4	3	Bahrain Grand Prix	2009-04-26	12:00:00	http://en.wikipedia.org/wiki/2009_Bahrain_Gran...	7614	18	...	http://en.wikipedia.org/wiki/Brawn_GP	Finished	button	22	BUT	Jenson	Button	1980-01-19	British	http://en.wikipedia.org/wiki/Jenson_Button
4	5	2009	5	4	Spanish Grand Prix	2009-05-10	12:00:00	http://en.wikipedia.org/wiki/2009_Spanish_Gran...	7634	18	...	http://en.wikipedia.org/wiki/Brawn_GP	Finished	button	22	BUT	Jenson	Button	1980-01-19	British	http://en.wikipedia.org/wiki/Jenson_Button
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
24955	784	1956	1	25	Argentine Grand Prix	1956-01-22	\N	http://en.wikipedia.org/wiki/1956_Argentine_Gr...	20269	806	...	http://en.wikipedia.org/wiki/Maserati	+10 Laps	oscar_gonzalez	\N	\N	Óscar	González	1923-11-10	Uruguayan	http://en.wikipedia.org/wiki/Oscar_Gonz%C3%A1l...
24956	726	1963	8	46	United States Grand Prix	1963-10-06	\N	http://en.wikipedia.org/wiki/1963_United_State...	17583	448	...	http://en.wikipedia.org/wiki/Stebro	+22 Laps	broeker	\N	\N	Peter	Broeker	1926-05-15	Canadian	http://en.wikipedia.org/wiki/Peter_Broeker
24957	833	1950	1	9	British Grand Prix	1950-05-13	\N	http://en.wikipedia.org/wiki/1950_British_Gran...	20045	790	...	http://en.wikipedia.org/wiki/English_Racing_Au...	Supercharger	leslie_johnson	\N	\N	Leslie	Johnson	1912-03-22	British	http://en.wikipedia.org/wiki/Leslie_Johnson_(r...
24958	815	1953	8	66	Swiss Grand Prix	1953-08-23	\N	http://en.wikipedia.org/wiki/1953_Swiss_Grand_...	19598	719	...	http://en.wikipedia.org/wiki/Hersham_and_Walto...	+16 Laps	scherrer	\N	\N	Albert	Scherrer	1908-02-28	Swiss	http://en.wikipedia.org/wiki/Albert_Scherrer
24959	809	1953	2	19	Indianapolis 500	1953-05-30	\N	http://en.wikipedia.org/wiki/1953_Indianapolis...	20204	804	...	http://en.wikipedia.org/wiki/Kurtis_Kraft	+24 Laps	mantz	\N	\N	Johnny	Mantz	1918-09-18	American	http://en.wikipedia.org/wiki/Johnny_Mantz

	raceId	year	round	circuitId	name	date	time	url
839	841	2011	1	1	Australian Grand Prix	2011-03-27	06:00:00	http://en.wikipedia.org/wiki/2011_Australian_G...
840	842	2011	2	2	Malaysian Grand Prix	2011-04-10	08:00:00	http://en.wikipedia.org/wiki/2011_Malaysian_Gr...
841	843	2011	3	17	Chinese Grand Prix	2011-04-17	07:00:00	http://en.wikipedia.org/wiki/2011_Chinese_Gran...
842	844	2011	4	5	Turkish Grand Prix	2011-05-08	12:00:00	http://en.wikipedia.org/wiki/2011_Turkish_Gran...
843	845	2011	5	4	Spanish Grand Prix	2011-05-22	12:00:00	http://en.wikipedia.org/wiki/2011_Spanish_Gran...
...	...	...	...	...	...	...	...	...
1030	1043	2020	13	21	Emilia Romagna Grand Prix	2020-11-01	12:10:00	https://en.wikipedia.org/wiki/2020_Emilia_Roma...
1031	1044	2020	14	5	Turkish Grand Prix	2020-11-15	10:10:00	https://en.wikipedia.org/wiki/2020_Turkish_Gra...
1032	1045	2020	15	3	Bahrain Grand Prix	2020-11-29	14:10:00	https://en.wikipedia.org/wiki/2020_Bahrain_Gra...
1033	1046	2020	16	3	Sakhir Grand Prix	2020-12-06	17:10:00	https://en.wikipedia.org/wiki/2020_Sakhir_Gran...
1034	1047	2020	17	24	Abu Dhabi Grand Prix	2020-12-13	13:10:00	https://en.wikipedia.org/wiki/2020_Abu_Dhabi_G...

	circuit_name	date	lap_time	driverRef	code
10	Abu Dhabi Grand Prix	2020-12-13	1:40.926	ricciardo	RIC
164	Sakhir Grand Prix	2020-12-06	0:55.404	russell	RUS
38	Bahrain Grand Prix	2020-11-29	1:32.014	max_verstappen	VER
186	Turkish Grand Prix	2020-11-15	1:36.806	norris	NOR
87	Emilia Romagna Grand Prix	2020-11-01	1:15.484	hamilton	HAM

	year	code	driverRef	counts	round	percent
0	2020	HAM	Hamilton	5	17.0	29.411765
1	2019	HAM	Hamilton	6	21.0	28.571429
2	2018	BOT	Bottas	7	21.0	33.333333
3	2017	HAM	Hamilton	7	20.0	35.000000
4	2016	ROS	Rosberg	6	21.0	28.571429
5	2015	HAM	Hamilton	7	19.0	36.842105
6	2014	HAM	Hamilton	6	19.0	31.578947
7	2013	VET	Vettel	7	19.0	36.842105
8	2012	VET	Vettel	6	20.0	30.000000
9	2011	WEB	Webber	7	19.0	36.842105

	coef
grid:constructorRef[T.alphatauri]	0.411790
grid:constructorRef[T.caterham]	-0.046704
grid:constructorRef[T.ferrari]	0.537285
grid:constructorRef[T.force_india]	0.383128
grid:constructorRef[T.haas]	0.341813
grid:constructorRef[T.hrt]	-0.144501
grid:constructorRef[T.lotus_f1]	0.511789
grid:constructorRef[T.lotus_racing]	0.236891
grid:constructorRef[T.manor]	-0.138199
grid:constructorRef[T.marussia]	-0.105427
grid:constructorRef[T.mclaren]	0.517392
grid:constructorRef[T.mercedes]	0.455405
grid:constructorRef[T.racing_point]	0.448790
grid:constructorRef[T.red_bull]	0.319052
grid:constructorRef[T.renault]	0.534060
grid:constructorRef[T.sauber]	0.147669
grid:constructorRef[T.toro_rosso]	0.107363
grid:constructorRef[T.virgin]	0.489510
grid:constructorRef[T.williams]	0.451579