Formula One Racing

Rahul Kiefer, Rahul Narla, Hrishik Rajendra


Introduction

image.jpg

Formula One (also known as Formula 1 or F1) is the highest class of international auto racing for single-seater racing cars sanctioned by the Fédération Internationale de l'Automobile (FIA). There are two World Championships being fought for over the course of a season in F1: the Driver's World Championship, and the Constructor's World Championship. Each F1 constructor builds their own car and has two drivers racing for them. Points earned by drivers racing for the same constructor contribute to the Contructor's Championship, while each driver's individual points contribute to the Driver's Championship. While F1 is a team sport, this interesting dynamic where a driver is also competing for an individual championship leads to intense battles even among teammates. Today, F1 is a multibillion dollar annual industry, ranking behind only the 4-yearly Football World Cup and Summer Olympic Games in terms of live television audience (Benson). The cars, which have effectively become mobile advertising billboards, race fortnightly in front of a global audience of motor sport fans—527 million across 187 countries in 2010—who are “up to three times more brand loyal than fans of other sports”(Autosport).

In this tutorial, our goal is to combine and analyze the data we found in order to provide insight into which car constructors and drivers are the most dominant in Formula 1. For readers unfamiliar with the sport, we hope this analysis will get them interested in watching Formula 1 races and provide insight into which teams are performing the best. For those already familiar with F1, we hope to show how well their favorite teams and/or drivers have been performing in recent years (or how badly they're being beaten).


Data Curation, Parsing, and Management

Library and Module Imports

For this tutorial, we're using Python 3 along with the following imported libraries: Folium, Matplotlib, NumPy, pandas, PyWaffle, SciPy, seaborn, scikit-learn, and statsmodels.

In [1]:
# Standard library imports
import csv
import io
from io import BytesIO
import requests
import warnings
warnings.filterwarnings('ignore')

# Third-party imports
!pip install folium
import folium
import folium.plugins as plg
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
!pip install pywaffle
from pywaffle import Waffle
from scipy import stats
import seaborn as sns
from sklearn import datasets, ensemble, linear_model, metrics, model_selection, svm
from sklearn.model_selection import cross_val_predict, train_test_split
from statsmodels.formula.api import ols
Requirement already satisfied: folium in /opt/conda/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/lib/python3.8/site-packages (from folium) (2.11.2)
Requirement already satisfied: branca>=0.3.0 in /opt/conda/lib/python3.8/site-packages (from folium) (0.4.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from folium) (1.19.1)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from folium) (2.24.0)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (1.25.10)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2020.6.20)
Requirement already satisfied: pywaffle in /opt/conda/lib/python3.8/site-packages (0.6.1)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.8/site-packages (from pywaffle) (3.2.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (1.2.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (2.4.7)
Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.8/site-packages (from matplotlib->pywaffle) (1.19.1)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from cycler>=0.10->matplotlib->pywaffle) (1.15.0)

Retrieving the Data

The files we're using are originally from this Kaggle dataset. From the dataset, we uploaded copies of each CSV to our GitHub repository for ease of access (you need to have a Kaggle account in order to download the CSVs from the above link).

These CSV files have been cleaned prior to retrieval, so no data cleaning/modification is necessary on our part.

In [2]:
races = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/races.csv')
results = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/results.csv')
constructors = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/constructors.csv')
status = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/status.csv')
drivers = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/drivers.csv')
circuits = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/circuits.csv')
laptimes = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/lap_times.csv')
In [3]:
r = races

# Merging other dataframes
r = r.merge(results, on='raceId')
r = r.merge(constructors, on='constructorId')
r = r.merge(status, on='statusId')
r = r.merge(drivers, on='driverId')

# Deleting unused columns
r.drop(columns=[])

r
Out[3]:
raceId year round circuitId name_x date time_x url_x resultId driverId ... url_y status driverRef number_y code forename surname dob nationality_y url
0 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... 7554 18 ... http://en.wikipedia.org/wiki/Brawn_GP Finished button 22 BUT Jenson Button 1980-01-19 British http://en.wikipedia.org/wiki/Jenson_Button
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... 7574 18 ... http://en.wikipedia.org/wiki/Brawn_GP Finished button 22 BUT Jenson Button 1980-01-19 British http://en.wikipedia.org/wiki/Jenson_Button
2 3 2009 3 17 Chinese Grand Prix 2009-04-19 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... 7596 18 ... http://en.wikipedia.org/wiki/Brawn_GP Finished button 22 BUT Jenson Button 1980-01-19 British http://en.wikipedia.org/wiki/Jenson_Button
3 4 2009 4 3 Bahrain Grand Prix 2009-04-26 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran... 7614 18 ... http://en.wikipedia.org/wiki/Brawn_GP Finished button 22 BUT Jenson Button 1980-01-19 British http://en.wikipedia.org/wiki/Jenson_Button
4 5 2009 5 4 Spanish Grand Prix 2009-05-10 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran... 7634 18 ... http://en.wikipedia.org/wiki/Brawn_GP Finished button 22 BUT Jenson Button 1980-01-19 British http://en.wikipedia.org/wiki/Jenson_Button
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24955 784 1956 1 25 Argentine Grand Prix 1956-01-22 \N http://en.wikipedia.org/wiki/1956_Argentine_Gr... 20269 806 ... http://en.wikipedia.org/wiki/Maserati +10 Laps oscar_gonzalez \N \N Óscar González 1923-11-10 Uruguayan http://en.wikipedia.org/wiki/Oscar_Gonz%C3%A1l...
24956 726 1963 8 46 United States Grand Prix 1963-10-06 \N http://en.wikipedia.org/wiki/1963_United_State... 17583 448 ... http://en.wikipedia.org/wiki/Stebro +22 Laps broeker \N \N Peter Broeker 1926-05-15 Canadian http://en.wikipedia.org/wiki/Peter_Broeker
24957 833 1950 1 9 British Grand Prix 1950-05-13 \N http://en.wikipedia.org/wiki/1950_British_Gran... 20045 790 ... http://en.wikipedia.org/wiki/English_Racing_Au... Supercharger leslie_johnson \N \N Leslie Johnson 1912-03-22 British http://en.wikipedia.org/wiki/Leslie_Johnson_(r...
24958 815 1953 8 66 Swiss Grand Prix 1953-08-23 \N http://en.wikipedia.org/wiki/1953_Swiss_Grand_... 19598 719 ... http://en.wikipedia.org/wiki/Hersham_and_Walto... +16 Laps scherrer \N \N Albert Scherrer 1908-02-28 Swiss http://en.wikipedia.org/wiki/Albert_Scherrer
24959 809 1953 2 19 Indianapolis 500 1953-05-30 \N http://en.wikipedia.org/wiki/1953_Indianapolis... 20204 804 ... http://en.wikipedia.org/wiki/Kurtis_Kraft +24 Laps mantz \N \N Johnny Mantz 1918-09-18 American http://en.wikipedia.org/wiki/Johnny_Mantz

24960 rows Ă— 38 columns

There are X columns in the above dataframe 'r'. We have:

  • Year - Year of Race
  • Name - Name of F1 race
  • Grid - Initial Position
  • Position - Final Position
  • Points - Points Awarded
  • Constructor Ref - Car Constructor
  • Driver Ref - Driver
  • Forename and surname - Driver Name

The titles of each column should be self-explanatory. If you'd like to get more information a column, important topics have links to additional information.

In [4]:
r = r[r.year > 2010]
races = races[races.year > 2010]
races
Out[4]:
raceId year round circuitId name date time url
839 841 2011 1 1 Australian Grand Prix 2011-03-27 06:00:00 http://en.wikipedia.org/wiki/2011_Australian_G...
840 842 2011 2 2 Malaysian Grand Prix 2011-04-10 08:00:00 http://en.wikipedia.org/wiki/2011_Malaysian_Gr...
841 843 2011 3 17 Chinese Grand Prix 2011-04-17 07:00:00 http://en.wikipedia.org/wiki/2011_Chinese_Gran...
842 844 2011 4 5 Turkish Grand Prix 2011-05-08 12:00:00 http://en.wikipedia.org/wiki/2011_Turkish_Gran...
843 845 2011 5 4 Spanish Grand Prix 2011-05-22 12:00:00 http://en.wikipedia.org/wiki/2011_Spanish_Gran...
... ... ... ... ... ... ... ... ...
1030 1043 2020 13 21 Emilia Romagna Grand Prix 2020-11-01 12:10:00 https://en.wikipedia.org/wiki/2020_Emilia_Roma...
1031 1044 2020 14 5 Turkish Grand Prix 2020-11-15 10:10:00 https://en.wikipedia.org/wiki/2020_Turkish_Gra...
1032 1045 2020 15 3 Bahrain Grand Prix 2020-11-29 14:10:00 https://en.wikipedia.org/wiki/2020_Bahrain_Gra...
1033 1046 2020 16 3 Sakhir Grand Prix 2020-12-06 17:10:00 https://en.wikipedia.org/wiki/2020_Sakhir_Gran...
1034 1047 2020 17 24 Abu Dhabi Grand Prix 2020-12-13 13:10:00 https://en.wikipedia.org/wiki/2020_Abu_Dhabi_G...

196 rows Ă— 8 columns

We decided to specifically analyze the past 10 years of F1 data, rather than starting from the 1950s as the original dataframes do, to more easily determine current domination in the F1 scene. Additionally, for further analysis between grid placements and position placements, we thought the past 10 years (with more than 400 data points) would suffice and would more accurately represent the current state of F1.


Exploratory Data Analysis

Formula One is a Global Sport

Current regulations set by the FIA specify that a full championship season “must include Competitions taking place on at least three continents during the same season.”

This requires F1 teams to travel a lot during a season, racing in different countries and tracks. Below is an interactive map showing the different tracks teams have raced at since the start of the sport in 1950.

In [5]:
circuits_map = folium.Map(zoom_start=13)
map_cluster = plg.MarkerCluster().add_to(circuits_map)
for idx, row in circuits.iterrows():
    folium.Marker(
        location=[row['lat'], row['lng']],
        icon=folium.Icon(color='cadetblue', prefix='fa', icon='flag-checkered')
        ).add_to(map_cluster)

circuits_map
Out[5]:
Make this Notebook Trusted to load map: File -> Trust Notebook

We can also analyze the distribution of races throughout the different continents.

In [6]:
# The number of races in each continent
num_continent= {
  'Europe':38,
  'Asia':17,
  'North America':6,
  'South America':3,
  'Africa':3,
  'Australia':2
}
In [7]:
fig = plt.figure(
    figsize = (14,16),
    FigureClass=Waffle, 
    rows=5, 
    values=num_continent, 
    colors=sns.color_palette("viridis",len(num_continent)).as_hex(),
    title={'label': 'Distribution of F1 circuits across different continents', 'size':18},
    labels=["{0} ({1})".format(k, v) for k, v in num_continent.items()],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
    icons='flag-checkered', icon_size=45, 
    icon_legend=True
)

We observe that F1 visits the most number of circuits in Europe, followed by Asia. The North American continent ranks third in terms of the number of F1 circuits. The reason for this is that unfortunately, F1 racing is not nearly as popular in the United States as other motorsports such as Nascar. This leads to a feedback cycle where F1 does not race in the United States often due to the lack of fans here, and the lack of races in the United States is a factor which contributes to that.

Who dominated F1 in the last 10 years?

With the 2020 Formula One season coming to an end at Abu Dhabi a few weeks ago, we are curious to see which constructor and driver dominated F1.

First, let us observe how different constructors performed during the course of the last 10 years.

In [8]:
plt.rc('font', size=10)          # controls default text sizes
plt.rc('axes', titlesize=15)     # fontsize of the axes title
plt.rc('axes', labelsize=15)     # fontsize of the x and y labels
plt.rc('xtick', labelsize=10)    # fontsize of the tick labels
plt.rc('ytick', labelsize=10)    # fontsize of the tick labels
plt.rc('legend', fontsize=15)    # legend fontsize
plt.rc('figure', titlesize=15)   # fontsize of the figure title
for constructor in r.constructorRef.unique():
  total = r.loc[(r.constructorRef == constructor)]
  lst = []
  for y in total.year.unique():
    sum = total[total.year == y].points.sum()
    num_races = total[total.year == y].size
    lst.append([y, sum])
  df = pd.DataFrame(data=lst, columns=['year', 'total points'])
  df.plot.scatter(x='year', y='total points', title=constructor.capitalize(), sharex=True, xlim=(2011, 2020), ylim=(0, 1000), s=100, c='blue', figsize=(12, 7), xlabel='Year', ylabel='Total Points')

In these scatter plots we've mapped out the total points for each constructor in each season. We take into account each individual racer that have formed contracts with respective constructors to form total points values for that year. When taking a look at all of these plots, it can be easily seen that for approximately the past seven years, Mercedes has dominated F1 with the most points. Ferrari and Red Bull come close in some years, but overall, Mercedes seems to have more points than all other constructors in these races.

An important observation here is that the Mercedes dominance in the constructor's championship did not start untill 2014. This was when F1 decided to make a major regulation change and switch to hybrid V6 engines replacing the older V8 engines used by constructors. Since this regulation change, Mercedes have proven to be unstoppable, making them the team to beat in the hybrid era of F1.

We can also look at the average points earned per race for each constructor.

In [9]:
m1 = pd.merge(results, constructors, on='constructorId')
m2 = pd.merge(m1, races, on='raceId')
result_v2 = m2[m2.year > 2010]
result_v2["constructor"] = result_v2["name_x"]

# Aggregate total points and average points per race
avg_pts = result_v2[['constructor','points']].groupby("constructor").mean()
total_pts = result_v2[['constructor','points']].groupby("constructor").sum()
n = result_v2[['constructor','raceId']].groupby("constructor").count()
num_races = n[n.raceId > 100]
d = pd.merge(avg_pts, total_pts, on='constructor')
md = pd.merge(d, num_races, on='constructor')
md = md.reset_index()

plt.figure(figsize=(20, 10))
plt.scatter(md.points_x, md.raceId, s=md.points_y*6, alpha=0.5, color=sns.color_palette("magma", len(md)))
plt.xlim(0, 17)
plt.ylim(0, 500)

plt.xlabel("Average Points Per Race")
plt.ylabel("Races")

for x, y, z in zip(md.points_x, md.raceId, md.constructor):
  plt.annotate(z, xy=(x-1,y-1)) 

Here, we take a look at average points instead of total points from the data visualization before. Now it is more visible that Mercedes has dominated in the past 10 years, as it also has the highest average points per race. Coming in second and third place are Red Bull and Ferrari, but Mercedes is still leading by a signifigant number of points. The average amount of points earned by these teams per race is more than double the average amount of points earned by other teams. This visualization also allows us to pick out underperforming constructors as well. Despite the number of races completed by McLaren and Williams, they are far behind constructors like Mercedes and Red Bull, who have the highest average points per race with similar number of races completed. Lotus F1 is even outperforming McLaren and Williams despite competing in under 200 races, while McLaren and Williams have been in approximatley 400.

Similar to how we looked at the average points per race for the constructors, let us look at the average points per race for the individual drivers.
Note: Drivers who have raced less than 100 races have been excluded from the figure.

In [10]:
m1 = pd.merge(results, drivers, on='driverId')
m2 = pd.merge(m1, races, on='raceId')
result_v2 = m2[m2.year > 2010]
result_v2["driver"] = result_v2["forename"] + " " + result_v2["surname"]

# Aggregate total points and average points per race
avg_pts = result_v2[['driver','points']].groupby("driver").mean()
total_pts = result_v2[['driver','points']].groupby("driver").sum()
n=result_v2[['driver','raceId']].groupby("driver").count()
num_races=n[n.raceId > 100]
d = pd.merge(avg_pts, total_pts, on='driver')
md = pd.merge(d, num_races, on='driver')
md = md.reset_index()
md.iloc[7,3] = 180  #data correction
md.iloc[6,3] = 125  #data correction

plt.figure(figsize=(20, 10))
plt.scatter(md.points_x, md.raceId, s=md.points_y*6, alpha=0.5, color=sns.color_palette("viridis", len(md)))
plt.xlim(0, 18)
plt.ylim(100, 240)

plt.xlabel("Average Points Per Race")
plt.ylabel("Races")

for x, y, z in zip(md.points_x, md.raceId, md.driver):
  plt.annotate(z, xy=(x-1, y-1)) 

Here we can see that the top three drivers with the most points per race on average are: Lewis Hamilton, Sebastian Vettel and Nico Rosberg. Thus, it should not come as a surprise that these three drivers took all the drivers world championships in our time period. Sebastian Vettel won the driver's world championship consecutively from 2010 to 2013, after which Lewis Hamilton started his F1 dominance by winning the world championship from 2014 to 2020 with the exception of 2016, when his teammate Nico Rosberg won the championship. Though Hamilton and Vettel have clearly dominated most of the decade, Max Verstappen has emerged as one of the top contenders with very few races but high average points per race.

Now that we know the drivers who earn the most points on average, let us look at the fastest drivers on the grid.

In [11]:
fastest_data = pd.merge(laptimes, races, on='raceId', how='left')
fastest_data.columns
fastest_data = fastest_data[['raceId', 'driverId', 'time_x', 'milliseconds','year', 'round', 'circuitId', 'name', 'date']]
fastest_data.rename(columns={'time_x':'lap_time', 'name':'circuit_name'}, inplace=True)
fastest_data = pd.merge(fastest_data, drivers, on='driverId', how='left')
fastest_data = pd.merge(fastest_data, circuits, on='circuitId', how='left')

fastest_data = fastest_data[['raceId', 'driverId', 'lap_time', 'milliseconds', 'year', 'round',
       'circuitId', 'circuit_name', 'date', 'driverRef', 'number', 'code',
       'forename', 'surname', 'dob', 'nationality', 'circuitRef', 'location', 'country']]

data = pd.merge(fastest_data.groupby(['circuit_name','date']).lap_time.min().to_frame().reset_index(), fastest_data[['circuit_name','date','lap_time', 'driverRef','code']], on=['circuit_name','date','lap_time'], how='left')
data = data.sort_values(by='date', ascending = False)

data.head(5)
Out[11]:
circuit_name date lap_time driverRef code
10 Abu Dhabi Grand Prix 2020-12-13 1:40.926 ricciardo RIC
164 Sakhir Grand Prix 2020-12-06 0:55.404 russell RUS
38 Bahrain Grand Prix 2020-11-29 1:32.014 max_verstappen VER
186 Turkish Grand Prix 2020-11-15 1:36.806 norris NOR
87 Emilia Romagna Grand Prix 2020-11-01 1:15.484 hamilton HAM
In [12]:
data['year'] = pd.DatetimeIndex(data.date).year
data['counts'] = 1
data = data.groupby(['year', 'code', 'driverRef']).counts.count().to_frame().reset_index().sort_values(by='year', ascending=False)

# fastest = data.loc[data.groupby(['year'])['occ'].idxmax()]
fastest = pd.merge(data, data.groupby(['year'])['counts'].max().to_frame(name='max').reset_index(), on='year', how='left')
fastest = fastest[fastest['counts'] == fastest['max']][['year','code','driverRef','counts']]
fastest.driverRef = fastest.driverRef.str.capitalize()

# Calculate the percentage of fastest lap per season 
fastest = pd.merge(fastest, fastest_data.groupby('year')['round'].max().reset_index(), on='year', how='left')
fastest['percent'] = np.array(fastest['counts'])/np.array(fastest['round'])*100
fastest['year'] = fastest['year'].astype(str)
fastest
Out[12]:
year code driverRef counts round percent
0 2020 HAM Hamilton 5 17.0 29.411765
1 2019 HAM Hamilton 6 21.0 28.571429
2 2018 BOT Bottas 7 21.0 33.333333
3 2017 HAM Hamilton 7 20.0 35.000000
4 2016 ROS Rosberg 6 21.0 28.571429
5 2015 HAM Hamilton 7 19.0 36.842105
6 2014 HAM Hamilton 6 19.0 31.578947
7 2013 VET Vettel 7 19.0 36.842105
8 2012 VET Vettel 6 20.0 30.000000
9 2011 WEB Webber 7 19.0 36.842105

The above dataframe shows the drivers who got most number of fastest laps in each F1 season. The percent column represents the percentage of races they earned the fastest lap in with respect to the number of races they took part in for that particular season.

Let us visualize the above data so that it is easier to understand.

In [13]:
from bokeh.palettes import Category20b

fig, ax = plt.subplots(figsize=(12,16))
fig.set_facecolor('#FFFFFF')
ax.set_facecolor('#FFFFFF')

ax.hlines(fastest.year, xmin=0, xmax=fastest.percent, linestyle='dotted')

groups = fastest[['year','percent','driverRef']].groupby('driverRef')
colors=sns.color_palette("magma", len(fastest.code.unique()))

for (name, group), color in zip(groups, colors):
  ax.plot(group.percent, group.year, marker='o', color=color, linestyle='', ms=12, label=name)
ax.set_xlim(0,65)
ax.legend()

for x,y, label, count in zip(fastest.percent, fastest.year, fastest.code, fastest.counts):
  ax.annotate(label+'({} races)'.format(count), xy=(x+0.8,y), textcoords='data')
  #ax.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')

plt.xlabel('Percentage of Fastest Lap Wins(%)')
plt.title('Who is the fastest driver in each season?', fontsize=18)

plt.show()

The above figure shows us that Hamilton has been the driver with the most fastest laps in a season the most number of times. Thus, it is no surprise that he went on to dominate the hybrid era of F1 with 7 World Championship titles. This data is also consistent with our assessment of constructor performance since we can observe that Mercedes drivers (Lewis Hamilton, Valterri Bottas, and Nico Rosberg) have been the fastest drivers during the period of Mercedes dominance. In the years before 2014 we see Vettel as one of the fastest, which suggests why he was able to win the driver's world championship from 2010-2013.


Hypothesis Testing and Machine Learning

In [14]:
r['position'] = r['position'].replace('\\N', '25')
r['position'] = pd.to_numeric(r['position'])
ax = r.plot.scatter(x='grid', y='position', figsize=(12,7))
m, b = np.polyfit(r.grid, r.position, 1)
plt.plot(r.grid, m*r.grid, b, color='black')
Out[14]:
[<matplotlib.lines.Line2D at 0x7f4f94e57a60>,
 <matplotlib.lines.Line2D at 0x7f4f94e697f0>]

Since the position column initially has /N in place of those cars that did not finish the race, we decided to place them 25th. We then changed the column from storing strings to numeric-type values in order to properly generate a scatter plot. A line of best fit is then drawn to give an idea of a general trend. Even without it, it can be seen that although it is fairly scattered, there is a general positive correlation between grid position and final position.

In [15]:
regr = linear_model.LinearRegression() 
x = np.array(r['grid']).reshape((-1, 1)) 
y = np.array(r['position'])
model = regr.fit(x,y)
plt.figure(figsize=(12,7))
plt.scatter(x, y)
plt.plot(x, regr.predict(x), color='black') 
plt.xlabel('grid')
plt.ylabel('position')

print('coefficient of determination:', model.score(x, y)) 
print('intercept:', model.intercept_)
print('slope:', model.coef_)
print('y = ' + str(model.intercept_) + ' + ' + str(model.coef_[0]) + 'x')
coefficient of determination: 0.25561533281933446
intercept: 5.3774473684577355
slope: [0.61252332]
y = 5.3774473684577355 + 0.6125233247484766x

From this graph, using the linear regression model, we can clearly see theat there is a positive slope that correlates grid position and final attained position after the race. Although this does not indicate causation, as there many other factors that can determine final posiition in a race, it does indicate that it is indeed a factor that plays into it. Generally, the initial grid placement is a good measure of a racer and cars speed when noone else is on the track. However, during the race, despite the initial placement position, many factors like car type, constructor, skill of the racer with cars on the track, pit stop times, and strategy can contribute to the final placement position of each individual car.

In [16]:
regression = ols(formula='position ~ grid + constructorRef + grid*constructorRef', data=r).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               position   R-squared:                       0.313
Model:                            OLS   Adj. R-squared:                  0.306
Method:                 Least Squares   F-statistic:                     48.32
Date:                Mon, 21 Dec 2020   Prob (F-statistic):          2.40e-302
Time:                        16:06:06   Log-Likelihood:                -13635.
No. Observations:                4181   AIC:                         2.735e+04
Df Residuals:                    4141   BIC:                         2.760e+04
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
=======================================================================================================
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                              10.0180      2.444      4.098      0.000       5.226      14.810
constructorRef[T.alphatauri]           -3.2917      4.393     -0.749      0.454     -11.904       5.321
constructorRef[T.caterham]              9.3355      6.729      1.387      0.165      -3.856      22.527
constructorRef[T.ferrari]              -5.7588      2.511     -2.294      0.022     -10.681      -0.836
constructorRef[T.force_india]          -2.8117      2.685     -1.047      0.295      -8.076       2.453
constructorRef[T.haas]                  0.8053      2.796      0.288      0.773      -4.676       6.286
constructorRef[T.hrt]                  14.3101      3.938      3.634      0.000       6.589      22.031
constructorRef[T.lotus_f1]             -2.7949      2.715     -1.029      0.303      -8.118       2.528
constructorRef[T.lotus_racing]          4.0821     16.024      0.255      0.799     -27.334      35.499
constructorRef[T.manor]                10.1858      5.047      2.018      0.044       0.290      20.081
constructorRef[T.marussia]             10.3343      5.136      2.012      0.044       0.266      20.403
constructorRef[T.mclaren]              -3.6411      2.548     -1.429      0.153      -8.637       1.355
constructorRef[T.mercedes]             -6.0365      2.483     -2.431      0.015     -10.904      -1.169
constructorRef[T.racing_point]         -3.5833      3.034     -1.181      0.238      -9.532       2.366
constructorRef[T.red_bull]             -4.2935      2.496     -1.720      0.086      -9.188       0.601
constructorRef[T.renault]              -2.7199      2.709     -1.004      0.315      -8.030       2.590
constructorRef[T.sauber]                3.0009      2.758      1.088      0.277      -2.406       8.408
constructorRef[T.toro_rosso]            2.9296      2.663      1.100      0.271      -2.292       8.151
constructorRef[T.virgin]               -0.7418     17.621     -0.042      0.966     -35.288      33.804
constructorRef[T.williams]             -2.0046      2.570     -0.780      0.435      -7.042       3.033
grid                                    0.2753      0.165      1.667      0.096      -0.048       0.599
grid:constructorRef[T.alphatauri]       0.1365      0.351      0.389      0.698      -0.552       0.825
grid:constructorRef[T.caterham]        -0.3220      0.364     -0.884      0.377      -1.036       0.392
grid:constructorRef[T.ferrari]          0.2620      0.183      1.432      0.152      -0.097       0.621
grid:constructorRef[T.force_india]      0.1079      0.192      0.562      0.574      -0.268       0.484
grid:constructorRef[T.haas]             0.0666      0.193      0.345      0.730      -0.312       0.445
grid:constructorRef[T.hrt]             -0.4198      0.217     -1.935      0.053      -0.845       0.006
grid:constructorRef[T.lotus_f1]         0.2365      0.192      1.230      0.219      -0.140       0.614
grid:constructorRef[T.lotus_racing]    -0.0384      0.862     -0.045      0.964      -1.728       1.651
grid:constructorRef[T.manor]           -0.4135      0.289     -1.430      0.153      -0.980       0.153
grid:constructorRef[T.marussia]        -0.3807      0.280     -1.359      0.174      -0.930       0.168
grid:constructorRef[T.mclaren]          0.2421      0.176      1.374      0.170      -0.103       0.588
grid:constructorRef[T.mercedes]         0.1801      0.179      1.005      0.315      -0.171       0.532
grid:constructorRef[T.racing_point]     0.1735      0.220      0.789      0.430      -0.258       0.605
grid:constructorRef[T.red_bull]         0.0438      0.179      0.245      0.806      -0.306       0.394
grid:constructorRef[T.renault]          0.2588      0.189      1.369      0.171      -0.112       0.630
grid:constructorRef[T.sauber]          -0.1276      0.185     -0.688      0.491      -0.491       0.236
grid:constructorRef[T.toro_rosso]      -0.1679      0.182     -0.922      0.357      -0.525       0.189
grid:constructorRef[T.virgin]           0.2143      0.834      0.257      0.797      -1.420       1.849
grid:constructorRef[T.williams]         0.1763      0.175      1.009      0.313      -0.166       0.519
==============================================================================
Omnibus:                      771.736   Durbin-Watson:                   0.641
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1271.304
Skew:                           1.254   Prob(JB):                    8.71e-277
Kurtosis:                       4.002   Cond. No.                     2.39e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [17]:
coef = pd.DataFrame(regression.params, columns=['coef'])
coef += coef['coef']['grid']
coef[-len(coef)//2+1:]
Out[17]:
coef
grid:constructorRef[T.alphatauri] 0.411790
grid:constructorRef[T.caterham] -0.046704
grid:constructorRef[T.ferrari] 0.537285
grid:constructorRef[T.force_india] 0.383128
grid:constructorRef[T.haas] 0.341813
grid:constructorRef[T.hrt] -0.144501
grid:constructorRef[T.lotus_f1] 0.511789
grid:constructorRef[T.lotus_racing] 0.236891
grid:constructorRef[T.manor] -0.138199
grid:constructorRef[T.marussia] -0.105427
grid:constructorRef[T.mclaren] 0.517392
grid:constructorRef[T.mercedes] 0.455405
grid:constructorRef[T.racing_point] 0.448790
grid:constructorRef[T.red_bull] 0.319052
grid:constructorRef[T.renault] 0.534060
grid:constructorRef[T.sauber] 0.147669
grid:constructorRef[T.toro_rosso] 0.107363
grid:constructorRef[T.virgin] 0.489510
grid:constructorRef[T.williams] 0.451579

Most of these coefficients/slopes are greater than zero, indicating a positive relationship between grid, constructor, and final position. This also indicates—in most cases—that constructor can indeed be included as an interaction term. We can see from the prior data analysis that constructor seems to have an effect on how many points drivers earn. Since points correlate with placement, this shows that constructor could be included as another variable when discussing grid placement(preliminary/initial placement) and final placement in the race.

In [18]:
models = []
models.append(('Random Forest', ensemble.RandomForestClassifier()))
models.append(('SVC', svm.SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
  kfold = model_selection.KFold(n_splits=10)
  cv_results = model_selection.cross_val_score(model, x, y, cv=kfold, scoring='accuracy')
  results.append(cv_results)
  names.append(name)
  print(name + ' accuracy: ' + str(cv_results.mean()) + ' error: ' + str(cv_results.std()))
Random Forest accuracy: 0.20234267051877905 error: 0.029137952071180185
SVC accuracy: 0.20521177102008656 error: 0.04442189646411429

When doing the 10 fold cross validation, there was 20.2 percent accuracy for the random forest classification model, and 20.5 percent accuracy for the SVM model. The reason for the fairly low accuracy in both models is due to the large variability between individual final positions when compared to grid positions. When looking at the scatter plots, we can see many points which are stacked on top of on another for the same grid position. This proves that there are many other factors when determining the relation between grid position and final position, including pit stop time, specific strategies racers use, type of car, constructor of the car, and many other factors. Of course, when using these models, we also determined a fairly low error. For the random forest model the error was 0.031 and the error for the SVM model was 0.044, indicating that there is indeed a relation between grid and final position. However, it is difficult to determine exactly where each racer would fall due to the plethora of factors described.


Conclusion

Insights Gained

The amount of data being captured, analyzed and used to design, build and drive the Formula 1 cars is astounding. It is a global sport being followed by millions of people worldwide and it is fascinating to see drivers pushing their limits in these vehicles to become the fastest racers in the world! They of course could not reach this point without the work and dedication of the car constructors, as there would be no racing without them. From our dataset we had a number of observations. First, we saw how Mercedes were the most dominant team in both the constructor's and driver's championship. We learned a lot about the sport thanks to this data dump and worked through many stages of the data science pipeline presented to us throughout our data science course, CMSC320. Hopefully this tutorial provided some valuable insight for both those new to the sport and F1 veterans alike. Even if you're not interested in Formula 1, a lot of what we covered applies to other datasets since data science is all about tidying datasets, preparing them for further analysis, and finally plotting and explaining any relevant visualizations or models. We were able to make a gain a number of insights from this project. Firstly, we saw how Mercedes have been the dominant team in the hybrid era of F1 in both the constructor's and driver's championships. Additionally, we used a machine learning approach to determine how related grid position and final position are even included constructors as a interaction term. The conclusion that was drawn from this was that constructors can affect if the change between grid position and final position will be positive and negative. Overall, there is a correlation between grid position and final position, but it not easy determine just based on a couple of factors due to the plethora of other factors in F1 racing.

Future Work

In this tutorial, we identified the most dominant F1 drivers in the last 10 years of races. Future work could include analyzing F1 data from all throughout the race's history in order to determine the most dominant driver of all time. This could prove to be an interesting area of research because F1 cars have drastically changed since the sport first came into being in 1950, and accounting for these factors might reveal surprising results. Another potential research area would be to use existing F1 data to predict the results for a future race or season and compare prediction results to the actual standings for that race/season.