Formula One (also known as Formula 1 or F1) is the highest class of international auto racing for single-seater racing cars sanctioned by the Fédération Internationale de l'Automobile (FIA). There are two World Championships being fought for over the course of a season in F1: the Driver's World Championship, and the Constructor's World Championship. Each F1 constructor builds their own car and has two drivers racing for them. Points earned by drivers racing for the same constructor contribute to the Contructor's Championship, while each driver's individual points contribute to the Driver's Championship. While F1 is a team sport, this interesting dynamic where a driver is also competing for an individual championship leads to intense battles even among teammates. Today, F1 is a multibillion dollar annual industry, ranking behind only the 4-yearly Football World Cup and Summer Olympic Games in terms of live television audience (Benson). The cars, which have effectively become mobile advertising billboards, race fortnightly in front of a global audience of motor sport fans—527 million across 187 countries in 2010—who are “up to three times more brand loyal than fans of other sports”(Autosport).
In this tutorial, our goal is to combine and analyze the data we found in order to provide insight into which car constructors and drivers are the most dominant in Formula 1. For readers unfamiliar with the sport, we hope this analysis will get them interested in watching Formula 1 races and provide insight into which teams are performing the best. For those already familiar with F1, we hope to show how well their favorite teams and/or drivers have been performing in recent years (or how badly they're being beaten).
For this tutorial, we're using Python 3 along with the following imported libraries: Folium, Matplotlib, NumPy, pandas, PyWaffle, SciPy, seaborn, scikit-learn, and statsmodels.
# Standard library imports
import csv
import io
from io import BytesIO
import requests
import warnings
warnings.filterwarnings('ignore')
# Third-party imports
!pip install folium
import folium
import folium.plugins as plg
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
!pip install pywaffle
from pywaffle import Waffle
from scipy import stats
import seaborn as sns
from sklearn import datasets, ensemble, linear_model, metrics, model_selection, svm
from sklearn.model_selection import cross_val_predict, train_test_split
from statsmodels.formula.api import ols
The files we're using are originally from this Kaggle dataset. From the dataset, we uploaded copies of each CSV to our GitHub repository for ease of access (you need to have a Kaggle account in order to download the CSVs from the above link).
These CSV files have been cleaned prior to retrieval, so no data cleaning/modification is necessary on our part.
races = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/races.csv')
results = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/results.csv')
constructors = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/constructors.csv')
status = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/status.csv')
drivers = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/drivers.csv')
circuits = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/circuits.csv')
laptimes = pd.read_csv('https://raw.githubusercontent.com/rnarla123/rnarla123.github.io/master/archive/lap_times.csv')
r = races
# Merging other dataframes
r = r.merge(results, on='raceId')
r = r.merge(constructors, on='constructorId')
r = r.merge(status, on='statusId')
r = r.merge(drivers, on='driverId')
# Deleting unused columns
r.drop(columns=[])
r
There are X columns in the above dataframe 'r'. We have:
The titles of each column should be self-explanatory. If you'd like to get more information a column, important topics have links to additional information.
r = r[r.year > 2010]
races = races[races.year > 2010]
races
We decided to specifically analyze the past 10 years of F1 data, rather than starting from the 1950s as the original dataframes do, to more easily determine current domination in the F1 scene. Additionally, for further analysis between grid placements and position placements, we thought the past 10 years (with more than 400 data points) would suffice and would more accurately represent the current state of F1.
Current regulations set by the FIA specify that a full championship season “must include Competitions taking place on at least three continents during the same season.”
This requires F1 teams to travel a lot during a season, racing in different countries and tracks. Below is an interactive map showing the different tracks teams have raced at since the start of the sport in 1950.
circuits_map = folium.Map(zoom_start=13)
map_cluster = plg.MarkerCluster().add_to(circuits_map)
for idx, row in circuits.iterrows():
folium.Marker(
location=[row['lat'], row['lng']],
icon=folium.Icon(color='cadetblue', prefix='fa', icon='flag-checkered')
).add_to(map_cluster)
circuits_map
We can also analyze the distribution of races throughout the different continents.
# The number of races in each continent
num_continent= {
'Europe':38,
'Asia':17,
'North America':6,
'South America':3,
'Africa':3,
'Australia':2
}
fig = plt.figure(
figsize = (14,16),
FigureClass=Waffle,
rows=5,
values=num_continent,
colors=sns.color_palette("viridis",len(num_continent)).as_hex(),
title={'label': 'Distribution of F1 circuits across different continents', 'size':18},
labels=["{0} ({1})".format(k, v) for k, v in num_continent.items()],
legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
icons='flag-checkered', icon_size=45,
icon_legend=True
)
We observe that F1 visits the most number of circuits in Europe, followed by Asia. The North American continent ranks third in terms of the number of F1 circuits. The reason for this is that unfortunately, F1 racing is not nearly as popular in the United States as other motorsports such as Nascar. This leads to a feedback cycle where F1 does not race in the United States often due to the lack of fans here, and the lack of races in the United States is a factor which contributes to that.
With the 2020 Formula One season coming to an end at Abu Dhabi a few weeks ago, we are curious to see which constructor and driver dominated F1.
First, let us observe how different constructors performed during the course of the last 10 years.
plt.rc('font', size=10) # controls default text sizes
plt.rc('axes', titlesize=15) # fontsize of the axes title
plt.rc('axes', labelsize=15) # fontsize of the x and y labels
plt.rc('xtick', labelsize=10) # fontsize of the tick labels
plt.rc('ytick', labelsize=10) # fontsize of the tick labels
plt.rc('legend', fontsize=15) # legend fontsize
plt.rc('figure', titlesize=15) # fontsize of the figure title
for constructor in r.constructorRef.unique():
total = r.loc[(r.constructorRef == constructor)]
lst = []
for y in total.year.unique():
sum = total[total.year == y].points.sum()
num_races = total[total.year == y].size
lst.append([y, sum])
df = pd.DataFrame(data=lst, columns=['year', 'total points'])
df.plot.scatter(x='year', y='total points', title=constructor.capitalize(), sharex=True, xlim=(2011, 2020), ylim=(0, 1000), s=100, c='blue', figsize=(12, 7), xlabel='Year', ylabel='Total Points')
In these scatter plots we've mapped out the total points for each constructor in each season. We take into account each individual racer that have formed contracts with respective constructors to form total points values for that year. When taking a look at all of these plots, it can be easily seen that for approximately the past seven years, Mercedes has dominated F1 with the most points. Ferrari and Red Bull come close in some years, but overall, Mercedes seems to have more points than all other constructors in these races.
An important observation here is that the Mercedes dominance in the constructor's championship did not start untill 2014. This was when F1 decided to make a major regulation change and switch to hybrid V6 engines replacing the older V8 engines used by constructors. Since this regulation change, Mercedes have proven to be unstoppable, making them the team to beat in the hybrid era of F1.
We can also look at the average points earned per race for each constructor.
m1 = pd.merge(results, constructors, on='constructorId')
m2 = pd.merge(m1, races, on='raceId')
result_v2 = m2[m2.year > 2010]
result_v2["constructor"] = result_v2["name_x"]
# Aggregate total points and average points per race
avg_pts = result_v2[['constructor','points']].groupby("constructor").mean()
total_pts = result_v2[['constructor','points']].groupby("constructor").sum()
n = result_v2[['constructor','raceId']].groupby("constructor").count()
num_races = n[n.raceId > 100]
d = pd.merge(avg_pts, total_pts, on='constructor')
md = pd.merge(d, num_races, on='constructor')
md = md.reset_index()
plt.figure(figsize=(20, 10))
plt.scatter(md.points_x, md.raceId, s=md.points_y*6, alpha=0.5, color=sns.color_palette("magma", len(md)))
plt.xlim(0, 17)
plt.ylim(0, 500)
plt.xlabel("Average Points Per Race")
plt.ylabel("Races")
for x, y, z in zip(md.points_x, md.raceId, md.constructor):
plt.annotate(z, xy=(x-1,y-1))
Here, we take a look at average points instead of total points from the data visualization before. Now it is more visible that Mercedes has dominated in the past 10 years, as it also has the highest average points per race. Coming in second and third place are Red Bull and Ferrari, but Mercedes is still leading by a signifigant number of points. The average amount of points earned by these teams per race is more than double the average amount of points earned by other teams. This visualization also allows us to pick out underperforming constructors as well. Despite the number of races completed by McLaren and Williams, they are far behind constructors like Mercedes and Red Bull, who have the highest average points per race with similar number of races completed. Lotus F1 is even outperforming McLaren and Williams despite competing in under 200 races, while McLaren and Williams have been in approximatley 400.
Similar to how we looked at the average points per race for the constructors, let us look at the average points per race for the individual drivers.
Note: Drivers who have raced less than 100 races have been excluded from the figure.
m1 = pd.merge(results, drivers, on='driverId')
m2 = pd.merge(m1, races, on='raceId')
result_v2 = m2[m2.year > 2010]
result_v2["driver"] = result_v2["forename"] + " " + result_v2["surname"]
# Aggregate total points and average points per race
avg_pts = result_v2[['driver','points']].groupby("driver").mean()
total_pts = result_v2[['driver','points']].groupby("driver").sum()
n=result_v2[['driver','raceId']].groupby("driver").count()
num_races=n[n.raceId > 100]
d = pd.merge(avg_pts, total_pts, on='driver')
md = pd.merge(d, num_races, on='driver')
md = md.reset_index()
md.iloc[7,3] = 180 #data correction
md.iloc[6,3] = 125 #data correction
plt.figure(figsize=(20, 10))
plt.scatter(md.points_x, md.raceId, s=md.points_y*6, alpha=0.5, color=sns.color_palette("viridis", len(md)))
plt.xlim(0, 18)
plt.ylim(100, 240)
plt.xlabel("Average Points Per Race")
plt.ylabel("Races")
for x, y, z in zip(md.points_x, md.raceId, md.driver):
plt.annotate(z, xy=(x-1, y-1))
Here we can see that the top three drivers with the most points per race on average are: Lewis Hamilton, Sebastian Vettel and Nico Rosberg. Thus, it should not come as a surprise that these three drivers took all the drivers world championships in our time period. Sebastian Vettel won the driver's world championship consecutively from 2010 to 2013, after which Lewis Hamilton started his F1 dominance by winning the world championship from 2014 to 2020 with the exception of 2016, when his teammate Nico Rosberg won the championship. Though Hamilton and Vettel have clearly dominated most of the decade, Max Verstappen has emerged as one of the top contenders with very few races but high average points per race.
Now that we know the drivers who earn the most points on average, let us look at the fastest drivers on the grid.
fastest_data = pd.merge(laptimes, races, on='raceId', how='left')
fastest_data.columns
fastest_data = fastest_data[['raceId', 'driverId', 'time_x', 'milliseconds','year', 'round', 'circuitId', 'name', 'date']]
fastest_data.rename(columns={'time_x':'lap_time', 'name':'circuit_name'}, inplace=True)
fastest_data = pd.merge(fastest_data, drivers, on='driverId', how='left')
fastest_data = pd.merge(fastest_data, circuits, on='circuitId', how='left')
fastest_data = fastest_data[['raceId', 'driverId', 'lap_time', 'milliseconds', 'year', 'round',
'circuitId', 'circuit_name', 'date', 'driverRef', 'number', 'code',
'forename', 'surname', 'dob', 'nationality', 'circuitRef', 'location', 'country']]
data = pd.merge(fastest_data.groupby(['circuit_name','date']).lap_time.min().to_frame().reset_index(), fastest_data[['circuit_name','date','lap_time', 'driverRef','code']], on=['circuit_name','date','lap_time'], how='left')
data = data.sort_values(by='date', ascending = False)
data.head(5)
data['year'] = pd.DatetimeIndex(data.date).year
data['counts'] = 1
data = data.groupby(['year', 'code', 'driverRef']).counts.count().to_frame().reset_index().sort_values(by='year', ascending=False)
# fastest = data.loc[data.groupby(['year'])['occ'].idxmax()]
fastest = pd.merge(data, data.groupby(['year'])['counts'].max().to_frame(name='max').reset_index(), on='year', how='left')
fastest = fastest[fastest['counts'] == fastest['max']][['year','code','driverRef','counts']]
fastest.driverRef = fastest.driverRef.str.capitalize()
# Calculate the percentage of fastest lap per season
fastest = pd.merge(fastest, fastest_data.groupby('year')['round'].max().reset_index(), on='year', how='left')
fastest['percent'] = np.array(fastest['counts'])/np.array(fastest['round'])*100
fastest['year'] = fastest['year'].astype(str)
fastest
The above dataframe shows the drivers who got most number of fastest laps in each F1 season. The percent column represents the percentage of races they earned the fastest lap in with respect to the number of races they took part in for that particular season.
Let us visualize the above data so that it is easier to understand.
from bokeh.palettes import Category20b
fig, ax = plt.subplots(figsize=(12,16))
fig.set_facecolor('#FFFFFF')
ax.set_facecolor('#FFFFFF')
ax.hlines(fastest.year, xmin=0, xmax=fastest.percent, linestyle='dotted')
groups = fastest[['year','percent','driverRef']].groupby('driverRef')
colors=sns.color_palette("magma", len(fastest.code.unique()))
for (name, group), color in zip(groups, colors):
ax.plot(group.percent, group.year, marker='o', color=color, linestyle='', ms=12, label=name)
ax.set_xlim(0,65)
ax.legend()
for x,y, label, count in zip(fastest.percent, fastest.year, fastest.code, fastest.counts):
ax.annotate(label+'({} races)'.format(count), xy=(x+0.8,y), textcoords='data')
#ax.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')
plt.xlabel('Percentage of Fastest Lap Wins(%)')
plt.title('Who is the fastest driver in each season?', fontsize=18)
plt.show()
The above figure shows us that Hamilton has been the driver with the most fastest laps in a season the most number of times. Thus, it is no surprise that he went on to dominate the hybrid era of F1 with 7 World Championship titles. This data is also consistent with our assessment of constructor performance since we can observe that Mercedes drivers (Lewis Hamilton, Valterri Bottas, and Nico Rosberg) have been the fastest drivers during the period of Mercedes dominance. In the years before 2014 we see Vettel as one of the fastest, which suggests why he was able to win the driver's world championship from 2010-2013.
r['position'] = r['position'].replace('\\N', '25')
r['position'] = pd.to_numeric(r['position'])
ax = r.plot.scatter(x='grid', y='position', figsize=(12,7))
m, b = np.polyfit(r.grid, r.position, 1)
plt.plot(r.grid, m*r.grid, b, color='black')
Since the position column initially has /N in place of those cars that did not finish the race, we decided to place them 25th. We then changed the column from storing strings to numeric-type values in order to properly generate a scatter plot. A line of best fit is then drawn to give an idea of a general trend. Even without it, it can be seen that although it is fairly scattered, there is a general positive correlation between grid position and final position.
regr = linear_model.LinearRegression()
x = np.array(r['grid']).reshape((-1, 1))
y = np.array(r['position'])
model = regr.fit(x,y)
plt.figure(figsize=(12,7))
plt.scatter(x, y)
plt.plot(x, regr.predict(x), color='black')
plt.xlabel('grid')
plt.ylabel('position')
print('coefficient of determination:', model.score(x, y))
print('intercept:', model.intercept_)
print('slope:', model.coef_)
print('y = ' + str(model.intercept_) + ' + ' + str(model.coef_[0]) + 'x')
From this graph, using the linear regression model, we can clearly see theat there is a positive slope that correlates grid position and final attained position after the race. Although this does not indicate causation, as there many other factors that can determine final posiition in a race, it does indicate that it is indeed a factor that plays into it. Generally, the initial grid placement is a good measure of a racer and cars speed when noone else is on the track. However, during the race, despite the initial placement position, many factors like car type, constructor, skill of the racer with cars on the track, pit stop times, and strategy can contribute to the final placement position of each individual car.
regression = ols(formula='position ~ grid + constructorRef + grid*constructorRef', data=r).fit()
print(regression.summary())
coef = pd.DataFrame(regression.params, columns=['coef'])
coef += coef['coef']['grid']
coef[-len(coef)//2+1:]
Most of these coefficients/slopes are greater than zero, indicating a positive relationship between grid, constructor, and final position. This also indicates—in most cases—that constructor can indeed be included as an interaction term. We can see from the prior data analysis that constructor seems to have an effect on how many points drivers earn. Since points correlate with placement, this shows that constructor could be included as another variable when discussing grid placement(preliminary/initial placement) and final placement in the race.
models = []
models.append(('Random Forest', ensemble.RandomForestClassifier()))
models.append(('SVC', svm.SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10)
cv_results = model_selection.cross_val_score(model, x, y, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print(name + ' accuracy: ' + str(cv_results.mean()) + ' error: ' + str(cv_results.std()))
When doing the 10 fold cross validation, there was 20.2 percent accuracy for the random forest classification model, and 20.5 percent accuracy for the SVM model. The reason for the fairly low accuracy in both models is due to the large variability between individual final positions when compared to grid positions. When looking at the scatter plots, we can see many points which are stacked on top of on another for the same grid position. This proves that there are many other factors when determining the relation between grid position and final position, including pit stop time, specific strategies racers use, type of car, constructor of the car, and many other factors. Of course, when using these models, we also determined a fairly low error. For the random forest model the error was 0.031 and the error for the SVM model was 0.044, indicating that there is indeed a relation between grid and final position. However, it is difficult to determine exactly where each racer would fall due to the plethora of factors described.
The amount of data being captured, analyzed and used to design, build and drive the Formula 1 cars is astounding. It is a global sport being followed by millions of people worldwide and it is fascinating to see drivers pushing their limits in these vehicles to become the fastest racers in the world! They of course could not reach this point without the work and dedication of the car constructors, as there would be no racing without them. From our dataset we had a number of observations. First, we saw how Mercedes were the most dominant team in both the constructor's and driver's championship. We learned a lot about the sport thanks to this data dump and worked through many stages of the data science pipeline presented to us throughout our data science course, CMSC320. Hopefully this tutorial provided some valuable insight for both those new to the sport and F1 veterans alike. Even if you're not interested in Formula 1, a lot of what we covered applies to other datasets since data science is all about tidying datasets, preparing them for further analysis, and finally plotting and explaining any relevant visualizations or models. We were able to make a gain a number of insights from this project. Firstly, we saw how Mercedes have been the dominant team in the hybrid era of F1 in both the constructor's and driver's championships. Additionally, we used a machine learning approach to determine how related grid position and final position are even included constructors as a interaction term. The conclusion that was drawn from this was that constructors can affect if the change between grid position and final position will be positive and negative. Overall, there is a correlation between grid position and final position, but it not easy determine just based on a couple of factors due to the plethora of other factors in F1 racing.
In this tutorial, we identified the most dominant F1 drivers in the last 10 years of races. Future work could include analyzing F1 data from all throughout the race's history in order to determine the most dominant driver of all time. This could prove to be an interesting area of research because F1 cars have drastically changed since the sport first came into being in 1950, and accounting for these factors might reveal surprising results. Another potential research area would be to use existing F1 data to predict the results for a future race or season and compare prediction results to the actual standings for that race/season.