Section 1: Introduction to the Data

FM StatGuy
10 min readApr 27, 2021
A truncated view of the data the project has acquired

To introduce you to this pet project of mine, I want to begin by explaining why I’ve wanted to do this project. The Football Manager series caught my eye a few years ago because of my budding interest in the sport and my prospective career path in data science. It drew me close with the empirical nature of the game. I was fascinated by how the game created this incredibly immersive environment by feeding numbers into a match engine.

As I began playing, I was constantly learning more and more about the game. More often than not, new ‘rules of thumb’ would replace old heuristics and I would move on managing. As I became familiar with the game, this swapping of heuristics began to frustrate me somewhat; the empirical nature of Football Manager wasn’t as simple as I thought it could be. I began to watch youtubers and read forums to sharpen my managerial skills. I loved learning about trends that the community picked up on within the noise of the football manager data. A lot of these experiments were extremely well done and absolutely fascinated me, but I felt that none of these experiments ever captured the full breadth of what we could extract from the game.

Enter Zealand, leader of an ‘Elite Online Gaming Community’, who helped me get this project off the ground. I felt that I could become one of his community projects and give the community some insight into the ‘black box’ we call Football Manager. Some of these projects have been really fun to follow, especially the brains behind the NewGAN Manager tool (check it out if you haven’t). I approached him to begin a data crowdsourcing initiative within his community. Each participant would take my ‘Start File’ and simulate one year and send me an export of all the players data; about 112,000 players per sim. The rest is history and as of today I have around 1.3 million lines of ‘usable’ data and counting. I think this data has the potential to fill in some pretty large gaps in our understanding of the game.

In the next few months, I will be exploring trends I find in this dataset in chapters. I will be attempting to squeeze all of the useful nuggets of Football Manager Wisdom I can from this. Today, I want to take a tour of our dataset so we can familiarize ourselves with the data. I would love this project to become collaborative, so this introduction is going to be a great way for everyone to build a solid foundation of knowledge regarding our dataset.

The ‘Kitchen Sink’ Player Search View

A screenshot of the so-called ‘Kitchen Sink Stat Page’

To begin this project, I had an idea of what I wanted, and that was pretty much everything. I began to add variables into a ‘player search’ view with reckless abandon. As I approached 120 variables added into my player search view, the text file exports of the data began to fail immediately or take ages to export. As this was a crowd sourcing effort, I wanted to keep everything manageable and repeatable to minimize any potential for unwanted variation attributable to user error. I settled on a view that contained 117 variables.

The variables included in the data

I think this a pretty comprehensive view of how the players are interacting with the game, however I am expecting a number of hiccups in the data. The first of which being variation in minutes played. With a dataset this large, there are bound to be missing stats from players who simply did not play, got injured for a significant amount of time, etc. This would almost certainly skew our data when looking at match engine stats.

I’ve done some visualizations on the relationship between our variables to show a baseline of the relationships that exist in the entire dataset as whole (before any steps are taken to further tease out patterns). The visuals you will see below are heatmaps correlation coefficients between each of the numeric variables within the dataset. These visuals are going to provide extremely useful. With these heatmaps, I can quickly get a glimpse of relationships between variables and then move on to trying to uncover solid trends.

Heatmap 1: The full dataset

Correlation heatmap of the entire dataset

Please, please, please do not try and squint to read all of these variables. I promise you will be able to see my point just by giving these visuals a glance. In later sections, I will pare this down to only the variables relevant to the conversation. For the ultra curious, feel free to get out the magnifying glass, but for now just roll with me.

Like I mentioned, this visually plots all of the correlation coefficients between all of our numerical data. For reference, a correlation coefficient of 1 means that there is a perfect positive linear relationship between the variables and -1 means that there is a perfect negative linear relationship. A coefficient of 0 means there is no apparent relationship at all. As you can see, there are some interesting patterns in this visual. I am particularly intrigued by the triangle of red squares in the bottom right…

To get back to my point about the data being obscured by a variety of factors, I’d invite you to compare the heatmap of the entire dataset (above) with the heatmap when every player who did not hit the 900 minute threshold is removed (below).

Heatmap 2: Players below 900 minutes removed

Correlation heatmap of the dataset when players with <900 minutes are removed

You can see, a decent amount of the color has shifted somewhat. Why is that? When you have a large chunk of your variables equaling zero (these being match engine stats for players who never played), any possible relationships will be disfigured. In the future, the number of minutes played will be a huge factor and most likely I will convert any stats I analyze into a ‘stat per 90 minutes’ measure. While minutes played is a big obscuring factor, there are lots of little things to be accounted for too. This is my challenge; to find the tiny details that obscure how the player attributes are truly affecting the match engine and the game as a whole.

Let me take this one step further. I’ve produced one more heatmap for us to consider. This time, I have taken out both players under 900 minutes of game time and also goalkeepers. Why? It’s a little obvious, but the goalkeeper is a fundamentally different type of player on the pitch. Have you spotted the vertical lines of red hot positive correlation correlation of near the bottom of the heatmap that then snap into the blue range after 5 boxes? Those boxes would be the goalkeeper stats and they are another obscurity when analyzing the rest of the players on the pitch.

Compare the last heatmap with this one.

Heatmap 3: Goalkeepers and Players below 900 minutes removed

Correlation heatmap of the dataset when goalkeepers and players with <900 minutes are removed

Very different results. The most robust patterns still appear however, the we can now see, much more clearly, how the variables are related. This will only become more clear as I progress through this data and put more thought into each relationship.

Lets move onto how some core player information is distributed.

Distributions of the Basics

Now I want to look at how the players information is distributed. Looking at the distributions will give us a good idea of the landscape of our data. The variables I wanted to hit on specifically are Age, Current Ability and Potential Ability, Best Position, and Personality.

Age

Histogram of the Age of Players, n = 1,181,296

There is nothing particularly exciting about what we have here. The data is shaped like you might think; players slowly drifting into retirement, a spike with the newest youth intake. Keep in mind that the bins of this histogram are every 2 ages, so 15 and 16 are together here (the left edge is inclusive, the right edge is not). The large spike in the 15–16 age bracket would be a consequence of the youth intake I would assume. I would be interested to see what more years of data would look like with the youth intakes.

I had to look up who is still kicking around at age 55, and it actually happens to be two guys!

Congrats to Zvonko Jager and Juha Ansamaa for being the oldest guys in the set. Both are GKs from Kovinar in the Slovenian Lower League and FC Ylivieska in the Finish Lower Division.

Current and Potential Ability

I have a feeling that this topic is going to be the most anticipated topic in this article. First, we have our histograms.

Histograms of current and potential ability, n = 1,181,296

There are a couple of interesting things here. The first thing I spot is the shape. It looks roughly gaussian, particularly potential ability, but I will get into that in a second.

The next thing I spot is the skew. Current ability has a somewhat positive skew, meaning the extreme high values makes the right tail long. The maximum current ability in the data is 192 and there are 13,534 players with a current ability greater than 150. At the moment, I have 13 datasets from the community in this group so, on average, there will be an average of 1041.08 players with a CA greater than 150 in each game with this particular dataset. This makes sense as the right tail will have exceptional players more spread out between 140–200 Current Ability.

I plan on having a more in-depth section on the differences between nations later on, but I wanted to give a brief little synopsis of the most notable footballing nations. Here is a violin plot describing the distributions of current ability and potential ability from France, Germany, England, Portugal, Spain, the Netherlands, Italy, Argentina, and Brazil. To be clear, this is nationality rather than country the players play in.

While this is pretty self-explanatory, I would like to point out that Italy and England both have leagues loaded at a lower level than most other countries. Glory to the South American wonderkid lovers out there.

This is becoming a pattern, but researching the dynamics of the youth intake is a segment I plan to work on later. I have a save going 2 years into the future with practically every vanilla league loaded so we can take a look at youth intakes in greater detail.

Best Position

This is a variable in which I will base a decent amount of my match engine analysis on. It may be obvious but, segmenting our data by position will allow us to find key patterns for each position. The heatmaps I showed earlier already show some correlation between variables, however I expect these pattern to become more clear as we segment by position. Below I have the breakdown of players by their best position.

n = 1,181,296

Pretty standard results. The most populous position being center mids; center backs are second, and strikers a close third. Wingbacks are incredibly under-represented likely due to the fact that the fullback and wide midfielder positions are the best position for your typical wingbacks.

Personality

Ok, let’s throw personality in our countplot.

n = 1,181,296

Woah. I did not expected ‘Balanced’ to be so overwhelming. Super interesting, but let’s take out ‘Balanced’ so we can the distribution of other personalities better.

n = 471,788

Other than ‘Balanced’, we have some relative favorites, that being ‘ Fairly Determined’, ‘Fairly Professional’, and ‘Unambitious’.

Continuing my analysis of Personality, I am pretty sure that personality will differ by age. It only makes sense that as players mature they will develop their personalities. We see that in-game as we play, but this is likely also represented in the initial database as well. Let’s take a look.

While you can see there is some difference between the groups, the called out group of personalities is almost exclusively. This is interesting, but I assume that this could be a product of youth intakes. This would mean that there is a cohort of personalities that are more frequently assigned to young players from intakes. This topic will be further explored in the section on youth intakes.

Wrap Up

This article will set us up to some more juicy analysis on how the game’s patterns, trends, and match engine dynamics intertwine with pattern attributes. Now that we have a good starting place for more juicy analysis. I listed some topics I want to cover below, comment what you want to see from my analysis or suggest new topics!

Youth Intakes — Trends in CA, PA, personalities as well as exploring how facilities and the Head of Youth Development affect the quality of youth

Position Analysis — What stats affect a positions average rating? What attributes play into these key stats? I’m planning on doing this for the majority of positions.

Build Up Analysis — What stats dictate how your team gets into scoring positions?

Winning Stats — What stats are best correlated with wins? By position?

Cohort Analysis — What attributes are present in players that stand out in cohorts made up of players in the same positions, plus or minus a range of current ability

For those of you wanting to get your hands on this dataset, don’t you worry. It’s coming soon, I just have to finish up some cleaning with it before hosting a download.

--

--

FM StatGuy

Prospective Data Science student with a love for Football Manager