Nerding Out: Variable Design (AKA Feature Engineering)
These plots not only describe what has happened so far, they also provide clues on how to proceed with designing variables. When building a model on such an aggregate level, you have few data points. The variables you use in such a case should be few but very representative to avoid what we dramatically call curse of dimensionality. Clearly, in the case of inbound and internal transfers, summer and winter seasons follow a similar pattern. This means that we can reliably model these cases with the previous year’s values and previous season’s (i.e. summer season for the winter transfers) values. However, in the case of outbound transfers, those would be misleading because the behavior is different. Furthermore, we suspected that the number of free agents at the end of each previous season would also have an impact on transfers.
Ultimately we ended up with a representative set of features to model the time series: autoregressive component (i.e. value for previous year’s same season), previous season’s outbound revenue, number of free agents at the end of previous season.
Notice that we are only ever using information about previous seasons when building our features. This time dependency must always be preserved to avoid models that cannot be deployed in real-life. Especially with time series, it is rather easy to build features that accidentally include information from future points in time.
Nerding Out: Model Selection
For sake of thoroughness, we have tried multiple models to try and represent the transfer patterns: Autoregressive Linear Model, Random Forest and Prophet. As a best practice, all models need to be trained using a training set while hyperparameters are optimized using a validation set, and then performance is tested on a third completely separate hold-out set which the model has not seen before.
In case of time-series, this means dividing the time horizon into chunks and using the chunks to perform this train-validate-test process. Under all circumstances, we want to preserve the temporal dependency structure. The largest test error we have observed is 15% while predicting transfer spend and that is the summer 2020. The models after seeing this shock adjusted their predictions to account for the impact of COVID. So all in all, these models are fairly accurate.
*In-bound transfers are classified as getting a player from outside of UEFA region, Out-bound transfers are classified as selling a player from UEFA to outside of the region and internal transfers are between UEFA clubs.