Statistics: a libero in sports

Looking back at the 2014 World Cup, the dramatic match between Brazil and Germany immediately comes to mind. Within minutes, Brazil conceded three goals. Some fans might wonder if Germany had had a winning streak or a “hot hand” that brought them an enormous amount of luck. For a long time, the uncontested view among scientists was that a hot hand was nothing but a probabilistic coincidence. Economists and psychologists argued that the idea of a winning streak was due to the human predisposition to detect patterns in randomness.

However, leading scientists from Berkeley have raised doubt on this alleged cognitive bias. Assuming that a coin was thrown a hundred times, they then looked at how many trials it would take until the expected proportion of success actually converges to the probability of success. They repeated the experiment with different numbers of consecutive successes. To put in context, they found that the more consecutive successes occurred, the longer it took the expected probability to converge to the true probability.  This means that a bias indeed exists. However, that bias is not cognitive, but rather a selection bias from the data’s sequential nature.

MIT researchers found that basketball players who have performed well – whether expected to or not – tend to take more difficult shots. Moreover, “hot” players are much more likely to take the team’s next shot and thus are not choosing shots independently. This challenges the common view that shot selection is independent of a player’s own perception of hot-or-coldness. Thus, it might not be a cognitive bias for the audience, but it definitely is for the players. A player who performed well at an earlier stage of the game and exceeded his own expectations is willing to take more risks. Therefore, he shoots from significantly further away, tackles tighter defence, and attempts more challenging shots.

cognitive bias

In the case of the 2014 World Cup, the question of a hot hand is difficult to answer because football provides less data than sports like basketball and baseball due to lower point yield. Additionally, while baseball and basketball are rather democratic sports – meaning everyone theoretically has an equal chance to score – football has a relatively high number of players with different positions and thus different probabilities to score. Whereas statistics in score-based sports like baseball have intrigued the public interest through movies such as Moneyball, featuring Brad Pitt, the quantitative aspects of races such as NASCAR or horse racing have been of greater academic interest.

Horse races are particularly interesting due to their abundance of data. It is a common phenomenon for gamblers to underestimate favourites and overbet longshots in order to receive a higher reward in case the longshot wins. This favourite-longshot bias, however, has been proven culturally diverse. The existence of this bias depends on the average pool size,   meaning the total amount of bets paid. In the western world, horse-betting is more of a pastime with a relatively low betting pool, whereas in Asia, notably in Hong Kong, betting is business to be taken seriously. Analysis by researchers from Berkeley yielded that bias in favour of longshots exists more prominently in Western countries, where bookmakers bet relatively low amounts (an average pool size of $218,000 at the Yonkers race in the United States) as compared to in Asia (an average pool size of $1.1 M at the Happy Valley race in Hong Kong). As the average pool size is much higher in Hong Kong, bookmakers assess their bets more carefully and attempt to predict the outcome of games more accurately.

Another classic application for horse races in statistics are Markov Chain Monte Carlo simulations (MCMC). A Markov chain is a stochastic process in which the future is not dependant of the past but only of the present. MCMC can be thought of as carrying out many experiments, each time altering the variables in a model and observing the response. The goal of MCMC is to draw samples from some probability distribution without needing to know its exact height at any point. MCMC achieves this by “wandering around” on that distribution such that the amount of time spent in each location is proportional to the height of the distribution. For example, one has eight horses and wants to predict which one is going to win. The individual winning probability of each horse is calculated sequentially based on the prior odds assigned to the animal. The probabilities are then cumulated such that horse number eight has value 100%. If the value that is drawn from the random distribution is higher than the first horse’s value, one will move up the line until a horse’s true probability is strictly higher than the drawn value. In a sufficiently large sample, the proportion of assigned values will reflect the true probabilities for each horse. Thus, even if the true probability is unknown, it is still possible to achieve an accurate model.

Horse 1 2 3 4 5 6 7 8
Cumulative probability 13.04% 30.42% 40.85% 48.87% 69.73% 72.62% 79.14% 100%

As horse races usually largely depend on prior knowledge rather than assuming complete independence from the past, the Bayesian approach has enjoyed significant popularity as it takes prior beliefs into account. Different factors, including the days since the last run, the time of the year, or the characteristics of the running ground, impact the probability of winning. A Bayesian statistician not only distinguishes between winning or losing, but also takes the other factors into consideration. Using the example of seasons – spring and autumn – one can distinguish four cases: success or loss of a horse in spring respective of autumn. This allows the bookmaker to make a more informed decision about the horse’s performance based on the season, and he can thus decide which season would be the best to bet on the horse. This model can be extended to the multivariate case, so that the bookmaker can assess all participating horses based on their performance in different seasons. This and other approaches can set the foundation for more elaborate machine learning methods whose discussion would be a horse of a different colour.

Statistics can also be used to predict the winner of the international championships such as the Olympic Games. A team of data miners attempted to identify the factors that determine success during the Winter Games in Sochi 2014. Unsurprisingly, they found that the geography of the participating country is notably important for Winter Games. About 90 percent of countries have never won a single Winter Olympics medal, including Middle Eastern, South American, African, and Caribbean countries. Additionally, the researchers used GDP per capita as an explanatory variable since nations whose people are affluent can afford to spend time in pursuing excellence in sports. Moreover, history has shown that the nation hosting the games often over-performs. Both Italy and Canada over-performed in 2006 and 2010 with five and 14 gold medals, respectively, as the Winter Games were hosted in Torino and Vancouver.  The trend continued during the Winter Games in Sochi 2014 with Russia scoring eleven gold medals. However, South Korea’s performance 2018 remained on an average level.

The reason for the overperformance at home could be twofold. On the one hand, the host may allocate more money to the success of winning in order to increase the prestige when the world’s spotlight is on them. On the other hand, hosting could have the same effect as the home advantage in any other sports such as soccer. Past analysis has also shown that countries with a socialist background generally overperform. One reason is that in a command economy it is easier to direct funding to the training of athletes. Another motive is that in an authoritarian system, the elites appreciate medals as a demonstration of power and thus, push harder in order to achieve their prestige-bringing goals.

The GDP also reveals another characteristic of the winning scheme: high-income countries diversify more in terms of sports, while low-income states usually focus on a few sports as a safe bet for medals, such as Ethiopia in athletics. For the Winter Games, this phenomenon is not as common, as most low-income countries have limited access to practicing the more resource intensive winter sports.

In conclusion, no matter whether ball sports, races, or a combination of both in form of the Olympics, statistics has found its way into the world of sports. In consideration of the World Cup, we should not forget: the passion for the numbers in the field of football should not predominate the passion for the jersey numbers on the football field.

by Jacqueline Seufert



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s