How many rounds on top are necessary to be a soccer champion?

Introduction

Like almost everyone here in Brazil, I have a passion for football (or soccer) and because of that my father and I are always alert to the national championship.
So, a few days ago my father gave me a challenge, he challenged me to understand what is needed to be a champion in the brazilian national soccer league (knowed here as “Brasileirão”) since the competition change its format to a point based championship in 2003.

Thus I’ve accepted the challenge and here are the results:

Data

Firstly we need to get the data!!

I haven’t found a well structured data base to work with, so I had to work with the information provided by this site and I had to do a function to scrap the data.

Packages used

library(XML)      #Package to do web scraping
library(dplyr)    #Manipulation
library(magrittr) #Manipulation
library(tidyr)    #Manipulation
library(scales)   #Formatting
library(ggplot2)  #Plots
library(pROC)     #ROC curves

That is the function to do the web scraping. Note that the site has information only until 2015.

# URL base to all years
url1="futpedia.globo.com/campeonato
/campeonato-brasileiro/"

#Empty data frame to store the information
games= data_frame()

for (i in 2003:2015) { ## years
  url2 = paste0(url1,i)
  
  readHTMLTable(url2,as.data.frame = T)%>%
    data.table::rbindlist()%>%
    setNames(c("Position","Team","Points","Games","Victories","Ties",
               "Losses","Goals scored","Goals conceded","Goal difference",
               "Performance"))%>%
    mutate(Year = i)%>%
    bind_rows(games) -> games
}

If you don’t want to scrap the data (because it can take some time and sometimes doesn’t work so fine) you can get it from here Data!.

Tip: Be aware with the encoding! (use “latin-1”)

Moving on, we need some extra information, like which team was the champion in the season and how many games (rounds) are in each season, because the number of teams has changed in 2005 and changed again in 2006.

# Champions data
champ_dt = data_frame(Year = 2003:2018,
                      Champion = c("Cruzeiro","Santos","Corinthians","São Paulo",
                                   "São Paulo","São Paulo","Flamengo","Fluminense",
                                   "Corinthians","Fluminense","Cruzeiro","Cruzeiro",
                                   "Corinthians","Palmeiras","Corinthias","Palmeiras"))

games %<>%  # Tracking the round (game) number
  mutate(Round = c(rep(rep(1:38,each =20),10),
                   rep(1:42,each =22),
                   rep(1:46,each =24),rep(1:46,each =24)))%>%
  
  # Finding the champion in each season
  inner_join(champ_dt) %>%
  # Is this team the champion of this year?
  mutate(Champ = if_else(Team == Champion, 1,0),
         # Total number of games in the season
         N_round = case_when(Year < 2005 ~ 46,
                             Year == 2005 ~ 42,
                             Year > 2005 ~38),
         # In this round, this team is in the first place?
         First_place = if_else(Position ==1,1,0 ),
         # In this round, this team is in the first place
         # and is the champion?
         TP = if_else(First_place == 1 & Champ == 1,
                      1,0),
         # In this round, this team is in the first place
         # and isn't the champion?
         TN = if_else(First_place == 1 & Champ == 1,
                      1,0))

Let’s take a look in the data

head(games)

##   Position         Team Points Games Victories Ties Losses Goals scored
## 1        1        Sport      3     1         1    0      0            4
## 2        2 Athletico-PR      3     1         1    0      0            3
## 3        3  Chapecoense      3     1         1    0      0            2
## 4        3    São Paulo      3     1         1    0      0            2
## 5        5  Corinthians      3     1         1    0      0            1
## 6        6   Fluminense      3     1         1    0      0            1
##   Goals conceded Goal difference Performance Year Round    Champion Champ
## 1              1               3         100 2015     1 Corinthians     0
## 2              0               3         100 2015     1 Corinthians     0
## 3              1               1         100 2015     1 Corinthians     0
## 4              1               1         100 2015     1 Corinthians     0
## 5              0               1         100 2015     1 Corinthians     1
## 6              0               1         100 2015     1 Corinthians     0
##   N_round First_place TP TN
## 1      38           1  0  0
## 2      38           0  0  0
## 3      38           0  0  0
## 4      38           0  0  0
## 5      38           0  0  0
## 6      38           0  0  0

All seems ok, let’s continue…

Analysing the initial rounds.

So, now we can analyse the data. Initially, we can try to find the “champion profile” at the 5th and 10th round.

5th round:

# Making a plot
games %>%
  filter(Round == 5)%>%
  select(Points, Victories, Ties, Losses,
         `Goals scored`, `Goals conceded`,
         Position,Champ)%>%
  mutate_all(.funs = as.numeric)%>%
  mutate(Champ = factor(Champ, labels= c("Not champion","Champion")))%>%
  gather(key="Variable",value="value",-c(Champ))%>%
  group_by(Champ,Variable)%>%
  mutate(tot = n())%>%
  group_by(Champ,Variable,value)%>%
  summarise(freq = n(), tot=mean(tot))%>%
  mutate(perc = freq/tot)%>%
  ggplot(aes(value,perc,fill=Champ))+
  geom_col(width = 1,col=1,alpha=.3)+
  facet_wrap(~Variable,scales = "free")+
  theme_light()+
  scale_fill_discrete("")+
  ylab("Frequency (%)")+
  scale_y_continuous(labels = scales::percent)

Note that are negative points, this happened because of punishments gave by CBF for some teams.

# Making a table
games %>%
  filter(Round == 5)%>%
  select(Points, Victories, Ties, Losses,
         `Goals scored`, `Goals conceded`,
         Position,Champ,Performance)%>%
  mutate_all(.funs = as.numeric)%>%
  mutate(Champ = factor(Champ, labels= c("Not champion","Champion")))%>%
  gather(key="Variable",value="value",-c(Champ))%>%
  group_by(Variable,Champ)%>%
  summarise(Min= min(value),
            Mean = mean(value) %>% round(digits = 2),
            Max = max(value),
            SD = sd(value) %>% round(digits = 2))%>%
  knitr::kable()

Variable	Champ	Min	Mean	Max	SD
Goals conceded	Not champion	1	6.59	17.0	2.95
Goals conceded	Champion	2	6.00	12.0	2.83
Goals scored	Not champion	1	6.49	16.0	2.71
Goals scored	Champion	3	7.92	17.0	3.59
Losses	Not champion	0	1.77	5.0	1.10
Losses	Champion	0	1.38	3.0	0.96
Performance	Not champion	0	44.44	100.0	19.67
Performance	Champion	40	55.91	86.7	13.75
Points	Not champion	-13	6.57	15.0	3.26
Points	Champion	6	8.38	13.0	2.06
Position	Not champion	1	11.11	24.0	6.11
Position	Champion	1	7.23	15.0	4.40
Ties	Not champion	0	1.52	4.0	0.96
Ties	Champion	0	1.23	3.0	1.01
Victories	Not champion	0	1.72	5.0	1.07
Victories	Champion	1	2.38	4.0	0.77

Here we can see that, at the 5th round, all the champions have woned at least one game, and drawn three games at the most, with the minimum performance (points made / possible points) of 40%, but none of the champions had a performance score of 100% at the 5th round.

The minimum number of goals scored by the champions was 3 and the maximum number of goals conceded was 12.

10th round:

# Making a plot
games %>%
  filter(Round == 10)%>%
  select(Points, Victories, Ties, Losses,
         `Goals scored`, `Goals conceded`,
         Position,Champ)%>%
  mutate_all(.funs = as.numeric)%>%
  mutate(Champ = factor(Champ, labels= c("Not champion","Champion")))%>%
  gather(key="Variable",value="value",-c(Champ))%>%
  mutate(y = if_else(Champ=="Champion", value/13,value/ 270 ))%>%
  ggplot(aes(value,fill=Champ))+
  #geom_point()+
  geom_histogram(aes(y=..density..),binwidth = 1,col=1,alpha=.3)+
  facet_wrap( ~Variable,scales = "free")+
  theme_light()+
  scale_fill_discrete("")+
  ylab("Frequency (%)")+
  scale_y_continuous(labels = scales::percent)+
  scale_x_continuous(breaks = seq(0,30,by=4))

Note that are negative points again, this happened because of the same reason.

# Making a table
games %>%
  filter(Round == 10)%>%
  select(Points, Victories, Ties, Losses,
         `Goals scored`, `Goals conceded`,
         Position,Champ,Performance)%>%
  mutate_all(.funs = as.numeric)%>%
  mutate(Champ = factor(Champ, labels= c("Not champion","Champion")))%>%
  gather(key="Variable",value="value",-c(Champ))%>%
  group_by(Variable,Champ)%>%
  summarise(Min= min(value),
            Mean = mean(value) %>% round(digits = 2),
            Max = max(value),
            SD = sd(value)%>% round(digits = 2))%>%
  knitr::kable()

Variable	Champ	Min	Mean	Max	SD
Goals conceded	Not champion	4.0	13.30	26.0	3.89
Goals conceded	Champion	3.0	10.15	19.0	5.19
Goals scored	Not champion	3.0	12.95	23.0	3.95
Goals scored	Champion	8.0	16.92	27.0	5.04
Losses	Not champion	0.0	3.66	8.0	1.62
Losses	Champion	0.0	2.23	5.0	1.42
Performance	Not champion	6.7	44.42	83.3	14.16
Performance	Champion	43.3	62.82	86.7	12.61
Points	Not champion	-6.0	13.23	25.0	4.48
Points	Champion	13.0	18.85	26.0	3.78
Position	Not champion	1.0	11.27	24.0	6.02
Position	Champion	1.0	4.31	11.0	3.35
Ties	Not champion	0.0	2.84	7.0	1.31
Ties	Champion	1.0	2.23	5.0	1.30
Victories	Not champion	0.0	3.49	8.0	1.51
Victories	Champion	3.0	5.54	8.0	1.39

In the 10th round we can see that the champions scored at least 8 goals, conceded at least 3 goals and lose 5 games at the most. We can see also that all the champions have woned at least 3 games, and drawn between one and five games, with the minimum performance (points made / possible points) of 43.3%.

How many rounds on top make a champion?

So, to answer my dad’s question about the football champions I thought that discovery the numbers of rounds in the first place would be of great value.

Thus now we need to count the number of rounds in first place of each team in each season (year).

games %>%
  group_by(Year,Team)%>%
  #Counting the number of rounds on top
  summarise(Rounds_on_top = sum(First_place))%>%
  inner_join(champ_dt)%>%
  #Was this team the champion of the season?
  mutate(Won = (Team==Champion),
         Won = as.numeric(Won),
         N_round = case_when(Year < 2005 ~ 46,
                             Year == 2005 ~ 42,
                             Year > 2005 ~38)) -> first_place_df

We can now calcule the proportions of rounds needed.

first_place_df%>%
  mutate(Proportion = Rounds_on_top/N_round,
         Won = factor(Won,labels = c("Not champion",
                                 "Champion")))%>%
  ggplot(aes(Proportion))+
  geom_histogram(bins = 5,col=1,alpha=.5)+
  facet_wrap(~Won,scales = "free_y")+
  theme_light()

first_place_df%>%
  mutate(Proportion = Rounds_on_top/N_round,
         Won = factor(Won,labels = c("Not champion",
                                 "Champion")))%>%
  group_by(Won)%>%
  summarise(min_n=min(Rounds_on_top) %>% round(digits = 2),
            mean_n = mean(Rounds_on_top) %>% round(digits = 2),
            max_n=max(Rounds_on_top) %>% round(digits = 2),
            min_prop=min(Proportion) %>% percent(),
            mean_prop = mean(Proportion) %>% percent(),
            max_prop=max(Proportion) %>% percent())%>%
  knitr::kable()

Won	min_n	mean_n	max_n	min_prop	mean_prop	max_prop
Not champion	0	0.89	17	0%	2.27%	44.7%
Champion	2	22.38	38	5.26%	56.4%	86.8%

So, we can note that the minimum number of rounds on top was 2 (Flamengo, 2009), while the maximum number was 38 (Cruzeiro, 2003) and the average was about 22 games. Looking on the proportions, we can see that the champions often stays on the top a bit more than half of the championship (56,4%).

Estimating the number of rounds

To estimate the number of rounds, we can consider a logist regression with “champion” (binary) as the response variable and the proportion of rounds on first as the predictor variable.

first_place_df%<>%
  mutate(Proportion = Rounds_on_top/N_round)

#fitting the model

fit=glm(Won ~ Proportion,
        family = binomial(link = "logit"),
        data=first_place_df)

summary(fit)

## 
## Call:
## glm(formula = Won ~ Proportion, family = binomial(link = "logit"), 
##     data = first_place_df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.30508  -0.08386  -0.08386  -0.08386   3.14843  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.6485     0.9264  -6.097 1.08e-09 ***
## Proportion   13.2860     2.5762   5.157 2.51e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 104.234  on 269  degrees of freedom
## Residual deviance:  28.442  on 268  degrees of freedom
## AIC: 32.442
## 
## Number of Fisher Scoring iterations: 8

Obviously the proportion of rounds in first is a important variable to explain the win. Let’s take a look in the ROC curve.

(ROC = roc(Won ~ Proportion,data=first_place_df,percent=T))

## 
## Call:
## roc.formula(formula = Won ~ Proportion, data = first_place_df,     percent = T)
## 
## Data: Proportion in 257 controls (Won 0) < 13 cases (Won 1).
## Area under the curve: 98.59%

ggroc(ROC)+
  theme_light()

We got 98.59% of Area under the curve (which is excellent). Now we can look on the true positive ratio and true negative ratio to decide a cut point.

data_frame(Proportion = ROC$thresholds,
           Sens = ROC$sensitivities %>% round(digits = 2),
           Speci = ROC$specificities %>% round(digits = 2))%>%
  mutate(Proportion = percent(Proportion))%>%
  knitr::kable()

Proportion	Sens	Speci
-Inf	100.00	0.00
1.1%	100.00	77.04
2.3%	100.00	78.21
2.5%	100.00	79.38
3.5%	100.00	84.44
4.6%	100.00	85.21
5.0%	100.00	85.60
5.9%	92.31	89.49
7.2%	92.31	90.66
8.3%	92.31	92.61
9.1%	92.31	93.00
10.0%	92.31	93.39
10.7%	92.31	94.16
12.0%	92.31	94.55
14.5%	92.31	94.94
17.1%	84.62	95.72
18.7%	84.62	96.50
20.1%	84.62	96.89
21.4%	84.62	97.28
24.0%	84.62	97.67
27.6%	84.62	98.05
30.3%	84.62	98.44
35.5%	84.62	98.83
41.5%	84.62	99.22
44.1%	76.92	99.22
51.3%	69.23	100.00
59.2%	53.85	100.00
61.2%	46.15	100.00
66.5%	38.46	100.00
72.4%	23.08	100.00
78.1%	15.38	100.00
84.7%	7.69	100.00
Inf	0.00	100.00

Seems like 41,5% is an good cut point because we got 99,22% of specifity (probability of classify the true champion as champion) without losing so much sensibility.

So let’s take a look on the model results…

data_frame(Proportion=seq(0,1,by=.05),
           prob=predict(fit,
                        newdata = data_frame(Proportion=seq(0,1,by=.05)),
                        type = "response")
)%>%
  mutate(prob = percent(prob))%>%
  knitr::kable()

Proportion	prob
0.00	0.4%
0.05	0.7%
0.10	1.3%
0.15	2.5%
0.20	4.8%
0.25	8.9%
0.30	15.9%
0.35	26.9%
0.40	41.7%
0.45	58.2%
0.50	73.0%
0.55	84.0%
0.60	91.1%
0.65	95.2%
0.70	97.5%
0.75	98.7%
0.80	99.3%
0.85	99.6%
0.90	99.8%
0.95	99.9%
1.00	100.0%

Finnaly I can answer to my dad what is the estimated probability of a team be the champion based on how many rounds it was in the first place!

Arthur Rocha