Arthur Rocha
Home
(current)
Blog
About
Projects
Arthur Rocha
Statistician.
Data scientist.
R & data visualization freak.
'It is not a question of if but how'
Categories
Cool-Stuff (2)
How-To (2)
Featured Tags
r (4)
statistics (3)
blogdown (1)
brazil (1)
density (1)
hugo (1)
logistic-regression (1)
parallel-computing (1)
probability (1)
roc (1)
simulation (1)
soccer (1)
xkcd (1)
Solving the Birthday Paradox
May 24, 2019
4 minutes read
Introduction If you ever had a probability course, it’s probably that you had to solve the birthday paradox (also called as the birthday problem) or had heard of it at least. The birthday paradox consists of measuring the probability of at least 2 persons in a room, with n < 365 persons, were born on the same day (\(p(n)\)). To calculate this is necessary to make the assumptions that are 365 possibilities of days and each day has the same probability of being a birthday. Thinking on the complementary probability \(p^c(n)\)(probability of none of the persons had born on the same day), after some inspiration you get: \(p^c(n) = 1 \times (1 - \frac{1}{365}) \times (1 - \frac{2}{365}) \times ... \times (1 - \frac{n-1}{365}) = \frac{365!}{365^n(365-n)!}\) With more inspiration and some free time you can see that it can be approximated, by a Taylor expansion to: \(p^c(n) \approx e^{-(n(n-1)) / 2 \times 365}\) And then: \(p(n) = 1 - p^c(n) \approx 1 - e^{-(n(n-1)) / 2 \times 365}\) But … What if you don’t want to do all the math by hand? Well, you can always (not always, but you get the point) simulate! So let’s do that in R!......
Cool stuff
R
statistics
probability
parallel-computing
simulation
Using Xkcd to Make Density Plots
Apr 24, 2019
One minute read
So, I’ve always been interesting in the “xkcd type” of figures only because IT’S FUN. My goal here is to make density plots of the most common probability functions using the xkcd style. First of all, we will need to load these packages: library(xkcd) library(dplyr) library(ggplot2) Now, we do a function using the ggplot2 sintax to make the plot. Note that we need calculate the axes ranges and apply “jitter” to x and y to create a hand-drawn effect in the line. xkcd_density = function(x,y){ # Calcule the axis range xrange = range(x) yrange = range(y) # Plot line type + jitter qplot(jitter(x),jitter(y),geom = "blank")+ geom_line(linetype = 1,size=1.5)+ theme_xkcd()+ xkcdaxis(xrange,yrange)+ xlab("")+ ylab("") } Ploting the standart normal (0,1) density. x = seq(-3,3,by=.1) y = dnorm(x,mean=0,sd=1) xkcd_density(x,y) Ploting the Chi square (df = 3) density. x = c(0,seq(0.1,10,by=.1)) y = c(0,dchisq(x[-1],df=3)) xkcd_density(x,y) There you have! A nice looking (sort of) xkcd density plot!......
How to
R
xkcd
density
statistics
How many rounds on top are necessary to be a soccer champion?
Apr 24, 2019
10 minutes read
Introduction Like almost everyone here in Brazil, I have a passion for football (or soccer) and because of that my father and I are always alert to the national championship. So, a few days ago my father gave me a challenge, he challenged me to understand what is needed to be a champion in the brazilian national soccer league (knowed here as “Brasileirão”) since the competition change its format to a point based championship in 2003. Thus I’ve accepted the challenge and here are the results: Data Firstly we need to get the data!! I haven’t found a well structured data base to work with, so I had to work with the information provided by this site and I had to do a function to scrap the data. Packages used library(XML) #Package to do web scraping library(dplyr) #Manipulation library(magrittr) #Manipulation library(tidyr) #Manipulation library(scales) #Formatting library(ggplot2) #Plots library(pROC) #ROC curves That is the function to do the web scraping. Note that the site has information only until 2015. # URL base to all years url1="futpedia.globo.com/campeonato /campeonato-brasileiro/" #Empty data frame to store the information games= data_frame() for (i in 2003:2015) { ## years url2 = paste0(url1,i) readHTMLTable(url2,as.......
Cool stuff
R
soccer
brazil
statistics
ROC
logistic-regression
How (and why) I made this blog
Apr 14, 2019
One minute read
It’s been a long (kind of) time since I start to formulate the idea for a blog to share my statistical/analytical stuff, more precisely since I was in the second year of my bachelor degree (2016). Since then I’ve learned a lot (in multiple fields), so, finally I’m able to make this thing happen! As a statistician, I was molded in the R/S-plus world, so the most convenient way to make this site (that I’ve found) was using the bookdown package for R, wich is a really great tool to build blogs like this, based in the Hugo framework, and have a nice integration with the R-studio IDE. For this blog I’m using the AllinOne theme by Orianna with some modifications.......
How to
R
blogdown
hugo