Monday, September 16, 2013

Using analytics to plan a (sports) trip

My latest effort into analytics of sports-related data has been a little bit different than usual.
The goal was to plan a trip around Europe to watch hockey games.

The EuroHockey website has schedules for possibly every hockey league around the world (yeah, not only Euro), thus I grabbed data about matches in the top continental leagues scheduled between September and December.

Then I opened my friend R.

As a first pass, I calculated distances between each pair of games scheduled in consecutive days and filtered down to those within reasonable travel time--thus, sorry Admiral, I'm not gonna come to Vladivostok.

Then I recursively combined the pairs to obtain longer trips and added some subjective scores to rank the calculated trips to my likings.

The main parameters I used to rank my trips were:
  • Short travel distances between cities;
  • Trips with games from multiple leagues to be preferred;
  • A ranking of the leagues (e.g. priority to KHL games);
  • Starting and ending point of the trip possibly close to home (so that I can go just by train/bus).
After letting R do its job, some interesting solutions came up, but in the end I stuck with the one mapped below.
Clicking on the icons will show games (and make sure to zoom on Prague, as I'll be there both for KHL and Extraliga).


View Max EHT 2013 games only in a larger map
Thus, starting from December 1, I'll be around 5 European countries to catch 8 games in 8 days, featuring 14 teams from 5 leagues, in 7 different cities.

Maybe I'll blog about it. Or maybe not.

Tuesday, February 12, 2013

Baseball effect on hockey

In case you missed it, there is a website named Open Source Sports which contains databases for various sports, including the Lahman's for baseball.

Since in this blog, other than baseball I have dealt with basketball, soccer and football, it's time to do a short post on hockey. (The fact that the Hockey DB is one of the currently available at OSS helps too.)

This post is based on a single query on the Master table, the one containing players' bio info.
And reports something that I believe is widely known—so it's just a warm-up on the hockey database.

Percentage of left-shooting skaters by country

Slovakia       80
Sweden         78
Finland        76
Russia         75
Czech Republic 68
Canada         63
USA            55

The above numbers are for countries with at least 40 skaters in the database.

So, while for the Europeans the percentage of righty shooters is slightly above the population of lefthanded people (which should be in the order of 15%), the North-Americans lean way more on the right.

One likely explanation is Americans grow up playing baseball where the righthanded batter position is on the same side of the right-shooting hockey player. USA being more extreme than Canada would support this.

An American friend of mine who coached Team Sweden (baseball) said everyone seemed to bat lefthanded over there—which would support the case the other way around.

Wednesday, January 16, 2013

Icing the kicker?

What do I do when there's no baseball around?
Sometimes I do like Rogers Hornsby and just stare out of the window, waiting for Spring. Other times I watch hockey (KHL so far this year) or even football.

Last weekend I happened to watch the Seahawks @ Falcons game and couldn't help but noticing the timeout called by Seattle's head coach just before Atlanta kicked the decisive field goal.

After I understood that it was done just to disrupt the kicker's concentration (I'm not a football expert,) I decided I could have a statistical look at the issue (as stats is something I know better.)

The data

I found the following sources for play-by-play NFL data, both going back to 2002:
  • http://www.advancednflstats.com/2010/04/play-by-play-data.html
  • http://www.armchairanalysis.com/nfl-play-by-play-data.php

but neither had explicit information on when timeouts were called.
However, the play-by-play at the latter link is somewhat parsed and more ready to use, so I went with that.

Preparation

In order to identify when a timeout was called by the defensive team before a field goal, I looked at the remaining timeouts on the field goal plays and the remaining timeouts on plays immediately preceding them. When there was a difference, I classified the action as an "icing the kicker". Note that in some instances, the difference in timeouts left might have been due to a lost challenge on the previous play.

I wanted to use the stadium as one of the predictors of field goal success, but the data I used had them named in many different ways (with typos included,) thus I decided to use the home-team/season combination instead of it. Note that this will lead to considering the games played at Wembley no differently from those played at home by the Dolphins, the Saints, the Buccos or the 49ers.

Variables tested

I threw the following variables into my model (for those interested a multilevel multivariable logistic regression.)
  • the identity of the kicker;
  • distance (modeled linearly: it's a lazy choice, but not completely off the charts);
  • wind speed (no direction, as I would have needed to know the orientation of the field);
  • temperature;
  • being at home (for the kicker);
  • the "icing the kicker" dummy variable.


Results

Here's what I got. 
  • A 13% success reduction every 5 added yards of distance.
  • A 3% success reduction every additional 5mph of wind.
  • Around 1% success increase every 5 degrees (F) of temperature.
  • No effect for being at home
All the above seem too make sense. Also here are the best and worst kickers according to the model.

Best:
  1. Stover, Matt
  2. Gould, Robbie
  3. Kasay, John
  4. Akers, David
  5. Graham, Shayne
Worst:
  1. Peterson, Todd
  2. Hall, John
  3. Christie, Steve
  4. Gramatica, Martin
  5. Tynes, Lawrence
An here I need the help of knowledgeable NFL fans, to know whether the two lists pass the sniff test (though, for what I know, the first name seems OK out there.)

 The "icing"

Finally, what about the "icing the kicker"?
Though the point estimate would hint to a possible effect (-3%,) the variability is a bit large (from -8% to +1%.)

For now I would dismiss it having any influence on the outcome, but some further analysis could be in order for looking at the effect on particular kickers.

But, hey, the baseball season is approaching, so maybe someone else should look at this...

Tuesday, September 25, 2012

More on parsing data in R

Ask and you shall receive.

I pointed to the great guys at MCFC Analytics and Opta that having the table they provided in the appendix as Excel files (or whatever manageable format) would be a great time saver.
And they promptly provided me with what I asked for.

I'm not sure about the policy about putting that Excel file here, thus I won't do that.
However I'm pretty sure you can obtain it from them.

Below, an updated version of my code.

A few notes.

  1. I converted the .xlsx file Opta sent me to an .xls file, because I had some problem with the XLSX R package.
  2. The code can be run without the Excel file: I noted the code you should skip in that case.
  3. The code now parses some match info, like the teams, the players (as suggested in a comment), the scoring order. It does not grab everything in the F7 dataset, but it should be easily modifiable (or just ask in the comments).
  4. I wrote some code to add info to the events data set, like what team is involved and the score at that moment. Due to my familiarity with US sport it's away first... I know in soc... ahem... football should be the other way around.
Enjoy!

library(XML)
library(plyr)
library(reshape)
library(gdata)
 
f7 <- "c:/download/mcfc/Bolton_ManCityF7.xml" #file path & name (f7)
f24 <- "c:/download/mcfc/Bolton_ManCityF24.xml" #(f24)
#in case you have event and qualifier descriptions in xls file... (otherwhise comment the following 2 lines)
evNames <- read.xls("c:/download/mcfc/Event Definitions - Excel file.xls", sheet=1, as.is=T)
quNames <- read.xls("c:/download/mcfc/Event Definitions - Excel file.xls", sheet=2, as.is=T)
 
#utility function
grabAll <- function(XML.parsed, field){
  parse.field <- xpathSApply(XML.parsed, paste("//", field, "[@*]", sep=""))
  results <- t(sapply(parse.field, function(x) xmlAttrs(x)))
  if(typeof(results)=="list"){
    do.call(rbind.fill, lapply(lapply(results, t), data.frame, stringsAsFactors=F))
  } else {
    as.data.frame(results, stringsAsFactors=F)
  }
}
 
#team parsing
gameParse <- xmlInternalTreeParse(f7)
teamParse <- xpathSApply(gameParse, "//TeamData")
teamParse2 <- xpathSApply(gameParse, "//Team/Name")
 
teamInfo <- data.frame(
  team_id = sapply(teamParse, function(x) xmlGetAttr(node=x, "TeamRef"))
  , team_side = sapply(teamParse, function(x) xmlGetAttr(node=x, "Side"))
  , team_name = sapply(teamParse2, function(x) xmlValue(x))
  , stringsAsFactors=F
)
 
#players parsing
playerParse <- xpathSApply(gameParse, "//Team/Player")
lineupParse <- xpathSApply(gameParse, "//Team")
 
NPlayers <- sapply(lineupParse, function(x) sum(names(xmlChildren(x)) == "Player"))
 
playerInfo <- data.frame(
  player_id = sapply(playerParse, function(x) xmlGetAttr(node=x, "uID"))
  , team_id = c(rep(teamInfo$team_id[1], NPlayers[1]), rep(teamInfo$team_id[2], NPlayers[2]))
  , position = sapply(playerParse, function(x) xmlGetAttr(node=x, "Position"))
  , first_name = sapply(playerParse, function(x) xmlValue(xmlChildren(xmlChildren(x)$PersonName)$First))
  , last_name = sapply(playerParse, function(x) xmlValue(xmlChildren(xmlChildren(x)$PersonName)$Last))
)
 
#scoring order
goalInfo <- grabAll(teamParse[[1]], "Goal")
 
goalInfo$TimeStamp <- as.POSIXct(goalInfo$TimeStamp, format="%Y%m%dT%H%M%S")
 
scoringOrderInfo <- goalInfo[order(goalInfo$TimeStamp), c("TimeStamp", "uID")]
scoringOrderInfo$team_id <- substr(gsub("g", "t", scoringOrderInfo$uID), 1, 3)
scoringOrderInfo <- merge(scoringOrderInfo, teamInfo)
scoringOrderInfo$Away <- 0
scoringOrderInfo$Home <- 0
for(i in 1: dim(scoringOrderInfo)[1]){
  dt <- subset(scoringOrderInfo, TimeStamp <= scoringOrderInfo$TimeStamp[i])
  scoringOrderInfo[i,c("Away", "Home")] <- table(dt$team_side)
}
scoringOrderInfo$Score <- paste(scoringOrderInfo$Away, scoringOrderInfo$Home, sep="-")
scoringOrderInfo <- scoringOrderInfo[order(scoringOrderInfo$TimeStamp),]
 
 
#Play-by-Play Parsing
pbpParse <- xmlInternalTreeParse(f24)
eventInfo <- grabAll(pbpParse, "Event")
eventParse <- xpathSApply(pbpParse, "//Event")
NInfo <- sapply(eventParse, function(x) sum(names(xmlChildren(x)) == "Q"))
QInfo <- grabAll(pbpParse, "Q")
EventsExpanded <- as.data.frame(lapply(eventInfo[,1:2], function(x) rep(x, NInfo)), stringsAsFactors=F)
QInfo <- cbind(EventsExpanded, QInfo)
names(QInfo)[c(1,3)] <- c("Eid", "Qid")
QInfo$value <- ifelse(is.na(QInfo$value), -1, QInfo$value)
Qual <- cast(QInfo, Eid ~ qualifier_id)
 
#comment the following loop if you have commented the xls files loading at the beginning
for(i in names(Qual)[-1]){
  txt <- quNames[which(quNames$id==as.integer(i)), "name"]
  txt <- gsub('[[:space:]]+$', '', txt)
  lbl <- tolower(gsub("-", "_", gsub(" ", "_", txt, fixed=T), fixed=T))
  names(Qual)[which(names(Qual)==i)] <- lbl
}
 
#final data set
events <- merge(eventInfo, Qual, by.x="id", by.y="Eid", all.x=T, suffixes=c("", "Q"))
 
#adjustment of variables
events$TimeStamp <- as.POSIXct(events$timestamp, format="%Y-%m-%dT%H:%M:%S")
events$x <- as.double(events$x)
events$y <- as.double(events$y)
events$Score <- cut(events$TimeStamp, c(min(events$TimeStamp), scoringOrderInfo$TimeStamp, max(events$TimeStamp)+1), c("0-0", scoringOrderInfo$Score))
events$team_id <- paste("t", events$team_id, sep="")
events <- merge(events, teamInfo)
Created by Pretty R at inside-R.org

Sunday, September 23, 2012

The seconds before shoting, visualized

I took the chance of looking into the "Advanced" data set for learning something of the ggplot2 package (I have been doing my visualizations with lattice until now.)

So here's what I have done.
I looked at where the ball was up till 20 seconds before a shot was attempted.
Click on the picture below for an enlarged version.

I don't have much to comment on this chart, except that I suppose it could be useful having this kind of visualization (or animated heatmaps) for a full season of data. That would allow to see where the action begins for a particular team (or for teams playing against a particular opponent.)

Below the same chart, with lines added for plays leading to a goal.



Friday, September 14, 2012

R code for managing the F24 dataset

Many times I have benefited from the work of great guys, who were so kind to share the results of their labor.

Particularly, I would not be a top baseball data analyst if not for Kyle Wilkomm's code at Baseball On a Stick.

I haven't had much time to do some analysis on the newly released advanced data set by MCFC Analytics and Opta, so I thought it could be a good idea to share the R code I wrote for massaging the provided XML file a bit.

So, for the R users who are a bit stuck with the file, here's some help (I hope).

The code that follows only works on the "big file", the one containing the play-by-play.

The resulting dataset contains 1673 events, with 121 variables.

Some more notes after the code.


library(XML)
library(plyr)
library(reshape)
 
fnm <- "c:/download/mcfc/Bolton_ManCityF24.xml" #file path & name
 
#utility function
grabAll <- function(XML.parsed, field){
  parse.field <- xpathSApply(XML.parsed, paste("//", field, "[@*]", sep=""))
  results <- t(sapply(parse.field, function(x) xmlAttrs(x)))
  if(typeof(results)=="list"){
    do.call(rbind.fill, lapply(lapply(results, t), data.frame, stringsAsFactors=F))
  } else {
    as.data.frame(results, stringsAsFactors=F)
  }
}
 
#XML Parsing
pbpParse <- xmlInternalTreeParse(fnm)
eventInfo <- grabAll(pbpParse, "Event")
eventParse <- xpathSApply(pbpParse, "//Event")
NInfo <- sapply(eventParse, function(x) sum(names(xmlChildren(x)) == "Q"))
QInfo <- grabAll(pbpParse, "Q")
EventsExpanded <- as.data.frame(lapply(eventInfo[,1:2], function(x) rep(x, NInfo)), stringsAsFactors=F)
QInfo <- cbind(EventsExpanded, QInfo)
names(QInfo)[c(1,3)] <- c("Eid", "Qid")
QInfo$value <- ifelse(is.na(QInfo$value),-1, QInfo$value)
Qual <- cast(QInfo, Eid ~ qualifier_id)
 
#final data set
events <- merge(eventInfo, Qual, by.x="id", by.y="Eid", all.x=T)
Created by Pretty R at inside-R.org


Variables coming from the qualifiers are named with numbers, and I know this is not a good practice.
If someone has turned the tables in the pdf file provided by Opta into a spreadsheet (or the good guys at Opta are willing to share those table in an easier to manage format), please share—as I have done with the code ;-). In that way, it would be easy to give columns meaningful (and good-practicesy) names.

Let me know if you find this useful, or if you have any comments.

Thursday, August 30, 2012

Passing efficiency

I have been working some more on passing data.
This time I have looked at success percentage.

Considering every single pass recorded by Opta, EPL players succeed in reaching the target teammate 77% of the time.
Obviously not all the passes are created equal, as the success rate is 83% on short passes compared to 55% on long ones.
Similarly, teams are OK in letting their opponent move the ball in their defensive third (93%, that's also due to teams taking less risks when in front of their own net), but they make it harder as the opponents become more of a threat (84% in the middle third and 66% in the final third.)

Having data at an aggregate level (the Lite dataset), one cannot go into deeper details, such as combining the passes length and position.

My goal here is to find factors related to the percentage of success.
While building my model I was interested in evaluating the effect of the passing player and team, the opposing team and the field of play.
I used a multilevel mixed model, as I do for baseball when evaluating the simultaneous effect of several players on the outcome (so that when I evaluate the ability of a catcher in preventing steals, I take into account his pitchers and the runners who attempt the steal against them.)

Here is the first table: percentage of success. Zero means an average performance, so Manchester United has a pass success six percentage points higher than the average team.

team           %relative to average
Arsenal                         3.3
Aston Villa                    -4.0
Blackburn Rovers               -4.4
Bolton Wanderers               -5.4
Chelsea                         4.0
Everton                        -0.8
Fulham                          2.6
Liverpool                       2.7
Manchester City                 5.5
Manchester United               6.0
Newcastle United               -2.2
Norwich City                   -3.4
Queens Park Rangers            -2.5
Stoke City                     -6.0
Sunderland                     -2.8
Swansea City                    3.3
Tottenham Hotspur               4.6
West Bromwich Albion           -1.4
Wigan Athletic                  1.4
Wolverhampton Wanderers        -0.5

Again, as could have been anticipated, the top teams are the better at making good passes. Like it was for passing attempts, Swansea City passes better than its position in the standings would indicate, while the opposite can be said for Newcastle United.

Let's look at the other side of the mirror.

opponent       %relative to average
Arsenal                        -2.7
Aston Villa                    -0.5
Blackburn Rovers                2.5
Bolton Wanderers               -0.8
Chelsea                        -0.3
Everton                        -0.8
Fulham                          3.5
Liverpool                      -0.6
Manchester City                -1.1
Manchester United              -0.8
Newcastle United               -0.5
Norwich City                   -0.4
Queens Park Rangers             0.6
Stoke City                     -1.1
Sunderland                     -0.1
Swansea City                    0.6
Tottenham Hotspur              -0.6
West Bromwich Albion            1.6
Wigan Athletic                  1.3
Wolverhampton Wanderers         0.2

Here the variation is much smaller, suggesting (not surprisingly) that the team (and the player) attempting the pass has more impact on its success than the opposing team.

Fulham comes out again as a beast of its own. For some reason, if you are an average team at completing passes, when you face Fulham you suddenly become Arsenal!
I would like to hear from people who watch many EPL matches whether Fulham has a peculiar way of defending, as they allow both more passes and more successful ones than expected.

And here are the park effects.

stadium        %relative to average
Arsenal                         0.4
Aston Villa                     0.3
Blackburn Rovers               -0.6
Bolton Wanderers               -1.0
Chelsea                         0.9
Everton                         0.5
Fulham                         -0.4
Liverpool                      -0.7
Manchester City                 1.5
Manchester United               1.4
Newcastle United               -0.3
Norwich City                    1.2
Queens Park Rangers            -1.6
Stoke City                     -2.3
Sunderland                     -0.1
Swansea City                    0.3
Tottenham Hotspur               0.7
West Bromwich Albion            0.5
Wigan Athletic                 -0.2
Wolverhampton Wanderers        -0.5

The baseball analyst in me made me throw the stadium variable in the model, but soccer fields are of roughly fixed dimension and surely fixed shape, thus they don't play much of a role. (Note that I referred to the stadium with the home team name. Sorry for not having bothered to attach the name of the venue.)

However small the park effect, I noted that the top teams play in pitches that make completing good passes easier (OK, the highest value is just +1.5%!)
This makes sense to me, as top clubs likely have better resources to take care of the field and, given they are stuffed with talented players, are more interested at keeping a fair surface of play.

I turn once more to EPL connoisseurs, asking them to share any info on the field of play of Stoke City, which appears as the most good-pass-preventing venue.

Note that the stadium values are not influenced by the team calling it home. Thus the +1.4% of Old Trafford means that having accounted for the fact that the best passing team plays there, an increase of 1.4% in passing success is attributable to the venue itself.

I threw a couple more variables into the model.

One is the zone where the pass was made. A pass in the middle part of the field is 8% less likely to reach the desired teammate than a pass in the defensive third. Going into the offensive zone the probability of success drops even more (-25% compared to the defensive third.)

The second one is an indicator of whether the team attempting the pass is playing at home. In such cases the chances of a successful pass increase of about 1%. This also makes sense as players are certainly more familiar with the grounds the play on half of the time.

OK. You may say "Wow, you told us that good teams are better at making passes, that playing at home is better and that once you get into the final third you have a tougher time—were advanced analyses needed for such obvious things?"

I agree, they weren't so much. However, when you want to evaluate individual passing ability, you have to make sure you remove other factors (such the ones presented here) from the equation. So that when you show the passing rate of a player, you have taken into account the fact that he plays (for example) for a good team, on an uneven field, and makes his passes mostly on the offensive zone.

Thus I'm happy that the model so far has spit out "obvious" results, because I'll be more comfortable when I look at individual ratings.