Tuesday, September 25, 2012

More on parsing data in R

Ask and you shall receive.

I pointed to the great guys at MCFC Analytics and Opta that having the table they provided in the appendix as Excel files (or whatever manageable format) would be a great time saver.
And they promptly provided me with what I asked for.

I'm not sure about the policy about putting that Excel file here, thus I won't do that.
However I'm pretty sure you can obtain it from them.

Below, an updated version of my code.

A few notes.

  1. I converted the .xlsx file Opta sent me to an .xls file, because I had some problem with the XLSX R package.
  2. The code can be run without the Excel file: I noted the code you should skip in that case.
  3. The code now parses some match info, like the teams, the players (as suggested in a comment), the scoring order. It does not grab everything in the F7 dataset, but it should be easily modifiable (or just ask in the comments).
  4. I wrote some code to add info to the events data set, like what team is involved and the score at that moment. Due to my familiarity with US sport it's away first... I know in soc... ahem... football should be the other way around.
Enjoy!

library(XML)
library(plyr)
library(reshape)
library(gdata)
 
f7 <- "c:/download/mcfc/Bolton_ManCityF7.xml" #file path & name (f7)
f24 <- "c:/download/mcfc/Bolton_ManCityF24.xml" #(f24)
#in case you have event and qualifier descriptions in xls file... (otherwhise comment the following 2 lines)
evNames <- read.xls("c:/download/mcfc/Event Definitions - Excel file.xls", sheet=1, as.is=T)
quNames <- read.xls("c:/download/mcfc/Event Definitions - Excel file.xls", sheet=2, as.is=T)
 
#utility function
grabAll <- function(XML.parsed, field){
  parse.field <- xpathSApply(XML.parsed, paste("//", field, "[@*]", sep=""))
  results <- t(sapply(parse.field, function(x) xmlAttrs(x)))
  if(typeof(results)=="list"){
    do.call(rbind.fill, lapply(lapply(results, t), data.frame, stringsAsFactors=F))
  } else {
    as.data.frame(results, stringsAsFactors=F)
  }
}
 
#team parsing
gameParse <- xmlInternalTreeParse(f7)
teamParse <- xpathSApply(gameParse, "//TeamData")
teamParse2 <- xpathSApply(gameParse, "//Team/Name")
 
teamInfo <- data.frame(
  team_id = sapply(teamParse, function(x) xmlGetAttr(node=x, "TeamRef"))
  , team_side = sapply(teamParse, function(x) xmlGetAttr(node=x, "Side"))
  , team_name = sapply(teamParse2, function(x) xmlValue(x))
  , stringsAsFactors=F
)
 
#players parsing
playerParse <- xpathSApply(gameParse, "//Team/Player")
lineupParse <- xpathSApply(gameParse, "//Team")
 
NPlayers <- sapply(lineupParse, function(x) sum(names(xmlChildren(x)) == "Player"))
 
playerInfo <- data.frame(
  player_id = sapply(playerParse, function(x) xmlGetAttr(node=x, "uID"))
  , team_id = c(rep(teamInfo$team_id[1], NPlayers[1]), rep(teamInfo$team_id[2], NPlayers[2]))
  , position = sapply(playerParse, function(x) xmlGetAttr(node=x, "Position"))
  , first_name = sapply(playerParse, function(x) xmlValue(xmlChildren(xmlChildren(x)$PersonName)$First))
  , last_name = sapply(playerParse, function(x) xmlValue(xmlChildren(xmlChildren(x)$PersonName)$Last))
)
 
#scoring order
goalInfo <- grabAll(teamParse[[1]], "Goal")
 
goalInfo$TimeStamp <- as.POSIXct(goalInfo$TimeStamp, format="%Y%m%dT%H%M%S")
 
scoringOrderInfo <- goalInfo[order(goalInfo$TimeStamp), c("TimeStamp", "uID")]
scoringOrderInfo$team_id <- substr(gsub("g", "t", scoringOrderInfo$uID), 1, 3)
scoringOrderInfo <- merge(scoringOrderInfo, teamInfo)
scoringOrderInfo$Away <- 0
scoringOrderInfo$Home <- 0
for(i in 1: dim(scoringOrderInfo)[1]){
  dt <- subset(scoringOrderInfo, TimeStamp <= scoringOrderInfo$TimeStamp[i])
  scoringOrderInfo[i,c("Away", "Home")] <- table(dt$team_side)
}
scoringOrderInfo$Score <- paste(scoringOrderInfo$Away, scoringOrderInfo$Home, sep="-")
scoringOrderInfo <- scoringOrderInfo[order(scoringOrderInfo$TimeStamp),]
 
 
#Play-by-Play Parsing
pbpParse <- xmlInternalTreeParse(f24)
eventInfo <- grabAll(pbpParse, "Event")
eventParse <- xpathSApply(pbpParse, "//Event")
NInfo <- sapply(eventParse, function(x) sum(names(xmlChildren(x)) == "Q"))
QInfo <- grabAll(pbpParse, "Q")
EventsExpanded <- as.data.frame(lapply(eventInfo[,1:2], function(x) rep(x, NInfo)), stringsAsFactors=F)
QInfo <- cbind(EventsExpanded, QInfo)
names(QInfo)[c(1,3)] <- c("Eid", "Qid")
QInfo$value <- ifelse(is.na(QInfo$value), -1, QInfo$value)
Qual <- cast(QInfo, Eid ~ qualifier_id)
 
#comment the following loop if you have commented the xls files loading at the beginning
for(i in names(Qual)[-1]){
  txt <- quNames[which(quNames$id==as.integer(i)), "name"]
  txt <- gsub('[[:space:]]+$', '', txt)
  lbl <- tolower(gsub("-", "_", gsub(" ", "_", txt, fixed=T), fixed=T))
  names(Qual)[which(names(Qual)==i)] <- lbl
}
 
#final data set
events <- merge(eventInfo, Qual, by.x="id", by.y="Eid", all.x=T, suffixes=c("", "Q"))
 
#adjustment of variables
events$TimeStamp <- as.POSIXct(events$timestamp, format="%Y-%m-%dT%H:%M:%S")
events$x <- as.double(events$x)
events$y <- as.double(events$y)
events$Score <- cut(events$TimeStamp, c(min(events$TimeStamp), scoringOrderInfo$TimeStamp, max(events$TimeStamp)+1), c("0-0", scoringOrderInfo$Score))
events$team_id <- paste("t", events$team_id, sep="")
events <- merge(events, teamInfo)
Created by Pretty R at inside-R.org

Sunday, September 23, 2012

The seconds before shoting, visualized

I took the chance of looking into the "Advanced" data set for learning something of the ggplot2 package (I have been doing my visualizations with lattice until now.)

So here's what I have done.
I looked at where the ball was up till 20 seconds before a shot was attempted.
Click on the picture below for an enlarged version.

I don't have much to comment on this chart, except that I suppose it could be useful having this kind of visualization (or animated heatmaps) for a full season of data. That would allow to see where the action begins for a particular team (or for teams playing against a particular opponent.)

Below the same chart, with lines added for plays leading to a goal.



Friday, September 14, 2012

R code for managing the F24 dataset

Many times I have benefited from the work of great guys, who were so kind to share the results of their labor.

Particularly, I would not be a top baseball data analyst if not for Kyle Wilkomm's code at Baseball On a Stick.

I haven't had much time to do some analysis on the newly released advanced data set by MCFC Analytics and Opta, so I thought it could be a good idea to share the R code I wrote for massaging the provided XML file a bit.

So, for the R users who are a bit stuck with the file, here's some help (I hope).

The code that follows only works on the "big file", the one containing the play-by-play.

The resulting dataset contains 1673 events, with 121 variables.

Some more notes after the code.


library(XML)
library(plyr)
library(reshape)
 
fnm <- "c:/download/mcfc/Bolton_ManCityF24.xml" #file path & name
 
#utility function
grabAll <- function(XML.parsed, field){
  parse.field <- xpathSApply(XML.parsed, paste("//", field, "[@*]", sep=""))
  results <- t(sapply(parse.field, function(x) xmlAttrs(x)))
  if(typeof(results)=="list"){
    do.call(rbind.fill, lapply(lapply(results, t), data.frame, stringsAsFactors=F))
  } else {
    as.data.frame(results, stringsAsFactors=F)
  }
}
 
#XML Parsing
pbpParse <- xmlInternalTreeParse(fnm)
eventInfo <- grabAll(pbpParse, "Event")
eventParse <- xpathSApply(pbpParse, "//Event")
NInfo <- sapply(eventParse, function(x) sum(names(xmlChildren(x)) == "Q"))
QInfo <- grabAll(pbpParse, "Q")
EventsExpanded <- as.data.frame(lapply(eventInfo[,1:2], function(x) rep(x, NInfo)), stringsAsFactors=F)
QInfo <- cbind(EventsExpanded, QInfo)
names(QInfo)[c(1,3)] <- c("Eid", "Qid")
QInfo$value <- ifelse(is.na(QInfo$value),-1, QInfo$value)
Qual <- cast(QInfo, Eid ~ qualifier_id)
 
#final data set
events <- merge(eventInfo, Qual, by.x="id", by.y="Eid", all.x=T)
Created by Pretty R at inside-R.org


Variables coming from the qualifiers are named with numbers, and I know this is not a good practice.
If someone has turned the tables in the pdf file provided by Opta into a spreadsheet (or the good guys at Opta are willing to share those table in an easier to manage format), please share—as I have done with the code ;-). In that way, it would be easy to give columns meaningful (and good-practicesy) names.

Let me know if you find this useful, or if you have any comments.