Friday, September 14, 2012

R code for managing the F24 dataset

Many times I have benefited from the work of great guys, who were so kind to share the results of their labor.

Particularly, I would not be a top baseball data analyst if not for Kyle Wilkomm's code at Baseball On a Stick.

I haven't had much time to do some analysis on the newly released advanced data set by MCFC Analytics and Opta, so I thought it could be a good idea to share the R code I wrote for massaging the provided XML file a bit.

So, for the R users who are a bit stuck with the file, here's some help (I hope).

The code that follows only works on the "big file", the one containing the play-by-play.

The resulting dataset contains 1673 events, with 121 variables.

Some more notes after the code.


library(XML)
library(plyr)
library(reshape)
 
fnm <- "c:/download/mcfc/Bolton_ManCityF24.xml" #file path & name
 
#utility function
grabAll <- function(XML.parsed, field){
  parse.field <- xpathSApply(XML.parsed, paste("//", field, "[@*]", sep=""))
  results <- t(sapply(parse.field, function(x) xmlAttrs(x)))
  if(typeof(results)=="list"){
    do.call(rbind.fill, lapply(lapply(results, t), data.frame, stringsAsFactors=F))
  } else {
    as.data.frame(results, stringsAsFactors=F)
  }
}
 
#XML Parsing
pbpParse <- xmlInternalTreeParse(fnm)
eventInfo <- grabAll(pbpParse, "Event")
eventParse <- xpathSApply(pbpParse, "//Event")
NInfo <- sapply(eventParse, function(x) sum(names(xmlChildren(x)) == "Q"))
QInfo <- grabAll(pbpParse, "Q")
EventsExpanded <- as.data.frame(lapply(eventInfo[,1:2], function(x) rep(x, NInfo)), stringsAsFactors=F)
QInfo <- cbind(EventsExpanded, QInfo)
names(QInfo)[c(1,3)] <- c("Eid", "Qid")
QInfo$value <- ifelse(is.na(QInfo$value),-1, QInfo$value)
Qual <- cast(QInfo, Eid ~ qualifier_id)
 
#final data set
events <- merge(eventInfo, Qual, by.x="id", by.y="Eid", all.x=T)
Created by Pretty R at inside-R.org


Variables coming from the qualifiers are named with numbers, and I know this is not a good practice.
If someone has turned the tables in the pdf file provided by Opta into a spreadsheet (or the good guys at Opta are willing to share those table in an easier to manage format), please share—as I have done with the code ;-). In that way, it would be easy to give columns meaningful (and good-practicesy) names.

Let me know if you find this useful, or if you have any comments.

2 comments:

  1. Hey! I am going through the F7 and I want to try to make a table of names from it. I tried altering your code to deal with that file, but since it's not of the form "First=..." (it's ...) I'm really not sure how to get at the info there. Any help you can give would be greatly appreciated!

    ReplyDelete
  2. Go to the new post and you should find what you need.

    ReplyDelete