Particularly, I would not be a top baseball data analyst if not for Kyle Wilkomm's code at Baseball On a Stick.
I haven't had much time to do some analysis on the newly released advanced data set by MCFC Analytics and Opta, so I thought it could be a good idea to share the R code I wrote for massaging the provided XML file a bit.
So, for the R users who are a bit stuck with the file, here's some help (I hope).
The code that follows only works on the "big file", the one containing the play-by-play.
The resulting dataset contains 1673 events, with 121 variables.
Some more notes after the code.
library(XML) library(plyr) library(reshape) fnm <- "c:/download/mcfc/Bolton_ManCityF24.xml" #file path & name #utility function grabAll <- function(XML.parsed, field){ parse.field <- xpathSApply(XML.parsed, paste("//", field, "[@*]", sep="")) results <- t(sapply(parse.field, function(x) xmlAttrs(x))) if(typeof(results)=="list"){ do.call(rbind.fill, lapply(lapply(results, t), data.frame, stringsAsFactors=F)) } else { as.data.frame(results, stringsAsFactors=F) } } #XML Parsing pbpParse <- xmlInternalTreeParse(fnm) eventInfo <- grabAll(pbpParse, "Event") eventParse <- xpathSApply(pbpParse, "//Event") NInfo <- sapply(eventParse, function(x) sum(names(xmlChildren(x)) == "Q")) QInfo <- grabAll(pbpParse, "Q") EventsExpanded <- as.data.frame(lapply(eventInfo[,1:2], function(x) rep(x, NInfo)), stringsAsFactors=F) QInfo <- cbind(EventsExpanded, QInfo) names(QInfo)[c(1,3)] <- c("Eid", "Qid") QInfo$value <- ifelse(is.na(QInfo$value),-1, QInfo$value) Qual <- cast(QInfo, Eid ~ qualifier_id) #final data set events <- merge(eventInfo, Qual, by.x="id", by.y="Eid", all.x=T)
Variables coming from the qualifiers are named with numbers, and I know this is not a good practice.
If someone has turned the tables in the pdf file provided by Opta into a spreadsheet (or the good guys at Opta are willing to share those table in an easier to manage format), please share—as I have done with the code ;-). In that way, it would be easy to give columns meaningful (and good-practicesy) names.
Let me know if you find this useful, or if you have any comments.
Hey! I am going through the F7 and I want to try to make a table of names from it. I tried altering your code to deal with that file, but since it's not of the form "First=..." (it's ...) I'm really not sure how to get at the info there. Any help you can give would be greatly appreciated!
ReplyDeleteGo to the new post and you should find what you need.
ReplyDelete