Monte Carlo or Bust
Monte Carlo Simulation is basically a computer model that repeatedly plays out an uncertain situation. In this case for my MLB project, to play out many innings and then average the number of runs scored per inning. You can read more about the method here -: http://en.wikipedia.org/wiki/Monte_Carlo_method
To assist with this we can use the RAND() function in MS Excel which creates a random number that is greater than equal to 0 and less than or equal to 1. We can simulate an inning for a player by assigning an outcome of a HOME RUN to a random number less than 0.5 and assigning an outcome of an OUT to a random number between 0.5 and 1. We can then run 5000 simulated innings and work out the average runs per inning.
In the model we need to work out all the events that can occur at each plate appearance through an inning and assign probabilities to these events as well as all the different types of play. So, for example, since the year 2000 1.86% of (At Bats+Sacrifice Hits+Sacrifice Bunts) have resulted in ERRORS. The calculation for OUTS (in play) is equal to ((AB+SB+SF) – hits – errors – strikeouts), and so on. For the ‘types of play’, ie short or long singles, doubles, fly balls etc I have just used the estimates from the Mathletics book. We can input this data into a spreadsheet.
The next step is to input some relevant player data. Using Joey Votto’s 2012 data (for consistency) we input all his relevant data into the spreadsheet which basically consists of the following (bear in mind he missed a good spell on the sidelines following knee surgery) -:
Plate Appearances = 475
At Bats + Sacrifice Hits + Sacrifice Bunts = 376
Errors = 5 (0.01053)
Outs (in play) = 160 (0.33684)
Strikeouts = 85 (0.17895)
Walks = 94 (0.19789)
HBP = 5 (0.01053)
Singles = 68 (0.14316)
Doubles = 44 (0.09263)
Triples = 0 (0)
Home Runs = 14 (0.02947)
In brackets are the probability of each event, which are (frequency of event/total plate appearances)
Monte Carlo Simulation
I am using a 15-day trial of the Excel @RISK add-in but will have to come up with a better (lower-cost) solution for the rest of the year as I’m not prepared to outlay a grand for an excel plug-in during a testing phase. If anyone knows of any decent free/cheap add-ins to run Monte Carlo simulation or alternative R solutions then please let me know. I have been looking into SIMTOOLS.XLA from the University of Chicago but if anyone knows anything better then please tweet me @formbet
In the meantime I am going with the @RISK add-in for 15-days trial. When running the simulation it basically “plays out” an innings thousands of times, generating an ‘event’ for each plate appearance based on the above probabilities, in this case for Joey Votto.
Again I won’t go into the full detail of this from the Mathletics book which also provides an spreadsheet that can be used, safe to say that playing out 5000 iterations with @RISK basically gives us a resultant MEAN figure of 1.185. When multiplying this figure by 8.91 innings it gives us an model estimate of 10.56 Runs Created Per Game for Joey Votto which is extremely accurate considering his actual “Runs Created Per Game” in 2012 was 10.61.
Now we already determined using Linear Weights that Votto added 67.17 runs to an ‘average MLB team’ based on the 2000-2010 data but what we really want to know is how many actual wins Votto added to the Cincinatti Reds in 2012.
Firstly we take the Cincinatti Reds stats below WITHOUT Votto in the team and the resultant probabilities. Playing out this 5000 times using @RISK we get an average Runs Per Game of 3.69, so over 162 games that equates to 597.86 Runs over the season without Votto in the squad.
We can then work out how many wins that Votto added compared to an ‘average’ Reds hitter by using the “Predicted Win %” formula from earlier on.
The Reds scored 669 and gave up 588 runs in 2012 so their scoring ratio was 1.138 and the Predicted Win% formula estimates that they ‘should’ have won 91 games with Votto in the team (56.4% of 162 games).
Using the previous figures WITHOUT Votto we get a scoring ratio of 1.017 (597.86/669). This gives a predicted winning percentage of 50.8% which gives 82 games. So we can estimate that “with Votto in the squad rather than an ‘average’ batter the Cincinatti Reds gained an extra 9 wins over the course of a season.”
So it seems that the Monte Carlo simulation could be the best way to determine the Linear Weightings although I would like to investigate alternatives using some free Monte-Carlo add-ins and also look into using R which is my usual weapon of choice.
Bill James advocates comparing a player to an ‘average MLB player’ which is something I will look to do at some point. Rather than delve into that just now and before I get started on the the pitchers I will take a few weeks to take on board all the methods of evaluating batters while it’s still fresh in my mind. I will try to make some predictions and decide on what kind of software and automation I want to use to get all the data out of the database and into the format I want for ease of use as at the moment it’s a little haphazard. This data is obviously historical from the Lahman DB but ideally during the 2013 season I will need to find a source that can give me the same information on a daily or weekly basis so it’s a few weeks of planning, research, SQL and testing various batting models. Once I am happy with all that I can then get stuck into the pitching data which is another huge piece of what is a very complicated puzzle.
Will be back in mid-Feb sometime with an update.
OPS…I did it again
Without going into too much detail the “Runs Created” and “Linear Weights” give very similar predictions for the number of runs a player is responsible for in a game so I will stick with the easier to calculate “Runs Created” for each player especially as it can be calculated automatically from the database.
However, On-Base Plus Slugging (OPS) is what I want to touch on next. The formula is as follows…
So again this is ‘fairly’ easy to calculate and extract from the Lahman DB. The previously calculated “Linear Weights” model had an R-Square figure of 91.1% and there was only a 2.7% difference in the model data and a 2.4% difference in the fresh sample data from 2011+2012.
The OPS model has an R-Square of 90.8% with a 2.4% difference in the 2000-2010 data between estimated and actual runs and a 2.6% difference in the 2011+2012 data so there is hardly anything in it really and as OPS is probably slightly easier to understand and calculate with the two variables in a model then it seems the way to go as most of the grunt work can be done at the database level.
In the Mathletics book it’s stated that OBP is roughly twice as important as SLG but from my 2000-2010 model I estimate it’s x 1.68 more important so it could be power hitters/sluggers are having a bit more influence than they used to but I’m thinking out loud and probably need to get all the data up to date to 2012.
“Runs Over Average (ROA)” can be used to estimate how many runs a certain player could possibly add to a team. In the example given in the Mathletics book it calculates adding a 2004 Ichiro Suzuki to an ‘average’ team would add an extra 59.11 runs.
The way I decided to practice this was to take the Linear Weights and averages for Singles, Doubles, Triples, Home Runs, Walks + Hit By Pitch and Stolen Bases and compare them to Joey Votto’s individual stats, who I had calculated had the highest “Runs Created Per Game” figures from the earlier analysis.
Now bear in mind this is an average team as calculated between 2000 and 2010 and to work out the real worth of a players addition to a team it may be best to have individual team ‘models’ and work out the calculations based on the most recent data as teams change over time but it’s close enough to give a bit of practice and trigger some ideas for the future in estimating a players potential worth to an actual team rather than an average.
Next up I’ll be looking at evaluating teams and hitters using Monte Carlo Simulation.
Weight a minute…it’s Linear !
An alternative approach to the “Runs Created” formula is to use Linear Weights or Multiple Linear Regression if you prefer. It basically involves giving a set of weights to variables to arrive at “Predicted Runs for season”. I’m familiar with this approach in horse racing but have not really used the Regression Tool in Excel before.
As Excel is the regression weapon of choice in the Mathletics book (Chapter 3) I thought it would be good to analyse the team data from the year 2000 through to 2010 and then apply the formula for that same dataset but then also running it against the 2011 and 2012 seasons to see how accurately the resultant weightings computed “Predicted Runs” on a fresh sample set of data.
For the independent variables we are going to use the following stats for each team -:
BB+HBP (Base On Balls + Hit By Pitch)
SINGLES (Single Runs)
HR (Home Runs)
SB (Stolen Bases)
CS (Caught Stealing)
When running this against the 2000-2010 data we arrive with the following formula and the best set of Linear Weights preceded by the constant (Intercept) to predict runs scored in a season -:
Predicted Runs1 = -535.25 + 0.58(singles) + 0.76(2B) + 1.25(3B) + 1.48(HR) + 0.33(BB+HBP) + 0.09(SB) + 0.08(CS)
This suggests that singles create 0.58 runs, doubles 0.76 runs etc
In the Mathletics Book Stolen Bases and Caught Stealing are dropped as they didn’t add anything to the model. From my analysis of the P-Values there is a 57% chance that Caught Stealing is not needed to predict runs scored but only a 7% chance Stolen Bases doesn’t, compared to 93% and 43% respectively from the Mathletics data from 2000-2006.
This suggests to me that between 2007 and 2010 that perhaps Caught Stealing is still not useful but Stealing Bases is perhaps more prevalent/valuable than it used to be, so I will leave that in for now and drop Caught Stealing which gives a revised model to apply to the 2011-2012 data -:
Predicted Runs2 = -531.76 + 0.59(singles) + 0.76(2B) + 1.26(3B) + 1.48(HR) + 0.33(BB+HBP) + 0.11(SB)
When applied to the same 10 year dataset there was only a 2.7% difference between the team averages that the model predicted for “runs compared” to the actual “runs scored” which is an improvement on the 3.3% using the “Runs Created” formula from the previous blog post.
When I applied this to the 2011 and 2012 datasets I expected it to be a little higher but the difference was actually only 2.4% - obviously the smaller sample size plays it’s part but it’s very encouraging nonetheless. I am not sure how this will have any predictive value but I’m sure I will think of something during the course of the future exercises..meanwhile answers and suggestions on a postcard/tweet please !
Next up though I will perform a similar analysis on the individual player batting data.