Run Expectancy/Empirical LWTS

145 views
Skip to first unread message

John Choiniere

unread,
Nov 30, 2013, 1:19:49 PM11/30/13
to baseball-sq...@googlegroups.com
One of the first ways I tested myself when I was starting to learn how to use MySQL was attempting to figure out linear weights values empirically for each individual year (since 1974).  As such, I've got queries that should work for building run expectancy/LWTS tables.  Now, I'll always be more interested in figuring this stuff out for myself, so if someone else were posting them I wouldn't bother downloading/using them, but not everyone necessarily feels that way, so I thought I might offer - would anyone be interested in having these?  If so, I can take a little time to format/comment them and post them as attachments.  Let me know.

jonath...@gmail.com

unread,
Nov 30, 2013, 1:26:02 PM11/30/13
to baseball-sq...@googlegroups.com
I'd definitely be very interested in seeing those queries. I like to do things myself too but I also like to learn from other people's queries. Generating empirical run expectancy is on my list of queries to write.

John Choiniere

unread,
Dec 12, 2013, 1:39:18 PM12/12/13
to baseball-sq...@googlegroups.com
So I dug back through my old queries and decided I had done this really inefficiently.  I rewrote the procedure to be fewer steps.

Step One - Run Expectancy:
create table run_expectancy_retro_pbp as
select
substring(game_id,4,4) AS yearID,
outs_ct AS outs,
concat( if(base1_run_id <> "","1","_"),
if(base2_run_id <> "","2","_"),
if(base3_run_id <> "","3","_")) AS base_state,
count(*) AS count,
sum(fate_runs_ct) + sum(event_runs_ct) AS inn_runs,
(sum(fate_runs_ct) + sum(event_runs_ct))/count(*) AS run_expectancy
from
events_regseason
where
substring(game_ID,4,4)>1973
AND ((inn_ct < 9) OR (inn_ct = 9 AND bat_team_id = away_team_id))
group by 
yearID,outs,base_state
order by yearID asc, outs asc, find_in_set(base_state,'___,1__,_2_,__3,12_,1_3,_23,123')


Step Two - Linear Weights by year and retrosheet event:

select
substring(a.game_ID,4,4) AS year,
a.event_cd AS event,
count(a.event_cd) AS count,
sum(b.run_expectancy)/count(a.event_cd) AS re_pre,
sum(c.run_expectancy + a.event_runs_ct)/count(a.event_cd) AS re_post,
(sum(c.run_expectancy + a.event_runs_ct)/count(a.event_cd))-(sum(b.run_expectancy)/count(a.event_cd)) AS re_total

from
events_regseason a
inner join run_expectancy_retro_pbp b
on (substring(a.game_ID,4,4) = b.yearID 
AND a.outs_ct = b.outs 
AND concat(
if(a.base1_run_id <> "","1","_"),
if(a.base2_run_id <> "","2","_"),
if(a.base3_run_id <> "","3","_")) = b.base_state)
inner join run_expectancy_retro_pbp c
on (substring(a.game_ID,4,4) = c.yearID 
AND (a.outs_ct+a.event_outs_ct) = c.outs 
AND concat(
if((a.bat_dest_id = 1 OR a.run1_dest_id = 1),"1","_"),
if((a.bat_dest_id = 2 OR a.run1_dest_id = 2 OR a.run2_dest_id = 2),"2","_"),
if((a.bat_dest_id = 3 OR a.run1_dest_id = 3 OR a.run2_dest_id = 3 OR a.run3_dest_id = 3),"3","_")) = c.base_state)
group by year, event

As per usual, let me know if you see any mistakes.  This relies on Retrosheet PBP as processed by the Chadwick tool including extra fields, and will not work when used with bevent-processed files.  The second query takes ~1.5 hrs to run on my machine, fyi.  Note that the LWTS query relies on a table created by the RE queries; there might be a way around that, but I wanted both to exist anyway.

-John

Colin Wyers

unread,
Dec 13, 2013, 11:44:29 AM12/13/13
to baseball-sq...@googlegroups.com
A few notes to improve the speed:

* You're referring to years a lot, you're probably going to see greatly improved performance if you have a year column in the database. This should do the trick:

ALTER TABLE events_regseason ADD year_id SMALLINT(4) UNSIGNED AFTER game_id;
UPDATE events_regseason SET year_id = substring(game_id,4,4);

I haven't run this code myself, so I would suggest making a backup before testing it. The reason I suggest SMALLINT rather than YEAR is because MySQL's YEAR type has a lower bound of 1901, there are box score event files from 1971, 1872 and 1874, and this maintains the ability to load files from those years.

Also, once you have a year column in there, I highly suggest partitioning by year. This tutorial is a bit out of date (particularly as regards Chadwick), but the code I used to partition the database is in the downloadable ZIP:


Partitioning makes disk reads/memory usage much more efficient for large tables.

* Instead of using the concat statement to group by base state, use START_BASES_CD and END_BASES_CD, part of the cwevent extended field numbers. For best performance, join on integers whenever you can, as opposed to character fields. There's two reasons for this:

1) Integers are going to be smaller than character-based fields -- in this case, if you're properly storing START_BASES_CD as a TINYINT, it's one third the size of the character-based field.
2) Overhead. MySQL is designed to be able to store text in multiple languages, which often use different character sets (and the same language can be represented by multiple character sets) and different ways of sorting those character sets, known as "collations" in MySQL. In order to compare two strings, MySQL has to load information about the character set and collations used. Integers are universal -- there's only one way of storing integers in any part of the world.

Both your GROUP BY and JOIN operations will be great sped up by using integer values in place of characters for base state information.

3) Use indexes. If you put the word EXPLAIN before SELECT in your query, instead of running the query, MySQL shows you how it plans to execute that query. This will tell you what indexes are being used to run a query. I can provide more details on proper indexing if people here are interested.

4) You have two versions of the run_expectancy_retro_pbp table being joined to your events table. You could achieve some rather substantial speed increases by only joining one table to it. How? First, create a transition matrix:

SELECT DISTINCT outs_ct AS start_outs_ct, outs_ct+event_outs_ct AS end_outs_ct

Colin Wyers

unread,
Dec 13, 2013, 11:52:52 AM12/13/13
to baseball-sq...@googlegroups.com
Whoops, hit Post on accident (that'll teach me to try and tab-indent when writing to Google Groups). Transition matrix:

SELECT DISTINCT outs_ct AS start_outs_ct, outs_ct+event_outs_ct AS end_outs_ct
    , start_bases_cd, end_bases_cd
FROM events_regseason
WHERE year_id = 2013;

It doesn't really matter what year you use, it just runs a lot faster on only one year (especially if you've partitioned the table) and the transition matrix is the same for all years. Then, you join two copies of your run expectancy matricies to that, and create a table that has all possible combinations of starting and ending run expectancy, by year. Then, instead of joining two copies of the run expectancy table to the events table, join your combined table.

Also, out of the box, MySQL has rather poor default settings for memory usage. To get a much better set of configuration settings, you can use the Configuration Wizard here:


To find the best settings for your computer. (There's also a pretty cool query analyzer there, too.) By default, MySQL on Windows stores its configuration settings in

C:\Program Files\MySQL\MySQL Server 5.5\my.ini

Where 5.5 is replaced by whatever your version number is (and if you're using a 32-bit version of MySQL on 64-bit Windows... well, stop doing that, but instead of "Program Files" it'll be in "Program Files (x86)"). If you've installed MySQL through something other than the official installer, like XAMPP or something, it'll be elsewhere. Linux users, it'll be in /etc/my.cnf. Mac users, I have no earthly idea. Simply replace the contents of that file with the output of the Percona tool (after making a backup) and restart your MySQL server.

jonath...@gmail.com

unread,
Dec 14, 2013, 8:41:57 PM12/14/13
to baseball-sq...@googlegroups.com
Thanks for sharing the script John, I'm gonna create the table tonight.

Colin, thanks for the tips. adding the year_id column dramatically improved the speed of a script I'm using to find natural cycles. It went from 2+ hours to 20 minutes (still getting unexpected data though). 

John Choiniere

unread,
Jan 17, 2014, 1:57:31 AM1/17/14
to baseball-sq...@googlegroups.com
I don't know if anyone used by RE creator, but I realized today it didn't include any three-out states.  So, you should add those in (I did it manually) and then re-run the LWTS query if you're using this.
Reply all
Reply to author
Forward
0 new messages