Programmer? Statistician?

33 views
Skip to first unread message

jmblackmer

unread,
Oct 23, 2012, 11:38:19 AM10/23/12
to study-hal...@googlegroups.com
Hey all,

I'm just trying to get a feeling for what everyone's area of expertise is to maybe help organize efforts. Are you a programmer? A statistician? Someone just dabbling in both What is your language or program of choice (specifically to work with the data)? What data outside of the data provided in the CSV files is useful and where do you get it?

I'm a programmer and I generally work in either C# or PHP. I've been putting together a C# library to parse the CSV files, which I'm planning on making available for others to use once I finish it up. I've also extended it to create a program that scrapes Covers.com historical data and stores it into a CSV. I guess I'm trying to get a feeling for whether or not the library would be used before I worry about hosting it online. I'm also trying to figure out if there are other languages or programs that people use most frequently to work with this data so that we can create libraries or scripts for those as well or if there are data sets from specific sites that we should write scrapers for and compile CSVs to be paired with the base set of data.

Ed Feng

unread,
Oct 23, 2012, 11:43:11 AM10/23/12
to jmblackmer, study-hal...@googlegroups.com

Parsing covers.com into csv file would be incredible.

I'm a Python person, although math is my strong point.

I think it would be best to develop some data formats.  Brett has a nice one for play by play logs.  With a format, then someone can go scrape a new site in their own language.  Someone else can write a different script that cross checks all the data files from different sources.

Thanks for starting this thread.

Ed


--
 
 



--
The Power Rank.  Will your team win?

Travis Fossett

unread,
Oct 23, 2012, 11:44:02 AM10/23/12
to jmblackmer, study-hal...@googlegroups.com
I'm a developer that primarily uses Microsoft products, C# with MSSQL back-ends. I've written a screen-scraper to pull box scores, play-by-play, and drive charts from ESPN, though their play-by-play can miss a play or repeat a drive at times.

--
 
 

Chris Treadaway

unread,
Oct 23, 2012, 11:48:36 AM10/23/12
to Travis Fossett, jmblackmer, study-hal...@googlegroups.com
I'm a statistician armed with a megacomputer and licenses of all sorts of fun tools. ;)

My bias is to want as much raw, elemental data as possible -- but also give me a way to request "rollups".  

Chris

--

--
Chris Treadaway
CEO, Notice Technologies
blog - http://treadaway.typepad.com
e-mail - ch...@noticetechnologies.com

Huck L Berry

unread,
Oct 23, 2012, 11:52:08 AM10/23/12
to Chris Treadaway, Travis Fossett, jmblackmer, study-hal...@googlegroups.com
I'm an engineer.  I dabble and tinker until it works.  Got my first break into programming when Peter Wolfe sent me an Excel file with a macro that runs the late David Rothman's ratings.

I'll leave the programming to the experts.

--
 
 

David Fobare

unread,
Oct 23, 2012, 11:58:19 AM10/23/12
to study-hal...@googlegroups.com
Delphi guy here. I've been parsing Covers for a long time. I've also written code for Yahoo, NFL.com, ESPN, DonBest and others.

--
 
 

Chris Treadaway

unread,
Oct 23, 2012, 12:02:54 PM10/23/12
to David Fobare, study-hal...@googlegroups.com
FYI, on the parsing front, there is a great tool called Mozenda for those of us who are not programmers.  With a little manipulation, it proves to grab pseudo-structured data pretty reliably.

Chris

Jonathan Hodges

unread,
Oct 23, 2012, 12:05:57 PM10/23/12
to Chris Treadaway, David Fobare, study-hal...@googlegroups.com
Glad to see so many programmers here as I don't have much experience there.  Most of my background is in stats (I'm a quality engineer in my other life) but would be willing to contribute in data collection/organization.  I know that I'll have more time to contribute come the end of the season in January.

--
Jonathan Hodges
Contributor, HailToPurple
Web: http://www.hailtopurple.com/jhodges/
Twitter: @hailtopurple
Facebook: https://www.facebook.com/hailtopurple
Email: j-ho...@alumni.northwestern.edu



--
 
 

Bill Connelly

unread,
Oct 23, 2012, 12:07:33 PM10/23/12
to study-hal...@googlegroups.com
Seriously ... I may be asking for favors at some point ...  :-)



--
 
 

jake...@gmail.com

unread,
Oct 23, 2012, 12:10:52 PM10/23/12
to Bill Connelly, study-hal...@googlegroups.com
I'm a mathematician/statistician with access to fun stuff as well. I primarily do my work in microsoft/vba and some sql work for parsing data.

If we're all going to be sharing data and files, a standard format would be nice for all the programmers, I imagine.
Sent from my BlackBerry device from Cincinnati Bell Wireless

From: Bill Connelly <billco...@gmail.com>
Date: Tue, 23 Oct 2012 11:07:33 -0500
Subject: Re: Programmer? Statistician?
--
 
 

Matthew Smith

unread,
Oct 23, 2012, 12:15:25 PM10/23/12
to study-hal...@googlegroups.com, Bill Connelly, jake...@gmail.com
I think that makes a lot of sense.  My background is also more stats oriented than programming oriented.  I do all of my football modelling work in excel, so I can parse plenty of different table formats but it's nice to have standard notation.  As an FYI, my naming/numbering convention for 1-A teams (which I try to use when possible) is:

1 Air Force
2 Akron
3 Alabama
4 Arizona
5 Arizona State
6 Arkansas
7 Arkansas State
8 Army
9 Auburn
10 Ball State
11 Baylor
12 Boise State
13 Boston College
14 Bowling Green State
15 Buffalo
16 Brigham Young
17 California
18 Central Florida
19 Central Michigan
20 Cincinnati
21 Clemson
22 Colorado
23 Colorado State
24 Connecticut
25 Duke
26 East Carolina
27 Eastern Michigan
28 Florida
29 Florida State
30 Fresno State
31 Georgia
32 Georgia Tech
33 Hawaii
34 Houston
35 Idaho
36 Illinois
37 Indiana
38 Iowa
39 Iowa State
40 Kansas
41 Kansas State
42 Kent
43 Kentucky
44 Louisiana-Lafayette
45 Louisiana-Monroe
46 Louisiana Tech
47 Louisville
48 Louisiana State
49 Marshall
50 Maryland
51 Memphis
52 Miami (Florida)
53 Miami (Ohio)
54 Michigan
55 Michigan State
56 Middle Tennessee State
57 Minnesota
58 Mississippi
59 Mississippi State
60 Missouri
61 Navy
62 North Carolina State
63 Nebraska
64 Nevada
65 New Mexico
66 New Mexico State
67 North Texas
68 Northern Illinois
69 Northwestern
70 Notre Dame
71 Ohio
72 Ohio State
73 Oklahoma
74 Oklahoma State
75 Oregon
76 Oregon State
77 Penn State
78 Pittsburgh
79 Purdue
80 Rice
81 Rutgers
82 San Diego State
83 San Jose State
84 Southern Methodist
85 South Carolina
86 South Florida
87 Southern Mississippi
88 Stanford
89 Syracuse
90 Texas Christian
91 Temple
92 Tennessee
93 Texas
94 Texas A&M
95 Texas Tech
96 Toledo
97 Troy State
98 Tulane
99 Tulsa
100 Alabama-Birmingham
101 UCLA
102 North Carolina
103 Nevada-Las Vegas
104 Southern California
105 Utah
106 Utah State
107 Texas-El Paso
108 Vanderbilt
109 Virginia
110 Virginia Tech
111 Wake Forest
112 Washington
113 Washington State
114 West Virginia
115 Western Michigan
116 Wisconsin
117 Wyoming
118 Florida Atlantic
119 Florida International
120 Western Kentucky
121 Umass
122 South Alabama
123 Texas State
124 UTSA

Basically alphabetical (except UNC/UNLV use the U), and then I tacked on the 1-A newcomers over the last decade or so onto the end.

jake...@gmail.com

unread,
Oct 23, 2012, 12:16:17 PM10/23/12
to study-hal...@googlegroups.com
Also, I'd like to second that Mozenda is a good program for parsing. I use R for my analyses does anyone else out there use R or sas for stat analysis?

Jake T.
Sent from my BlackBerry device from Cincinnati Bell Wireless

Date: Tue, 23 Oct 2012 16:10:52 +0000
Subject: Re: Programmer? Statistician?

I'm a mathematician/statistician with access to fun stuff as well. I primarily do my work in microsoft/vba and some sql work for parsing data.

If we're all going to be sharing data and files, a standard format would be nice for all the programmers, I imagine.
Sent from my BlackBerry device from Cincinnati Bell Wireless

From: Bill Connelly <billco...@gmail.com>
Date: Tue, 23 Oct 2012 11:07:33 -0500
Subject: Re: Programmer? Statistician?

--
 
 

Matt Mills

unread,
Oct 23, 2012, 12:16:29 PM10/23/12
to study-hal...@googlegroups.com
Studying Industrial Engineering and taken 2 python courses, so nothing haha. I'll definitely be trying to learn from all you guys though. 

-Matt
--
 
 

Marty

unread,
Oct 23, 2012, 12:17:36 PM10/23/12
to study-hal...@googlegroups.com
I do everything in Java, and use SQL extensively as all of my data is in a MySQL database.  Like I said in my intro, I'm not statistically inclined at all.

David Fobare

unread,
Oct 23, 2012, 12:18:26 PM10/23/12
to study-hal...@googlegroups.com
My retirement fund wishes I used R. It also wishes I had professional Python skills.

--
 
 

Kalon Jelen

unread,
Oct 23, 2012, 12:23:24 PM10/23/12
to David Fobare, study-hal...@googlegroups.com
I'm a programmer that works with C#, Java and SQL. No serious stats experience other than what I used at Amazon. 

One thought comes to mind: it might be more useful to either have this data in a database or have the data in a cube of some kind. That way the data can be fairly easily manipulated while still being standardized. 

--
 
 

jmblackmer

unread,
Oct 23, 2012, 12:39:51 PM10/23/12
to study-hal...@googlegroups.com
Ugh. I'm still getting used to google groups. This will be the 3rd time I've typed this so I'll be brief this time. Attached are my initial copies of the covers parsed data in CSV format. They will change as I do more testing and work them into my library.

I see a lot of .Net/C#/VB, Java, SQL, Python, R, and Mozenda and several statisticians/mathematician. It looks like my .Net library will be useful for some, so I'll try to get that posted this week.

I second getting a SQL database available for use in addition to the CSVs. Does anyone use SQL to analyze the data or does everyone just use it for storage? My original ranking system was in SQL, but I moved it into C#. 

On Tuesday, October 23, 2012 11:38:19 AM UTC-4, jmblackmer wrote:
covers-game.csv
covers-team.csv

jmblackmer

unread,
Oct 23, 2012, 12:54:35 PM10/23/12
to study-hal...@googlegroups.com
I'd also like to suggest that we try to keep data sets in separate files/tables even if they are adding data for the same item. For example, covers-game.csv contains data for games from covers.com. Not everyone will need that data, so it makes sense to keep it in a separate file/table from game.csv and people can just load the data that they need.

If someone wants to load additional data from ESPN, then I think it should be espn-game.csv.

The files should have always start with the Item Code as the first column. I have the Covers numbers listed as Id because that's what I'm used to from a DB perspective, but I can/probably should change that to Code.

I'm also thinking that any items that are referenced should have their own files. covers-team.csv only exists to map the code used from NCAA to that of Covers. I then plan on using the the Covers Team Id/Code instead of Team Code in covers-game.csv. That way if we want to add data to covers-team.csv later, we don't have to change any CSV layouts. I suppose we could always use the Codes in the files, but if we do that, then I think we should renumber them and have an ncaa-team.csv file that links to the NCAA code used to reference them.

Any thoughts on these? Any other ideas for conventions to use?

David Fobare

unread,
Oct 23, 2012, 1:22:35 PM10/23/12
to study-hal...@googlegroups.com


We need mappings for all of the potential scraping sources: NCAA, ESPN, Covers, etc. I do this now in my own work but its a royal pain to maintain. Gets messier still when some pages you want to scrape only deal in names, not team numbers. I maintain 2 sets of college team names for ESPN.


On Tue, Oct 23, 2012 at 12:54 PM, jmblackmer <jmbla...@gmail.com> wrote:
I'm also thinking that any items that are referenced should have their own files. covers-team.csv only exists to map the code used from NCAA to that of Covers. I then plan on using the the Covers Team Id/Code instead of Team Code in covers-game.csv. That way if we want to add data to covers-team.csv later, we don't have to change any CSV layouts. I suppose we could always use the Codes in the files, but if we do that, then I think we should renumber them and have an ncaa-team.csv file that links to the NCAA code used to reference them.

Any thoughts on these? Any other ideas for conventions to use?


On Tuesday, October 23, 2012 12:39:52 PM UTC-4, jmblackmer wrote:
Ugh. I'm still getting used to google groups. This will be the 3rd time I've typed this so I'll be brief this time. Attached are my initial copies of the covers parsed data in CSV format. They will change as I do more testing and work them into my library.

I see a lot of .Net/C#/VB, Java, SQL, Python, R, and Mozenda and several statisticians/mathematician. It looks like my .Net library will be useful for some, so I'll try to get that posted this week.

I second getting a SQL database available for use in addition to the CSVs. Does anyone use SQL to analyze the data or does everyone just use it for storage? My original ranking system was in SQL, but I moved it into C#. 

On Tuesday, October 23, 2012 11:38:19 AM UTC-4, jmblackmer wrote:
Hey all,

I'm just trying to get a feeling for what everyone's area of expertise is to maybe help organize efforts. Are you a programmer? A statistician? Someone just dabbling in both What is your language or program of choice (specifically to work with the data)? What data outside of the data provided in the CSV files is useful and where do you get it?

I'm a programmer and I generally work in either C# or PHP. I've been putting together a C# library to parse the CSV files, which I'm planning on making available for others to use once I finish it up. I've also extended it to create a program that scrapes Covers.com historical data and stores it into a CSV. I guess I'm trying to get a feeling for whether or not the library would be used before I worry about hosting it online. I'm also trying to figure out if there are other languages or programs that people use most frequently to work with this data so that we can create libraries or scripts for those as well or if there are data sets from specific sites that we should write scrapers for and compile CSVs to be paired with the base set of data.

--
 
 

DSMok1

unread,
Oct 23, 2012, 1:34:49 PM10/23/12
to study-hal...@googlegroups.com
I am a structural engineer.  I am a Microsoft Excel whiz (where most of my research has been done) and know some R as well, though I rarely use it.  I use Tableau Public for visualizations, which I highly recommend (quick example: http://public.tableausoftware.com/views/ASPMvsRAPM2/ASPMvs_RAPM )

I'm pretty well-versed in statistics theory as well, mostly learned at THE BOOK blog in baseball.

Dave Hudson (@okc_dave)

unread,
Oct 23, 2012, 5:20:35 PM10/23/12
to study-hal...@googlegroups.com
I'm a financial analyst mostly working on fixed income (bond) portfolios. I probably should have learned programming along the way but I never did. I am the resident Excel expert among my colleagues.

john.c....@gmail.com

unread,
Oct 23, 2012, 9:31:52 PM10/23/12
to study-hal...@googlegroups.com
I studied finance in college. Through my current job, I've gotten pretty good with Excel and Access. I also did well in my statistics courses in school, and have considered taking some sort of class to keep up. I'm available to help in any way I can (data collection, organization, scripts, etc).


On Tuesday, October 23, 2012 11:38:19 AM UTC-4, jmblackmer wrote:
Reply all
Reply to author
Forward
0 new messages