Programmer? Statistician?

jmblackmer

unread,

Oct 23, 2012, 11:38:19 AM10/23/12

to study-hal...@googlegroups.com

Hey all,

I'm just trying to get a feeling for what everyone's area of expertise is to maybe help organize efforts. Are you a programmer? A statistician? Someone just dabbling in both What is your language or program of choice (specifically to work with the data)? What data outside of the data provided in the CSV files is useful and where do you get it?

I'm a programmer and I generally work in either C# or PHP. I've been putting together a C# library to parse the CSV files, which I'm planning on making available for others to use once I finish it up. I've also extended it to create a program that scrapes Covers.com historical data and stores it into a CSV. I guess I'm trying to get a feeling for whether or not the library would be used before I worry about hosting it online. I'm also trying to figure out if there are other languages or programs that people use most frequently to work with this data so that we can create libraries or scripts for those as well or if there are data sets from specific sites that we should write scrapers for and compile CSVs to be paired with the base set of data.

Ed Feng

unread,

Oct 23, 2012, 11:43:11 AM10/23/12

to jmblackmer, study-hal...@googlegroups.com

Parsing covers.com into csv file would be incredible.

I'm a Python person, although math is my strong point.

I think it would be best to develop some data formats. Brett has a nice one for play by play logs. With a format, then someone can go scrape a new site in their own language. Someone else can write a different script that cross checks all the data files from different sources.

Thanks for starting this thread.

Ed

--
The Power Rank. Will your team win?

Website Twitter Facebook

Travis Fossett

unread,

Oct 23, 2012, 11:44:02 AM10/23/12

to jmblackmer, study-hal...@googlegroups.com

I'm a developer that primarily uses Microsoft products, C# with MSSQL back-ends. I've written a screen-scraper to pull box scores, play-by-play, and drive charts from ESPN, though their play-by-play can miss a play or repeat a drive at times.

Chris Treadaway

unread,

Oct 23, 2012, 11:48:36 AM10/23/12

to Travis Fossett, jmblackmer, study-hal...@googlegroups.com

I'm a statistician armed with a megacomputer and licenses of all sorts of fun tools. ;)

My bias is to want as much raw, elemental data as possible -- but also give me a way to request "rollups".

Chris

Huck L Berry

unread,

Oct 23, 2012, 11:52:08 AM10/23/12

to Chris Treadaway, Travis Fossett, jmblackmer, study-hal...@googlegroups.com

I'm an engineer. I dabble and tinker until it works. Got my first break into programming when Peter Wolfe sent me an Excel file with a macro that runs the late David Rothman's ratings.

I'll leave the programming to the experts.

David Fobare

unread,

Oct 23, 2012, 11:58:19 AM10/23/12

to study-hal...@googlegroups.com

Delphi guy here. I've been parsing Covers for a long time. I've also written code for Yahoo, NFL.com, ESPN, DonBest and others.

Chris Treadaway

unread,

Oct 23, 2012, 12:02:54 PM10/23/12

to David Fobare, study-hal...@googlegroups.com

FYI, on the parsing front, there is a great tool called Mozenda for those of us who are not programmers. With a little manipulation, it proves to grab pseudo-structured data pretty reliably.

Chris

Jonathan Hodges

unread,

Oct 23, 2012, 12:05:57 PM10/23/12

to Chris Treadaway, David Fobare, study-hal...@googlegroups.com

Glad to see so many programmers here as I don't have much experience there. Most of my background is in stats (I'm a quality engineer in my other life) but would be willing to contribute in data collection/organization. I know that I'll have more time to contribute come the end of the season in January.

--
Jonathan Hodges
Contributor, HailToPurple
Web: http://www.hailtopurple.com/jhodges/
Twitter: @hailtopurple
Facebook: https://www.facebook.com/hailtopurple
Email: j-ho...@alumni.northwestern.edu

Bill Connelly

unread,

Oct 23, 2012, 12:07:33 PM10/23/12

to study-hal...@googlegroups.com

Seriously ... I may be asking for favors at some point ... :-)

jake...@gmail.com

unread,

Oct 23, 2012, 12:10:52 PM10/23/12

to Bill Connelly, study-hal...@googlegroups.com

I'm a mathematician/statistician with access to fun stuff as well. I primarily do my work in microsoft/vba and some sql work for parsing data.

If we're all going to be sharing data and files, a standard format would be nice for all the programmers, I imagine.

Sent from my BlackBerry device from Cincinnati Bell Wireless

From: Bill Connelly <billco...@gmail.com>

Sender: study-hal...@googlegroups.com

Date: Tue, 23 Oct 2012 11:07:33 -0500

To: <study-hal...@googlegroups.com>

Subject: Re: Programmer? Statistician?

--

Matthew Smith

unread,

Oct 23, 2012, 12:15:25 PM10/23/12

to study-hal...@googlegroups.com, Bill Connelly, jake...@gmail.com

I think that makes a lot of sense. My background is also more stats oriented than programming oriented. I do all of my football modelling work in excel, so I can parse plenty of different table formats but it's nice to have standard notation. As an FYI, my naming/numbering convention for 1-A teams (which I try to use when possible) is:

1	Air Force
2	Akron
3	Alabama
4	Arizona
5	Arizona State
6	Arkansas
7	Arkansas State
8	Army
9	Auburn
10	Ball State
11	Baylor
12	Boise State
13	Boston College
14	Bowling Green State
15	Buffalo
16	Brigham Young
17	California
18	Central Florida
19	Central Michigan
20	Cincinnati
21	Clemson
22	Colorado
23	Colorado State
24	Connecticut
25	Duke
26	East Carolina
27	Eastern Michigan
28	Florida
29	Florida State
30	Fresno State
31	Georgia
32	Georgia Tech
33	Hawaii
34	Houston
35	Idaho
36	Illinois
37	Indiana
38	Iowa
39	Iowa State
40	Kansas
41	Kansas State
42	Kent
43	Kentucky
44	Louisiana-Lafayette
45	Louisiana-Monroe
46	Louisiana Tech
47	Louisville
48	Louisiana State
49	Marshall
50	Maryland
51	Memphis
52	Miami (Florida)
53	Miami (Ohio)
54	Michigan
55	Michigan State
56	Middle Tennessee State
57	Minnesota
58	Mississippi
59	Mississippi State
60	Missouri
61	Navy
62	North Carolina State
63	Nebraska
64	Nevada
65	New Mexico
66	New Mexico State
67	North Texas
68	Northern Illinois
69	Northwestern
70	Notre Dame
71	Ohio
72	Ohio State
73	Oklahoma
74	Oklahoma State
75	Oregon
76	Oregon State
77	Penn State
78	Pittsburgh
79	Purdue
80	Rice
81	Rutgers
82	San Diego State
83	San Jose State
84	Southern Methodist
85	South Carolina
86	South Florida
87	Southern Mississippi
88	Stanford
89	Syracuse
90	Texas Christian
91	Temple
92	Tennessee
93	Texas
94	Texas A&M
95	Texas Tech
96	Toledo
97	Troy State
98	Tulane
99	Tulsa
100	Alabama-Birmingham
101	UCLA
102	North Carolina
103	Nevada-Las Vegas
104	Southern California
105	Utah
106	Utah State
107	Texas-El Paso
108	Vanderbilt
109	Virginia
110	Virginia Tech
111	Wake Forest
112	Washington
113	Washington State
114	West Virginia
115	Western Michigan
116	Wisconsin
117	Wyoming
118	Florida Atlantic
119	Florida International
120	Western Kentucky
121	Umass
122	South Alabama
123	Texas State
124	UTSA

Basically alphabetical (except UNC/UNLV use the U), and then I tacked on the 1-A newcomers over the last decade or so onto the end.

jake...@gmail.com

unread,

Oct 23, 2012, 12:16:17 PM10/23/12

to study-hal...@googlegroups.com

Also, I'd like to second that Mozenda is a good program for parsing. I use R for my analyses does anyone else out there use R or sas for stat analysis?

Jake T.

Sent from my BlackBerry device from Cincinnati Bell Wireless

From: jake...@gmail.com

Date: Tue, 23 Oct 2012 16:10:52 +0000

To: Bill Connelly<billco...@gmail.com>; <study-hal...@googlegroups.com>

ReplyTo: jake...@gmail.com

Subject: Re: Programmer? Statistician?

I'm a mathematician/statistician with access to fun stuff as well. I primarily do my work in microsoft/vba and some sql work for parsing data.

If we're all going to be sharing data and files, a standard format would be nice for all the programmers, I imagine.

Sent from my BlackBerry device from Cincinnati Bell Wireless

From: Bill Connelly <billco...@gmail.com>

Sender: study-hal...@googlegroups.com

Date: Tue, 23 Oct 2012 11:07:33 -0500

To: <study-hal...@googlegroups.com>

Subject: Re: Programmer? Statistician?

--

Matt Mills

unread,

Oct 23, 2012, 12:16:29 PM10/23/12

to study-hal...@googlegroups.com

Studying Industrial Engineering and taken 2 python courses, so nothing haha. I'll definitely be trying to learn from all you guys though.

-Matt

Marty

unread,

Oct 23, 2012, 12:17:36 PM10/23/12

to study-hal...@googlegroups.com

I do everything in Java, and use SQL extensively as all of my data is in a MySQL database. Like I said in my intro, I'm not statistically inclined at all.

David Fobare

unread,

Oct 23, 2012, 12:18:26 PM10/23/12

to study-hal...@googlegroups.com

My retirement fund wishes I used R. It also wishes I had professional Python skills.

Kalon Jelen

unread,

Oct 23, 2012, 12:23:24 PM10/23/12

to David Fobare, study-hal...@googlegroups.com

I'm a programmer that works with C#, Java and SQL. No serious stats experience other than what I used at Amazon.

One thought comes to mind: it might be more useful to either have this data in a database or have the data in a cube of some kind. That way the data can be fairly easily manipulated while still being standardized.

jmblackmer

unread,

Oct 23, 2012, 12:39:51 PM10/23/12

to study-hal...@googlegroups.com

Ugh. I'm still getting used to google groups. This will be the 3rd time I've typed this so I'll be brief this time. Attached are my initial copies of the covers parsed data in CSV format. They will change as I do more testing and work them into my library.

I see a lot of .Net/C#/VB, Java, SQL, Python, R, and Mozenda and several statisticians/mathematician. It looks like my .Net library will be useful for some, so I'll try to get that posted this week.

I second getting a SQL database available for use in addition to the CSVs. Does anyone use SQL to analyze the data or does everyone just use it for storage? My original ranking system was in SQL, but I moved it into C#.

On Tuesday, October 23, 2012 11:38:19 AM UTC-4, jmblackmer wrote:

covers-game.csv

covers-team.csv

jmblackmer

unread,

Oct 23, 2012, 12:54:35 PM10/23/12

to study-hal...@googlegroups.com

I'd also like to suggest that we try to keep data sets in separate files/tables even if they are adding data for the same item. For example, covers-game.csv contains data for games from covers.com. Not everyone will need that data, so it makes sense to keep it in a separate file/table from game.csv and people can just load the data that they need.

If someone wants to load additional data from ESPN, then I think it should be espn-game.csv.

The files should have always start with the Item Code as the first column. I have the Covers numbers listed as Id because that's what I'm used to from a DB perspective, but I can/probably should change that to Code.

I'm also thinking that any items that are referenced should have their own files. covers-team.csv only exists to map the code used from NCAA to that of Covers. I then plan on using the the Covers Team Id/Code instead of Team Code in covers-game.csv. That way if we want to add data to covers-team.csv later, we don't have to change any CSV layouts. I suppose we could always use the Codes in the files, but if we do that, then I think we should renumber them and have an ncaa-team.csv file that links to the NCAA code used to reference them.

Any thoughts on these? Any other ideas for conventions to use?

David Fobare

unread,

Oct 23, 2012, 1:22:35 PM10/23/12

to study-hal...@googlegroups.com

We need mappings for all of the potential scraping sources: NCAA, ESPN, Covers, etc. I do this now in my own work but its a royal pain to maintain. Gets messier still when some pages you want to scrape only deal in names, not team numbers. I maintain 2 sets of college team names for ESPN.

On Tue, Oct 23, 2012 at 12:54 PM, jmblackmer <jmbla...@gmail.com> wrote:

DSMok1

unread,

Oct 23, 2012, 1:34:49 PM10/23/12

to study-hal...@googlegroups.com

I am a structural engineer. I am a Microsoft Excel whiz (where most of my research has been done) and know some R as well, though I rarely use it. I use Tableau Public for visualizations, which I highly recommend (quick example: http://public.tableausoftware.com/views/ASPMvsRAPM2/ASPMvs_RAPM )

I'm pretty well-versed in statistics theory as well, mostly learned at THE BOOK blog in baseball.

Dave Hudson (@okc_dave)

unread,

Oct 23, 2012, 5:20:35 PM10/23/12

to study-hal...@googlegroups.com

john.c....@gmail.com

unread,

Oct 23, 2012, 9:31:52 PM10/23/12

to study-hal...@googlegroups.com

I studied finance in college. Through my current job, I've gotten pretty good with Excel and Access. I also did well in my statistics courses in school, and have considered taking some sort of class to keep up. I'm available to help in any way I can (data collection, organization, scripts, etc).

On Tuesday, October 23, 2012 11:38:19 AM UTC-4, jmblackmer wrote:

Reply all

Reply to author

Forward