The NCAA collects player statistics from many different sports, but if you want to get that data, it’s not always easy. It’s not like they have an open API. Recently, I wanted to get the season-long statistics for all NCAA baseball players, divisions 1-3, so that data could be used for other purposes. I wrote a screen scraper to do it. It’s a very simple script, which crawls the site, scraping the information and outputting it to csv (comma separated values) so it can then be imported into an Excel spreadsheet or a database. Highlights:
- Can scrape 1 school’s worth of baseball player data, if you know the NCAA’s numeric identifier for the school.
- Can scrape all men’s baseball player data for all NCAA schools for which data is available. (Not all schools have data listed)
- It’s written in php.
- It’s written for men’s baseball, but it could be made more generic by changes to the parsing of the HTML pages and the generation of the csv.
- It could be changed to pull another sport’s data in under an hour (probably closer to 1/2 hour).
- It does not include the NCAA’s player ids in the output, but that would be easy to add.
- Because the NCAA.org site requires an established Java session, each request to the site is actually made in 2 requests. The first retrieves a jsessionid, and the second requests the data, attaching the jsessionid. This makes the application take twice as many requests as should be needed. (This certainly could be improved, but considering the script is not intended to be run many times, this is probably sufficient.)
Considering I wrote it in 4-5 hours, in a language I haven’t written serious code with in several years, and considering I only needed it for a 1 time scraping, I think it came out ok. I don’t expect to need this script again. The code is available here: