Screen Scraping NCAA.org Player Statistics

The NCAA collects player statistics from many different sports, but if you want to get that data, it’s not always easy. It’s not like they have an open API. Recently, I wanted to get the season-long statistics for all NCAA baseball players, divisions 1-3, so that data could be used for other purposes. I wrote a screen scraper to do it. It’s a very simple script, which crawls the site, scraping the information and outputting it to csv (comma separated values) so it can then be imported into an Excel spreadsheet or a database. Highlights:

  • Can scrape 1 school’s worth of baseball player data, if you know the NCAA’s numeric identifier for the school.
  • Can scrape all men’s baseball player data for all NCAA schools for which data is available. (Not all schools have data listed)

Implementation details:

  • It’s written in php.
  • It’s written for men’s baseball, but it could be made more generic by changes to the parsing of the HTML pages and the generation of the csv.
  • It could be changed to pull another sport’s data in under an hour (probably closer to 1/2 hour).
  • It does not include the NCAA’s player ids in the output, but that would be easy to add.

My challenges:

  • I hadn’t written anything serious in php in a while, since my day job consists mostly of Java, JavaScript, C#, and VB.Net. I chose php only because I suspected that the person who was going to use this would have an easier time with this than another language.
  • Because the NCAA.org site requires an established Java session, each request to the site is actually made in 2 requests. The first retrieves a jsessionid, and the second requests the data, attaching the jsessionid. This makes the application take twice as many requests as should be needed. (This certainly could be improved, but considering the script is not intended to be run many times, this is probably sufficient.)

Considering I wrote it in 4-5 hours, in a language I haven’t written serious code with in several years, and considering I only needed it for a 1 time scraping, I think it came out ok. I don’t expect to need this script again. The code is available here:

https://github.com/squdgy/ncaa_stats_scraper

Advertisements
This entry was posted in Hackathon, Miscellaneous, PHP, Sports, Web Development. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s