April | 2011 | Matthew Ericson

Making information graphics these days often requires scraping data from web sites, and Ruby is currently my goto language for most scraping tasks. The process of building a web scraper often involves a lot of trial and error, and I don’t necessarily want to pound the same site with HTTP requests again and again as I tweak and debug code.

So, I wrap HTTP requests in a tiny little class that saves the responses to the file system, so if you request the same URL again, it will load the cached data, eliminating the need for an HTTP request:

class HTTPCacher 
  def initialize( base_dir )
    @base_dir = base_dir
  end

  def get( url, key )
    
    cached_path = @base_dir + '/' + key
    if File.exists?( cached_path  )
      puts "Getting file #{key} from cache"
      return IO.read( cached_path )
    else
      puts "Getting file #{key} from URL #{url}"
      resp = Net::HTTP.get_response(URI.parse(url))
      data = resp.body  
      
      File.open( cached_path, 'w' ) do |f|
        f.puts data
      end
    
      return data
    end

  end
end

Usage is pretty simple. Create a new HTTPCacher object

getter = HTTPCacher.new( '/path/to/data/dir/here' )

and then make a get request, passing two parameters: 1. a URL, and 2. the key that you want to cache it under. Any further requests with that cache key will load the file straight from the filesystem.

data = getter.get( 'http://otter.topsy.com/search.json?q=ipad&window=auto', 'ipad.html' )

Note that making sure your keys are unique between URLs is entirely up to you. If you try to request two different URLs but pass the same key, it won’t be able to tell them apart and it will return the cached data on the second request.

Monthly Archives: April 2011

Caching HTTP Requests with Ruby

International Journalism Festival