I wanted to see if a very simple web scraper that just counted words could still surface the major topic of a web page. I made a little script in ruby with mechanize that seems to show that it is still possible to glean the topic just counting words. It’s somewhat of a toy that is fun to see what words show up. For example on CNN.com today you can get top words like: hostage, jobs, Obama, and crisis.

require 'mechanize'

agent = Mechanize.new
page = agent.get("http://www.cnn.com")

hash = {}
words = page.search('body').text.split(' ')
words.each do|word|
  if word.length > 3
    if hash.has_key?(word)
        hash[word] = hash[word] + 1
    else
        hash[word] = 1
    end
  end
end
puts hash.sort_by(&:last).reverse

It’s also here in a gist on github gist on github.