I wanted to see if a very simple web scraper that just counted words could still surface the major topic of a web page. I made a little script in ruby with mechanize that seems to show that it is still possible to glean the topic just counting words. It’s somewhat of a toy that is fun to see what words show up. For example on CNN.com today you can get top words like: hostage, jobs, Obama, and crisis.

<br /> require 'mechanize'</p> <p>agent = Mechanize.new<br /> page = agent.get("http://www.cnn.com")</p> <p>hash = {}<br /> words = page.search('body').text.split(' ')<br /> words.each do|word|<br />   if word.length > 3<br />     if hash.has_key?(word)<br />         hash[word] = hash[word] + 1<br />     else<br />         hash[word] = 1<br />     end<br />   end<br /> end<br /> puts hash.sort_by(&:last).reverse<br />

It’s also here in a gist on github gist on github.