Most Used Words in File
July 2, 2008 9:56 pm HTML, PHPI recently saw a question posted in a forum about obtaining the most-used words in a file for the purpose of populating the content of a "keywords" meta tag. It got me thinking about how I would do it. I knew I could use the str_word_count() function to count words, but running that against any typical [X]HTML file would give a lot of junk you would not want in your keyword list.
One obvious step was to use the strip_tags() function as a means to only look at actual content and not grab tag and property names. A bit more involved was how to avoid common words. Then it occurred to me that MySQL's full text search mechanism utilizes a list of "stopwords" to ignore in its matching process. Luckily enough, I found the default list of stopwords in their online manual, so I copied that into a text file, ran a few search/replace operations on it to get it into one word per line, did a little massaging, and then I had stopword list ready to go:
a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently definitely described despite did didn't different do does doesn't doing don't done down downwards during each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except far few fifth first five followed following follows for former formerly forth four from further furthermore get gets getting given gives go goes going gone got gotten greetings had hadn't happens hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn't it it'd it'll it's its itself just keep keeps kept know knows known last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own particular particularly per perhaps placed please plus possible presumably probably provides que quite qv rather rd re really reasonably regarding regardless regards relatively respectively right said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two un under unfortunately unless unlikely until unto up upon us use used useful uses using usually value various very via viz vs want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would would wouldn't yes yet you you'd you'll you're you've your yours yourself yourselves zero
Now all I needed to do was load that file into an array, read the target file and get all its words into another array, then let the array_diff() function do the work of removing the stopwords from it. All that remained then was a little sorting and slicing of the array to get the desired number of most-used words:
<?php
/**
* Get most-used words from a file.
*/
class Keywords
{
/**
* Words that are not to be included in keyword list
* @var array
*/
protected $stopwords = array();
/**
* Import file of stopwords
* File should be one word per line, all lower-case
* @param string $stopwordFile path to file
* @return bool
*/
public function setStopwords($stopwordFile)
{
if(($stopwords = @file($stopwordFile)) === false)
{
user_error("Unable to read stopword file '$stopwordFile'");
return false;
}
array_walk($stopwords, create_function('&$val,$key', '$val = trim($val);'));
$this->stopwords = array_filter($stopwords);
return true;
}
/**
* get array of most used words in file
* @param string $file file to get keywords from
* @param int $numWords number of keywords to get
* @return array
*/
public function getKeywords($file, $numWords, $stripTags=true)
{
$text = @file_get_contents($file);
if($text == false)
{
user_error("Unable to read file '$fle'");
return false;
}
$text = strtolower($text);
if($stripTags)
{
$text = strip_tags($text);
}
$words = str_word_count($text, 1); // 1 = array of all words
$words = array_diff($words, $this->stopwords);
$words = array_count_values($words);
arsort($words);
return array_slice(array_keys($words), 0, (int)$numWords);
}
}
Here's a sample usage...
// sample usage: $test = new Keywords(); $test->setStopwords('stopwords.txt'); $result = $test->getKeywords('http://www.charles-reace.com/', 7); echo implode(",", $result);
...and the output:
site,web,php,interactive,email,charles,mysql
Personally, I would not use this technique to directly populate a meta keyword tag as I think that should require some thought and selectivity, but it could be useful as a way to examine the keyword density within a page and help you make editorial/SEO decisions.

Znupi :
Date: July 16, 2008 @ 14:53
Pretty nice :-) one thing though: shouldn’t the Keywords::setStopwords function return true at the end?
Also, I think it would be better if would work on a string and not a file, so you can take the body of an article from mysql and get its keywords for example. Anyway that’s a really small and easy change to do :-)
cwreace :
Date: July 16, 2008 @ 17:09
Oops, yeah that should be “true” instead of “false”. I’m off to fix it now.