Most Used Words in File
July 2, 2008 HTML, PHP 2 CommentsI recently saw a question posted in a forum about obtaining the most-used words in a file for the purpose of populating the content of a "keywords" meta tag. It got me thinking about how I would do it. I knew I could use the str_word_count() function to count words, but running that against any typical [X]HTML file would give a lot of junk you would not want in your keyword list.
One obvious step was to use the strip_tags() function as a means to only look at actual content and not grab tag and property names. A bit more involved was how to avoid common words. Then it occurred to me that MySQL's full text search mechanism utilizes a list of "stopwords" to ignore in its matching process. Luckily enough, I found the default list of stopwords in their online manual, so I copied that into a text file, ran a few search/replace operations on it to get it into one word per line, did a little massaging, and then I had stopword list ready to go:
