Most Used Words in File

HTML, PHP 2 Comments

I recently saw a question posted in a forum about obtaining the most-used words in a file for the purpose of populating the content of a "keywords" meta tag. It got me thinking about how I would do it. I knew I could use the str_word_count() function to count words, but running that against any typical [X]HTML file would give a lot of junk you would not want in your keyword list.

One obvious step was to use the strip_tags() function as a means to only look at actual content and not grab tag and property names. A bit more involved was how to avoid common words. Then it occurred to me that MySQL's full text search mechanism utilizes a list of "stopwords" to ignore in its matching process. Luckily enough, I found the default list of stopwords in their online manual, so I copied that into a text file, ran a few search/replace operations on it to get it into one word per line, did a little massaging, and then I had stopword list ready to go:

Read the rest...