Filtering MS Word Text

HTML, PHP No Comments

A common annoyance when dealing with user-supplied content is the way MS Word uses some non-standard character encodings (at least non-standard in terms of the web). Among others, these include the directional (a.k.a. "smart") quotes. The problem occurs when you output text that contains those characters as a result of a user copying and pasting text from a Word document. Typically they are not interpreted by the browser and the font being used, resulting in the dreaded place-holder characters (question marks, boxes, etc.).

If outputting the UTF-8 character encoding in your PHP pages, I came up with the following PHP function to help deal with this. It is inspired by this comment on Chris Shiflett's blog. It is simply a use of the str_replace() function to replace some known problem characters with character entities that should work better when outputting UTF-8 content.

<?php
function filterText($text)
{
   $search = array (
      '&',
      '<',
      '>',
      '"',
      chr(212),
      chr(213),
      chr(210),
      chr(211),
      chr(209),
      chr(208),
      chr(201),
      chr(145),
      chr(146),
      chr(147),
      chr(148),
      chr(151),
      chr(150),
      chr(133)
   );
   $replace = array (
      '&amp;',
      '&lt;',
      '&gt;',
      '&quot;',
      '&#8216;',
      '&#8217;',
      '&#8220;',
      '&#8221;',
      '&#8211;',
      '&#8212;',
      '&#8230;',
      '&#8216;',
      '&#8217;',
      '&#8220;',
      '&#8221;',
      '&#8211;',
      '&#8212;',
      '&#8230;'
   );
   return str_replace($search, $replace, $text);
}

// USAGE:
header('Content-Type: text/html; charset="UTF-8"');
echo filterText($test);
?>

Tabbed Ouput with Tidy

HTML, PHP 3 Comments

In response to this thread at WebDeveloper.com, I came up with the idea of using PHP's Tidy functions to format the HTML output from a script. The basic idea was to capture all the output by using ob_start() to buffer the output and then ob_get_clean() to save it to a variable. Then just run it through the tidy_repair_string() function with a couple configuration settings to indent it.


<?php
ob_start
();
?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>test</title>
</head>
<body>
<h1>Test</h1>
<ul>
<li>This is a test.</li>
<li>It is only a test.</li>
</ul>
</body>
</html>
<?php
$text
= ob_get_clean();
$config = array(
'indent' => true,
'indent-spaces' => 4
);
$text = tidy_repair_string($text, $config);
echo
$text;

But Read the rest...

Most Used Words in File

HTML, PHP 2 Comments

I recently saw a question posted in a forum about obtaining the most-used words in a file for the purpose of populating the content of a "keywords" meta tag. It got me thinking about how I would do it. I knew I could use the str_word_count() function to count words, but running that against any typical [X]HTML file would give a lot of junk you would not want in your keyword list.

One obvious step was to use the strip_tags() function as a means to only look at actual content and not grab tag and property names. A bit more involved was how to avoid common words. Then it occurred to me that MySQL's full text search mechanism utilizes a list of "stopwords" to ignore in its matching process. Luckily enough, I found the default list of stopwords in their online manual, so I copied that into a text file, ran a few search/replace operations on it to get it into one word per line, did a little massaging, and then I had stopword list ready to go:

Read the rest...

Clearing the Floats

CSS, HTML No Comments

No, this has nothing to do with the aftermath of the Macy's Thanksgiving Day Parade. This is about that annoying thing with floated HTML elements within a container element, whereby it seems that said container has no inkling of the fact that it should expand enough vertically to contain any and all such floated elements.

I used to address this via the common Band-Aid of inserting a non-floated element just before the closing tag of the container, and giving that new element the clear: both; CSS property:

<div id='container'>
  <p class='float'>This element has a float:left or float:right;" style property.</p>
  <div class='clear'></div> <!-- "clear" class has the "clear:both;" style -->
</div>

But lo and behold, I recently discovered a simpler solution that does not require any extraneous, empty HTML mark-up, thanks to this article at  quirksmode.com. Now I can get rid of that semantically useless <div class='clear'></div> line in the above HTML, and instead just make sure the style for the "container" div includes:

#container {
  width: 100%;
  overflow: auto;
}