Filtering MS Word Text

5:04 pm HTML, PHP

A common annoyance when dealing with user-supplied content is the way MS Word uses some non-standard character encodings (at least non-standard in terms of the web). Among others, these include the directional (a.k.a. "smart") quotes. The problem occurs when you output text that contains those characters as a result of a user copying and pasting text from a Word document. Typically they are not interpreted by the browser and the font being used, resulting in the dreaded place-holder characters (question marks, boxes, etc.).

If outputting the UTF-8 character encoding in your PHP pages, I came up with the following PHP function to help deal with this. It is inspired by this comment on Chris Shiflett's blog. It is simply a use of the str_replace() function to replace some known problem characters with character entities that should work better when outputting UTF-8 content.

<?php
function filterText($text)
{
   $search = array (
      '&',
      '<',
      '>',
      '"',
      chr(212),
      chr(213),
      chr(210),
      chr(211),
      chr(209),
      chr(208),
      chr(201),
      chr(145),
      chr(146),
      chr(147),
      chr(148),
      chr(151),
      chr(150),
      chr(133)
   );
   $replace = array (
      '&amp;',
      '&lt;',
      '&gt;',
      '&quot;',
      '&#8216;',
      '&#8217;',
      '&#8220;',
      '&#8221;',
      '&#8211;',
      '&#8212;',
      '&#8230;',
      '&#8216;',
      '&#8217;',
      '&#8220;',
      '&#8221;',
      '&#8211;',
      '&#8212;',
      '&#8230;'
   );
   return str_replace($search, $replace, $text);
}

// USAGE:
header('Content-Type: text/html; charset="UTF-8"');
echo filterText($test);
?>
Leave a Comment

Note: You must be registered and logged in in order to leave a comment.