Category: HTML

2008-10-15

Filtering MS Word Text

by Charles — Categories: HTML, PHP — Tags: , , Leave a comment

A common annoyance when dealing with user-supplied content is the way MS Word uses some non-standard character encodings (at least non-standard in terms of the web). Among others, these include the directional (a.k.a. “smart”) quotes. The problem occurs when you output text that contains those characters as a result of a user copying and pasting text from a Word document. Typically they are not interpreted by the browser and the font being used, resulting in the dreaded place-holder characters (question marks, boxes, etc.).

If outputting the UTF-8 character encoding in your PHP pages, I came up with the following PHP function to help deal with this. It is inspired by this comment on Chris Shiflett’s blog. It is simply a use of the str_replace() function to replace some known problem characters with character entities that should work better when outputting UTF-8 content.

<?php
function filterText($text)
{
   $search = array (
      '&',
      '<',
      '>',
      '"',
      chr(212),
      chr(213),
      chr(210),
      chr(211),
      chr(209),
      chr(208),
      chr(201),
      chr(145),
      chr(146),
      chr(147),
      chr(148),
      chr(151),
      chr(150),
      chr(133)
   );
   $replace = array (
      '&amp;',
      '&lt;',
      '&gt;',
      '&quot;',
      '&#8216;',
      '&#8217;',
      '&#8220;',
      '&#8221;',
      '&#8211;',
      '&#8212;',
      '&#8230;',
      '&#8216;',
      '&#8217;',
      '&#8220;',
      '&#8221;',
      '&#8211;',
      '&#8212;',
      '&#8230;'
   );
   return str_replace($search, $replace, $text);
}

// USAGE:
header('Content-Type: text/html; charset="UTF-8"');
echo filterText($test);
?>

2008-09-05

Tabbed Ouput with Tidy

by Charles — Categories: HTML, PHP — Tags: , 3 Comments

In response to this thread at WebDeveloper.com, I came up with the idea of using PHP’s Tidy functions to format the HTML output from a script. The basic idea was to capture all the output by using ob_start() to buffer the output and then ob_get_clean() to save it to a variable. Then just run it through the tidy_repair_string() function with a couple configuration settings to indent it.


<?php
ob_start
();
?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>test</title>
</head>
<body>
<h1>Test</h1>
<ul>
<li>This is a test.</li>
<li>It is only a test.</li>
</ul>
</body>
</html>
<?php
$text
= ob_get_clean();
$config = array(
'indent' => true,
'indent-spaces' => 4
);
$text = tidy_repair_string($text, $config);
echo
$text;

But (more…)

2008-07-02

Most Used Words in File

by Charles — Categories: HTML, PHP — Tags: , 2 Comments

I recently saw a question posted in a forum about obtaining the most-used words in a file for the purpose of populating the content of a “keywords” meta tag. It got me thinking about how I would do it. I knew I could use the str_word_count() function to count words, but running that against any typical [X]HTML file would give a lot of junk you would not want in your keyword list.

One obvious step was to use the strip_tags() function as a means to only look at actual content and not grab tag and property names. A bit more involved was how to avoid common words. Then it occurred to me that MySQL’s full text search mechanism utilizes a list of “stopwords” to ignore in its matching process. Luckily enough, I found the default list of stopwords in their online manual, so I copied that into a text file, ran a few search/replace operations on it to get it into one word per line, did a little massaging, and then I had stopword list ready to go:

(more…)

2008-06-30

Clearing the Floats

by Charles — Categories: CSS, HTML — Tags: , , Leave a comment

No, this has nothing to do with the aftermath of the Macy’s Thanksgiving Day Parade. This is about that annoying thing with floated HTML elements within a container element, whereby it seems that said container has no inkling of the fact that it should expand enough vertically to contain any and all such floated elements.

I used to address this via the common Band-Aid of inserting a non-floated element just before the closing tag of the container, and giving that new element the clear: both; CSS property:

<div id='container'>
  <p class='float'>This element has a float:left or float:right;" style property.</p>
  <div class='clear'></div> <!-- "clear" class has the "clear:both;" style -->
</div>

But lo and behold, I recently discovered a simpler solution that does not require any extraneous, empty HTML mark-up, thanks to this article at  quirksmode.com. Now I can get rid of that semantically useless <div class='clear'></div> line in the above HTML, and instead just make sure the style for the “container” div includes:

#container {
  width: 100%;
  overflow: auto;
}
© 2012 PHP Musings All rights reserved - Wallow theme v0.46.4 by ([][]) TwoBeers - Powered by WordPress - Have fun!