Replacing HTML entities with their numeric equivalent

Posted: 2011/01/18 in PHP

Problem: The output of PHP’s htmlentities() is seldom valid XML.  As I am attempting to make KML maps for Google Maps, I have run into a problem where various names use UTF8 valid characters that are not very HTML friendly (ex. the ‘n’ with a tilde in daño — not sure if that shows up in your browser or not but the ‘n’ should have a squiggly line over it). Google Maps states the .kml file is invalid because htmlentities() turns that ñ into ‘ñ’ which XML thinks is something that has to be previously defined (“Fatal Error 26: Entity ‘ntilde’ not defined”).

Solution: Create a mapping table that is used to convert HTML entities with their numeric equivalent using the ord() function.

This solution has some ‘issues,’ though.  The solution I have is very U.S.-English centric (cf. the ‘A’ of ASCII).  It is also prone to being obscure with changes to “today’s” standards.  On a positive side, this solution is based on PHP’s libraries and not my own invention (which is prone to errors, obscurity, and a lot of work like this guy did — I am grateful to this Matt Robinson as his post helped me figure some of this out.)

However, my ‘solution’ comes mainly from Michael Krenz at this article (dated 2006!!!).  For my use, I needed not only the ‘regular’ entities numerically encoded but also curly quotes (both single and double).  I did not go through the effort of using a preg_() function which, according to the Matt Robinson article, may be a better approach.  Instead, I use the simpler and faster str_replace() function, which fits my current needs just fine (a developer should never state that, right!?!).

Basically, I take one array of HTML entities and another array of numeric equivalents and use the one to replace the other.


/*
 * This code is free software; you can use it, redistribute it, and/or modify as you wish.
 * This code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY - implied or otherwise.
 * This code is distributed "as is."  All risk and cost are assumed by the user of the code and not the creator thereof.
 * If you want to give attribution to the original creator of the original code, his name is David Malouf and he is, probably, available at EmailTheDavid@gmail.com
 */

// Create array where the keys are replaced by their values
$search = get_html_translation_table(HTML_ENTITIES, ENT_NOQUOTES);
foreach ($search as $key=>$value) {
  $replace[$value] = "&#" . ord($key) . ";";
}
// Add curly quotes (left-single, right-single, left-double...)
//  if these are not turning out correctly (either getting converted or WordPress is being stupid),
//  the keys are (remove space): ampersand l s q u o colon (then same pattern r s q u o, ld, and rd)
//   the values are: ampersand pound 8216, 8217, 8220, and 8221
$replace['‘'] = '&#8216';
$replace['’'] = '&#8217';
$replace['“']=  '&#8220';
$replace['”']= '&#8221';

// hand in an array of strings to be numeric-ized -- maintain keys if there are any
foreach ($stringsToFix as $key=>$value) {
   $cleanedValue = iconv("","UTF-8",htmlentities($in,ENT_NOQUOTES,"UTF-8",0));
   $fixedlValue = str_replace(array_keys($replace), $replace, $cleanedValue);
   $fixedArray[$key] = $fixedValue;
}

// output is $fixedArray where all the values have had their HTML entities converted
//  to their numeric equivalent

Addendum:
On 2011 Jan 19, Matt Robinson (see above) pointed out that the get_html_translation_table() function, previous to PHP 5.3.4, does NOT have enough!  Per his suggestion, I ran a count() on the function and got 99 rows.  There are a lot more than 99 HTML entities that need to be accounted for (more like 250, per Matt’s email!!).  That is why Matt wrote the enormous function!  He did point out that PHP 5.3.4 changes get_html_translation_table() to allow for a third argument: charset_hint.  In theory, this should provide a lot more entities to be ‘converted.’  This should also remove the need to add all the $replace[xxxx]=xxxx; lines!!
I told you Matt was a smart guy!!

Advertisements
Comments
  1. David, nice to see an old post of mine ist still useful! 🙂 Didn’t expect anyone to dig this up…

  2. Leona says:

    I believe everything published made a bunch of sense.
    However, what about this? what if you were to create
    a awesome headline? I mean, I don’t wish to tell you how to run your website, however what if you added a post title that grabbed folk’s attention?
    I mean Replacing HTML entities with their numeric equivalent | Dissection by David blog is kinda vanilla.
    You ought to peek at Yahoo’s front page and watch how they write article titles to get people to click. You might try adding a video or a picture or two to get readers excited about everything’ve written.
    In my opinion, it could make your website a little livelier.

    • Leona,

      You are very correct – the titles in this blog are very boring. At the risk of coming across as one who is trying ‘work’ SEO, I’ll explain my title method.

      Basically, this blog is a problem-solving resource. I title the posts such that they will be easy for me/others to find given the problem they address. I try to title them in such a way that would make it easy for me to find the solution if I type only one or two words in the search-box.

      Based on the analytics of this site, it seems that most people come to this while searching for a solution to the problem the posts address. For example, no small number of visits to this post came from someone(s) searching “html entities replace numeric” (roughly).

      Thus, I try to make the titles easy to find using keywords. Because I need these posts (these posts are for me as much as anyone – I never remember the details of these posts!) and I need to find them easily.

      Hope that gives you (and anyone else reading this reply) the rationale for this blog (and its super-lame titles!)

      David

  3. Susanne says:

    great put up, very informative. I wonder why the other specialists of this sector do not
    notice this. You must proceed your writing. I’m confident, you have a huge readers’ base already!

  4. Valencia says:

    Way cool! Some extremely valid points! I appreciate you writing this post and also the
    rest of the website is also really good.

  5. I love it when individuals get together and share thoughts.
    Great blog, stick with it!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s