PHP: Convert Numeric Character Reference to UTF8


Some functions to convert Numeric Character Reference (NCR) to UTF8:

Method 01

<?php
function detectUTF8($string)  
{  
        return preg_match('%(?:  
        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte  
        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs  
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte  
        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates  
        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3  
        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15  
        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16  
        )+%xs', $string);  
}  

function encoding($string){  
    if (detectUTF8($string)) {  
        return $string;         
    } else {  
        $string = html_entity_decode($string,ENT_QUOTES,"UTF-8"); 
        return $string; 
    }
}
?>

Ex:

<?php
....
echo  encoding("La Fen&#234;tre de Soleil");
?> 

Result:

La Fenêtre de Soleil

Alternative methods:

Method 02

function ncr_utf8_2($string)
{
    $_utf8 = create_function('$data',
        'if ($data < 128) return chr($data);if ($data < 2048) return chr(($data >> 6) + 192) . chr(($data & 63) + 128); if ($data < 65536) return chr(($data >> 12) + 224) . chr((($data >> 6) & 63) + 128) . chr(($data & 63) + 128); if ($data < 2097152)return chr(($data >> 18) + 240) . chr((($data >> 12) & 63) + 128) . chr((($data >> 6) & 63) + 128) . chr(($data & 63) + 128); return "";');
    $string = preg_replace('/&#x([0-9a-f]+);/ei', '$_utf8(hexdec("\\1"))', $string);
    $string = preg_replace('/&#([0-9]+);/e', '$_utf8(\\1)', $string);
    if (!isset($tbl)) {
        $tbl = array();
        foreach (get_html_translation_table(HTML_ENTITIES) as $val => $key)
            $tbl[$key] = utf8_encode($val);
    }
    return strtr($string, $tbl);
}

Update 08/23/2018:
if you see error:

Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in ...php on line 8

You must use this method instead:

function ncr_utf8_2($string)
{
    $_utf8 = create_function('$data',
        'if ($data < 128) return chr($data);if ($data < 2048) return chr(($data >> 6) + 192) . chr(($data & 63) + 128); if ($data < 65536) return chr(($data >> 12) + 224) . chr((($data >> 6) & 63) + 128) . chr(($data & 63) + 128); if ($data < 2097152)return chr(($data >> 18) + 240) . chr((($data >> 12) & 63) + 128) . chr((($data >> 6) & 63) + 128) . chr(($data & 63) + 128); return "";');
    $string = preg_replace_callback('/&#x([0-9a-f]+);/i', function($ms) use($_utf8){return $_utf8(hexdec($ms[1]));}, $string);
    $string = preg_replace_callback('/&#([0-9]+);/', function($ms) use($_utf8){$_utf8($ms[1]);}, $string);
    if (!isset($tbl)) {
        $tbl = array();
        foreach (get_html_translation_table(HTML_ENTITIES) as $val => $key)
            $tbl[$key] = ($val);
    }
    return strtr($string, $tbl);
}

Method 03

function ncr_utf8_3($string)
{
    $_utf8 = create_function('$data',
        'if ($data > 127){ $i = 5; while (($i--) > 0){ if ($data != ($a = $data % ($p = pow(64, $i)))) { $ret = chr(base_convert(str_pad(str_repeat(1, $i + 1), 8, "0"), 2, 10) + (($data - $a) / $p)); for ($i; $i > 0; $i--) $ret .= chr(128 + ((($data % pow(64, $i)) - ($data % ($p = pow(64, $i - 1)))) / $p)); break;}}} else $ret = "&#$data;"; return $ret;');
    return preg_replace("/\\&\\#([0-9]{3,10})\\;/e", '$_utf8("\\1")', $string);
}

Ex:

echo ncr_utf8_3("Peut &#234;tre");

result:

Peut être

Update 08/23/2018:
if you see error:

Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in ...php on line 8

You must use this method instead:

function ncr_utf8_3($string)
{
    $_utf8 = create_function('$data',
        'if ($data > 127){ $i = 5; while (($i--) > 0){ if ($data != ($a = $data % ($p = pow(64, $i)))) { $ret = chr(base_convert(str_pad(str_repeat(1, $i + 1), 8, "0"), 2, 10) + (($data - $a) / $p)); for ($i; $i > 0; $i--) $ret .= chr(128 + ((($data % pow(64, $i)) - ($data % ($p = pow(64, $i - 1)))) / $p)); break;}}} else $ret = "&#$data;"; return $ret;');
    return preg_replace_callback("/\\&\\#([0-9]{3,10})\\;/", function($ms) use($_utf8){$_utf8($ms[1]);}, $string);
}

Next: PHP: Convert Numeric Character Reference to UTF8 – Part 2

2 Comments

Leave a Reply