PHP DOMDocument: Error while parsing HTML tags in JavaScript string


This is an example you will see this error clearly:

<?php
error_reporting(0);
$html = new DOMDocument();
 
$strPage = <<<HTML
<html>
<head>
<title>Demo Error - Tutorialspots.com</title>
<script type="text/javascript">
var strJS = "<b>This is bold.</b><br /><br />This should not be bold. Where did my closing tag go to?";
</script>
</head>
<body>
<script type="text/javascript">
document.write(strJS);
</script>
</body>
</html>
HTML;

$html->loadHTML($strPage);
echo $html->saveHTML();

Online demo: http://demo.tutorialspots.com/html/domdocument.php

Result:

<html><head><title>Demo Error - Tutorialspots.com</title><script type="text/javascript">
var strJS = "<b>This is bold.<br /><br />This should not be bold. Where did my closing tag go to?";
</script></head><body>
<script type="text/javascript">
document.write(strJS);
</script></body><html></html></html>

Right result must be: http://demo.tutorialspots.com/html/domdocument.html

How to fix this error?

We must change script tags to other content then replace reverse. Here our solution with preg_replace_callback

<?php
error_reporting(0);
$html = new DOMDocument();
  
$strPage = <<<HTML
<html>
<head>
<title>Demo Error - Tutorialspots.com</title>
<script type="text/javascript">
var strJS = "<b>This is bold.</b><br /><br />This should not be bold. Where did my closing tag go to?";
</script>
</head>
<body>
<script type="text/javascript">
document.write(strJS);
</script>
</body>
</html>
HTML;
 
$mm = array(); //store script tags
$count=0; //counter of $mm
$md5 = md5($strPage); //we want unique content
 
//we don't change <script src="..."></script>
$strPage = preg_replace_callback('@<script.*>(.*)</script>@Uis', function($matches)use(&$mm,&$count,$md5){  
    if($matches[1]==''){
        return $matches[0];
    }   
    $mm[$count] = $matches[0];     
    return '<script>'.$md5.($count++).$md5.'</script>';
}, $strPage);
 
$html->loadHTML($strPage);
$strPage = $html->saveHTML();
 
$strPage = preg_replace_callback('/<script>'.$md5.'(\d+)'.$md5.'<\/script>/', function($matches)use($mm){     
    return $mm[$matches[1]];
}, $strPage);
 
echo $strPage;

Online demo: http://demo.tutorialspots.com/html/domdocumentfixed.php

Explain regular expression:
U: means ungreedy (shortest match possible)
i: case insensitive ( will be matched as well)
s: whitespace is included in . (dot) character (newline will not break match)

Bonus:
In some cases:

<script src="app.js">  
</script>

We should replace to:

<script src="app.js"></script>

Here’s the solution:

$strPage = preg_replace('@(<script.*>)\s+(</script>)@i',"$1$2",$strPage);

DOMDocument error

Leave a Reply