This is an example you will see this error clearly:
<?php error_reporting(0); $html = new DOMDocument(); $strPage = <<<HTML <html> <head> <title>Demo Error - Tutorialspots.com</title> <script type="text/javascript"> var strJS = "<b>This is bold.</b><br /><br />This should not be bold. Where did my closing tag go to?"; </script> </head> <body> <script type="text/javascript"> document.write(strJS); </script> </body> </html> HTML; $html->loadHTML($strPage); echo $html->saveHTML();
Online demo: http://demo.tutorialspots.com/html/domdocument.php
Result:
<html><head><title>Demo Error - Tutorialspots.com</title><script type="text/javascript"> var strJS = "<b>This is bold.<br /><br />This should not be bold. Where did my closing tag go to?"; </script></head><body> <script type="text/javascript"> document.write(strJS); </script></body><html></html></html>
Right result must be: http://demo.tutorialspots.com/html/domdocument.html
How to fix this error?
We must change script tags to other content then replace reverse. Here our solution with preg_replace_callback
<?php error_reporting(0); $html = new DOMDocument(); $strPage = <<<HTML <html> <head> <title>Demo Error - Tutorialspots.com</title> <script type="text/javascript"> var strJS = "<b>This is bold.</b><br /><br />This should not be bold. Where did my closing tag go to?"; </script> </head> <body> <script type="text/javascript"> document.write(strJS); </script> </body> </html> HTML; $mm = array(); //store script tags $count=0; //counter of $mm $md5 = md5($strPage); //we want unique content //we don't change <script src="..."></script> $strPage = preg_replace_callback('@<script.*>(.*)</script>@Uis', function($matches)use(&$mm,&$count,$md5){ if($matches[1]==''){ return $matches[0]; } $mm[$count] = $matches[0]; return '<script>'.$md5.($count++).$md5.'</script>'; }, $strPage); $html->loadHTML($strPage); $strPage = $html->saveHTML(); $strPage = preg_replace_callback('/<script>'.$md5.'(\d+)'.$md5.'<\/script>/', function($matches)use($mm){ return $mm[$matches[1]]; }, $strPage); echo $strPage;
Online demo: http://demo.tutorialspots.com/html/domdocumentfixed.php
Explain regular expression:
U: means ungreedy (shortest match possible)
i: case insensitive ( will be matched as well)
s: whitespace is included in . (dot) character (newline will not break match)
Bonus:
In some cases:
<script src="app.js"> </script>
We should replace to:
<script src="app.js"></script>
Here’s the solution:
$strPage = preg_replace('@(<script.*>)\s+(</script>)@i',"$1$2",$strPage);