Skip to content

Fix sanitizer with libxml2 >= 2.12.0

Somehow with newer libxml2, <?xml encoding="UTF-8"> no longer enforces UTF-8. Instead, non-ASCII contents are treated as ISO-8859-1 and get broken.

For example, <p>中文</p> becomes <p>&auml;&cedil;&shy;&aelig;&#150;&#135;</p> (should be <p>&#20013;&#25991;</p>).

Switching to another trick fixes the issue, and the new trick still works with older libxml2 (tested 2.11.5).

As a side note, DOMDocument::loadHTML uses HTMLParser in libxml2.

Merge request reports