The Developer Day | Staying Curious



Cleaning Invalid UTF-8 characters in PHP

I ran into an ugly issue having to discard invalid UTF-8 characters from a string before I pass it to json_decode() as otherwise it fails decoding it. First I’ve discovered that it’s possible to ignore invalid UTF-8 characters using:

iconv(“UTF-8″, “UTF-8//IGNORE”, $text)

However turns out this has been broken for ages and using //IGNORE produces an E_NOTICE. Luckily I found a comment which suggests a workaround:

ini_set(‘mbstring.substitute_character’, “none”);
$text = mb_convert_encoding($text, ‘UTF-8′, ‘UTF-8′);

This however was not enough. Because I was getting some characters that were non printable UTF-8 characters json_decode was failing on them as well. To work around this I’ve used:

$text = preg_replace(‘/[^\pL\pN\pP\pS\pZ\pM]/u’, ”, $text);

This will remove new lines as well which is fine for me. You can also try a removing non-printable byte sequences.

RSS Feed

No comments yet.

Leave a comment!



Find it!

Theme Design by