The Developer Day | Staying Curious

TAG | encoding

Mar/13

7

Cleaning Invalid UTF-8 characters in PHP

I ran into an ugly issue having to discard invalid UTF-8 characters from a string before I pass it to json_decode() as otherwise it fails decoding it. First I’ve discovered that it’s possible to ignore invalid UTF-8 characters using:

iconv(“UTF-8″, “UTF-8//IGNORE”, $text)

However turns out this has been broken for ages and using //IGNORE produces an E_NOTICE. Luckily I found a comment which suggests a workaround:

ini_set(‘mbstring.substitute_character’, “none”);
$text = mb_convert_encoding($text, ‘UTF-8′, ‘UTF-8′);

This however was not enough. Because I was getting some characters that were non printable UTF-8 characters json_decode was failing on them as well. To work around this I’ve used:

$text = preg_replace(‘/[^\pL\pN\pP\pS\pZ\pM]/u’, ”, $text);

This will remove new lines as well which is fine for me. You can also try a removing non-printable byte sequences.

, , , Hide

Feb/10

26

Zend Framework Escaping Entities

Zend Framework has a powerful input filtering component Zend_Filter_Input. It provides an interface to define multiple filters and validators and apply them to a collection of data. By default values returned from Zend_Filter_Input are escaped for safe HTML output. An example of such functionality:

$validators = array(
    'id' => array(
        'digits',
        'allowEmpty' => true,
    ),
    'name' => array(
        'presence' => 'required',
        new Zend_Validate_StringLength(5, 255),
    ),
    'date' => array(
        'presence' => 'required',
        new Zend_Validate_Date('d/m/Y'),
    ),
);
$filters = array(
    '*'  => 'StringTrim',
);
$input = new Zend_Filter_Input($filters, $validators, $data);
if ($input->isValid()) {
    print_r($input->getEscaped());
} else {
    print_r($input->getMessages());
}

The code above validates the $data array by checking that the id key consists only of digits or is empty, that name is present and it’s min length is 5 and max length is 255 and that date is present and it’s format is d/m/Y. The StringTrim validator removes any spaces in front and at the end of all the $data values. If the $data array is valid all escaped values are outputted and if not a generated list of error messages is presented.

Recently I’ve ran into a small problem while using Zend_Input_Filter. According to the best security practices all data coming from the user should be escaped before being outputted to the screen. This also means data coming from a database. This requires to escape every single output statement in every view. In my opinion this adds unwanted verbosity, performance overhead and a possibility to easily miss that something was not escaped in one of the views. That is why if there is a possibility I would prefer to trust the database and not worry about the data incoming from it. It may not be a very good option if data in the database is most often presented in other documents than HTML. Since all values would need to be decoded which defeats the whole purpose. Though this is rarely the case.

By default Zend_Filter_Input escapes values by using the htmlentities function. Which converts all applicable characters to HTML entities. This means that for example Danish language characters Æ Ø Å would be stored as Æ Ø Å Which means that it would take 7 bytes on average to store a single letter that otherwise could be stored using 1 - 3 bytes using UTF-8 encoding support in MySQL. This could also potentially ruin collation sorting.

To overcome this issue a very similar function htmlspecialchars could be used. It escapes only a few certain characters such as > < ” without escaping all international characters. The actual problem with Zend_Filter_Input is that it does not have an escape filter that uses htmlspecialchars. To solve the issue I’ve created a copy of HtmlEntities filter which uses htmlspecialchars function instead.

The filter can then be used like this:

$input = new Zend_Filter_Input($filters, $validators, $data);
$input->setDefaultEscapeFilter(new Company_Product_Filter_HtmlSpecialChars());

Zend_Filter_Input is a great tool to ease form validation and filtering. I would be very interested to hear from you how this problem could be solved.

, , , , , , Hide

Find it!

Theme Design by devolux.org