From RNWiki
Jump to: navigation, search

This is where I (fkelly) propose that we write our filtering standards for 2.5 .. this is all draft level stuff now


Filtering of data is a fundamental aspect of any web based system. It affects security, performance and the acceptability of the system to users. It underlies every other facet of the system. In generic *nuke (those based on PHPNUKE(tm)) based systems as well as in RavenNuke™ specifically, filtering has traditionally been scattershot or "fractured" -- in other words not based on any set of design principles or standards. The topic has been discussed ad nauseum in forum threads without arriving at any resolution. For RavenNuke™ 2.5 we aim to change that.

The purpose of this document is to lay out a consistent set of standards that all programmers working on RavenNuke™ software will use. The document will reference specific functions and programs that are contained (or will be contained) in RavenNuke™. It is not intended as a general reference or to be used outside of the RavenNuke™ context.

The Flow of things

Within RavenNuke™ (and content management systems generally) there is basically a relatively simple flow. Content for the system is stored in a MYSQL database. The data in that database gets put in by HTML forms. The forms are presented to a user who fills them in. Optionally, there can be Javascript client side validation of the form before it is submitted. Upon submission, the form is processed by another PHP program. The "posted" data should be validated and prepared for the database. Database input or change statements are issued, the database is updated and the user is presented with another form or report.

Within this flow there are several points at which filtering or validation takes place. Javascript is generally used for client side validation. RavenNuke™ is moving in the direction of using Jquery based validation classes as the foundation for Javascript client side validation. When the form is submitted, there is a layer of software that lies between the form and its processing program to assure that the form comes from within the system, thus preventing cross site request forgery. A PHP program receives the posted form and validates all data. Before any data is submitted to the database a specific set of MYSQL related "steps" must be carried out to prevent SQL injection type attacks -- basically adding slashes (escape characters) before certain special characters.

Let's start with a Form

Ideally, filtering would be built into the design of a form and would essentially be "declarative". That is, the form designer would specify the type of each input element and or validation or edit criteria. If an input element is supposed to be an email address, then a standard email validation routine would be run both at the client side (Javascript -- Jquery provides a validation routine) when the form is submitted. If the element is a checkbox then the only values that can be in it after submission are "on" or a value you have associated with the checked attribute. In the case of text fields the designer needs to specify whether he wants to allow any html and if so which attributes. In the case of textareas the settings for allowablehtml will determine which html features can be used ... and the receiving program will have to run the posted data through standard validation routines.

In the RavenNuke™ context we do not have the libraries, frameworks or capabilities to implement such declarative filtering of forms. In addition there is a large amount of legacy code which would essentially have to be rewritten to put into such a framework. So, such an approach is not practical in the short run. Instead what we need to do is to look at our forms on a one-by-one basis and refit them to implement standards as specified here. One issue (that I am not sure about) is whether we want to do Javascript validation of all forms and fields on the client side before the form is ever submitted. Or perhaps set it as a standard that we do so whenever we go in to modify a legacy program? Whatever Javascript validation we do should be, as in the current RNYA module, based on the Jquery validation library. We should also have a standard for presenting any errors to users including putting messages in a standard location on the form and whether we want to validate field by field as the user moves focus off a field or only upon a submission attempt. (or some combination for fields that have dependencies and need to validated together)?

Upon submission

So, the user fills out the form and hits submit. Even though we have taken steps in RavenNuke™ 2.4 to stop cross site request forgery, we still cannot trust that the form submitted is one from within our system. We can't be sure, even if we have exhaustive Javascript validation, that the fields are properly validated. A forged form could contain a script command in a checkbox field, just for instance. What does our receiving PHP program need to do? To eliminate the possibility of PHP warnings and errors, we first need to check if the posted field is set. So we will have syntax like:

if isset($_POST['field1'] {
do some validation

Now I have a question. If we know what form we are receiving the POST data from and we know what fields are on it then should not the absence of one of those fields in the POST data be considered evidence of a security violation ... should we have some kind of way to pass this to Sentinel to ban the user?

But in any event, assuming the field is present, we then want to do as specific a filtering job as we can. In other words, if we know the values a field can have, we should check that it has one of them. If the field is a State field, it should have one of the 50 state values. In fact anything that comes from an option list should have one of those option values. A numeric field that is supposed to be an integer should be an integer. If the values have to be less than, say, 150 then they should be. By taking this approach to fields where we know the possible values we eliminate the need to send them through more extensive filtering libraries such as KSES or HTML Purifier. No?

If we are going to be doing these validations in a standard way throughout RavenNuke™ should we not have a library of common validation logic that could be called? In fact, should we not insist that it be called instead of doing "one-off" coding.

Likewise, there are some fields where there is pre-written logic that can be used to validate. These include email addresses, phone numbers, zip codes, and URL's. Again, we should have a standard approach to these and the logic should be totally consistent with what we use on the Javascript side.

Finally we come to the more complex case of text fields and textarea input where we want to allow some HTML. In these cases we need to pass the field through a standard library. Currently we do it by passing the field to the check_html function in mainfile. This in turn sees if stripslashes is needed (more on that later) and passes the field to kses.php which in its turn "normalizes" and validates the html in the data and checks for security violations.

Note to self: gotta mention that NukeSentinel™ gets a shot at the form data before our receiving program ever sees it. And that NukeSentinel™ post logic filtering has to go :)


The function check_html from mainfile is so central and critical to our discussion that it gets it's own topic. It is also relatively short so I will quote the RavenNuke™ 2.4 version in its entirety:

function check_html ($string, $allowed_html = '', $allowed_protocols = array('http', 'https', 'ftp', 'news', 'nntp', 'gopher', 'mailto'))
	$stop = FALSE;
	if(!function_exists('kses_no_null')) {
	if (get_magic_quotes_gpc() == 1) {
		$string = stripslashes($string);
	$string = kses_no_null($string);
	$string = kses_js_entities($string);
	$string = kses_normalize_entities($string);
	$string = kses_hook($string);
	if (stripos_clone($allowed_html, 'nocheck') === true) {
		return $string;
	} else {
		if (stripos_clone($allowed_html, 'nohtml') === false) {
			global $AllowableHTML;
			$allowed_html = $AllowableHTML;
		} else {
			$allowed_html = array('<null>');
		$allowed_html_fixed = kses_array_lc($allowed_html);
		return kses_split($string, $allowed_html_fixed, $allowed_protocols);

The function check_html takes three parameters. The first, $string is the string you are filtering. In most cases it will be derived from doing a $_POST on form data. The third parameter is specific to the kses program and sets the allowable protocols that can be passed through the kses filtering program. Generally you should not need to deal with this nor change it. The second parameter essentially specifies the level of filtering that you want to apply.

There are three levels of filtering available. The most strict is specified by having 'nohtml' in the $allowed_html parameter. With nohtml in this parameter the $allowed_html array will be set to null and passed to the kses_array_lc and kses_split functions. These will strip any html out of the $string. So, for instance, you will not be able to bold text or change the font size, much less apply more advanced html attributes. It should be used for input items such as usernames, real names, and any other items where you, as form designer, do not want html to be entered.

The next level of strictness is specified by having any other value besides 'nocheck' in the second parameter. By tradition most developers just pass a ′′ (null value) in the second parameter ... in other words calling $string = check_html($string, ′′) but the way the code works anything in there except 'nocheck' will have the same effect. The program rnconfig.php contains an $AllowableHTML array thus allowing the site administrator some flexibility in terms of which html he wants to allow. With any other parameters besides 'nocheck' and 'nohtml' the $AllowableHTML array will be loaded into a $allowed_html array and passed to kses where non-allowed html items will be stripped. The filtered items will be passed back in an revised $string.

And while a full discussion of kses is beyond the scope of this documentation (and also beyond the competence of its author) note that while two functions are called from check_html -- kses_array_lc and kses_split -- the majority of the work within kses is done by a kses_split2 function that is called from kses_split. Which means that if you want to truly understand what is going on in the innards of the filtering, you need to look at those functions within includes/kses.php.

The 'nocheck' parameter was added to allow an intermediate level of filtering. Using it, $string is passed through four kses functions that do a minimum amount of formatting on the input. Perhaps the best way to document this is to quote the comments for each of these functions:

  1. kses_no_null() -- This function removes any NULL characters in $string.
  2. kses_js_entities() -- This function removes the HTML JavaScript entities found in early versions of Netscape 4.
  3. kses_normalize_entities() -- This function normalizes HTML entities. It will convert "AT&T" to the correct "AT&T", ":" to ":", "&#XYZZY;" to "&#XYZZY;" and so on.
  4. kses_hook() -- doesn't do anything in our current implementation