Filtering is one of the cornerstones of web application security. It is the process by which you prove the validity of data. By ensuring that all data is properly filtered on input, you can eliminate the risk that tainted (unfiltered) data is mistakenly trusted or misused in your application. The vast majority of security vulnerabilities in popular PHP applications can be traced to a failure to filter input.
When I refer to filtering input, I am really describing three different steps:
Distinguishing between filtered and tainted data
The first step is to identify input because if you don’t know what it is, you can’t be sure to filter it. Input is any data that originates from a remote source. For example, anything sent by the client is input, although the client isn’t the only remote source of dataother examples include database servers and RSS feeds.
Data that originates from the client is easy to identifyPHP provides this data in superglobal arrays, such as $_GET and $_POST. Other input can be more difficult to identifyfor example, $_SERVER contains many elements that can be manipulated by the client. It’s not always easy to determine which elements in $_SERVER constitute input, so a best practice is to consider this entire array to be input.
What you consider to be input is a matter of opinion in some cases. For example, session data is stored on the server, and you might not consider the session data store to be a remote source. If you take this stance, you can consider the session data store to be an integral part of your application. It is wise to be mindful of the fact that this ties the security of your application to the security of the session data store. This same perspective can be applied to a database because the database can be considered a part of the application as well.
Generally speaking, it is more secure to consider data from session data stores and databases to be input, and this is the approach that I recommend for any critical PHP application.
Once you have identified input, you’re ready to filter it. Filtering is a somewhat formal term that has many synonyms in common parlancesanitizing, validating, cleaning, and scrubbing. Although some people differentiate slightly between these terms, they all refer to the same processpreventing invalid data from entering your application.
Various approaches are used to filter data, and some are more secure than others. The best approach is to treat filtering as an inspection process. Don’t correct invalid data in order to be accommodatingforce your users to play by your rules. History has shown that attempts to correct invalid data often create vulnerabilities. For example, consider the following method intended to prevent file traversal (ascending the directory tree):
Can you think of a value of $_POST[’filename’] that causes $filename to be ../../etc/passwd? Consider the following:
This particular error can be corrected by continuing to replace the string until it is no longer found:
Of course, the basename( ) function can replace this entire technique and is a safer way to achieve the desired goal. The important point is that any attempt to correct invalid data can potentially contain an error and allow invalid data to pass through. Inspection is a much safer alternative.
In addition to treating filtering as an inspection process, you want to use a whitelist approach whenever possible. This means that you want to assume the data that you’re inspecting to be invalid unless you can prove that it is valid. In other words, you want to err on the side of caution. Using this approach, a mistake results in your considering valid data to be invalid. Although undesirable (as any mistake is), this is a much safer alternative than considering invalid data to be valid. By mitigating the damage caused by a mistake, you increase the security of your applications. Although this idea is theoretical in nature, history has proven it to be a very worthwhile approach.
If you can accurately and reliably identify and filter input, your job is almost done. The last step is to employ a naming convention or some other practice that can help you to accurately and reliably distinguish between filtered and tainted data. I recommend a simple naming convention because this can be used in both procedural and object-oriented paradigms. The convention that I use is to store all filtered data in an array called $clean. This allows you to take two important steps that help to prevent the injection of tainted data :
Always initialize $clean to be an empty array.
Add logic to detect and prevent any variables from a remote source named clean.
In truth, only the initialization is crucial, but it’s good to adopt the habit of considering any variable named clean to be one thingyour array of filtered data. This step provides reasonable assurance that $clean contains only data that you knowingly store therein and leaves you with the responsibility of ensuring that you never store tainted data in $clean.
In order to solidify these concepts, consider a simple HTML form that allows a user to select among three colors:
In the programming logic that processes this form, it is easy to make the mistake of assuming that only one of the three choices can be provided. As you will learn in the next section, the client can submit any data as the value of $_POST[’color’]. To properly filter this data, you can use a switch statement:
This example first initializes $clean to an empty array in order to be certain that it cannot contain tainted data. Once it is proven that the value of $_POST[’color’] is one of red, green, or blue, it is stored in $clean[’color’]. Therefore, you can use $clean[’color’] elsewhere in your code with reasonable assurance that it is valid. Of course, you could add a default case to this switch statement to take a particular action in the case of invalid data. One possibility is to display the form again while noting the errorjust be careful not to output the tainted data in an attempt to be friendly.
While this particular approach is useful for filtering data against a known set of valid values, it does not help you filter data against a known set of valid characters. For example, you might want to assert that a username may contain only alphanumeric characters:
Although a regular expression can be used for this particular purpose, using a native PHP function is always preferable. These functions are less likely to contain errors than code that you write yourself is, and an error in your filtering logic is almost certain to result in a security vulnerability.