Friday, January 13, 2012

"Sanitize Input"

When application security was still in it’s infancy, there were discussions on how to protect applications from newly discovered injection vulnerabilities. "Sanitize Input" was a popular solution that rolled off the tongue nicely and was not overly complicated to explain. It was also, a very generic solution that would (hopefully) be part of a more complete approach.

As much as "Sanitizing Input" makes sense, so does writing your code in a way which, allows you to handle failure safely. This way, when the unexpected does happen, an entire operation doesn't fall down, introduce a bug or propagate unsafe data.

Question: When does this approach fail miserably?

Answer:  When it is the only approach you have.

The OWASP Top 10 categorizes XSS and SQL Injection separately. As an attacker, you are injecting data that is handled insecurely by application code. In this way, it is really just another form of injection. On that note, let’s discuss two manifestations of injection. SQL Injection and HTML Injection (XSS). I'd like to demonstrate other ways to think about or handle data beyond just "Sanitize Input". If you take away nothing more from this article, I'd like it to be that applications are unique, there is a level of complexity to design choices and solutions and there are more options than "Sanitize Input" available.

SQL Injection: "Save your one-liners for the bar". Parametrization of database queries is a classic method for handling queries safely and in many cases more efficiently. From a security standpoint, parametrized queries help to solidify the boundaries between user data and SQL statements. It ensures that data submitted by the user will be separated from the actual database query and won’t interfere with the SQL code and ultimately the database.

Example of lazy code....

$uname = $_GET['user_name']

....and this is the classic example everyone shows, nothing new here, that illustrates a SQL Injection flaw where the data ($uname) is actually included in the SQL statement.

"SELECT user_id from users where username = $uname;"

This programming flaw has destroyed the boundary between the SQL command and user-supplied input. Because the user data is now cast as a string-- it is no longer clear to the SQL server what part was supplied by the developer and what was supplied by the user. The whole query can fall apart by appending double quotes. This is not just a vulnerability, this is bad programming. Sure, it takes one line to write the query but there is no further sanity checking here. The string is formed, sent to the server, and executed as SQL. How are parametrized queries different?

Parametrized queries separate the data from the query so that we as coders don’t miscommunicate our intentions to the database server. How does it work? The majority of the query is sent to the server MINUS the actual user submitted data. So, the query is prepared (meaning sent to the server), a response comes back with a token (minus MySQL as I understand it), and THEN, the variable is sent to the server with the token and a SQL query executes. This means the expected query and actual data that we've gathered from the user are separated prior to execution.

Lets provide a visualization

// Pass in db credentials as well as the host it is located on and the database we'd like to connect to

$conn = new PDO("mysql:host=$dbhost;dbname=$dbname",$dbuser,$dbpass);

// Prepare the statement
$sql = "SELECT user_id from users where username = ?";

//Execute the query, taking in the variable data ($uname) from the user
$q = $conn->prepare($sql);
$object = $q->fetchColumn();

As you can see, the $sql statement is prepared and the server knows exactly what it should look like. Next, the SQL statement is executed, passing in the variable value in place of the "?" (shown above). By specifying that question mark, you tell the db, this is my statement but I don't know what the value will be.....I'll give you that on the next call.

Lets examine XSS. Again, something I hear a lot is "Sanitize your input". Some people even go as far as "Whitelist" versus "Blacklist". Okay, great, that is not extensible and ultimately context matters. What do I mean? It is a very one-sided approach with a lot of assumptions. Let me draw a picture for you. The understanding, as of right now, is the data comes in one place and is potentially echoed in another. So the model looks something like this:

A typical example would be a registration form. You sign up with your First Name, Last Name, etc. Upon successful authentication to the application, you notice a little message at the top right.... 

So....."Welcome, Ken!", I wonder where that value came from? When we registered, our information was stored in the db, later extracted after login, and shown on the page. Now, we should be safe right? Even if we had attempted to place JavaScript in the First Name value upon registration, it wouldn't have mattered.....We Sanitized!!! 

Two months later, a user complains that they signed up with a misspelled username and would like the ability to change it. A new developer is assigned the task of adding the ability to edit your first and last name and does so. The new developers assumption is that we are going to safely handle that data when rendered to the user. But we aren't. We sanitized the input and didn't bother with handling the data. Our model has changed from Input/Output to......

So with one additional point of input, our model gets (very slightly) more complicated. Now imagine adding multiple points of input, multiple points of output. Now split input into data entry (processed), storage handling (stored in the db) and then do the same for output. While we are at, lets throw a web-service that consumes the data as well. It becomes very easy to see how "Sanitize Input" doesn't scale, isn't a sure-fire solution, and really oversimplifies the problem for those who are looking to either receive or give an easy answer.

In summary, please join me in the fight to stop the mindless regurgitation of old material.





Anonymous said...

I really enjoyed this post, thank you.

MaXe said...

Often, the following is enough to prevent against XSS.

$string = "< script >alert('Hello Kitty');< /script >"; // Our input string, this could be $_GET['input'] as well.

$sanitized_string = htmlentities($string, ENT_QUOTES); // Sanitize the string and encode both double- and single-quotes. (Without ENT_QUOTES, single-quotes are not encoded and if the string is within a tag attribute using single quotes to encapsulate data, the sanitization is broken. < img src='htmlentities($string)' /> is e.g. vulnerable to XSS, as ENT_QUOTES is not set.

echo $sanitized_string; // Print the string.


Anonymous said...

Great post. It clearly identifies the problem. Do you have any suggestions for solutions that are more scalable and effective? I know that there's no solution that's going to fit all cases and prevent all attacks, but it would be great if you could provide some generalized solutions. Thanks!

cktricky said...

@Anonymous #1 - Thanks!

@MaXe - It's true that various languages have methods to encode HTML characters. Thanks for the code example.

Output encoding, in very basic applications where the context of the output being rendered doesn't preclude it, can in many cases limit HTML Injection. However, this isn't a solution for every application, language, framework etc. Take for example, UI/UX folks could invoke a JQuery function that decodes HTML encoded characters, etc. so that it doesn't look funny when rendered to the user. Simple example, I've seen this before, and it totally destroyed the output encoding technique. A combination of whitelisting the input AND output encoding along with identification & review of data processing boundaries (where the decoding would have been caught) would have been a more complete solution.

I guess one of the points in this article is more about understanding boundaries/layers and realizing how simple || complex your applications are. I personally think it is true that the simpler the application, the simpler the solution. Its when your model begins to scale and expand that things can get tricky in a hurry.

@Anonymous #2 - Well I sort of talked about some solutions here. Specific recommendations are highly dependent on the framework, language, etc. There are a lot of resources out there to find specific technical solutions for "Injection" problems in the varying application technologies.

To provide deep analysis and explanation of how I feel Injection issues could be solved, even at a high level (which I think is what you are asking), would be a conversation too big for a blog post or comment. Unfortunately, I didn't mean this post to be primarily about that. This was more of a rant on the singular solution that everyone seems to think exists. In reality, if it were really that simple, Injection vulnerabilities would be solved by now.