Monday, May 27, 2013

Down the Rabbit-Hole with CASE and Order of Evaluation

An Aluminum Falcon?

Sometimes you have what looks like a simple problem, but after a few hours of immersion in some dark corner of SQL Server, you end up more confused than when you started. And not just about the original problem: you're now also confused about stuff you were pretty sure about. That's when, as a seasoned software professional, you stop, get a good night's sleep, and come back fresh the next day with a clear mind.

Or, you can consume a few hundred fluid ounces of Diet Mountain Dew, read absolutely everything you can find on the subject until the wee hours of the morning, write endless little test programs, and mutter to yourself maniacally for hours on end, determined to beat this beastie into submission, and meanwhile your wife slips into "Puttering Around the House Much More Loudly Than Necessary While Sighing Heavily" mode.

You can guess which option I chose.

Punch it, Chewie

The simple problem I thought I was having was that the query optimizer was not making use of a CHECK constraint to avoid having to read data from a table. The real code is too large to fit in the margins, so I've shown below the smallest script that could possibly fail. It creates two tables, each with one column. Each column has a CHECK constraint. Then, 100,000 rows are INSERT'd into each table. (I chose to use a lot of data just to make the differences in the number of "logical reads" stand out. And I've done the INSERTs in kind of a weird way, because it executes more quickly and also displays decently on a web page. If you really want to feel my pain on my production machine, change the "GO 10000" and "GO 10000000".)

Next, the SELECT statement uses a CASE to compare a variable to the values in each table. If a match is found, it performs an aggregation (specifically, MAX) on that table. This is made simpler by having table t1 contain only "1"s, and table t2 contain only "2"s: this is enforced by the CHECK constraints.

-- Step 1 of 3 - Set up. ------------------------------------------ 


INSERT t1 (c) VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)
GO 10000

INSERT t2 (c) VALUES (2), (2), (2), (2), (2), (2), (2), (2), (2), (2)
GO 10000

-- Step 2 of 3 - Execution. --------------------------------------- 

           WHEN EXISTS (SELECT *      FROM t2 WHERE c <= @LtOrEq)
           THEN        (SELECT MAX(c) FROM t2 WHERE c <= @LtOrEq)

           WHEN EXISTS (SELECT *      FROM t1 WHERE c <= @LtOrEq)
           THEN        (SELECT MAX(c) FROM t1 WHERE c <= @LtOrEq)


-- Step 3 of 3 - Cleanup. ----------------------------------------- 

Since @LtOrEq is set to "1", I would expect the optimizer to use t2.ck2 to avoid scanning the t2 table; in other words, skip the evaluation of the first WHEN/THEN pair entirely. Since the values cannot be anything but "2", I would have thought that a search for a value of "1" would be eliminated without even looking at the data. At least that how I thought it worked.

Put another way, as Remus Rusanu wrote, "SQL is a declarative language. You express in a query the desired result and the server is free to choose whatever means to deliver those results. As such the order of evaluation of SQL expressions is not determined and OR and AND evaluation short circuit does not occur. However for CASE the documentation actually states that the order of evaluation occurs in the order of declaration and the evaluation stops after the first condition is met."

But that's not what I was seeing. When you run the code above, as I did on several different servers, the output is:

-- Truncated the zero values for readability. 
Table 't1'.        Scan count 2, logical reads   2
Table 'Worktable'. Scan count 0, logical reads   0
Table 't2'.        Scan count 1, logical reads 345

Now, there's three things about this output I don't understand:

  1. The tables are listed in reverse order to what I would expect: t1 appears before t2. Books OnLine is very clear that WHEN statements are evaluated from first to last (my emphasis added in blue):    

    Searched CASE expression:
    • Evaluates, in the order specified, Boolean_expression for each WHEN clause.
    • Returns result_expression of the first Boolean_expression that evaluates to TRUE.
    • If no Boolean_expression evaluates to TRUE, the Database Engine returns the else_result_expression if an ELSE clause is specified, or a NULL value if no ELSE clause is specified.

    So either the WHEN statements aren't evaluated in the correct order, or SET STATISTICS IO is outputting them in an incorrect order. Which is it?

  2. Table t1 is scanned twice, which makes sense: once to see if any matching rows exist, and once to find the MAX. And it only needs to do a single logical read for each of them because all 100,000 32-byte integer values in t1.c1 will fit inside a single 8 KB page and... uh-oh. Why doesn't it take, um, 100000 * 32 / 8060 = 397.02 logical reads, since the table has to occupy at least that many pages?

  3. Table t2 is scanned once, which also makes sense: since its WHEN clause evaluates to FALSE, it shouldn't evaluate its THEN clause. So then why does it do 345 logical reads? How can the table with the check constraint in place to keep it from being read have more reads than the table that being scanned for MAX?

As an aside, the BOL entry for CASE is very clear about order of execution, but the language is muddled regarding short-circuiting. Evaluates, in the order specified, Boolean_expression for each WHEN clause. Returns result_expression of the first Boolean_expression that evaluates to TRUE. This could be read - logically correctly, but not what the authors meant (or so I thought) - as: Evaluates, in the order specified, Boolean_expression for each WHEN clause. Then, when all the WHEN clauses have been evaluated, returns result_expression of the first Boolean_expression that evaluated to TRUE. There's at least one posting on the web which makes this (incorrect) argument. The BOL page should be edited to make it clear that short-circuiting does indeed occur... but read on...

I used a third-party tool to try to understand what's going on, did some amateur art-work to highlight the two tables, and rearranged the tree to help me visualize it all.

Clearly the t2 table is getting scanned twice.

There is Another

What finally pointed me in the right direction was a blog post by Mladen Prajdić about short-circuiting.  After setting up a test table t1:

Run this statement:

       FROM t1
      WHERE id / 0 = 1
        AND id = 2

You'll get this error:

Msg 241, Level 16, State 1, Line 10
Conversion failed when converting datetime from character string.

The execution plan shows us how SQL Server parameterizes our select statement:

       FROM [t1]
      WHERE [id] / @1 = @2
        AND [id] = @3
        AND CONVERT([DATETIME], [val], 0) > GETDATE()

Conversion failed when converting datetime from character string

The most obvious condition to fail is of course the divide by zero. You'd think it would be the first to evaluate since it's obviously an illegal call and everything else would be discarded. However because of the statement parameterization this doesn't happen because at the condition evaluation time the values 0 and 1 aren't known. For SQL Server they are still valid parameters whose condition cost must be evaluated.

Aha! No wonder the t2 table's WHEN was evaluated: the values of the @LtOrEq variable couldn't be assumed at run-time. So if I change the variable to a literal "1":

           WHEN EXISTS (SELECT *      FROM t2 WHERE c <= 1  /* @LtOrEq */  )
           THEN        (SELECT MAX(c) FROM t2 WHERE c <= 1  /* @LtOrEq */  )

           WHEN EXISTS (SELECT *      FROM t1 WHERE c <= 1  /* @LtOrEq */  )
           THEN        (SELECT MAX(c) FROM t1 WHERE c <= 1  /* @LtOrEq */  )

... the results are:

-- Truncated the zero values for readability. 
Table 't1'. Scan count 2, logical reads 2

Much better. Table t2 isn't read from at all (and this is backed up by the query plan). The problem now is, how do I get this behavior and still use variables?

These aren't The Droids You're Looking For

But first I have to explain that this has nothing to do with "short-circuiting" of evaluation. What's that, you say? Imagine code like "IF a OR b". During execution, if "b" is not evaluated because "a" was evaluated as TRUE, that's short-circuiting. Some languages, like C++ and C#, guarantee that short-circuiting will occur. The T-SQL language - and this is where it gets confusing - can short-circuit if the engine feels like it. It's totally up to the query optimizer, which means you and I better not write code that depends on short-circuiting: it may work as we expect for years, and then one day some threshold is crossed, and it stops working as we - incorrectly - expected. There isn't even a way to force short-circuiting, and no, parentheses won't help.

What this is (partly) about, however, is the closely related concept of "order of evaluation". T-SQL doesn't make any promises about this, either, with one exception. Order of evaluation is guaranteed in the CASE statement: the WHEN clauses are guaranteed to be evaluated from top to bottom. This means that CASE guarantees short-circuiting.

Except when it doesn't.

I've Got a Bad Feeling About This...

As it turns out, there was a bug in the CASE statement that was causing the order of evaluation to not be done correctly (great discussion at Bart Duncan's SQL Blog ). This has been fixed.

But as it turns out, what I was seeing was not that bug: it was expected behavior. Microsoft states here, referring to BOL for SQL Server 2012 (my emphasis added in blue):

The following has been added to the Remarks section of the topic CASE (Transact-SQL) in Books Online.

The CASE statement evaluates its conditions sequentially and stops with the first condition whose condition is satisfied. In some situations, an expression is evaluated before a CASE statement receives the results of the expression as its input. Errors in evaluating these expressions are possible. Aggregate expressions that appear in WHEN arguments to a CASE statement are evaluated first, then provided to the CASE statement. For example, the following query produces a divide by zero error when producing the value of the MAX aggregate. This occurs prior to evaluating the CASE expression.

WITH Data (value) AS
     SELECT 1
          WHEN MIN(value) <= 0 THEN 0
          WHEN MAX(1/value) >= 100 THEN 1
     FROM Data

You should only depend on order of evaluation of the WHEN conditions for scalar expressions (including non-correlated sub-queries that return scalars), not for aggregate expressions.

But there's even more. It's not only in the WHEN clause that aggregate expressions cause divide-by-zero errors. If you change the code above to this:

WITH Data (value) AS
     SELECT 1
          WHEN MIN(value) <= 0 THEN 0
          WHEN 1          >= 1 THEN MAX(1/value)    -- Aggregate in THEN.
     FROM Data

... you'll still get a divide by zero error, this time caused by the evaluation of the second THEN clause!


And there you have it. The CASE statement guarantees order-of-evaluation, unless there are aggregates in the WHEN (or THEN) clauses. Is this a bug, or at least something that could be corrected by Microsoft? If you read the references below, some good arguments are made for not "fixing" this. Either way, at least now I know what's going on. I know this is a long and twisting post, so if you hung in here this far, thanks for reading!


Books OnLine: CASE (Transact-SQL) Read this first.

How SQL Server short-circuits WHERE condition evaluation by Mladen Prajdić is in my opinion the best post on the web about short-circuiting.

Understanding T-SQL Expression Short-Circuiting by Gianluca Sartori is brilliant and covers a lot of ground. A must-read. The interesting idea in this thread is that T-SQL always short-circuits, but because the boolean operators are commutative, the order (left-to-right, or right-to-left?) is indeterminate. Some very smart people here.

Don’t depend on expression short circuiting in T-SQL (not even with CASE) is very interesting. This is the article that is referenced in the Connect that ultimately led to the bug in CASE getting fixed.

Connect: Aggregates Don't Follow the Semantics Of CASE includes the explanation by Microsoft about how CASE doesn't short-circuit for aggregates, only scalar values.


Post a Comment