Thursday, May 30, 2013

Find All NOT NULL Columns Without a Default Constraint

Rarely, I'll run into a situation where it would be useful to know which columns that are specified as NOT NULL don't have a default constraint as well. For example, I've seen a table with two dozen columns, all but a few set as NOT NULL. This in itself isn't an issue, but on this occasion, for debugging purposes, we were constantly INSERTing a test row of data into the table. Having to specify a default value for each column got to be annoying after a while.

There are, of course, other ways to skin this cat, We could have created a SSMS template and used the old "Ctrl-Shift-M" key combo to type in just the values we needed. Or we could have created a script, or even a stored procedure to handle the default values for us. For various reasons, those solutions wouldn't work in our environment, so I decided to attack the problem at its root.

The code below looks at all non-empty tables and prints out T-SQL code to add the default constraint to each column it finds. Note that I've made an attempt to select a reasonable default value for all the common data types, but (a) the list is not complete, and (b) your idea of a "reasonable" default value may be very different from mine. So don't just copy and paste the output from this script and hit F5 - take a moment to read each line and be sure it does what you want; if not, you can easily change it!

-- For all tables in the current database that have rows, finds all "NOT NULL" 
-- columns that don't have a DEFAULT constraint, and emits the ALTER TABLE     
-- statements needed to add the DEFAULT constraints for them.                  


DECLARE @SchemaName      sysname         SET @SchemaName     = ''
DECLARE @TableName       sysname         SET @TableName      = ''
DECLARE @ColumnName      sysname         SET @ColumnName     = ''
DECLARE @ColumnType      sysname         SET @ColumnType     = ''
DECLARE @ConstraintName  sysname         SET @ConstraintName = ''
DECLARE @Sql             NVARCHAR(MAX)   SET @Sql            = ''

    SELECT SCHEMA_NAME(t.schema_id)    AS 'Schema Name'
         , OBJECT_NAME(c.object_id)    AS 'Table Name'
         ,                      AS 'Column Name'
         ,                     AS 'Column Type'
      FROM sys.columns   c
      JOIN sys.tables    t
        ON c.object_id = t.object_id
      JOIN sys.types     ty
        ON c.user_type_id = ty.user_type_id
     WHERE c.is_nullable                  = 0
       AND c.is_identity                  = 0
       AND c.is_computed                  = 0
       AND t.type_desc                    = 'USER_TABLE'
       AND t.is_ms_shipped                = 0
       AND ISNULL(c.default_object_id, 0) = 0
       AND 0 < (SELECT SUM(row_count)
                  FROM sys.dm_db_partition_stats
                 WHERE object_id = OBJECT_ID(   
                   AND (index_id = 0 OR index_id = 1))
  ORDER BY 'Schema Name'
         , 'Table Name'
         , 'Column Name'

OPEN cur

 INTO @SchemaName, @TableName, @ColumnName, @ColumnType
    PRINT 'USE ' + DB_NAME(0)
    PRINT ' ' 

    SET @ConstraintName = QUOTENAME('DF_'+ @TableName + '_' + @ColumnName)
    SET @Sql = ''
             + 'PRINT ''Processing: ' + QUOTENAME(@SchemaName) + '.' + @ConstraintName + ''''
             + ' ALTER TABLE ' + QUOTENAME(@SchemaName) + '.' + QUOTENAME(@TableName) 
             + ' ADD CONSTRAINT ' + @ConstraintName 
             + ' DEFAULT ('
                CASE @ColumnType
                    WHEN 'char'             THEN ''''''
                    WHEN 'nchar'            THEN ''''''
                    WHEN 'varchar'          THEN ''''''
                    WHEN 'nvarchar'         THEN ''''''
                    WHEN 'sysname'          THEN ''''''
                    WHEN 'xml'              THEN ''''''
                    WHEN 'udtUserID'        THEN ''''''
                    WHEN 'text'             THEN ''
                    WHEN 'ntext'            THEN ''

                    WHEN 'bigint'           THEN '0'
                    WHEN 'int'              THEN '0'
                    WHEN 'smallint'         THEN '0'
                    WHEN 'tinyint'          THEN '0'
                    WHEN 'bit'              THEN '0'

                    WHEN 'real'             THEN '0.0'
                    WHEN 'money'            THEN '0.0'
                    WHEN 'smallmoney'       THEN '0.0'
                    WHEN 'float'            THEN '0.0'
                    WHEN 'decimal'          THEN '0.0'
                    WHEN 'numeric'          THEN '0.0'

                    WHEN 'image'            THEN '0x0'
                    WHEN 'binary'           THEN '0x0'
                    WHEN 'varbinary'        THEN '0x0'
                    WHEN 'uniqueidentifier' THEN '0x0'
                    WHEN 'sql_variant'      THEN '0'
                    WHEN 'hierarchyid'      THEN '''/'''
                    WHEN 'geometry'         THEN '0'
                    WHEN 'geography'        THEN '0'

                    WHEN 'datetime'         THEN 'GETDATE()'
                    WHEN 'date'             THEN 'GETDATE()'
                    WHEN 'time'             THEN 'GETDATE()'
                    WHEN 'datetime2'        THEN 'GETDATE()'
                    WHEN 'datetimeoffset'   THEN 'GETDATE()'
                    WHEN 'smalldatetime'    THEN 'GETDATE()'
                    WHEN 'timestamp'        THEN 'GETDATE()'
                    ELSE                         '-1'
             + ')' 
             + ' FOR ' + QUOTENAME(@ColumnName)

    PRINT @Sql    
     INTO @SchemaName, @TableName, @ColumnName, @ColumnType


Script to Calculate Columns' Cardinality and Selectivity

The concepts of column cardinality and selectivity come up a lot when you're designing indexes; this script displays both values for a given table in the current database.

There are many excellent articles on the web explaining these two concepts (see the References section below for my favorites), but briefly:

Cardinality is the number of distinct values in a column. The classic example is the "Employees" table: the "Sex" column has two possible values (M, or F) and so has a cardinality of either 1 (for an all-woman company, for example), or 2 (for a co-ed company). The "ZIP Code" column could have a cardinality anywhere from 1 (for a very small company) to about 43,000 (for a huge, nation-wide enterprise).

Selectivity for a column is the ratio of the number of distinct values (the cardinality) to the total number of values. For example, in our "Employees" table, imagine there are 5000 employees, but all of them live in only 10 states. The selectivity of this column would be: 10 / 5000 = 0.002, or 0.2%. One potentially confusing point is that although this is a very low percentage, it is referred to as high selectivity: this would potentially be a good column for an index. The lower the selectivity percentage, the higher the selectivity this represents, and the more useful the column might be as the first column in an index.

I find this script useful when I encounter a table containing data I am totally unfamiliar with: I could probably guess the columns' selectivity for a table named Customer fairly well, but a table named Thx1138 containing some kind of exotic (to me) data would be a bigger challenge without this script.

-- Calculates the row count, cardinality, and selectivity of each column in a  
-- user-specified table.                                                       


-- Specify the table here - this is the only place you need to modify. 
DECLARE @SchemaName sysname         SET @SchemaName = 'dbo'
DECLARE @TableName  sysname         SET @TableName  = 'MyTable'

-- Declare variables. 
DECLARE @CrLf       CHAR(2)         SET @CrLf       = CHAR(13) + CHAR(10)
DECLARE @Sql        NVARCHAR(MAX)   SET @Sql        = ''
DECLARE @ColumnName sysname         SET @ColumnName = ''

-- Show total number of rows in table. 
SET @Sql = 'SELECT COUNT(*) AS "Row Count for ''' + @TableName + '''" FROM ' + @SchemaName + '.' + @TableName
EXEC sp_executesql @Sql

-- Calculate selectivity as "cardinality / row count" for each column. 
    SELECT name
      FROM sys.columns
     WHERE object_id   = OBJECT_ID(@SchemaName + '.' + @TableName)
       AND is_identity = 0
OPEN cur

    RAISERROR('Processing column: %s', 10, 1, @ColumnName) WITH NOWAIT

    SET @Sql = 'SELECT ''' + QUOTENAME(@ColumnName) + '''   AS ''Column'' '                        + @CrLf
             + '     ,       COUNT(DISTINCT ' + QUOTENAME(@ColumnName) + ')   AS ''Cardinality'' ' + @CrLf
             + '     , LEFT((COUNT(DISTINCT ' + QUOTENAME(@ColumnName) + ') * 1.0) / '             + @CrLf
             + '             NULLIF(COUNT(*), 0), 6)   AS ''Selectivity'' '                        + @CrLf
             + '  FROM ' + QUOTENAME(@SchemaName) + '.' + QUOTENAME(@TableName) + ' '              + @CrLf
             + ' WHERE ' + QUOTENAME(@ColumnName) + ' IS NOT NULL '                                + @CrLf
             + ''                                                                                  + @CrLf

    EXEC sp_executesql @Sql
    FETCH NEXT FROM cur INTO @ColumnName



Index columns, selectivity and equality predicates

Query Tuning Fundamentals: Density, Predicates, Selectivity, and Cardinality

Column Statistics Give the Optimizer an Edge

Tuesday, May 28, 2013

A quick script to display the total size each database, taking into account multiple MDF and LDF files.

-- Returns the total size of all databases' mdf and ldf files, and the grand   
-- total of the two.                                                           

-- Create the temp table. 
IF OBJECT_ID('tempdb..##t') IS NOT NULL
    DROP TABLE ##t

    [Database]   sysname, 
    [File Type]  NVARCHAR(50),
    [Size (MB)]  BIGINT

-- Populate the temp table with value for all the mdf and ldf files for all databases.  
SET @CrLf = CHAR(13) + CHAR(10)


SET @Sql = 'USE ? '                                                           + @CrLf
         + ''                                                                 + @CrLf
         + 'INSERT ##t '                                                      + @CrLf
         + '    SELECT ''?'' '                                                + @CrLf
         + '         , type_desc '                                            + @CrLf
         + '         , CAST(Size AS BIGINT) * (8 * 1024) / 1024 / 1024 '      + @CrLf
         + '      FROM sys.database_files '

EXEC sp_msforeachdb @Sql

-- Calculate the grand total (mdf files' sizes + ldf files' sizes; add to temp table. 
SET @Sql = 'USE ? '                                                           + @CrLf
         + ''                                                                 + @CrLf
         + 'INSERT ##t '                                                      + @CrLf
         + '    SELECT ''?'' '                                                + @CrLf
         + '         , '' - TOTAL - '' '                                      + @CrLf
         + '         , CAST(SUM(Size) AS BIGINT) * (8 * 1024) / 1024 / 1024 ' + @CrLf
         + '      FROM sys.database_files '
EXEC sp_msforeachdb @Sql

-- Display the totals for each database and file type. 
  SELECT [Database], [File Type], SUM([Size (MB)])   AS 'Size (MB)'
    FROM ##t
GROUP BY [Database], [File Type]
ORDER BY [File Type] DESC, [Size (MB)] DESC

IF OBJECT_ID('tempdb..##t') IS NOT NULL
    DROP TABLE ##t

Setting the Number of ERRORLOG files

For some reason, the default number of ERRORLOG files maintained by SQL Server defaults to six. This has always seemed like a poor choice to me, as it makes it impossible to investigate events that may have occurred many months ago. Fortunately, this is a simple Registry setting, and we can use the undocumented xp_instance_regwrite system stored procedure to set it to the value we want.

-- Set the number of ERRORLOG files to the maximum value of 99 instead of the  
-- default of 6. This is especially useful if you also set up a Job to cycle   
-- the ERRORLOG files every week or so, to keep them small enough to load      
-- comfortably.                                                                

EXEC xp_instance_regwrite 


See this SQLAdmin post at xp_instance_regwrite syntax for details on how to use that procedure.

Monday, May 27, 2013

Down the Rabbit-Hole with CASE and Order of Evaluation

An Aluminum Falcon?

Sometimes you have what looks like a simple problem, but after a few hours of immersion in some dark corner of SQL Server, you end up more confused than when you started. And not just about the original problem: you're now also confused about stuff you were pretty sure about. That's when, as a seasoned software professional, you stop, get a good night's sleep, and come back fresh the next day with a clear mind.

Or, you can consume a few hundred fluid ounces of Diet Mountain Dew, read absolutely everything you can find on the subject until the wee hours of the morning, write endless little test programs, and mutter to yourself maniacally for hours on end, determined to beat this beastie into submission, and meanwhile your wife slips into "Puttering Around the House Much More Loudly Than Necessary While Sighing Heavily" mode.

You can guess which option I chose.

Punch it, Chewie

The simple problem I thought I was having was that the query optimizer was not making use of a CHECK constraint to avoid having to read data from a table. The real code is too large to fit in the margins, so I've shown below the smallest script that could possibly fail. It creates two tables, each with one column. Each column has a CHECK constraint. Then, 100,000 rows are INSERT'd into each table. (I chose to use a lot of data just to make the differences in the number of "logical reads" stand out. And I've done the INSERTs in kind of a weird way, because it executes more quickly and also displays decently on a web page. If you really want to feel my pain on my production machine, change the "GO 10000" and "GO 10000000".)

Next, the SELECT statement uses a CASE to compare a variable to the values in each table. If a match is found, it performs an aggregation (specifically, MAX) on that table. This is made simpler by having table t1 contain only "1"s, and table t2 contain only "2"s: this is enforced by the CHECK constraints.

-- Step 1 of 3 - Set up. ------------------------------------------ 


INSERT t1 (c) VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)
GO 10000

INSERT t2 (c) VALUES (2), (2), (2), (2), (2), (2), (2), (2), (2), (2)
GO 10000

-- Step 2 of 3 - Execution. --------------------------------------- 

           WHEN EXISTS (SELECT *      FROM t2 WHERE c <= @LtOrEq)
           THEN        (SELECT MAX(c) FROM t2 WHERE c <= @LtOrEq)

           WHEN EXISTS (SELECT *      FROM t1 WHERE c <= @LtOrEq)
           THEN        (SELECT MAX(c) FROM t1 WHERE c <= @LtOrEq)


-- Step 3 of 3 - Cleanup. ----------------------------------------- 

Since @LtOrEq is set to "1", I would expect the optimizer to use t2.ck2 to avoid scanning the t2 table; in other words, skip the evaluation of the first WHEN/THEN pair entirely. Since the values cannot be anything but "2", I would have thought that a search for a value of "1" would be eliminated without even looking at the data. At least that how I thought it worked.

Put another way, as Remus Rusanu wrote, "SQL is a declarative language. You express in a query the desired result and the server is free to choose whatever means to deliver those results. As such the order of evaluation of SQL expressions is not determined and OR and AND evaluation short circuit does not occur. However for CASE the documentation actually states that the order of evaluation occurs in the order of declaration and the evaluation stops after the first condition is met."

But that's not what I was seeing. When you run the code above, as I did on several different servers, the output is:

-- Truncated the zero values for readability. 
Table 't1'.        Scan count 2, logical reads   2
Table 'Worktable'. Scan count 0, logical reads   0
Table 't2'.        Scan count 1, logical reads 345

Now, there's three things about this output I don't understand:

  1. The tables are listed in reverse order to what I would expect: t1 appears before t2. Books OnLine is very clear that WHEN statements are evaluated from first to last (my emphasis added in blue):    

    Searched CASE expression:
    • Evaluates, in the order specified, Boolean_expression for each WHEN clause.
    • Returns result_expression of the first Boolean_expression that evaluates to TRUE.
    • If no Boolean_expression evaluates to TRUE, the Database Engine returns the else_result_expression if an ELSE clause is specified, or a NULL value if no ELSE clause is specified.

    So either the WHEN statements aren't evaluated in the correct order, or SET STATISTICS IO is outputting them in an incorrect order. Which is it?

  2. Table t1 is scanned twice, which makes sense: once to see if any matching rows exist, and once to find the MAX. And it only needs to do a single logical read for each of them because all 100,000 32-byte integer values in t1.c1 will fit inside a single 8 KB page and... uh-oh. Why doesn't it take, um, 100000 * 32 / 8060 = 397.02 logical reads, since the table has to occupy at least that many pages?

  3. Table t2 is scanned once, which also makes sense: since its WHEN clause evaluates to FALSE, it shouldn't evaluate its THEN clause. So then why does it do 345 logical reads? How can the table with the check constraint in place to keep it from being read have more reads than the table that being scanned for MAX?

As an aside, the BOL entry for CASE is very clear about order of execution, but the language is muddled regarding short-circuiting. Evaluates, in the order specified, Boolean_expression for each WHEN clause. Returns result_expression of the first Boolean_expression that evaluates to TRUE. This could be read - logically correctly, but not what the authors meant (or so I thought) - as: Evaluates, in the order specified, Boolean_expression for each WHEN clause. Then, when all the WHEN clauses have been evaluated, returns result_expression of the first Boolean_expression that evaluated to TRUE. There's at least one posting on the web which makes this (incorrect) argument. The BOL page should be edited to make it clear that short-circuiting does indeed occur... but read on...

I used a third-party tool to try to understand what's going on, did some amateur art-work to highlight the two tables, and rearranged the tree to help me visualize it all.

Clearly the t2 table is getting scanned twice.

There is Another

What finally pointed me in the right direction was a blog post by Mladen Prajdić about short-circuiting.  After setting up a test table t1:

Run this statement:

       FROM t1
      WHERE id / 0 = 1
        AND id = 2

You'll get this error:

Msg 241, Level 16, State 1, Line 10
Conversion failed when converting datetime from character string.

The execution plan shows us how SQL Server parameterizes our select statement:

       FROM [t1]
      WHERE [id] / @1 = @2
        AND [id] = @3
        AND CONVERT([DATETIME], [val], 0) > GETDATE()

Conversion failed when converting datetime from character string

The most obvious condition to fail is of course the divide by zero. You'd think it would be the first to evaluate since it's obviously an illegal call and everything else would be discarded. However because of the statement parameterization this doesn't happen because at the condition evaluation time the values 0 and 1 aren't known. For SQL Server they are still valid parameters whose condition cost must be evaluated.

Aha! No wonder the t2 table's WHEN was evaluated: the values of the @LtOrEq variable couldn't be assumed at run-time. So if I change the variable to a literal "1":

           WHEN EXISTS (SELECT *      FROM t2 WHERE c <= 1  /* @LtOrEq */  )
           THEN        (SELECT MAX(c) FROM t2 WHERE c <= 1  /* @LtOrEq */  )

           WHEN EXISTS (SELECT *      FROM t1 WHERE c <= 1  /* @LtOrEq */  )
           THEN        (SELECT MAX(c) FROM t1 WHERE c <= 1  /* @LtOrEq */  )

... the results are:

-- Truncated the zero values for readability. 
Table 't1'. Scan count 2, logical reads 2

Much better. Table t2 isn't read from at all (and this is backed up by the query plan). The problem now is, how do I get this behavior and still use variables?

These aren't The Droids You're Looking For

But first I have to explain that this has nothing to do with "short-circuiting" of evaluation. What's that, you say? Imagine code like "IF a OR b". During execution, if "b" is not evaluated because "a" was evaluated as TRUE, that's short-circuiting. Some languages, like C++ and C#, guarantee that short-circuiting will occur. The T-SQL language - and this is where it gets confusing - can short-circuit if the engine feels like it. It's totally up to the query optimizer, which means you and I better not write code that depends on short-circuiting: it may work as we expect for years, and then one day some threshold is crossed, and it stops working as we - incorrectly - expected. There isn't even a way to force short-circuiting, and no, parentheses won't help.

What this is (partly) about, however, is the closely related concept of "order of evaluation". T-SQL doesn't make any promises about this, either, with one exception. Order of evaluation is guaranteed in the CASE statement: the WHEN clauses are guaranteed to be evaluated from top to bottom. This means that CASE guarantees short-circuiting.

Except when it doesn't.

I've Got a Bad Feeling About This...

As it turns out, there was a bug in the CASE statement that was causing the order of evaluation to not be done correctly (great discussion at Bart Duncan's SQL Blog ). This has been fixed.

But as it turns out, what I was seeing was not that bug: it was expected behavior. Microsoft states here, referring to BOL for SQL Server 2012 (my emphasis added in blue):

The following has been added to the Remarks section of the topic CASE (Transact-SQL) in Books Online.

The CASE statement evaluates its conditions sequentially and stops with the first condition whose condition is satisfied. In some situations, an expression is evaluated before a CASE statement receives the results of the expression as its input. Errors in evaluating these expressions are possible. Aggregate expressions that appear in WHEN arguments to a CASE statement are evaluated first, then provided to the CASE statement. For example, the following query produces a divide by zero error when producing the value of the MAX aggregate. This occurs prior to evaluating the CASE expression.

WITH Data (value) AS
     SELECT 1
          WHEN MIN(value) <= 0 THEN 0
          WHEN MAX(1/value) >= 100 THEN 1
     FROM Data

You should only depend on order of evaluation of the WHEN conditions for scalar expressions (including non-correlated sub-queries that return scalars), not for aggregate expressions.

But there's even more. It's not only in the WHEN clause that aggregate expressions cause divide-by-zero errors. If you change the code above to this:

WITH Data (value) AS
     SELECT 1
          WHEN MIN(value) <= 0 THEN 0
          WHEN 1          >= 1 THEN MAX(1/value)    -- Aggregate in THEN.
     FROM Data

... you'll still get a divide by zero error, this time caused by the evaluation of the second THEN clause!


And there you have it. The CASE statement guarantees order-of-evaluation, unless there are aggregates in the WHEN (or THEN) clauses. Is this a bug, or at least something that could be corrected by Microsoft? If you read the references below, some good arguments are made for not "fixing" this. Either way, at least now I know what's going on. I know this is a long and twisting post, so if you hung in here this far, thanks for reading!


Books OnLine: CASE (Transact-SQL) Read this first.

How SQL Server short-circuits WHERE condition evaluation by Mladen Prajdić is in my opinion the best post on the web about short-circuiting.

Understanding T-SQL Expression Short-Circuiting by Gianluca Sartori is brilliant and covers a lot of ground. A must-read. The interesting idea in this thread is that T-SQL always short-circuits, but because the boolean operators are commutative, the order (left-to-right, or right-to-left?) is indeterminate. Some very smart people here.

Don’t depend on expression short circuiting in T-SQL (not even with CASE) is very interesting. This is the article that is referenced in the Connect that ultimately led to the bug in CASE getting fixed.

Connect: Aggregates Don't Follow the Semantics Of CASE includes the explanation by Microsoft about how CASE doesn't short-circuit for aggregates, only scalar values.

xp_fileexist Fails Even if Mapped to a Local Folder

Apparently the undocumented xp_fileexist procedure really doesn't like mapped drives. I knew about the headaches involved with using this proc to read network drives (permissions issues are usually the problem), but I expected it to return correct results for a locally mapped drive.

To see this, map your C:\Windows folder to, for example the W: drive, and then run this script:

-- Demonstration that xp_fileexist really, 
-- really doesn't like mapped drives - 
-- even those mapped to a local path. 

-- Local disk folder succeeds: 
--    File_Exists = 1 
--    File_is_a_Directory = 0 
--    Parent_Directory_Exists = 1 
SET @FileName = 'C:\Windows\Notepad.exe'
EXEC xp_fileexist @FileName

-- Map a drive to  \\MYMACHINE\c$\Windows  as drive  W: 

-- Local mapped drive FAILS: 
--    File_Exists = 0 
--    File_is_a_Directory = 0 
--    Parent_Directory_Exists = 0 
SET @FileName = 'W:\Notepad.exe'
EXEC xp_fileexist @FileName

If there's a workaround for this, I'd be interested. (Yes, I could be the 1,000,000th person to write my own CLR to do this, but I'm hoping for a better way.)