Thursday, May 30, 2013

Script to Calculate Columns' Cardinality and Selectivity

The concepts of column cardinality and selectivity come up a lot when you're designing indexes; this script displays both values for a given table in the current database.

There are many excellent articles on the web explaining these two concepts (see the References section below for my favorites), but briefly:

Cardinality is the number of distinct values in a column. The classic example is the "Employees" table: the "Sex" column has two possible values (M, or F) and so has a cardinality of either 1 (for an all-woman company, for example), or 2 (for a co-ed company). The "ZIP Code" column could have a cardinality anywhere from 1 (for a very small company) to about 43,000 (for a huge, nation-wide enterprise).

Selectivity for a column is the ratio of the number of distinct values (the cardinality) to the total number of values. For example, in our "Employees" table, imagine there are 5000 employees, but all of them live in only 10 states. The selectivity of this column would be: 10 / 5000 = 0.002, or 0.2%. One potentially confusing point is that although this is a very low percentage, it is referred to as high selectivity: this would potentially be a good column for an index. The lower the selectivity percentage, the higher the selectivity this represents, and the more useful the column might be as the first column in an index.

I find this script useful when I encounter a table containing data I am totally unfamiliar with: I could probably guess the columns' selectivity for a table named Customer fairly well, but a table named Thx1138 containing some kind of exotic (to me) data would be a bigger challenge without this script.

-------------------------------------------------------------------------------
-- Calculates the row count, cardinality, and selectivity of each column in a  
-- user-specified table.                                                       
-------------------------------------------------------------------------------

SET NOCOUNT ON

-- Specify the table here - this is the only place you need to modify. 
DECLARE @SchemaName sysname         SET @SchemaName = 'dbo'
DECLARE @TableName  sysname         SET @TableName  = 'MyTable'

-- Declare variables. 
DECLARE @CrLf       CHAR(2)         SET @CrLf       = CHAR(13) + CHAR(10)
DECLARE @Sql        NVARCHAR(MAX)   SET @Sql        = ''
DECLARE @ColumnName sysname         SET @ColumnName = ''

-- Show total number of rows in table. 
SET @Sql = 'SELECT COUNT(*) AS "Row Count for ''' + @TableName + '''" FROM ' + @SchemaName + '.' + @TableName
EXEC sp_executesql @Sql

-- Calculate selectivity as "cardinality / row count" for each column. 
DECLARE cur CURSOR FOR
    SELECT name
      FROM sys.columns
     WHERE object_id   = OBJECT_ID(@SchemaName + '.' + @TableName)
       AND is_identity = 0
       
OPEN cur
FETCH NEXT FROM cur INTO @ColumnName

WHILE @@FETCH_STATUS = 0
BEGIN
    RAISERROR('Processing column: %s', 10, 1, @ColumnName) WITH NOWAIT

    SET @Sql = 'SELECT ''' + QUOTENAME(@ColumnName) + '''   AS ''Column'' '                        + @CrLf
             + '     ,       COUNT(DISTINCT ' + QUOTENAME(@ColumnName) + ')   AS ''Cardinality'' ' + @CrLf
             + '     , LEFT((COUNT(DISTINCT ' + QUOTENAME(@ColumnName) + ') * 1.0) / '             + @CrLf
             + '             NULLIF(COUNT(*), 0), 6)   AS ''Selectivity'' '                        + @CrLf
             + '  FROM ' + QUOTENAME(@SchemaName) + '.' + QUOTENAME(@TableName) + ' '              + @CrLf
             + ' WHERE ' + QUOTENAME(@ColumnName) + ' IS NOT NULL '                                + @CrLf
             + ''                                                                                  + @CrLf

    EXEC sp_executesql @Sql
    FETCH NEXT FROM cur INTO @ColumnName
END

CLOSE cur
DEALLOCATE cur

References

Index columns, selectivity and equality predicates

Query Tuning Fundamentals: Density, Predicates, Selectivity, and Cardinality

Column Statistics Give the Optimizer an Edge


1 comments:

Jon said...

"The lower the selectivity percentage, the higher the selectivity this represents"

This doesn't make sense to me. Running your query on my table the cluster key has a 1.0 and my column with only 12 unique columns out of >200000 gets a much lower value 0.00006393327 but if I add an index on that column SQL will just ignore it.

Post a Comment