The concepts of column cardinality and selectivity come up a lot when you're designing indexes; this script displays both values for a given table in the current database.
There are many excellent articles on the web explaining these two concepts (see the References section below for my favorites), but briefly:
Cardinality is the number of distinct values in a column. The classic example is the "Employees" table: the "Sex" column has two possible values (M, or F) and so has a cardinality of either 1 (for an all-woman company, for example), or 2 (for a co-ed company). The "ZIP Code" column could have a cardinality anywhere from 1 (for a very small company) to about 43,000 (for a huge, nation-wide enterprise).
Selectivity for a column is the ratio of the number of distinct values (the cardinality) to the total number of values. For example, in our "Employees" table, imagine there are 5000 employees, but all of them live in only 10 states. The selectivity of this column would be: 10 / 5000 = 0.002, or 0.2%. One potentially confusing point is that although this is a very low percentage, it is referred to as high selectivity: this would potentially be a good column for an index. The lower the selectivity percentage, the higher the selectivity this represents, and the more useful the column might be as the first column in an index.
I find this script useful when I encounter a table containing data I am totally unfamiliar with: I could probably guess the columns' selectivity for a table named Customer fairly well, but a table named Thx1138 containing some kind of exotic (to me) data would be a bigger challenge without this script.
------------------------------------------------------------------------------- -- Calculates the row count, cardinality, and selectivity of each column in a -- user-specified table. ------------------------------------------------------------------------------- SET NOCOUNT ON -- Specify the table here - this is the only place you need to modify. DECLARE @SchemaName sysname SET @SchemaName = 'dbo' DECLARE @TableName sysname SET @TableName = 'MyTable' -- Declare variables. DECLARE @CrLf CHAR(2) SET @CrLf = CHAR(13) + CHAR(10) DECLARE @Sql NVARCHAR(MAX) SET @Sql = '' DECLARE @ColumnName sysname SET @ColumnName = '' -- Show total number of rows in table. SET @Sql = 'SELECT COUNT(*) AS "Row Count for ''' + @TableName + '''" FROM ' + @SchemaName + '.' + @TableName EXEC sp_executesql @Sql -- Calculate selectivity as "cardinality / row count" for each column. DECLARE cur CURSOR FOR SELECT name FROM sys.columns WHERE object_id = OBJECT_ID(@SchemaName + '.' + @TableName) AND is_identity = 0 OPEN cur FETCH NEXT FROM cur INTO @ColumnName WHILE @@FETCH_STATUS = 0 BEGIN RAISERROR('Processing column: %s', 10, 1, @ColumnName) WITH NOWAIT SET @Sql = 'SELECT ''' + QUOTENAME(@ColumnName) + ''' AS ''Column'' ' + @CrLf + ' , COUNT(DISTINCT ' + QUOTENAME(@ColumnName) + ') AS ''Cardinality'' ' + @CrLf + ' , LEFT((COUNT(DISTINCT ' + QUOTENAME(@ColumnName) + ') * 1.0) / ' + @CrLf + ' NULLIF(COUNT(*), 0), 6) AS ''Selectivity'' ' + @CrLf + ' FROM ' + QUOTENAME(@SchemaName) + '.' + QUOTENAME(@TableName) + ' ' + @CrLf + ' WHERE ' + QUOTENAME(@ColumnName) + ' IS NOT NULL ' + @CrLf + '' + @CrLf EXEC sp_executesql @Sql FETCH NEXT FROM cur INTO @ColumnName END CLOSE cur DEALLOCATE cur
References
Index columns, selectivity and equality predicates
Query Tuning Fundamentals: Density, Predicates, Selectivity, and Cardinality
Column Statistics Give the Optimizer an Edge
1 comments:
"The lower the selectivity percentage, the higher the selectivity this represents"
This doesn't make sense to me. Running your query on my table the cluster key has a 1.0 and my column with only 12 unique columns out of >200000 gets a much lower value 0.00006393327 but if I add an index on that column SQL will just ignore it.
Post a Comment