Tag Archives: SQL

I-Tech SQL Server Script Series – Listing Clustered Indexes on All Tables

A good general rule for creating tables is to always have a clustered index. The clustered index defines how data is physically ordered on the disk – and that’s why there can only be one clustered index on a table (a table without a clustered index is a HEAP). Even though it’s a good rule to have a clustered index there are common scenarios when you won’t have one on a table. A typical example is a load or import table, but even then it may have a clustered index applied when the load is complete.

When investigating a database it is helpful to understand the clustered indexes. The chosen clustered index can have a large impact on performance. Understanding the clustered indexes is especially crucial since every non-clustered index on a table relies on the clustered index.

SQL Server stores object information, including indexes, in system views. Table information is available in sys.objects and the associated indexes are in sys.indexes. This script list all of the tables in the current database and their associated clustered index. If the table doesn’t have a clustered index it will have NULL for the IndexName. You can use SSMS to look at indexes individually, but listing them via SQL makes it easy to see all of them at once.

SQL Script

SELECT
	@@SERVERNAME		ServerName
	,DB_NAME()			DatabaseName
	,ss.name			SchemaName
	,so.name			TableName
	,si.name			IndexName
	,sc.name			ColumnName
	,st.name			ColumnType
	,CASE
		WHEN st.is_user_defined = 0 THEN st.name
		WHEN st.is_table_type	= 1	THEN AT.name
		WHEN st.is_user_defined = 1	THEN st2.name
	END ColumnBaseType
	,CASE 
		WHEN sc.max_length = -1 THEN 0
		WHEN st.name IN ('nchar','nvarchar') THEN sc.max_length	/ 2
		WHEN st2.name IN ('nchar','nvarchar') THEN sc.max_length	/ 2
		ELSE sc.max_length
	END	ColumnMaxLength
FROM sys.objects so
	INNER JOIN sys.schemas SS
		ON SO.schema_id = SS.schema_id
	LEFT JOIN sys.indexes si
		ON si.object_id = so.object_id
		AND si.type = 1	--clustered index
	LEFT JOIN sys.index_columns sic
		ON si.object_id = sic.object_id
		AND si.index_id = sic.index_id
	LEFT JOIN sys.columns sc
		ON sic.object_id = sc.object_id
		AND sic.column_id = sc.column_id
	LEFT JOIN sys.types st
		ON sc.user_type_id = st.user_type_id
	LEFT JOIN sys.types st2
		ON st.system_type_id		= st2.system_type_id 
		AND st2.system_type_id		= st2.user_type_id
		AND st.is_user_defined		= 1
	LEFT JOIN sys.assembly_types AT
		ON ST.system_type_id	= AT.system_type_id
		AND ST.user_type_id		= AT.user_type_id
WHERE so.type = 'u'
ORDER BY so.name
	,sic.Index_Column_ID

Script Download

The script is located here – INDEXES.Clustered.Indexes

It is also located on our script page – I-Tech SQL Server Scripts

Adding Limiting Criteria to a SQL OUTER JOIN

When developers first start using and writing SQL queries, JOIN syntax can be confusing. There are many explanations about the basic syntax available all over the web so this post doesn’t reiterate those. This post explains what happens when you start adding limiting criteria to your JOIN.

Even if the basic INNER and OUTER (LEFT and RIGHT) join syntax is understood, adding limiting criteria to the JOIN has tripped up more than one developer.

Examples

The following examples use the AdventureWorksLT sample database. It is currently available here – SQL Server Sample Databases.

USE AdventureWorksLT

 

Basic INNER JOIN

This is a basic INNER JOIN, so only matching rows are returned. If a customer doesn’t have any orders in the database it won’t be displayed. This returns 32 results.

SELECT *
FROM SalesLT.Customer C
    INNER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID

EXAMPLE.JOIN.OUTER.With.Criteria.Example 1

Basic OUTER JOIN

This is a basic OUTER JOIN, so all rows are returned from the primary table (Customer) and only the matching rows from the JOINed table. This returns 847 results.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID

EXAMPLE.JOIN.OUTER.With.Criteria.Example 2

OUTER JOIN With Criteria – try 1

So now you have your basic query created and you need to restrict the rows that are returned. If you get a requirement stating to exclude orders over $1,000 but you still want to see all of the Customer records you need to use an OUTER join. If you are new to SQL or you forget that the Total Due column is in the OUTER table you might put the exclusion criteria in the WHERE clause. This example returns 8 rows.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID
WHERE TotalDue < 1000

EXAMPLE.JOIN.OUTER.With.Criteria.Example 3

OUTER JOIN With Criteria – try 2

Putting the exclusion criteria in the WHERE clause was too restrictive. Another way I’ve seen developers approach this problem is to check for NULLs. Checking for NULL does take care of part of the problem, but you can see this only returns 823 rows, so some customers are still excluded.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID
WHERE (TotalDue < 1000
        OR TotalDue IS NULL)

EXAMPLE.JOIN.OUTER.With.Criteria.Example 4

OUTER JOIN With Criteria – Solution
The last option remaining is to put the exclusion criteria in the JOIN. This query matches the criteria – all of the customer records are returned and only orders with total due less than $1,000. This returns 847 results again – exactly what is expected.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID
        AND TotalDue < 1000

EXAMPLE.JOIN.OUTER.With.Criteria.Example 5

OUTER JOIN Equivalent to an INNER JOIN

The last example shows what happens when the query has criteria in WHERE clause checking for a column in the JOIN table – IS NOT NULL. If the check is on the JOIN key field (SOH.CustomerID) it will be equivalent to an INNER JOIN. If you check another field (TotalDue) it is equivalent to an INNER JOIN with the addition of a WHERE clause specifying that the field also cannot be NULL. In this example TotalDue is always populated so it is the same as an INNER JOIN. If TotalDue could be NULL and was null – it would be missing additional records. The example below returns 32 results.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID
WHERE TotalDue IS NOT NULL

EXAMPLE.JOIN.OUTER.With.Criteria.Example 6

You can see that the query is processed as an INNER JOIN by the query processor too – so not only is it logically equivalent it is identical to the query engine. Look at the join graphic in the middle in both examples. It is a Merge Join (Inner Join) in both.

INNER JOIN Query Plan (partial)
EXAMPLE.JOIN.OUTER.With.Criteria.Example 7 - INNER JOIN Query Plan

OUTER JOIN Query Plan (partial)
EXAMPLE.JOIN.OUTER.With.Criteria.Example 8 - LEFT JOIN Query Plan

Explanation

So why does an OUTER JOIN behave like an INNER JOIN sometimes and not others? When you put your limiting criteria inside of the OUTER JOIN syntax it is part of the JOIN. So if any records are limited it only impacts that JOIN and the table being JOINed. You only want orders placed in Iowa but you want to see all customers? No problem – put the criteria in the JOIN (AND State = ‘IA’). You only want to see orders placed in Iowa and the customers for those orders? No problem again – just put the criteria in the WHERE ( WHERE State = ‘IA).

If you remember that the criteria in the WHERE clause impacts ALL records it’s easy to know where to put the limiting clause.

Finding A Unique Address Key And Matching Addresses

We do a lot of work with addresses. A lot of addresses. One of the big problems we run into on our projects is determining if an address is unique. Another common issue is matching addresses from different sources. This may sound trivial, but there are a few things about US addresses that make this more difficult (or interesting if you really like addresses):

  • The USPS allows alternate city names for the same address
  • The USPS allows alternate street names for the same address
  • People input addresses in many different ways and they need to be standardized (Street vs. St., Avenue vs. Ave, Apartment vs. Unit vs. #, etc.) or you can’t do a string comparison

Terminology

First, some terminology. Different sources give different names to the same data elements. The following is a list of some common alternate names for address elements.

Field Alternate Names
House Number Street Number
Direction Pre-Directional
Street Name Street
Mode Street Suffix
Quadrant Post-Directional
Unit Type Secondary Designation
Unit Number Secondary NumberApartment Number
City City
State State
Zip5 ZipPostal Code
Zip4 Zip+4Zip Addon

If you have worked with SQL for any length of time you know that having a good key is critical for matching records from different tables or sources. Matching data based on addresses can get very messy – many fields in the key, multiple possible (valid) versions of what should be the same data, input errors and the data can change over time. To get around these issues the data needs to be cleaned and standardized.

Steps for Assigning a Unique Key to Your Existing Addresses

  1. Prepare addresses – scrub all addresses and only include addresses that successfully DPV in your solution. DPV (Delivery Point Validation) means that the address is deliverable by the USPS. This step also standardizes the data.
  2. Remove duplicates – you don’t want different records pointing to the same property or you can have a Cartesian join in your process or you may not match addresses that should match.
  3. Assign an artificial key to your address – this is easier to store and much easier to match on.
  4. Match disparate sources

Preparing Addresses

The goal of this step is to get the address into a standard format, parse the address into the component parts and validate that the address is an actual, deliverable address. The reason for validating the address should be fairly obvious – if an address doesn’t point to a property it isn’t useful for anything. Parsing the address and standardizing the components makes it possible to remove duplicates.

When verifying one or two addresses it’s relatively easy to go to the USPS website and verify your address. It checks the address, corrects the zip code and adds the zip+4. It won’t change the city to the preferred city – If you use an acceptable city it leaves it as-is. Although this isn’t useful for automating validation it is very useful for spot-checking individual addresses.

If you need to process hundreds, thousands, or – like some of our projects, millions of addresses you must automate your processing. The best, fastest and cheapest way to do this is purchase third-party address validation software. You want software that performs CASS and DPV certification. We have used address scrubbing software from a handful of vendors. They each have their advantages and disadvantages so look at their specifications carefully and choose the right one for your project. Prices vary quite a bit. We have been able to successfully scrub about 5 million rows an hour on a Windows machine busy doing other things using one of the “cheaper” versions, but some of the “Enterprise” versions have other capabilities or can be tied into an existing infrastructure or come from a vendor you may already use.

Whichever software you use to scrub you addresses, when this has been completed and the “clean” addresses are available you are ready for the next step.

Remove Duplicates

To remove duplicates from addresses you need to understand how USPS addresses work and how your address software works.

As mentioned at the top of this post, the USPS has different categories for cities and streets. Both have preferred versions and acceptable versions. For example, West Des Moines is a preferred city name but WDM is accepted by the USPS and your mail will get delivered if you use that version. If you try to do a string comparison between “West Des Moines” and “WDM” you know that it will never match. So what’s the solution? Easy – most address scrubbing software gives you the option of always returning the preferred city / street or returning the input city / street if they are on the acceptable list. Use the preferred version. The other way to get around this is to skip the city when creating your key. Use Zip+4 instead.

The following fields are needed for a natural unique address key.

  • House Number
  • Direction
  • Street Name
  • Mode
  • Quadrant
  • Unit Number
  • Zip 5
  • City OR Zip 4

Do a distinct query based on the above fields or use a ROW_NUMBER() OVER(PARTITION) to remove duplicates:

 

Example using DISTINCT

This is straight forward and easy to use. It gives you a distinct list of rows based on the columns specified.

--Example of getting a distinct address list
SELECT DISTINCT
    HouseNumber
    ,Direction
    ,StreetName
    ,Mode
    ,Quadrant
    ,UnitNumber
    ,Zip5
    ,Zip4
FROM dbo.SampleAddressTable

Example Using A CTE

This is a little more complicated that using DISTINCT, but it gives you more flexibility. You can include additional columns but they don’t have to be distinct. You can JOIN to other tables on the distinct list. You can perform additional checks before you use the data. Which one you use depends on your needs.

;
WITH ADDRESS_CTE AS (
    SELECT
        HouseNumber
        ,Direction
        ,StreetName
        ,Mode
        ,Quadrant
        ,UnitNumber
        ,Zip5
        ,Zip4
        ,ROW_NUMBER OVER(PARTITION BY
                            HouseNumber
                            ,Direction
                            ,StreetName
                            ,Mode
                            ,Quadrant
                            ,UnitNumber
                            ,Zip5
                            ,Zip4
                        ORDER BY
                            HouseNumber
                            ,Direction
                            ,StreetName
                            ,Mode
                            ,Quadrant
                            ,UnitNumber
                            ,Zip5
                            ,Zip4
                        ) SELECT_CRITERIA
    FROM dbo.SampleAddressTable
)

SELECT
    *
FROM ADDRESS_CTE
WHERE SELECT_CRITERIA = 1

Once duplicates are gone you have a good base set of addresses. These are the addresses that you will be matching to your other sources.

Assign A Key

The easiest way to use your freshly scrubbed and deduped data is to put it into a table. Then you can reference it from other tables easily. Although there is a natural key in the data (remember the fields listed above that make an address unique?) it is much easier to assign an integer key to the data. It makes your JOINs easier, it makes your indexes smaller (if you also use it as the CLUSTERED INDEX) and it is generally easier to use.

After you have created your key (and probably made it CLUSTERED) I would also make a UNIQUE INDEX on the natural key. This insures no duplicates get added accidentally and will help your JOINs when you add new addresses later.

Another important thing to note is that you need to use the parsed address for this to work correctly. If you use something like the “mailing address” or “address line 1” with the house number, street name, mode, etc. combined you will have a very difficult time getting rid of duplicates.

Match Sources

Now you can match addresses from other sources. Just run them through your scrubbing process and do a JOIN based on all of the address fields. You can then store the address key you created above to your new source. You can also add any new addresses you find when you scrub them to your address table.

If you have scrubbed, parsed and deduped your data this will be a relatively minor task.

Other Considerations

Re-scrubbing

Postal data has the possibility to change over time. Zip codes can change and so can street names or even city borders. it’s a good idea to re-scrub address data on a periodic basis and update fields to their latest preferred versions. You may consider deleting addresses that become invalid too.

Clean Input

Only putting valid addresses into your system of record is much easier than trying to figure out the correct address later. Tight controls on your input takes more time at the start of a project but gives much better data and almost no invalid addresses if they are verified up front.