SQL Server Queries: Writing queries for performance – Part 2

Writing an efficient query

An efficient query is the result of following a few simple rules and patterns. Part 1 of the series outlined general goals when creating a query. This part of the series gives concrete rules for writing efficient queries.

The Rules

  1. Use the minimum number of tables while meeting business requirements
  2. Only return the columns needed / specify the columns to be used
    1. Never use SELECT *
    2. This reduces network traffic, client processing and ensures you won’t get extra columns returned in your query if the base tables are modified
  3. Add a WHERE clause if possible
    1. Directly reduces I/O and network traffic
    2. Use sargable arguments (arguments that can effectively use an index)
      • =
      • >
      • <
      • <=
      • >=
      • BETWEEN
      • IN
      • LIKE ‘Text%’
    3. Avoid non-sargable arguments
      • <>
      • NOT IN
      • NOT LIKE
      • LIKE ‘%text’
      • LIKE ‘%text%’
      • OR
      • Functions in WHERE clause or JOIN
      • Avoid use of functions in clause
  4. Use “SET NOCOUNT ON” for all stored procedures
  5. Use stored procedures or parameterized queries
    1. Allows your execution plan to be re-used
    2. Especially important when the query is executed multiple times
  6. Use consistent formatting
    1. Create guidelines and be sure your developers follow them
  7. Verify the query plan
    1. Indexes
    2. Seeks versus scans
    3. Implicit conversions
    4. Join types
  8. Avoid cursors / RBAR
  9. Avoid query hints and index hints (let the query engine do its’ job and adjust the plan as the data changes over time)

 

SQL Server Queries: Writing queries for performance – Part 1

Troubleshooting query performance can be a frustrating task for developers. The goal of this series is to break down the steps to analyze a query and provide concrete actions that can be used to create performance oriented queries.

When trying to improve the performance of a query you will almost always be working on some or all of the following high level tasks:

  • Minimize network traffic
  • Minimize physical I/O
  • Minimize logical I/O
  • Minimize CPU usage
  • Perform set-based operations

Remember – queries are optimized in relation to the current system and database. This means that performance begins before a single line of code is written and before a single byte of data is inserted. Disk setup, network setup, server setup, data model, indexes and other databases on the system all impact performance (this also means that your query impacts the other queries or even entire databases running on the same server).

Writing efficient queries also means less troubleshooting in the future.

I-Tech SQL Server Script Series – Listing Clustered Indexes on All Tables

A good general rule for creating tables is to always have a clustered index. The clustered index defines how data is physically ordered on the disk – and that’s why there can only be one clustered index on a table (a table without a clustered index is a HEAP). Even though it’s a good rule to have a clustered index there are common scenarios when you won’t have one on a table. A typical example is a load or import table, but even then it may have a clustered index applied when the load is complete.

When investigating a database it is helpful to understand the clustered indexes. The chosen clustered index can have a large impact on performance. Understanding the clustered indexes is especially crucial since every non-clustered index on a table relies on the clustered index.

SQL Server stores object information, including indexes, in system views. Table information is available in sys.objects and the associated indexes are in sys.indexes. This script list all of the tables in the current database and their associated clustered index. If the table doesn’t have a clustered index it will have NULL for the IndexName. You can use SSMS to look at indexes individually, but listing them via SQL makes it easy to see all of them at once.

SQL Script

SELECT
	@@SERVERNAME		ServerName
	,DB_NAME()			DatabaseName
	,ss.name			SchemaName
	,so.name			TableName
	,si.name			IndexName
	,sc.name			ColumnName
	,st.name			ColumnType
	,CASE
		WHEN st.is_user_defined = 0 THEN st.name
		WHEN st.is_table_type	= 1	THEN AT.name
		WHEN st.is_user_defined = 1	THEN st2.name
	END ColumnBaseType
	,CASE 
		WHEN sc.max_length = -1 THEN 0
		WHEN st.name IN ('nchar','nvarchar') THEN sc.max_length	/ 2
		WHEN st2.name IN ('nchar','nvarchar') THEN sc.max_length	/ 2
		ELSE sc.max_length
	END	ColumnMaxLength
FROM sys.objects so
	INNER JOIN sys.schemas SS
		ON SO.schema_id = SS.schema_id
	LEFT JOIN sys.indexes si
		ON si.object_id = so.object_id
		AND si.type = 1	--clustered index
	LEFT JOIN sys.index_columns sic
		ON si.object_id = sic.object_id
		AND si.index_id = sic.index_id
	LEFT JOIN sys.columns sc
		ON sic.object_id = sc.object_id
		AND sic.column_id = sc.column_id
	LEFT JOIN sys.types st
		ON sc.user_type_id = st.user_type_id
	LEFT JOIN sys.types st2
		ON st.system_type_id		= st2.system_type_id 
		AND st2.system_type_id		= st2.user_type_id
		AND st.is_user_defined		= 1
	LEFT JOIN sys.assembly_types AT
		ON ST.system_type_id	= AT.system_type_id
		AND ST.user_type_id		= AT.user_type_id
WHERE so.type = 'u'
ORDER BY so.name
	,sic.Index_Column_ID

Script Download

The script is located here – INDEXES.Clustered.Indexes

It is also located on our script page – I-Tech SQL Server Scripts

I-Tech SQL Server Script Series – Generating BCP Commands

BCP (bulk copy program) is a command-line tool provided with SQL Server for quickly loading and extracting data. BCP extracts or loads data to / from a single table (or from a query) and is very much like a table-level backup file. Not as many developers are aware of BCP since DTS and now SSIS have become available but it is still very useful for moving data around quickly. It’s great for moving files between environments without making a backup of the whole database and it’s also a good safety net before you make a major change to the data in a table. I use it frequently in development environments and I’ve received data-loads from vendors as BCP files because of their ease-of-use and speed of loaded and extracting.

BCP is very flexible and fast, but since it is a command-line tool it has a minimal interface. This can be intimidating to new SQL developers and it can also be tedious if you need to extract all the tables in a database. That’s where this script comes in – it generates the BCP statements that you can put into the command line. Just set the variables at the top of the script and it’s ready to go. Run the script from the database you are targeting. When you run the generated commands it creates or loads the BCP files from the current directory. You can easily add a specific directory to the script if that’s what you need.

SQL Script

SELECT 
    @SERVER          = @@SERVERNAME
    ,@EXTENSION      = 'BCP'
    ,@OPTIONS        = '-n -k -E'
    ,@SCHEMA         = NULL
    ,@TRUSTED        = 1
    ,@USER           = ''
    ,@PASSWORD       = ''
    ,@TYPE           = 'BOTH'	--IN,OUT,BOTH
    ,@INCLUDEDATE    = 1

You can see the options listed above. The -n saves / loads the file in native format (binary). This limits how you can load the file but you don’t need a separate format file with this option. The script also has -k, keep NULL values and -E, keep identity values. Change any of the values needed.

Output

Running the script gives you something like this:

bcp [AdventureWorks2012].[Person].[Address] OUT AdventureWorks2012.Person.Address_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Person].[AddressType] OUT AdventureWorks2012.Person.AddressType_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[dbo].[AWBuildVersion] OUT AdventureWorks2012.dbo.AWBuildVersion_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Production].[BillOfMaterials] OUT AdventureWorks2012.Production.BillOfMaterials_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Person].[BusinessEntity] OUT AdventureWorks2012.Person.BusinessEntity_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Person].[BusinessEntityAddress] OUT AdventureWorks2012.Person.BusinessEntityAddress_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Person].[BusinessEntityContact] OUT AdventureWorks2012.Person.BusinessEntityContact_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Person].[ContactType] OUT AdventureWorks2012.Person.ContactType_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Person].[CountryRegion] OUT AdventureWorks2012.Person.CountryRegion_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Sales].[CountryRegionCurrency] OUT AdventureWorks2012.Sales.CountryRegionCurrency_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Sales].[CreditCard] OUT AdventureWorks2012.Sales.CreditCard_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Production].[Culture] OUT AdventureWorks2012.Production.Culture_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Sales].[Currency] OUT AdventureWorks2012.Sales.Currency_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Sales].[CurrencyRate] OUT AdventureWorks2012.Sales.CurrencyRate_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Sales].[Customer] OUT AdventureWorks2012.Sales.Customer_20140804.BCP -n -k -E -T  -S .
...
...
...
bcp [AdventureWorks2012].[Production].[UnitMeasure] OUT AdventureWorks2012.Production.UnitMeasure_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Purchasing].[Vendor] OUT AdventureWorks2012.Purchasing.Vendor_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Production].[WorkOrder] OUT AdventureWorks2012.Production.WorkOrder_20140804.BCP -n -k -E -T  -S .
bcp [AdventureWorks2012].[Production].[WorkOrderRouting] OUT AdventureWorks2012.Production.WorkOrderRouting_20140804.BCP -n -k -E -T  -S .

Results

Running this is very fast. Paste this into a command / batch file or put it directly into a command prompt after changing to the target directory. Some sample output:

C:\Temp\AdventureWorks>bcp [AdventureWorks2012].[Production].[WorkOrderRouting]
OUT AdventureWorks2012.Production.WorkOrderRouting_20140804.BCP -n -k -E -T  -S
.

Starting copy...
1000 rows successfully bulk-copied to host-file. Total received: 1000
1000 rows successfully bulk-copied to host-file. Total received: 2000
1000 rows successfully bulk-copied to host-file. Total received: 3000
1000 rows successfully bulk-copied to host-file. Total received: 4000
1000 rows successfully bulk-copied to host-file. Total received: 5000
...
...
...
1000 rows successfully bulk-copied to host-file. Total received: 65000
1000 rows successfully bulk-copied to host-file. Total received: 66000
1000 rows successfully bulk-copied to host-file. Total received: 67000

67131 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total     : 1092   Average : (61475.27 rows per sec.)

Auto-generating the script makes using BCP even easier. If you need to save space you can zip / gzip the files and get decent compression depending on your database. I’ve seen 5:1 or 10:1 compression, but this varies based on the type of data and your specific tables.

Script
The script is located here – ADMIN.BCP.Create.In.Out

It is also located on our script page – I-Tech SQL Server Scripts

SQL Server – Looking at Currently Running Processes and Queries

SQL Server has a number of ways to see what is happening on the server. If you just need a quick look you have a few options.

Activity Monitor

The Activity Monitor is a great way to get a quick overview. It has an Overview (graphs), Current processes, Resource Waits grouped by type, Data File I/O and Recent Expensive Queries. This is a good place to start if you are new to SQL Server.
Activity Monitor

Activity Monitor Processes

System Stored Procedures

Before the Activity Monitor was available there were a couple of system stored procedures that would give you the highlights – really just enough information to figure out where you needed to start digging for more information.

EXEC sp_who

sp_who
And the updated version:

EXEC sp_who2

sp_who2

These are both good at giving some basic information – you get the SPID, status, login name, if the command is being blocked and what SPID is blocking, the database and the command type. You can see that sp_who2 gives a little more information CPU time, disk I/O, last batch date and program name.

If you want to see the actual SQL command that is running you can use dbcc inputbuffer. The SPID for the process you are trying to view is the parameter – 85 in the example below. You can see my test query was running in another window.

dbcc inputbuffer(85)
EventType      Parameters EventInfo
-------------- ---------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Language Event 0          
DECLARE @Count int = 0

WHILE @Count < 100
BEGIN
	SELECT COUNT(*)
	FROM sys.all_objects

	WAITFOR DELAY '00:00:05'
	SELECT @Count += 1
END

(1 row(s) affected)

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

System Views

There are some system management views and system compatibility views that give even more detailed information. Sys.sysprocesses returns all the same rows that sp_who2 returns and gives additional information.

SELECT *
FROM sys.sysprocesses

sys.sysprocesses

There is a lot of information available in that view. Usually you don’t want everything in there and there are a few additional items that would be useful, so I add in sys.dm_exec_requests. The following query shows everything but I am adding it to get cpu_time, total_elapsed_time, reads, writes, logical_reads, sql_handle and the plan_handle.

SELECT *
FROM sys.sysprocesses SP
    LEFT JOIN sys.dm_exec_requests ER
        ON SP.spid = ER.session_id

The reason the sql_handle and the plan_handle are important is to get detailed information about the query and the query plan.

SELECT
    spid
    ,DB_NAME(SP.dbid)    DBName
    ,nt_username    NTUser
    ,sp.Loginame    LoginName
    ,Hostname
    ,nt_domain
    ,ER.status        QueryStatus
    ,ER.command        QueryType
    ,ER.blocking_session_id    BlockingSessionID
    ,program_name        ProgramName
    ,start_time
    ,SP.cpu                CPUTime
    ,SP.physical_io        DiskIO
    ,SP.memusage            MemoryUsage
    ,lastwaittype
    ,SUBSTRING(ST.text, (ER.statement_start_offset/2)+1, 
        ((CASE ER.statement_end_offset
            WHEN -1 THEN DATALENGTH(st.text)
            ELSE ER.statement_end_offset
        END - ER.statement_start_offset)/2) + 1)    RunningQuery
    ,ST.text        ParentQuery
    ,CONVERT(xml,(SELECT 1 AS Tag,NULL AS Parent, text [CurrentProcesses!1!Query!cdata] FROM sys.dm_exec_sql_text(ER.sql_handle)  FOR XML EXPLICIT))    ParentQueryXML
    ,QP.query_plan
FROM sys.sysprocesses SP WITH(NOLOCK)
    LEFT JOIN sys.dm_exec_requests ER WITH(NOLOCK)
        ON ER.session_id = sp.spid
    OUTER APPLY sys.dm_exec_sql_text(ER.sql_handle)as ST
    OUTER APPLY sys.dm_exec_query_plan(ER.plan_handle) AS QP
WHERE spid <> @@SPID --Exclude the session this query is running in.
ORDER BY spid

sys.sysprocesses with additions

This version has a few things in it that makes troubleshooting easier. The text of the query is added, the query plan is there (if available) and the query text is available as XML in case it is large.

Resources

This script can also be found on our script page or downloaded below. It also excludes all system processes.

PROCESSES.Current.Runnable.Statements
I-Tech SQL Server Scripts

Adding Limiting Criteria to a SQL OUTER JOIN

When developers first start using and writing SQL queries, JOIN syntax can be confusing. There are many explanations about the basic syntax available all over the web so this post doesn’t reiterate those. This post explains what happens when you start adding limiting criteria to your JOIN.

Even if the basic INNER and OUTER (LEFT and RIGHT) join syntax is understood, adding limiting criteria to the JOIN has tripped up more than one developer.

Examples

The following examples use the AdventureWorksLT sample database. It is currently available here – SQL Server Sample Databases.

USE AdventureWorksLT

 

Basic INNER JOIN

This is a basic INNER JOIN, so only matching rows are returned. If a customer doesn’t have any orders in the database it won’t be displayed. This returns 32 results.

SELECT *
FROM SalesLT.Customer C
    INNER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID

EXAMPLE.JOIN.OUTER.With.Criteria.Example 1

Basic OUTER JOIN

This is a basic OUTER JOIN, so all rows are returned from the primary table (Customer) and only the matching rows from the JOINed table. This returns 847 results.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID

EXAMPLE.JOIN.OUTER.With.Criteria.Example 2

OUTER JOIN With Criteria – try 1

So now you have your basic query created and you need to restrict the rows that are returned. If you get a requirement stating to exclude orders over $1,000 but you still want to see all of the Customer records you need to use an OUTER join. If you are new to SQL or you forget that the Total Due column is in the OUTER table you might put the exclusion criteria in the WHERE clause. This example returns 8 rows.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID
WHERE TotalDue < 1000

EXAMPLE.JOIN.OUTER.With.Criteria.Example 3

OUTER JOIN With Criteria – try 2

Putting the exclusion criteria in the WHERE clause was too restrictive. Another way I’ve seen developers approach this problem is to check for NULLs. Checking for NULL does take care of part of the problem, but you can see this only returns 823 rows, so some customers are still excluded.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID
WHERE (TotalDue < 1000
        OR TotalDue IS NULL)

EXAMPLE.JOIN.OUTER.With.Criteria.Example 4

OUTER JOIN With Criteria – Solution
The last option remaining is to put the exclusion criteria in the JOIN. This query matches the criteria – all of the customer records are returned and only orders with total due less than $1,000. This returns 847 results again – exactly what is expected.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID
        AND TotalDue < 1000

EXAMPLE.JOIN.OUTER.With.Criteria.Example 5

OUTER JOIN Equivalent to an INNER JOIN

The last example shows what happens when the query has criteria in WHERE clause checking for a column in the JOIN table – IS NOT NULL. If the check is on the JOIN key field (SOH.CustomerID) it will be equivalent to an INNER JOIN. If you check another field (TotalDue) it is equivalent to an INNER JOIN with the addition of a WHERE clause specifying that the field also cannot be NULL. In this example TotalDue is always populated so it is the same as an INNER JOIN. If TotalDue could be NULL and was null – it would be missing additional records. The example below returns 32 results.

SELECT *
FROM SalesLT.Customer C
    LEFT OUTER JOIN SalesLT.SalesOrderHeader SOH
        ON C.CustomerID = SOH.CustomerID
WHERE TotalDue IS NOT NULL

EXAMPLE.JOIN.OUTER.With.Criteria.Example 6

You can see that the query is processed as an INNER JOIN by the query processor too – so not only is it logically equivalent it is identical to the query engine. Look at the join graphic in the middle in both examples. It is a Merge Join (Inner Join) in both.

INNER JOIN Query Plan (partial)
EXAMPLE.JOIN.OUTER.With.Criteria.Example 7 - INNER JOIN Query Plan

OUTER JOIN Query Plan (partial)
EXAMPLE.JOIN.OUTER.With.Criteria.Example 8 - LEFT JOIN Query Plan

Explanation

So why does an OUTER JOIN behave like an INNER JOIN sometimes and not others? When you put your limiting criteria inside of the OUTER JOIN syntax it is part of the JOIN. So if any records are limited it only impacts that JOIN and the table being JOINed. You only want orders placed in Iowa but you want to see all customers? No problem – put the criteria in the JOIN (AND State = ‘IA’). You only want to see orders placed in Iowa and the customers for those orders? No problem again – just put the criteria in the WHERE ( WHERE State = ‘IA).

If you remember that the criteria in the WHERE clause impacts ALL records it’s easy to know where to put the limiting clause.

Changing SQL Server Database Owners

In a production SQL Server environment the database owner is usually carefully maintained and consistent. This isn’t always the case on development servers or on servers that you may inherit from other organizations. This script was written originally for a development environment with multiple database owners – one of the developers left the project and their user account needed to be removed.

If you just need to change a single database owner use the SQL command (change the database name and owner of course):

ALTER AUTHORIZATION ON DATABASE::[AdventureWorks2012] TO [sa]

If you need to change the owner on multiple databases or if you just want to see the current owners on the databases you can use the script referenced here – ADMIN.Database.Owners.Change.

Check out the SQL Server BOL entry on ALTER AUTHORIZATION to see the security requirements and other restrictions.

Be sure to check out our other SQL Server scripts here – I-Tech SQL Server Scripts.

Finding A Unique Address Key And Matching Addresses

We do a lot of work with addresses. A lot of addresses. One of the big problems we run into on our projects is determining if an address is unique. Another common issue is matching addresses from different sources. This may sound trivial, but there are a few things about US addresses that make this more difficult (or interesting if you really like addresses):

  • The USPS allows alternate city names for the same address
  • The USPS allows alternate street names for the same address
  • People input addresses in many different ways and they need to be standardized (Street vs. St., Avenue vs. Ave, Apartment vs. Unit vs. #, etc.) or you can’t do a string comparison

Terminology

First, some terminology. Different sources give different names to the same data elements. The following is a list of some common alternate names for address elements.

Field Alternate Names
House Number Street Number
Direction Pre-Directional
Street Name Street
Mode Street Suffix
Quadrant Post-Directional
Unit Type Secondary Designation
Unit Number Secondary NumberApartment Number
City City
State State
Zip5 ZipPostal Code
Zip4 Zip+4Zip Addon

If you have worked with SQL for any length of time you know that having a good key is critical for matching records from different tables or sources. Matching data based on addresses can get very messy – many fields in the key, multiple possible (valid) versions of what should be the same data, input errors and the data can change over time. To get around these issues the data needs to be cleaned and standardized.

Steps for Assigning a Unique Key to Your Existing Addresses

  1. Prepare addresses – scrub all addresses and only include addresses that successfully DPV in your solution. DPV (Delivery Point Validation) means that the address is deliverable by the USPS. This step also standardizes the data.
  2. Remove duplicates – you don’t want different records pointing to the same property or you can have a Cartesian join in your process or you may not match addresses that should match.
  3. Assign an artificial key to your address – this is easier to store and much easier to match on.
  4. Match disparate sources

Preparing Addresses

The goal of this step is to get the address into a standard format, parse the address into the component parts and validate that the address is an actual, deliverable address. The reason for validating the address should be fairly obvious – if an address doesn’t point to a property it isn’t useful for anything. Parsing the address and standardizing the components makes it possible to remove duplicates.

When verifying one or two addresses it’s relatively easy to go to the USPS website and verify your address. It checks the address, corrects the zip code and adds the zip+4. It won’t change the city to the preferred city – If you use an acceptable city it leaves it as-is. Although this isn’t useful for automating validation it is very useful for spot-checking individual addresses.

If you need to process hundreds, thousands, or – like some of our projects, millions of addresses you must automate your processing. The best, fastest and cheapest way to do this is purchase third-party address validation software. You want software that performs CASS and DPV certification. We have used address scrubbing software from a handful of vendors. They each have their advantages and disadvantages so look at their specifications carefully and choose the right one for your project. Prices vary quite a bit. We have been able to successfully scrub about 5 million rows an hour on a Windows machine busy doing other things using one of the “cheaper” versions, but some of the “Enterprise” versions have other capabilities or can be tied into an existing infrastructure or come from a vendor you may already use.

Whichever software you use to scrub you addresses, when this has been completed and the “clean” addresses are available you are ready for the next step.

Remove Duplicates

To remove duplicates from addresses you need to understand how USPS addresses work and how your address software works.

As mentioned at the top of this post, the USPS has different categories for cities and streets. Both have preferred versions and acceptable versions. For example, West Des Moines is a preferred city name but WDM is accepted by the USPS and your mail will get delivered if you use that version. If you try to do a string comparison between “West Des Moines” and “WDM” you know that it will never match. So what’s the solution? Easy – most address scrubbing software gives you the option of always returning the preferred city / street or returning the input city / street if they are on the acceptable list. Use the preferred version. The other way to get around this is to skip the city when creating your key. Use Zip+4 instead.

The following fields are needed for a natural unique address key.

  • House Number
  • Direction
  • Street Name
  • Mode
  • Quadrant
  • Unit Number
  • Zip 5
  • City OR Zip 4

Do a distinct query based on the above fields or use a ROW_NUMBER() OVER(PARTITION) to remove duplicates:

 

Example using DISTINCT

This is straight forward and easy to use. It gives you a distinct list of rows based on the columns specified.

--Example of getting a distinct address list
SELECT DISTINCT
    HouseNumber
    ,Direction
    ,StreetName
    ,Mode
    ,Quadrant
    ,UnitNumber
    ,Zip5
    ,Zip4
FROM dbo.SampleAddressTable

Example Using A CTE

This is a little more complicated that using DISTINCT, but it gives you more flexibility. You can include additional columns but they don’t have to be distinct. You can JOIN to other tables on the distinct list. You can perform additional checks before you use the data. Which one you use depends on your needs.

;
WITH ADDRESS_CTE AS (
    SELECT
        HouseNumber
        ,Direction
        ,StreetName
        ,Mode
        ,Quadrant
        ,UnitNumber
        ,Zip5
        ,Zip4
        ,ROW_NUMBER OVER(PARTITION BY
                            HouseNumber
                            ,Direction
                            ,StreetName
                            ,Mode
                            ,Quadrant
                            ,UnitNumber
                            ,Zip5
                            ,Zip4
                        ORDER BY
                            HouseNumber
                            ,Direction
                            ,StreetName
                            ,Mode
                            ,Quadrant
                            ,UnitNumber
                            ,Zip5
                            ,Zip4
                        ) SELECT_CRITERIA
    FROM dbo.SampleAddressTable
)

SELECT
    *
FROM ADDRESS_CTE
WHERE SELECT_CRITERIA = 1

Once duplicates are gone you have a good base set of addresses. These are the addresses that you will be matching to your other sources.

Assign A Key

The easiest way to use your freshly scrubbed and deduped data is to put it into a table. Then you can reference it from other tables easily. Although there is a natural key in the data (remember the fields listed above that make an address unique?) it is much easier to assign an integer key to the data. It makes your JOINs easier, it makes your indexes smaller (if you also use it as the CLUSTERED INDEX) and it is generally easier to use.

After you have created your key (and probably made it CLUSTERED) I would also make a UNIQUE INDEX on the natural key. This insures no duplicates get added accidentally and will help your JOINs when you add new addresses later.

Another important thing to note is that you need to use the parsed address for this to work correctly. If you use something like the “mailing address” or “address line 1” with the house number, street name, mode, etc. combined you will have a very difficult time getting rid of duplicates.

Match Sources

Now you can match addresses from other sources. Just run them through your scrubbing process and do a JOIN based on all of the address fields. You can then store the address key you created above to your new source. You can also add any new addresses you find when you scrub them to your address table.

If you have scrubbed, parsed and deduped your data this will be a relatively minor task.

Other Considerations

Re-scrubbing

Postal data has the possibility to change over time. Zip codes can change and so can street names or even city borders. it’s a good idea to re-scrub address data on a periodic basis and update fields to their latest preferred versions. You may consider deleting addresses that become invalid too.

Clean Input

Only putting valid addresses into your system of record is much easier than trying to figure out the correct address later. Tight controls on your input takes more time at the start of a project but gives much better data and almost no invalid addresses if they are verified up front.