We do a lot of work with addresses. A lot of addresses. One of the big problems we run into on our projects is determining if an address is unique. Another common issue is matching addresses from different sources. This may sound trivial, but there are a few things about US addresses that make this more difficult (or interesting if you really like addresses):
- The USPS allows alternate city names for the same address
- The USPS allows alternate street names for the same address
- People input addresses in many different ways and they need to be standardized (Street vs. St., Avenue vs. Ave, Apartment vs. Unit vs. #, etc.) or you can’t do a string comparison
First, some terminology. Different sources give different names to the same data elements. The following is a list of some common alternate names for address elements.
|House Number||Street Number|
|Unit Type||Secondary Designation|
|Unit Number||Secondary NumberApartment Number|
If you have worked with SQL for any length of time you know that having a good key is critical for matching records from different tables or sources. Matching data based on addresses can get very messy – many fields in the key, multiple possible (valid) versions of what should be the same data, input errors and the data can change over time. To get around these issues the data needs to be cleaned and standardized.
Steps for Assigning a Unique Key to Your Existing Addresses
- Prepare addresses – scrub all addresses and only include addresses that successfully DPV in your solution. DPV (Delivery Point Validation) means that the address is deliverable by the USPS. This step also standardizes the data.
- Remove duplicates – you don’t want different records pointing to the same property or you can have a Cartesian join in your process or you may not match addresses that should match.
- Assign an artificial key to your address – this is easier to store and much easier to match on.
- Match disparate sources
The goal of this step is to get the address into a standard format, parse the address into the component parts and validate that the address is an actual, deliverable address. The reason for validating the address should be fairly obvious – if an address doesn’t point to a property it isn’t useful for anything. Parsing the address and standardizing the components makes it possible to remove duplicates.
When verifying one or two addresses it’s relatively easy to go to the USPS website and verify your address. It checks the address, corrects the zip code and adds the zip+4. It won’t change the city to the preferred city – If you use an acceptable city it leaves it as-is. Although this isn’t useful for automating validation it is very useful for spot-checking individual addresses.
If you need to process hundreds, thousands, or – like some of our projects, millions of addresses you must automate your processing. The best, fastest and cheapest way to do this is purchase third-party address validation software. You want software that performs CASS and DPV certification. We have used address scrubbing software from a handful of vendors. They each have their advantages and disadvantages so look at their specifications carefully and choose the right one for your project. Prices vary quite a bit. We have been able to successfully scrub about 5 million rows an hour on a Windows machine busy doing other things using one of the “cheaper” versions, but some of the “Enterprise” versions have other capabilities or can be tied into an existing infrastructure or come from a vendor you may already use.
Whichever software you use to scrub you addresses, when this has been completed and the “clean” addresses are available you are ready for the next step.
To remove duplicates from addresses you need to understand how USPS addresses work and how your address software works.
As mentioned at the top of this post, the USPS has different categories for cities and streets. Both have preferred versions and acceptable versions. For example, West Des Moines is a preferred city name but WDM is accepted by the USPS and your mail will get delivered if you use that version. If you try to do a string comparison between “West Des Moines” and “WDM” you know that it will never match. So what’s the solution? Easy – most address scrubbing software gives you the option of always returning the preferred city / street or returning the input city / street if they are on the acceptable list. Use the preferred version. The other way to get around this is to skip the city when creating your key. Use Zip+4 instead.
The following fields are needed for a natural unique address key.
- House Number
- Street Name
- Unit Number
- Zip 5
- City OR Zip 4
Do a distinct query based on the above fields or use a ROW_NUMBER() OVER(PARTITION) to remove duplicates:
Example using DISTINCT
This is straight forward and easy to use. It gives you a distinct list of rows based on the columns specified.
--Example of getting a distinct address list SELECT DISTINCT HouseNumber ,Direction ,StreetName ,Mode ,Quadrant ,UnitNumber ,Zip5 ,Zip4 FROM dbo.SampleAddressTable
Example Using A CTE
This is a little more complicated that using DISTINCT, but it gives you more flexibility. You can include additional columns but they don’t have to be distinct. You can JOIN to other tables on the distinct list. You can perform additional checks before you use the data. Which one you use depends on your needs.
; WITH ADDRESS_CTE AS ( SELECT HouseNumber ,Direction ,StreetName ,Mode ,Quadrant ,UnitNumber ,Zip5 ,Zip4 ,ROW_NUMBER OVER(PARTITION BY HouseNumber ,Direction ,StreetName ,Mode ,Quadrant ,UnitNumber ,Zip5 ,Zip4 ORDER BY HouseNumber ,Direction ,StreetName ,Mode ,Quadrant ,UnitNumber ,Zip5 ,Zip4 ) SELECT_CRITERIA FROM dbo.SampleAddressTable ) SELECT * FROM ADDRESS_CTE WHERE SELECT_CRITERIA = 1
Once duplicates are gone you have a good base set of addresses. These are the addresses that you will be matching to your other sources.
Assign A Key
The easiest way to use your freshly scrubbed and deduped data is to put it into a table. Then you can reference it from other tables easily. Although there is a natural key in the data (remember the fields listed above that make an address unique?) it is much easier to assign an integer key to the data. It makes your JOINs easier, it makes your indexes smaller (if you also use it as the CLUSTERED INDEX) and it is generally easier to use.
After you have created your key (and probably made it CLUSTERED) I would also make a UNIQUE INDEX on the natural key. This insures no duplicates get added accidentally and will help your JOINs when you add new addresses later.
Another important thing to note is that you need to use the parsed address for this to work correctly. If you use something like the “mailing address” or “address line 1” with the house number, street name, mode, etc. combined you will have a very difficult time getting rid of duplicates.
Now you can match addresses from other sources. Just run them through your scrubbing process and do a JOIN based on all of the address fields. You can then store the address key you created above to your new source. You can also add any new addresses you find when you scrub them to your address table.
If you have scrubbed, parsed and deduped your data this will be a relatively minor task.
Postal data has the possibility to change over time. Zip codes can change and so can street names or even city borders. it’s a good idea to re-scrub address data on a periodic basis and update fields to their latest preferred versions. You may consider deleting addresses that become invalid too.
Only putting valid addresses into your system of record is much easier than trying to figure out the correct address later. Tight controls on your input takes more time at the start of a project but gives much better data and almost no invalid addresses if they are verified up front.