Address Cleaning Using the TRANWRD Function

NESUG 2008

Coders' Corner

Address Cleaning Using the TRANWRD Function

Rena Jones, New York State Department of Health, Troy, NY Mike Zdeb, U@Albany School of Public Health, Rensselaer, NY

ABSTRACT SAS? data sets sometimes contain character variables that need alteration. Address databases used to generate mailing lists or for geocoding (assigning of a geographic coordinate location in terms of latitude and longitude based on street address) require standardized address information. The various portions of the address (numbers, street names, street types) should conform to a set of pre-specified rules.

SAS string functions are one way to manipulate the components of address data. SAS on-line help states that the TRANWRD function replaces or removes all occurrences of a given word (or a pattern of characters) within a character string. This paper will provide an overview of the TRANWRD function syntax and its utility for cleaning an address database either by itself or in conjunction with other string functions (COMPRESS, TRANSLATE). A SAS macro is described that uses the TRANWRD function to adjust for common inconsistencies and errors in address data. The macro provides one of a number of BASE SAS alternatives to functions available in the SAS Data Quality Server.

THE DATA Consider the following raw data from a table of addresses which we would eventually like to geocode in a GIS software package, or use to create a mailing list.

data mail; input address $25. ; datalines; 81 CANTOR LN 922 NEW_HAVEN PL 22 HARING/ ROAD ONE WILLOW RD 727 RAVEN TERR 35 PL ST. 42 W.26TH ST 15 S. SWAN AVE 129 NORTH BDWY ; run;

Assume some pre-set rules to obtain consistency among all addresses, such as requiring the full suffix for street names and removing all punctuation.

STANDARD OPTIONS Using a simple IF/THEN statement and a quoted text string, which only allows for replacement of an entire text string with another, would be inefficient if you have multiple records in which you want to make the same change:

if address = '8 CANTOR LANE' then address = '8 CANTOR LN';

An IF/THEN statement with the SUBSTR function would also work. However, it requires knowledge of the location of the text to be changed and thus allows only for correction of address text that starts in the same location across observations. This is also not useful for a standard set of changes:

if substr(address,10,4)= 'LANE' then substr(address,10) = 'LN';

NESUG 2008

Coders' Corner

TRANWRD The TRANWRD function is a search and replace function. TRANWRD does not rely on knowing the specific location of the text to replace, and will replace all occurrences of a text string using the following syntax: (source, target text, replacement text):

address = tranwrd(address,'ST.','STREET');

The list of adjustments using TRANWRD can be as comprehensive as the data set requires. Depending on the number of changes needed, one can write numerous lines of code. TRANWRD can make numerous changes with each observation in a data set.

COMPRESS AND TRANSLATE COMPRESS will remove specified characters from a string. By default, SAS removes blanks unless other characters are specified in the second part of the argument. Using COMPRESS prior to or in conjunction with making text replacements with TRANWRD may be useful.

address = COMPRESS(address,".");

1

81 CANTOR LN

2

922 NEW_HAVEN PL

3

22 HARING/ ROAD

4

ONE WILLOW RD

5

727 RAVEN TERR

6

35 PL ST

7

42 W26TH ST

8

15 S SWAN AVE

9

129 NORTH BDWY

The TRANSLATE function replaces characters in strings, but is less "sophisticated" than TRANWRD in that it replaces single characters, not patterns of characters (source, replacement text or list, target text or list).

address = TRANSLATE (address, '', '_ /');

1

81 CANTOR LN

2

922 NEW HAVEN PL

3

22 HARING ROAD

4

ONE WILLOW RD

5

727 RAVEN TERR

6

35 PL ST.

7

42 W.26TH ST

8

15 S. SWAN AVE

9

129 NORTH BDWY

A MACRO A convenient way to use the TRANWRD function to clean multiple address data sets is to put all of your common corrections and inside a SAS macro, such as the one below.

%macro correct(dataset); data temp; set &dataset;

***TRANSLATE AND COMPRESS TO REMOVE/REPLACE CHARACTERS; address = COMPRESS(address,'.'); address = translate(address,' ','_ /');

NESUG 2008

Coders' Corner

***TRANWRD LIST OF COMMON CONVERSIONS; address = TRANWRD (address,'ST','STREET'); address = TRANWRD (address,'LN','LANE'); address = TRANWRD (address,'RD','ROAD'); address = TRANWRD (address,'PL','PLACE'); address = TRANWRD (address,'TERR','TERRACE'); address = TRANWRD (address,'ONE','1'); address = TRANWRD (address,'BDWY','BROADWAY');

*****THIS LIST COULD GO ON AND ON...;

run; %mend;

****CALL THE MACRO using the mail dataset; %correct (mail);

CAVEATS First, don't assume that the changes will work for your entire data set.

When working with address data, a mass change can create more errors rather than making multiple simple corrections. Look what happens to the original addresses if we use the code:

address = TRANWRD (address,'PL','PLACE');

1

81 CANTOR LN

2

922 NEW_HAVEN PLACE

3

22 HARING/ ROAD

4

ONE WILLOW RD

5

727 RAVEN TERR

6

35 PLACE ST.

7

42 W.26TH ST

8

15 S. SWAN AVE

9

129 NORTH BDWY

In observation # 2 the change was appropriate. For observation # 6, assuming the original address was factually correct (that 35 PL Street does exist), the new address now has an incorrect street name. This is problematic for both a mailing list and for data that will be geocoded, where such a change could lead to an incorrect geographic assignment or may not get geocoded at all.

Next, TRANWRD does not search and replace words. TRANWRD does not consider word boundaries since it is scanning for strings. It will replace the target text string with your text of choice and leave any neighboring text:

address = tranwrd(address,'AVE','AVENUE');

1

81 CANTOR LN

2

922 NEW_HAVENUEN PL

3

22 HARING/ ROAD

4

ONE WILLOW RD

5

727 RAVENUEN TERR

6

35 PL ST.

7

42 W.26TH ST

8

15 S. SWAN AVENUE

9

129 NORTH BDWY

NESUG 2008

Coders' Corner

In observation #8, the desired change was made. However, notice that in observations #2 and #5, the string 'AVE' in the middle of street names has been changed to 'AVENUE'. One way to avoid such problems is to embed a blank space at the start or the end of the search string. That also requires that you add a space temporarily at the start and end of the variable value being changed:

address = left(tranwrd(cat(' ',address,' '),' AVE ',' AVENUE '));

Using the CAT function to add the extra blank spaces ensures that even if a search string occurs at the beginning or the end of an address, there will always be a set of blanks around that string, for example:

AVE OF THE AMERICAS

Without the leading blank added with the CAT function, the string ' AVE ' would not be found. The LEFT function removes the leading blank that is added at the beginning of the string with the CAT function.

Finally, as of version 9 of SAS, the PRX functions also offer text changing capabilities. They are both more flexible and more powerful than the TRANWRD function. However, they do require more study and practice and are not as easily understood as TRANWRD. Here is one example that allows you to change either 'AV' or 'AVE' to 'AVENUE' in one statement using PRXCHANGE:

address = left(prxchange('s/ AVE | AV / AVENUE /',-1,cat(' ',address,' ')));

The same precautions mentioned earlier with TRANWRD about searching for words rather than portions of words are taken here. The statement is similar the previous use of TRANWRD, but it searches for both 'AV' and 'AVE' and changes occurrences to 'AVENUE'.

SUMMARY The SAS TRANWRD function is a useful way to make pre-specified changes to address information in order to prepare a data set for mailing or for geocoding.

ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ? indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.

REFERENCES SAS Online Documentation:

TRANWRD

PRX



CONTACT INFORMATION Rena Jones New York State Department of Health Center for Environmental Health 547 River Street, Room 200 Troy, NY 12180 (518) 402-7958

rrj01@health.state.ny.us

Mike Zdeb University at Albany, SUNY School of Public Health One University Place Rensselaer, NY 12144 (518) 402-6479

msz03@albany.edu

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download