Anybody who has worked with data knows that there is no limit to the creativity that users show when entering information into systems. Methodologies which rely on data being entered in a particular order and in a specific format will be of little use in the real world. Variation within one country and one language is wide enough – extrapolate the issues internationally and the issues increase exponentially.
What is required is Fuzzy matching.
There are numerous ways that a user might enter data where it does not match that found in a master data file. Common abbreviations might be used (Avenue, Ave., Av., Av.); punctuation, accents or spacing might be different (St.-Helen’s, Saint Helens); data might be misspelled (Betws-y-Coed, Betsy Co-ed, Betz ee kowed), letters might be reversed (Lodnon, Beflast), words might be dropped (Adwick, Adwick Street, Adwick le Street; rue Marseilles, rue de Marseilles); other-language versions of place names might be used (København, Copenhagen, Kopenhagen, Copenhague); transliterated data might be used (El Iskandarîya, El Qâhira); acronyms or abbreviations might replace full versions (BBC, British Broadcasting Corporation); numbers might be written differently (12 Avenue, 12th Avenue, Twelfth Avenue) … the possibilities for diverse input seem endless.
Fuzzy matching identifies matches by identifying the degree of likeness between strings, and is used where an exact match can’t be made. Systems can be programmed intelligently to allow for spelling and phonetic differences, string and field transposition, language use, casing and the use of diacritical marks. It is able to find duplication within fields that cannot be standardized or easily validated, such as personal and business names.
There is no single way that Fuzzy matching works – systems differ greatly. Systems exist, for example, that put every string in an address into a single string and then compares this with the next address. The user must decide what degree of similarity (e.g. 80%) must exist to accept two records as duplicates. This methodology allows for fast but inaccurate matching or de-duplication. Matching is always better when like is compared to like – compare, for example, the postal code with the postal code, the company with the company, the house number with the house number. Within these comparisons intelligence can be built. Allow for the existence of typos in postal codes; compare address strings within fields (where strings are often transposed) but also between fields, as address strings are often written in different orders by different people and different cultures.
The fuzzier the match that systems allow, the greater the chance that false positives will be found. A fine line exists between finding a correct match and finding an incorrect match. Fuzzy logic systems are much more refined and clever than can be explained in a simple blog post like this one. It takes skill and intelligence to design systems that can recognise “12 Ave Lodnon EI 288” as being the same as “Twelfth Avenue, London, E1 2BB” but not the same as “12, London Avenue, E12 8BB”.
The logic to match fuzzily exists in a great number of customer-facing utilities we use daily, and, though we tend to take them for granted, a great deal of clever data manipulation is going on behind the scenes.