Deduplication (Network Steve Forum)

Deduplication

Hi, I have been thinking of using Fuzzy Lookup and Grouping Transformation for the deduplication of our data. So far it seems to be a good option except one issue. We also have to dedupe names. In that case a person with name Dick can have a nickname of Richard. So, the process should show Dick and Richard as duplicates. Also there can be other complex name vs nicknames. Can fuzzy lookup find that ? Is there any way to do this inside SQL Server without me creating name vs nicknames table. Does SQL Server maintains that kind of dictionary ? Any suggestion is highly appreciated. Thanks,

May 1st, 2011 11:47am

No there isn't a nickname-check. It looks for string similarities.. If a persons name is Roberto, the names 'Robert' and 'Robberto' will have a higher similarity % then a nickname like 'Rob'. Here is a PDF with more details about the fuzzy logic: http://www.cs.washington.edu/education/courses/cse590q/04au/papers/Chaudhuri03.pdf Please mark the post as answered if it answers your question | My SSIS Blog: http://microsoft-ssis.blogspot.com

Free Windows Admin Tool Kit Click here and download it now

May 1st, 2011 2:00pm

May 1st, 2011 2:20pm

Thanks alot for the reply. So, there is no way of doing synonyms matching, names matching etc.. in SQL Server ? There are simple tools which can do those things. I was wondering why SQL Server does not provide those functionalities. It would be very useful during data cleansing. Those tools probably have some kind of table with matches... But if you compair more than just one column (add for example Lastname, residence, country, etc) it will reduce the problem: Roberto and Rob has a low similairity, but Roberto Johnson from Berlin in Germany will probably match Rob Johnson from Berlin in Germany. I personally use the Fuzzy Logic only as a help to reduce manually labour. I export the Fuzzy Grouping/deduplication result to an Excel/Csv and let the business choose: A) Manually check and correct all records yourself, but: - everything with a similarity and probability larger than 80% is most certainly a match - everything between 60 and 80% has to be checked for false positive matches - everything below 60% is most certainly NOT a match This means they only have to check around 20% of the file B) Check the file superficial and give me a percentage (like 77% and larger) that I can mark as a match.Please mark the post as answered if it answers your question | My SSIS Blog: http://microsoft-ssis.blogspot.com

Free Windows Admin Tool Kit Click here and download it now

May 1st, 2011 5:26pm

This topic is archived. No further replies will be accepted.