Computer Science Colloquium, 12/05/2013

DEC 05, 2013 | 4:15 PM

Details

WHERE:

The Graduate Center
365 Fifth Avenue

ROOM:

9205

WHEN:

December 05, 2013: 4:15 PM

ADMISSION:

Free

Description

Computer Science Colloquium
Thursday, December 5, 4:15pm, room 9205/06
 
Yuri Gurevich
Microsoft Research

Large-data deduplication problem


Imagine that you have a long list of items, say a hundred thousands of items. For example, the items may be client addresses. Some of the addresses are essentially duplicates distinguished only by "St." vs. "Street", or "Bill" vs. "William", or by little spelling errors, etc. You don't want to miss any of your clients, and you don't want to annoy them by sending them multiple copies of your communications. How do you clean up your item list? The problem is ubiquitous and hard. We analyze the problem and describe a fast probabilistic algorithm for it.

The Colloquium is supported by generous contributions from the Bloomberg, Information Builders, Inc., and Netlogic, Inc.