From the Desk of Anup Mathur, Chief, Center for Optimization and Data Science . . .
CENTER FOR OPTIMIZATION AND DATA SCIENCE SEMINAR SERIES*
Research and Methodology Directorate
"FIRLA: A Fast Incremental Record Linkage Algorithm"
Ahmed Soliman
University of Connecticut
August 31, 2022
3:05-4:30 p.m. ET
Virtual Meeting
Click here to join the meeting
Abstract:
Record linkage is an important problem studied widely in many domains including biomedical informatics. A standard version of this problem is to cluster records from several datasets, such that each cluster has records pertinent to just one individual. Typically, datasets are huge in size. Hence, existing record linkage algorithms take a very long time. It is thus essential to develop novel fast algorithms for record linkage. The incremental version of this problem is to link previously clustered records with new records added to the input datasets. A novel algorithm has been created to efficiently perform standard and incremental record linkage. This algorithm leverages a set of efficient techniques that significantly restrict the number of record pair comparisons and distance computations. Our algorithm shows an average speed-up of 2.4x (up to 4x) for the standard linkage problem as compared to the state-of-the-art, without any drop in linkage performance at all. On average, our algorithm can incrementally link records in just 33% of the time required for linking them from scratch. Our algorithms achieve comparable or superior linkage performance and outperform the state-of-the-art in terms of linking time in all cases where the number of comparison attributes is greater than two. In practice, more than two comparison attributes are quite common. The proposed algorithm is very efficient and could be used in practice for record linkage applications especially when records are being added over time and linkage output needs to be updated frequently.
Bio of the Speaker:
Ahmed Soliman completed his B.S. in communications and electronics engineering (Computer and Software Engineering Branch) at Helwan University, Egypt, in 2004. He worked as a teaching assistant for Helwan University between 2004 and 2016. During this period, he received his M.S. in computer science and engineering in 2012 at Helwan University. Ahmed is currently (2022) a Ph.D. student in the Computer Science and Engineering Department at the University of Connecticut. His research interests include big data analytics, clustering and record linkage algorithms, feature selection algorithms, and machine learning.
MS Teams meeting: Click here to join the meeting
Or call in (audio only)
tel:+13479734395, United
States, New York City
Meeting ID: 264
903 886 294
If you would like to be added to the distribution list and calendar invite for the “CODS Seminar Series Optional List,” please contact Josephine Bustos (Josephin...@census.gov) and Tom Loo (tom...@census.gov).
Please direct all requests for sign language interpreting services and captioning (also known as CART or Communication Access Real-Time Translation) to HRD.Inte...@census.gov. If you have questions concerning accommodations, please contact the Reasonable Accommodation Staff at HRD.Accom...@census.gov or 301-763-4060 (Voice).
* Scholars will present on a monthly basis on their research and engage in collaborative discussions with Census Bureau researchers on entity resolution, time series software, data science algorithms, adaptive survey design, artificial intelligence methods, and research computing ecosystems.