Skip to main content

Research Repository

Advanced Search

A novel ensemble learning approach to unsupervised record linkage

Jurek, Anna; Hong, Jun; Chi, Yuan; Liu, Weiru

Authors

Anna Jurek

Jun Hong Jun.Hong@uwe.ac.uk
Professor in Artificial Intelligence

Yuan Chi

Weiru Liu



Abstract

© 2017 Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches.

Citation

Chi, Y., Jurek, A., Hong, J., Yuan, C., & Liu, W. (2017). A novel ensemble learning approach to unsupervised record linkage. Information Systems, 71, 40-54. https://doi.org/10.1016/j.is.2017.06.006

Journal Article Type Article
Acceptance Date Jun 27, 2017
Online Publication Date Jun 28, 2017
Publication Date Nov 1, 2017
Journal Information Systems
Print ISSN 0306-4379
Publisher Elsevier
Peer Reviewed Peer Reviewed
Volume 71
Pages 40-54
DOI https://doi.org/10.1016/j.is.2017.06.006
Keywords unsupervised record linkage, data matching, classification, ensemble learning
Public URL https://uwe-repository.worktribe.com/output/877934
Publisher URL https://doi.org/10.1016/j.is.2017.06.006
Related Public URLs http://www.sciencedirect.com/science/article/pii/S0306437916305063

Files







You might also like



Downloadable Citations