Skip to main content

Research Repository

Advanced Search

A novel ensemble learning approach to unsupervised record linkage

Jurek, Anna; Hong, Jun; Chi, Yuan; Liu, Weiru

A novel ensemble learning approach to unsupervised record linkage Thumbnail


Authors

Anna Jurek

Jun Hong Jun.Hong@uwe.ac.uk
Professor in Artificial Intelligence

Yuan Chi

Weiru Liu



Abstract

© 2017 Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches.

Journal Article Type Article
Acceptance Date Jun 27, 2017
Online Publication Date Jun 28, 2017
Publication Date Nov 1, 2017
Deposit Date Jul 10, 2017
Publicly Available Date Jun 28, 2018
Journal Information Systems
Print ISSN 0306-4379
Publisher Elsevier
Peer Reviewed Peer Reviewed
Volume 71
Pages 40-54
DOI https://doi.org/10.1016/j.is.2017.06.006
Keywords unsupervised record linkage, data matching, classification, ensemble learning
Public URL https://uwe-repository.worktribe.com/output/877934
Publisher URL https://doi.org/10.1016/j.is.2017.06.006
Related Public URLs http://www.sciencedirect.com/science/article/pii/S0306437916305063
Contract Date Jul 10, 2017

Files






You might also like



Downloadable Citations