Anna Jurek
A novel ensemble learning approach to unsupervised record linkage
Jurek, Anna; Hong, Jun; Chi, Yuan; Liu, Weiru
Abstract
© 2017 Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches.
Citation
Chi, Y., Jurek, A., Hong, J., Yuan, C., & Liu, W. (2017). A novel ensemble learning approach to unsupervised record linkage. Information Systems, 71, 40-54. https://doi.org/10.1016/j.is.2017.06.006
Journal Article Type | Article |
---|---|
Acceptance Date | Jun 27, 2017 |
Online Publication Date | Jun 28, 2017 |
Publication Date | Nov 1, 2017 |
Journal | Information Systems |
Print ISSN | 0306-4379 |
Publisher | Elsevier |
Peer Reviewed | Peer Reviewed |
Volume | 71 |
Pages | 40-54 |
DOI | https://doi.org/10.1016/j.is.2017.06.006 |
Keywords | unsupervised record linkage, data matching, classification, ensemble learning |
Public URL | https://uwe-repository.worktribe.com/output/877934 |
Publisher URL | https://doi.org/10.1016/j.is.2017.06.006 |
Related Public URLs | http://www.sciencedirect.com/science/article/pii/S0306437916305063 |
Files
InformationSystems-InPress-2017-06-28.pdf
(1.7 Mb)
PDF
You might also like
Privacy preserving record linkage in the presence of missing values
(2017)
Journal Article
The event calculus in probabilistic logic programming with annotated disjunctions
(2017)
Presentation / Conference
A collaborative multiagent framework based on online risk-aware planning and decision-making
(2017)
Journal Article
Context-dependent combination of sensor information in Dempster–Shafer theory for BDI
(2016)
Journal Article
Uncertain information combination for decision making in smart grid BDI agent systems
(2016)
Journal Article