Anna Jurek
A novel ensemble learning approach to unsupervised record linkage
Jurek, Anna; Hong, Jun; Chi, Yuan; Liu, Weiru
Abstract
© 2017 Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches.
Journal Article Type | Article |
---|---|
Acceptance Date | Jun 27, 2017 |
Online Publication Date | Jun 28, 2017 |
Publication Date | Nov 1, 2017 |
Deposit Date | Jul 10, 2017 |
Publicly Available Date | Jun 28, 2018 |
Journal | Information Systems |
Print ISSN | 0306-4379 |
Publisher | Elsevier |
Peer Reviewed | Peer Reviewed |
Volume | 71 |
Pages | 40-54 |
DOI | https://doi.org/10.1016/j.is.2017.06.006 |
Keywords | unsupervised record linkage, data matching, classification, ensemble learning |
Public URL | https://uwe-repository.worktribe.com/output/877934 |
Publisher URL | https://doi.org/10.1016/j.is.2017.06.006 |
Related Public URLs | http://www.sciencedirect.com/science/article/pii/S0306437916305063 |
Contract Date | Jul 10, 2017 |
Files
InformationSystems-InPress-2017-06-28.pdf
(1.7 Mb)
PDF
You might also like
A survey of location inference techniques on Twitter
(2015)
Journal Article
Privacy preserving record linkage in the presence of missing values
(2017)
Journal Article
A collaborative multiagent framework based on online risk-aware planning and decision-making
(2017)
Journal Article
Context-dependent combination of sensor information in Dempster–Shafer theory for BDI
(2016)
Journal Article
Downloadable Citations
About UWE Bristol Research Repository
Administrator e-mail: repository@uwe.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search