Anna Jurek
A novel ensemble learning approach to unsupervised record linkage
Jurek, Anna; Hong, Jun; Chi, Yuan; Liu, Weiru
Abstract
© 2017 Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches.
Citation
Jurek, A., Hong, J., Chi, Y., & Liu, W. (2017). A novel ensemble learning approach to unsupervised record linkage. Information Systems, 71, 40-54. https://doi.org/10.1016/j.is.2017.06.006
Journal Article Type | Article |
---|---|
Acceptance Date | Jun 27, 2017 |
Online Publication Date | Jun 28, 2017 |
Publication Date | Nov 1, 2017 |
Deposit Date | Jul 10, 2017 |
Publicly Available Date | Jun 28, 2018 |
Journal | Information Systems |
Print ISSN | 0306-4379 |
Publisher | Elsevier |
Peer Reviewed | Peer Reviewed |
Volume | 71 |
Pages | 40-54 |
DOI | https://doi.org/10.1016/j.is.2017.06.006 |
Keywords | unsupervised record linkage, data matching, classification, ensemble learning |
Public URL | https://uwe-repository.worktribe.com/output/877934 |
Publisher URL | https://doi.org/10.1016/j.is.2017.06.006 |
Related Public URLs | http://www.sciencedirect.com/science/article/pii/S0306437916305063 |
Files
InformationSystems-InPress-2017-06-28.pdf
(1.7 Mb)
PDF
You might also like
Ego-graph replay based continual learning for misinformation engagement prediction
(2022)
Conference Proceeding
Social influence prediction with train and test time augmentation for graph neural networks
(2021)
Conference Proceeding
Towards establishing a 'cooperation' measure for coupled movement in close-proximity human-robot interaction
(2020)
Presentation / Conference
Social network influence ranking via embedding network interactions for user recommendation
(2020)
Conference Proceeding
Privacy preserving record linkage in the presence of missing values
(2017)
Journal Article
Downloadable Citations
About UWE Bristol Research Repository
Administrator e-mail: repository@uwe.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search