This chapter considers the confidentiality issues around linked data. It notes that the use and availability of secondary (adminstrative or social media) data, allied to powerful processing and machine learning techniques, in theory means that re-identification of confidential source data is likely in all types of releases.
In practice, there are barriers. Data linking is a complex and difficult process, and there are many things that could go wrong. However, this is less of a problem for a potential intruder, who is not concerned about re-identifying all data, but just enough to achieve his or her ends; the accuracy of the re-identification may not even be important if there is just a perception of poor confidentiality protection.
More importantly, focusing on the data-centred models misleads us into thinking "what can go wrong?" instead of "what will go wrong?". Aggregate statistics can be attacked to show hidden numbers of observations, but this does not necessarily disclose confidential information; aggregate statistics that could re-identify soruce data are not typically useful statistics. For the release of microdata, a user-centred perspective allows one to consider a range of non-statistical solutions which are both robust and fairly future-proof.
In summary, linked data does present a strong theoretical challenge to the protection of data, as statistical protection is outgunned by technology and software; but in practice a shift in focus to the evidence-based user-centred view shows that there are many directions for practical data protection to go.
Ritchie, F., & Smith, J. Confidentiality and linked data. In G. Roarson (Ed.), Privacy and Data Confidentiality Methods – a National Statistician’s Quality Review, 1-34. Office for National Statistics