Pre-processing Optimisation applied to the Classical Integer Programming Model for Statistical Disclosure Control

. A pre-processing optimisation is proposed that can be applied to the integer and mixed integer linear programming models that are used to solve the cell suppression problem in statistical disclosure con-trol. In this paper we report our initial ﬁndings and acknowledge that there is much more work to be done. Early indications are that the pre-processing optimisation will considerably reduce the resources required by the solver hence allowing either statistical tables to be protected quicker or larger statistical tables to be protected. This pre-processing optimisation may be suitable for application to the τ -Argus Optimal Method used in protecting statistical tables.


Introduction
Many statistical tables are published with some of the table cells suppressed (left blank).This is done to prevent the disclosure of individual respondents which contributed to the cell value.Cells that failed the primary rule are called primary, or sensitive, cells and must be protected by additional suppressed cells called secondary cells.Choosing which secondary cells to suppress is known, in the literature, as the cell suppression problem.The cell suppression problem involves choosing a set of secondary cells that will remove the risk of disclosing the values of the primary cells whilst also minimising the information loss from the published statistical table.
The cell suppression problem is a member of the class of NP-hard problems when solving for optimality.In fact, the problem of finding a secondary suppression pattern is easy to be achieved, for example if all cells are suppressed this is a feasible pattern but clearly not optimal.It is when solving the cell suppression problem optimally that as the size of the table to be protected grows the number of possible solutions that need to be evaluated grows much quicker.For a table with n cells there are 2 n possible suppression patterns.This means that when trying to find an optimal solution the computational time required grows rapidly as the table size grows, making finding optimal solutions for large tables difficult.Because the cell suppression problem is NP-hard MIP techniques can only find the optimal solution for small and medium sized statistical tables.
It is known that removing anything that is redundant from the mathematical program can make an efficiency gain.For example redundant equations, variables and protection levels can be removed.Another pre-processing efficiency gain can be obtained by removing any table cells that have the value set to zero or whose values must be published, subject to adjusting any marginal totals necessary.This decreases the number of working variables and constraints that the solver requires to find a solution, which in turn allows larger statistical tables to be protected.
Linear programming models and local search algorithms are used on relaxed cell suppression problems to obtain near optimal solutions when integer programming models are infeasible.Some have moved away from trying to calculate the optimal solution and have instead employed heuristic techniques to find near optimal solutions quickly.Others have employed hybrid algorithms that combine linear programming and heuristic techniques [2] [8].
This paper will present further improvements which are obtained when looking at the inferences made by an external attacker to a table.Section 2 presents definitions to the problem.Section 3 puts forward a conjecture for a pre-processing optimisation.Section 4 describes how the pre-processing optimisation can be implemented.Section 5 applies the pre-processing optimisation to the classical IP model for SDC.Section 6 describes our experimental setup.Section 7 contains our results.Section 8 contains our preliminary conclusions and section 9 lists further research.

Definitions
The external attacker wishes to deduce the values of cells that have been suppressed in a published statistical table, in order to glean confidential information.
The assumption made in the literature is that the external attacker has only the knowledge which is provided in the published table, i.e. he is not aware which suppressed cells are primary nor secondary but he knows that there is a number of suppressed cells in the table and their location (disclosure pattern).As each table has row and column totals, often referred to as marginals, the external attacker is able to calculate lower and upper bounds, feasibility range, for each of the suppressed cells by solving a set of linear constraint equations [1] [6].
A statistical table with marginal totals can be represented as a set of cells, please see details of the model in [1] and [6], a i , i = 1, ..., n, satisfying m linear constraint equations such that M a = 0, where M ij has one of the values The statistical agency will define a set P of primary cells whose publication will be suppressed in order to protect the confidentiality of the contributors to those cells.The statistical agency will provide lower and upper protection levels (lpl and upl) for each cell in P such that an external attacker must not be able to calculate a p within the range lpl p to upl p .For a p to be safe a p ≤ lpl p and a p ≥ upl p where a p is the lower bound and a p the upper bound of the feasible range that the external attacker can calculate for a p if only the primary cells P have been suppressed [1].Noting that some primary cells may occur alone in a marginal total, whereas others (e.g.those sharing rows/columns) may effectively protect each other, we define the following partition of the set of primary cells P .
An exposed primary cell in a statistical table with marginal totals is one whose value can be calculated, within a given lower and upper protection limit, by an external attacker when only the primary cells P have been suppressed.That is to say, p is a member of the set E of exposed primary cells if a p > lpl p or a p < upl p .E ⊆ P .
A not exposed primary cell in a statistical table with marginal totals is one whose value cannot be calculated, within a given lower and upper protection limit, by an external attacker when only the primary cells P have been suppressed.That is to say, p is a member of the set N of not exposed primary cells if a p ≤ lpl p and a p ≥ upl p .N ⊆ P , E ∪ N = P and E ∩ N = {}.The reason why there are not exposed primary cells in a statistical table is due to their locations in that table.Each not exposed primary cell receives sufficient protection from other primary cells in the table to prevent an external attacker from being able to calculate a feasible range of values within the given protection level.Proposition 1: As not exposed primary cells are already sufficiently protected they do not require secondary cells for their protection.
An initially exposed primary cell is a primary cell that can be exposed, by an external attacker when only the primary cells P have been suppressed, without requiring the exposure of any other primary cell.For example there may be only one primary cell in a row or column.Let L p be the subset of linear equations M that contain the value +1 or −1 in the locations for a p , L p ⊆ M .This subset L p only contains the linear equations that apply to a p .Then we can say that p is a member of the set I of initially exposed primary cells if a p > lpl p or a p < upl p , when, Conversely we can say that p is not a member of I if a p ≤ lpl p and a p ≥ upl p .A consequentially exposed primary cell is an exposed primary cell that is not an initially exposed primary cell.That is to say, p is a member of the set C of consequentially exposed primary cells if p is a member of E but not a member of I. C ⊂ E, C ∪ I = E and C ∩ I = {}.Hence a consequentially exposed primary cell is only vulnerable to an external attacker when at least one other exposed primary cell has been exposed.When an external attacker has exposed a primary cell it was for one of two reasons, the cell was either initially or consequentially exposed.If I = {} then both C = {} and E = {}.

Conjecture
In order to make a published statistical table safe from an external attacker only the protection of the initially exposed primary cells, I, need to be considered when selecting secondary cells to suppress.
Our reasoning is as follows.To protect the primary cells in a published statistical table a set of secondary cells, S, must be suppressed along with the primary (primary) cells, P .When choosing S to protect the primary cells, P , minimising the loss of information from the published statistical table is considered.Let S p be a set of secondary cells that protect p, p ∈ P .Let L p∪Sp be the subset of linear equations M that contain the value +1 or −1 in the locations for a p and all a s where s ∈ S p , L p∪Sp ⊆ M .This subset L p∪Sp only contains the linear equations that apply to a p and all associated a s .The set of secondary cells, S p , are primarily chosen so that a p ≤ lpl p and a p ≥ upl p , where From the definition of initially exposed primary cells we know that for p / ∈ I that a p ≤ lpl p and a p ≥ upl p when S p = {}.We also know that for p ∈ I that a p > lpl p or a p < upl p when S p = {}.From the definition of S p we know that for p ∈ I that a p ≤ lpl p and a p ≥ upl p when S p = {}.Therefore the set of secondary suppressed cells, S p , are only required to protect a p when p ∈ I, they are not required in the protection of a p when p / ∈ I. So, the only time S p = {} is when p ∈ I. p ∈ I ⇔ S p = {} Hence only the protection of the initially exposed primary cells, I, need to be considered when selecting secondary cells to suppress in order to make a published statistical table safe from an external attacker.
A Corollary to this conjecture is that if I = {} then N = P and therefore the statistical table is already adequately protected.

Finding initially exposed primary cells without using a solver
We present here a method that provides a superset of the elements in P that contains all those in I.
For each element p ∈ P let J denote the set of linear constraint equations (equivalent to rows of M ) in which p participates, i.e. ∀j ∈ J • M pj = 0.
A necessary, but not sufficient, condition for us to establish that p ∈ I is the existence of at least one marginal total in which the amount of "uncertainty" (and hence protection) provided by the absolute values of the other suppressed primary cells in that total is less than the required protection limits.Formally, for each j ∈ J let H j be the set of primary cells in j, we require that one of the following conditions holds:

Example
Taking a 6 by 6 statistical table with marginal totals (Table 1) as an example, the process of finding I, C and N can be shown.In our example the statistical agency has defined P = {8, 12, 15, 16, 19, 20, 24, 27}.When the test for the fully exposed primary cells is applied five primary cells are exposed, E = {16, 19, 20, 24, 27} and therefore N = {8, 12, 15}.The values of cells 16, 20, 24 and 27 are calculated exactly and the feasibility range of cell 19 is calculated within its lower and upper protection levels which in this case is 10% of the cell's value.
By contrast applying the test for initially exposed primary cells (Table 2) we find that I = {16, 19, 24}, and therefore N ∪ C = {8, 12, 15, 20, 27}.For this pre-processing optimisation to work it is not necessary (nor is it possible) to determine which cell is in C and which is in N .

Applying the Conjecture to the Classical IP Model for SDC
The cell suppression problem is the problem faced by statistical agencies when they release statistical tables, they must balance the risk of disclosing confidential information against the loss of information from the table caused by not publishing the suppressed cells in the table [3] [7] [4] [8] [5].
Here we consider the case of a single external attacker who has no other knowledge than what is in the published table.It is usually assumed that the external attacker, prior to attack, knows that the cell a i lies within the range from lb i to ub i .If the external attacker has no other knowledge than that published in the table then lb i = 0 and ub i = ∞.Fischetti and Salazar-González [3], when they defined the classical model, introduced a weighing w i for each cell a i to represent the information loss should the cell a i be suppressed.A variable z i was introduced for each a i to indicate whether or not a i had been suppressed (z i = 0 means that a i is published and z i = 1 means that a i is suppressed).Two tables where introduced that are consistent with a = [a 1 , ..., a n ], these tables f p = [f p 1 , ..., f p n ] and g p = [g p 1 , ..., g p n ] are used to calculate the lower and upper feasible limits for p ∈ P .In the classical model the lower and upper bounds (lb i and ub i ) are translated into LB i and UB i , where LB i = a i − lb i and UB i = ub i − a i .Those cells that are suppressed and are members of P are called primary suppressed cells and those cells that are suppressed but are not members of P are called secondary suppressed cells.

Classical Model
f or i = 1, ..., n and f or all p ∈ P :

Modified Classical Model
Applying conjecture in this paper we derived the Classic Model from Fischetti and Salazar as follows: f or i = 1, ..., n z p = 1 f or all p ∈ P and f or all p ∈ I(initially exposed primary cells) : 6 Experimental Setup

Comparing the Classical and Modified Classical Models
A set of 20 2-dimensional non-hierarchical magnitude statistical tables with marginal totals (see Table 3) were generated for the purpose of comparing the classical and modified models [8].These statistical tables with marginal totals were protected using a SAS/OR implementation of the classical model and a SAS/OR implementation of the modified (initially exposed primary cells only) classical model, using the same computer.These experiments were ran at ONS on a Dell Optiplex GX270 processor with 2GB RAM.The SAS version used was SAS 9 solver with SAS/OR Opt module.There are a variety of solvers in SAS and OptMILP was used.The selected secondary suppressed cells, the number of variables required, the number of constraints and the required cpu-time were recorded for comparison.For each of the statistical tables the percentage change in performance was calculated using the following formula.For each of these statistical tables the improvement in the number of variables, constraints and cpu time was plotted against the reduction in the number of primary cells needing to be considered, see Fig. 1.

Estimating the Improvement for Different Table Sizes
A set of 3360 2-dimensional non-hierarchical statistical tables with marginal totals, sizes ranging from 100 cells to 900,000 cells, were generated with random values.For each different table size; 40 tables were generated, these tables had either 10% or 25% primary cells and either 10% or 20% of cells set to zero.For each of these tables the percentage reduction in the number of primary cells that need to be considered when using the modified classical model was plotted against the table size, see Fig. 2.

Comparing the Classical and Modified Classical Models
Both models, classical and modified, selected the same secondary cells to suppress.The number of variables required, the number of constraints and the required cpu-time for each model is recorded in Table 4.For every percentage reduction in the number of primary cells that need to be considered when using the modified classical model to protect a published statistical table there is an equal percentage improvement in the number of variables and constraints required to solve the associated linear programme.There is also a similar improvement in the required cpu time, however the relationship is not as smooth as it is for the number variables and constraints required, see Fig. 1.For those statistical tables where all of the primary cells are initially exposed, P = I, the modified classical model may require more cpu time than the classical model.

Estimating the Improvement for Different Table Sizes
The reduction in the number of primary cells that needed to be considered when using the modified classical model was affected by some of the properties of the statistical tables being protected.The reduction was greater for larger tables, tables that were more square than long and tables that had a higher proportion of primary cells.This is explained by each factor increasing the probability that more than one primary cell would occupy the same row or column and hence provide some protection to each other.

Conclusions
This pre-processing optimisation has been shown to be very effective when applied to the classical IP SDC model developed by Fischetti and Salazar-González [3].This optimisation works by reducing the resources that the solver requires to protect statistical tables, hence allowing statistical tables to be protected quicker or allowing larger statistical tables to be protected.The classical IP SDC model has been implemented, as the Optimal Method, in the SDC tool, τ -Argus [5] [9].It may be the case that this pre-processing optimisation could be applied to the τ -Argus Optimal Method to enable it to handle larger tables.

Further Research
How the properties of the statistical tables affect the amount of improvement that this pre-processing optimisation provides requires further investigation.How hierarchical statistical tables affect the amount of improvement that this pre-processing optimisation provides requires further investigation.This preprocessing optimisation should be applied to other SDC techniques to see if similar performance improvements can be obtained.

Fig. 1 .Fig. 2 .
Fig. 1.Percentage Improvement in Number of Variables needed by SAS/OR, the Number of Constraints needed by SAS/OR and the CPU Time needed by SAS/OR by the Percentage Reduction in Primary Cells Considered.
If the external attacker is able to calculate a p > lpl p or a p < upl p then a p is unsafe (a p can be disclosed).It should be noted that we are considering the external attacker on tables which have not yet been protected by secondary suppressed cells in order to gauge the level of disclosiveness of the tables for our pre-processing optimisation.

Table 1 .
Example of a 6 by 6 statistical tables with marginal totals.There are 8 primary cells.Each primary cell has its cell number top left and number of contributors bottom right.

Table 2 .
Workings to find members of the superset of I. Any cell that has either a sum of other primary cells in either the row or column that is larger than it's protection range is a member of the superset of I.Applying a SAS/OR implementation of the classical IP SDC model to the whole set of primary cells in table 1 the set of secondary cells S = {37, 38, 40} was obtained.The solver required 833 variables, 1824 constraints and 23.28 seconds of cpu time to protect table 1.Applying a SAS/OR implementation of the modified classical IP SDC model to only the initially exposed primary cells, I = {16, 19, 24}, in table 1 the set of secondary cells S = {37, 38, 40} was also obtained.The solver required 343 variables, 689 constraints and 3.75 seconds of cpu time to protect table 1.

Table 3 .
Range of statistical tables with marginal totals

Table 4 .
Comparison of the two models

Table 5 .
Percentage Reduction in Primary Cells Considered, the Number of Variables needed by SAS/OR, the Number of Constraints needed by SAS/OR and the CPU Time needed by SAS/OR