Face Recognition and Verification Using Photometric Stereo: The Photoface Database and a Comprehensive Evaluation

This paper presents a new database suitable for both 2-D and 3-D face recognition based on photometric stereo (PS): the Photoface database. The database was collected using a custom-made four-source PS device designed to enable data capture with minimal interaction necessary from the subjects. The device, which automatically detects the presence of a subject using ultrasound, was placed at the entrance to a busy workplace and captured 1839 sessions of face images with natural pose and expression. This meant that the acquired data is more realistic for everyday use than existing databases and is, therefore, an invaluable test bed for state-of-the-art recognition algorithms. The paper also presents experiments of various face recognition and verification algorithms using the albedo, surface normals, and recovered depth maps. Finally, we have conducted experiments in order to demonstrate how different methods in the pipeline of PS (i.e., normal field computation and depth map reconstruction) affect recognition and verification performance. These experiments help to 1) demonstrate the usefulness of PS, and our device in particular, for minimal-interaction face recognition, and 2) highlight the optimal reconstruction and recognition algorithms for use with natural-expression PS data. The database can be downloaded from http://www.uwe.ac.uk/research/Photoface.


I. INTRODUCTION
T HERE have been a plethora of face recognition algorithms published in the literature during the last few decades.During this time, a great many databases of face images, and later 3-D scans, have been collected as a means to test these algorithms.Each algorithm has its owns strengths and limitations and each database is designed to test a specific aspect of recognition.This paper is motivated by the desire to have a face recognition system operational where users do not need to interact with the data capture device and do not need to present a particular pose or expression.While we do not completely solve these immensely challenging problems in this paper, our aim is to contribute towards this endeavour by means of: 1) Construction of a novel 3-D face capture system, which automatically detects subjects casually passing the device and captures their facial geometry using photometric stereo (PS)-a 3-D reconstruction method that uses multiple images with different light source directions to estimate shape [2].To the best of our knowledge this is one of the first realistic commercial acquisition arrangements for the collection of 2-D/3-D facial samples using PS. 2) Collection of a 2-D and 3-D face database using this device, with natural expressions and poses.3) Detailed experiments on a range of 3-D reconstruction and recognition/verification algorithms to demonstrate the suitability of our hardware to the problem of noninteractive face recognition.The experiments also show which reconstruction and recognition methods are best for use with natural pose/expression data from PS.This is one of the first detailed experimental studies to demonstrate how different methods in the pipeline of PS affect recognition/verification performance.

A. Face Databases
Face recognition researchers have been collecting databases of face images for several decades now [3,Chapter 13].While some databases can be regarded as superior to others, each of them are designed to test different aspects of recognition and have their own strengths and weaknesses.One of the largest databases available is the FERET database [4].This has a total of 1199 subjects with up to 20 poses, two expressions and two light source directions.The FERET database was originally acquired using a 35 mm camera.Others, for example the widely used CMU PIE database [5] or the Harvard RL database [6], concentrate specifically on varying the capture conditions such as pose and illumination.Another popular database collected for the purpose of face verification under well-controlled conditions is the XM2VTS database [7].
The PIE database is one of the most extensively researched.This is due to the fact that the faces are captured under highly controlled conditions involving 13 cameras and 21 light sources.The Yale B database [8] offers similar advantages to the PIE databases except with an even larger number of lighting conditions (64) using nine poses.However, the Yale B database includes just ten subjects.The original Yale database [9] was designed to consider facial expressions, with six types being imaged for 15 subjects.Finally, the extended Yale B database was published which contains 28 subjects with 9 different poses and 64 illumination conditions [10].
Even though the PIE [5], Yale [8] and extended Yale [10] databases provide facial samples taken under different illumination directions, they contain few subjects.More recently, the CMU Multi-PIE database [11] has been constructed with the aim of extending the image sets to include a larger number of subjects (337) and to capture faces taken in four different recording sessions.This database was recorded under controlled laboratory conditions, as with the others mentioned above.
A recently emerged trend in face recognition research has been to incorporate three-dimensional information into the recognition process.This has naturally lead to the collection of databases with 3-D facial samples. 1 This trend is due to the fact that changes in viewing conditions (e.g.lighting variation) adversely affect the 2-D appearance of a face image but not the 3-D appearance.The number of existing databases containing 3-D facial samples suitable for 3-D face recognition is constantly increasing.In the what follows we briefly present the publicly available ones, categorizing them according to the degree of difficulty in obtaining 3-D face samples and their equipment cost.
1) 3-D Face Databases: One of the first publicly available datasets containing 3-D face samples for face verification was presented in [13], [14].The database contains approximately 100 subjects.It was collected using the structured light technology, a low cost technology that is widely used in our days (for example in kinect 2 ).However, despite the obvious advantage of being of low cost, the commodity structured light technologies offer smaller spatial resolution, causing many captured models to have missing parts and holes.More recent publicly available databases that employ more sophisticated commercial structured light technology include the Bosphorus [15] and the University of York 3-D [16], [17] databases.The Bosphorus database, that can be used both for facial expression 1 Another trend in face recognition is totally unconstrained face matching (suitable for face tagging experiments) and a database for this task called "Labelled Faces in the Wild" has been recently collected [12], but this trend is out of the scope of the paper, since in this paper we aim at describing a database suitable for an industrial setting. 2 Kinect is a registered trademark of Microsoft Corp.
and face recognition experiments, was captured using a commercial structured light based 3-D digitizer device, the Inspeck Mega Capturor II 3-D. 3 It contains (the face recognition experiments version) 47 subjects with 53 face scans per subject.The University of York 3-D face database consists of 350 subjects with different facial expressions (happiness and anger) as well as images with closed eyes and raised eyebrows.One of the most widely known datasets of 3-D facial samples is the FRGC database [18] and its extension, namely ND 2006 [19].The FRGC database is a multipartition 2-D and 3-D database that contains a validation set consisting of 466 subjects to a total of 4007 images.The ND 2006 database consists of 888 subjects and multiple images per subject of various posed facial expressions (happiness, disgust, sadness and surprise).Both databases were captured using a rather expensive camera, the Minolta Vivid 910 laser range scanner [20], that provided 3-D face samples of up to 112,000 vertices and requires the full cooperation of the client.Another database that was collected using the same camera is the CASIA 3-D Face database [21], [22].It consists of 123 subjects with 10 images per subject displaying different facial expressions (smile, laughter, anger and surprise) and closed eyes.
A laser digitizer (the Minolta VI-700 digitizer) was used for capturing the GavabDB database [23].It contains 61 subjects (all Caucasian) displaying different smiles (open/closed mouth) and random facial expressions that each subject chose.Two publicly available databases that have been mainly used for 3-D/4-D facial expression recognition but could also be used for face recognition, are the BU3D [24] and BU4D [25] databases.Both of the databases were captured using 3DMD acquisition setups, combining in that way passive and active stereo.However, these setups are of high cost and require the full collaboration of the subject.Both of the databases consist of approximately 100 images, but unfortunately they contain only one session, something that does not meet with the requirements for face recognition.The Texas 3-D Face Recognition Database [26], [27] consists of 3-D models that were captured using an MU-2 stereo imaging system (similar to the used in BU3D) and contains 105 subjects.
The purpose of the new database described in this paper is to capture a large number of faces in 3-D from a more industrial setting.However, most existing 3-D capture devices (e.g.[20], [28]) are both financially and computationally expensive which can be highly inhibiting for commercial application.By contrast, we use a four-source high-speed capture arrangement (which now can be easily deployed with even less than GBP 2 K pounds), which permits the use of PS methods [2] to recover the 3-D information with minimal computational expense.Furthermore the device is significantly financially cheaper than most other 3-D capture mechanisms.A photograph of our device is shown in Fig. 1 and will be explained in more detail in Section III.
Another advantage of our capture mechanism is that PS methods have a one-to-one relationship between the surface normals and the resolution of the acquired image.This means that each pixel corresponds to a facet with one normal.Moreover, methods that provide subpixel reconstructions using PS are available [29] allowing lower resolution capturing devices to provide detailed estimates.It should be mentioned that this one-to-one relationship allows detailed reconstructions with applications in other areas such as object inspection for defects, surveillance and recognition.
So the aim here is to capture poses and expressions that are as natural as possible.For this reason, we placed our capture device near the entrance to a busy workplace and gave all of the volunteer subjects the sole instruction to "walk through the archway".The database therefore offers an ideal testbed for face recognition algorithms designed for real world applications where minimal interaction is desired.As PS can be applied to the four images (one image per light source) to calculate the 3-D structure of the face, the database also allows for both 2-D, 3-D and hybrid algorithms to be evaluated.Our database consists of 1,839 capture sessions of 261 subjects.

B. Reconstruction and Recognition Methods for PS Data
In addition to describing the device and the database, we also present experiments on the Photoface database by applying face recognition and verification techniques on the albedo, depth and surface normal images-all of which may be calculated from PS.For the case of the albedo, we applied algorithms from two of the most popular families of face recognition from intensity images: 1) subspace methods such as Principal Component Analysis (PCA), Nonnegative Matrix Factorization (NMF), Linear Discriminant Analysis (LDA) [30], [31] etc. using the vectorized albedo images as feature vectors, 2) feature-based methods using Elastic Graph Matching (EGM) architectures [32]- [34].For the depth map, we applied exactly the same methods as in the case of albedo images.Finally, for the normals we propose a novel method based on metric-multidimensional scaling of the -norm of angles.
We stress at this point that the experiments are intended to test the various algorithms on PS data of natural pose and expression specifically.The focus therefore is neither to compare various 2-D/3-D face recognition and verification methods against each other for general application nor to demonstrate that fusion of information of 3-D and 2-D data increase the recognition performance [35], [36].A comparative study of 3-D face recognition/verification methods was recently published in [36], where a large inventory of 3-D recognition methods were implemented and tested on various representations of facial geometry (i.e.depth images and normal fields).Moreover, the authors of [36] applied various fusion strategies on the results of 3-D face recognition methods.Another recent study on the fusion of information of intensity and depth images was presented in [35].
The aims of the experiments conducted in this paper are: • to demonstrate how different methods in the pipeline of PS (i.e.normal field computation and depth map reconstruction methods) affect recognition/verification performance • to verify that a similar conclusion to [35] can be drawn for the modalities derived from PS methods.In particular we would like to verify that (1) similar recognition performance can obtained using a single 2-D intensity image or a single 3-D image (or normal field), (2) 2-D 3-D face recognition performs better than using either 3-D or 2-D alone and (3) fusing results from two 2-D or 3-D images using a similar fusion scheme as used in multimodal 2-D 3-D also improves performance over using a single 2-D image.We applied three different PS methods in order to compute the normal field and the albedo image and five different integration methods that compute the height map from the normal field.To the best of the authors' knowledge this is the first experiment on a real-world PS database which also explores the affect of the use of different methods in the processing pipeline.
The rest of the paper is organized as follows.In Section III the device for image capture and the collected database is described.In Section IV, we describe the PS methods for the computation of the albedo and normal field.In Section V we describe the methods for surface reconstruction.In Section VI we describe the methods we tested for face recognition using albedo, depth maps and normal fields.Experimental results are described in Section VII.Finally, conclusions are drawn in Section VIII.

A. The Device
The Photoface database was collected using the custom-made four-source PS device shown in Fig. 1.Unlike previous constructions, our aim was to capture the data using hardware that could easily be deployed in commercial settings.Individuals are told to walk through the archway towards the camera located on the back panel and exit through the side.This arrangement makes the device suitable for usage at entrances to buildings, high security areas, airports etc.The presence of an individual is detected by an ultrasound proximity sensor placed before the archway.This can be seen in Fig. 1 on the horizontal beam towards the left-hand side of the rig.
The hardware components used to create the system were as follows: • Camera: Basler 504 kc with Camera Link interface operating at 200 fps. 1 ms exposure time.Placed at a distance of approximately 2 m from the head of the subject.• Lens: 55 mm, Sigma lens.• Light sources: low cost Jessops M100 flashguns, at a distance of approximately 75 cm from the head of the subject.• Device trigger: Baumer highly directional ultrasound proximity switch.Range 70 cm.• Hardware IO card (for interfacing the camera, frame grabber and light sources): NI PCI-7811 DIO with Field Programmable Gate Array (FPGA).• Frame grabber: NI PCIe-1429.
• Interfacing code: NI LabVIEW (the reconstruction and recognition algorithms were written in MATLAB).The device also contains a monitor (as can be seen in Fig. 1) that provides instructions and could be used to indicate whether or not an individual was recognized in the case of a recognition scenario, or whether an identity claim was accepted or rejected in the case of a verification scenario.
The device captures one image of the face for each light source in a total of approximately 20 ms.This rate of image capture, which we adopt for all our experiments, was regarded as adequate to reduce the interframe motion to below a few pixels.The only case in which the performance of the system can be expected to deteriorate significantly, is when a person runs through the device.This required frame rate was determined through experimentation.Fig. 2 shows an example of four raw images of an individual.The resolution of the original image captured is 1280 1024 px, although the images in our database are automatically [37] cropped to the face itself (typically of the order 600 800 px).
The capture device was placed at the entrance of a busy workplace for a period of four months.Volunteer employees casually passed through the booth at regular intervals throughout this period.No instructions were given, other than to ask them to walk through the archway towards at the camera and monitor.Thus, the volunteers typically passed through the device on their way in and out of the building.This arrangement is of great importance for face recognition testing as: 1) It meant that the capture conditions were realistic for a real-world example.This is in contrast with existing face databases such as the widely used CMU-PIE database [11] or the FRGC database [18].2) The whole setup was noninvasive, thus being suitable for any recognition algorithms developed for immediate commercial use.

B. Statistics of the Database
The Photoface database was collected between February and June 2008.It consists of a total of 1,839 sessions of 261 subjects 4  and a total of 7,356 images.Some individuals used the device only once, while others walked through it more than 20 times.The majority of people in the database are male (227 compared to 34 female).Furthermore, the subjects are predominantly Cau- 4 The database now contains 3187 sessions of 453 subjects.casian (257 subjects).Due to the lack of supervision, most of the captured faces in the database display an expression (for example, more than 600 smiles and more than 200 surprises, open mouth, scream like expression etc. were recorded).
Regarding repeat usages, 98 people walked through the device only once.For 126 of the 163 subjects that used the device more than once, the sessions were collected over a period of more than a week's interval.For the majority of those (90 people), this interval was greater than one month.The number of images corresponding to the number of subject recordings by the device is depicted in Fig. 3.

IV. PHOTOMETRIC STEREO FOR SURFACE NORMAL COMPUTATION
This section presents the various PS methods that we adopt to estimate the surface normal at each pixel from the raw images.We first summarize the original method [2], [38,Section 5.4], which we later apply to three and four sources.This is followed by a ray tracing method and an optimization method: both of which aim to diminish complications such as shadows and highlights.

A. Standard Photometric Stereo
The original method for PS assumes Lambert's Law and essentially derives a matrix equation for the expected intensity at each pixel for given light source directions, , and surface albedo, .For a single light source, Lambert's law can be written for a pixel at location as (1) where is the measured pixel intensity and is the surface normal.This is extended to light sources using the following matrix equation: (2) where is the th measured pixel intensity and is the th light source vector.Using measured intensity values from the images and assuming that the light source vectors are known a priori, we can then use this equation to solve for the albedo and the surface normal components at each pixel.Note the assumption that the camera response is linear.
We have mainly concentrated on a four-source version of the technique, although we have also compared our results to methods using three sources.For the latter, we omitted the upper-right source in Fig. 1 from the computation.The choice of which source to remove was somewhat arbitrary although we should note that the choice does have an impact on the reconstructions as some sources cause more shadows than others.

B. A Ray Tracing Method for Photometric Stereo
In many cases of PS usage, it is desirable to use all available light sources in the reconstruction in order to maximize robustness.However, where one or more sources do not illuminate the entire surface due to a self/cast shadow, it becomes disadvantageous to use all the sources.In the case of a face, this is most likely to happen around the nose and outer edges of the cheeks,  as shown in Fig. 2. Argyriou and Petrou recently proposed a recursive algorithm for PS in the presence of self and cast shadows and highlights [39].The algorithm works with as few as three light sources, but can be generalized for arbitrary .
Initially, the problematic areas of the surface are identified based on the fact that there exists be a linear equation expressing the relationship between the illumination direction vectors.In the next step, standard PS is applied, and prior to integration, the problematic areas are removed.The self and the cast shadows are then estimated from the reconstructed surface from ray tracing.Using the newly available information on shadows, PS is applied again using only the useful (i.e., those not causing shadow or highlights) lights.If erroneous areas still exist, they are regarded as highlights and the corresponding illumination sources are removed from the computation.In the final step, PS is applied using the shadow and highlight free sources.
This reconstruction method is therefore able to identify areas where some of the lighting directions result in unreliable data, providing the capability to adjust a reconstruction algorithm and improve its performance accordingly.The main advantage is that it does not rely on individual pixels to locate shadows, but is based on the entire surface resulting in more reliable shadow maps.

C. Mitigation of Shadow Effects
Another recently proposed method to address shadow effects was presented by Hansen et al. [40].This method assumes that no points on the face are shadowed by more than one source.The surface normal that is adopted for each point is then taken as a linear combination of the estimate using all four sources and that using the best combination of three sources.
More formally, the optimal surface normal estimate is given by: where is a measure of the likelihood of a pixel being in shadow and is the surface normal estimated from the optimal three light sources.The value of varies between 0 and 1.For the case where , the pixel in question is definitely not in shadow and so all four sources are used.For , the pixel is deep in shadow, so only three sources are used.For intermediate values of , a mixture of and are used.The vector is calculated using (2) with and the three light sources are chosen that give the brightest intensities.The mixing factor, , is determined based on the discrepancy between the measured intensity corresponding to the light source not used to calculate and the expected value based on Lambert's Law and .

V. FROM SURFACE NORMALS TO SURFACES
In this section we review the problem of reconstructing a surface (often called a depth map) from the recovered field of surface normals.This suggests representing the surface as (4) Let us assume that the computed value of the unit normal at some point is , as calculated by (2) for example.Then (5) To recover the depth map, we need to determine from the computed values of the unit normal.Of more formally, let us consider a rectangular grid of image pixels.Let denote the given nonintegrable gradient field over this grid.Given the gradients, the goal is to obtain a surface .
During the past twenty years there was a wealth of research concerning the recovery of depth map from normals [41]- [45].In this work we applied the following five algorithms • Frankot-Chellappa [41] (which is abbreviated as FC).
FC algorithm [41] enforced integrability in Brooks and Horn's algorithm [46] in order to recover integrable surfaces.Integrable surfaces are the ones that obey the following Surface slope (depth) estimates from their iterative scheme were expressed in terms of a linear combination of a finite set of orthogonal Fourier basis functions.More specifically, this algorithm reconstructs the surface by projecting onto the set of integrable Fourier basis functions.Let denote the Fourier transform of .Thus, given , then is obtained as (7) • the variation of Frankot-Chellappa method proposed in [42] using instead of bases of the Discrete Cosine Transform instead of bases of the Fouier transform (this method is abbreviated DCTFC).• the least-squares approach [43] (abbreviated as LS).Let denote the gradient field of .The LS approach [43] minimizes the least square error function given by (8) The Euler-Lagrange equation gives the Poisson equation: . We can write , where denote the correction gradient field which is added to the given nonintegrable field to make it integrable.Thus the Poisson solver minimizes and that solution minimizes the norm of the correction gradient field.It is however, well known that a least square solution does not perform well in the presence of outliers.
• the method proposed in [44] (abbreviated as 'AT').The AT algorithm utilizes a purely algebraic approach to enforce integrability to an estimated gradient field in the discrete domain that has nonzero curl .This approach corrects for the curl of the given nonintegrable gradient field by solving a linear system.Furthermore, this approach is noniterative and has the important property that the errors due to nonzero curl do not propagate across the surface.The extra computation cost for this method concerns the computation of the minimal set of edges to construct a connected graph.However, the most significant computational expense corresponds to finding the minimum spanning tree of the image graph (for which standard algorithms are available).In [44] first all edges in the graph corresponding to nonzero curl were broken.The resulting graph was connected by finding the set of links with minimum total weight by assigning curl values as weights.Two types of edge weights were used: one based on curl values and other based on gradient magnitude and by assigning gradient magnitude as weights gives better results compared to curl values.
• and finally the method proposed in [45] (abbreviated as 'ME' respectively).In [45] a generalized equation to represent a continuum of surface reconstruction solutions of a given nonintegrable gradient field was proposed.
The common approaches such as Poisson solver [43] and Frankot-Chellappa [41] algorithm are special cases of that generalized equation.According to [45], a general solution can be obtained by minimizing the following th order error functional (9) where is a continuous differentiable function, , , and are nonnegative integers such that , for some positive integer , , , and the above equation includes terms corresponding to all possible combinations of , , and for all where .Restricting to first order derivatives , we will consider error functionals of the form .The Euler-Lagrange equation gives .Then, it was shown that the previous solutions such as Poisson solver and Frankot-Chellappa algorithm [41] can be derived from this generalized framework.Furthermore, new types of reconstructions using a progression of spatially varying anisotropic weights along the continuum were presented.A solution based on the general affine transformation of the gradients using diffusion tensors near the other end of the continuum was also proposed producing better feature preserving reconstructions.Some of the reconstructed surfaces from the Agrawal et al. method are depicted in Figs. 4 and 5 in both frontal and profile view with and without the albedo.

VI. FACE RECOGNITION USING 2-D INTENSITY IMAGES, DEPTH MAPS AND NORMAL FIELDS
This section describes the different recognition algorithms we applied to the various modalities available to us using the Photoface database.Two different modalities are retrieved by the pipeline of PS: the albedo image and the 3-D facial geometry represented by the depth map of the normal field.

A. Face Recognition/Verification Using Albedo Images
Two well-established families of techniques were applied for this paper.The first one considers facial images as vectors and finds linear projections for dimensionality reduction and feature extraction [30], [31].The other one is based on elastic graph matching [34].We performed our experiments under two distinct paradigms: in the first, we use only one sample for training and in the second we use two samples.
1) Subspace Methods Based on Linear Projections: This family of methods aims to extract features using linear projections and includes PCA (Eigenfaces) [47], Nonnegative Matrix  Factorization (NMF) [48], Independent Component Analysis [49], etc.In our experiments NMF produced the best recognition and verification results.In subspace methods such as NMF, the facial images are lexicographically scanned so that the pixel values are reshaped into vectors.Let be the number of samples in the image database where is a database image.A linear transformation of the original -dimensional space onto a subspace with -dimensions is a matrix .The new feature vectors are given by: (10) where is the mean image of all samples.Classification is then performed using a simple distance measure and a nearest neighbor classifier using the normalized correlation.
For the second testing paradigm, where we have two samples per person for training and one for testing, we have also applied discriminant methods for recognition.In particular, we use the well-known Linear Discriminant Analysis (LDA) method.We have considered five different approaches: 1) Fisherfaces [30] 2) PCA plus LDA [50] where we have two different spaces the regular and irregular, 3) an LDA method based on a slightly different criterion [51] 4) LDA method proposed in [52] which uses a regularization on the eigenspectrum 5) NMF plus LDA [31], [53].The interested reader may refer to [30], [31], [50]- [53] for details regarding the algorithms.In our experiments here, most of the discriminant approaches resulted in very similar recognition rates.For the sake of compactness therefore, we shall only detail results of the NMF+LDA approach.
2) Elastic (Bunch) Graph Matching: The second family of techniques that we applied is that of Elastic Graph Matching (EGM) or Elastic Bunch Graph Matching (EBGM).In the first step of the EGM algorithm, a suitable face representation based on a sparse graph is selected.A uniformly distributed rectangular graph is one of the simplest forms possible.For this case, only a face detection algorithm is required in order to find an initial approximation of the desired rectangular facial region.An example of such a reference graph is depicted in Fig. 6.
In the second step the facial image region is analyzed and a set of local descriptors is extracted at each graph node (called jets).Analysis is usually performed by building an information pyramid using scale-space techniques.In the standard EGM, a 2-D Gabor based filter bank was used for image analysis.The outputs of multiscale morphological dilation-erosion operations or the morphological signal decomposition at several scales are nonlinear alternatives of the Gabor filters for multiscale analysis.Both have been successfully used for facial image analysis [54].Morphological feature vectors have the advantage of robustness to plane rotations.A robust to rotation and scaling jet based on Gabor filters was recently proposed in [55].
The third step involves matching the reference graph on the test face image in order to find the correspondences of the reference graph nodes on the test image.This is accomplished by minimizing a cost function that employs node jet similarities while simultaneously preserving the node neighborhood relationships.
Due to the matching procedure, EGM algorithms are relatively robust against facial image misalignment.Moreover, due to local node deformations, EGM algorithms are also relatively robust against the presence of facial expressions.In order to further boost the performance of EGM, subspace methods can be applied to extract features from the graphs, as in [32]- [34], [56].We have applied many unsupervised and supervised dimensionality reductions on the graphs to increase recognition speed (e.g.PCA, ICA, NMF) but for consistency and compactness we only report the results based on NMF and NMF plus LDA.

B. Face Recognition Using Depth Images and Normals
We perform face recognition on the surface normal estimates (as described in Section IV) and using depth images that were computed by the integration of the normal field (as described in Section V).We mainly experimented with recognition/verification methods that process the depth maps in exactly the same manner as the intensity images.That is, we applied dimensionality reduction methods based on linear projections and EGM algorithms.For the latter we used multiscale log-Gabor filters, which have been shown to be quite useful for processing of depth images [57], in order to fill the jets of the graphs.
Finally, we applied a method [34] that operates on the intensity images but uses the 3-D information for more reliable matching.In particular, in [34], an EGM algorithm was proposed which exploits the information of the 3-D depth maps only in the matching process.Instead of using the simple distance on a 2-D grid, the nodes are mapped on the 3-D surface.The method then uses geodesic distances between nodes as a similarity measure.For this paper, the geodesic distances between the points were calculated using the depth map derived from PS.Moreover, they are robust against isometry mappings (or isometries) of the facial surfaces.The facial pose variations are isometries of the facial surfaces and, to an extent, so are facial expressions (as long as the mouth remains closed).That is, the geodesic distances before and after the development of an expression are considered to remain (approximately) identical [58].An example of matching using the algorithm in [34] can be seen in Fig. 7 where the final node positions are shown.

C. Face Recognition Using NormalFaces
The final modality for face recognition we consider in this paper is based on the orientation of the normals.The baseline method is straightforward to implement and is based on a novel representation of faces: the so-called "Normalfaces".For an image using the computed and as and we compute: (11) which is an image that contains the normal orientations.Example normal orientation images are shown in Fig. 8.We measure the orientations in the interval .For two images and we use the following dissimilarity measure: (12) This dissimilarity measure can be transformed to a kernel and then used to extract features using embedding as described in [59].This kernel can be also used for extracting discriminant features using kernel Fisherfaces [60].Classification is performed using the normalized correlation in the new low-dimensional space.

VII. EXPERIMENTAL RESULTS
The results presented in this section are divided into recognition and verification.For each case we present results for albedo, surface normals, depth and fusion of 2-D and 3-D data.In particular we present experiments using Single Sample Single Modality (SSSM), Single Sample Multiple Modalities (SSMM) and Multiple Sample Single Modality (MSSM) [35] setups.

A. Recognition Experiments
For these experiments, we used a subset of 126 subjects taken with more than a week's interval.For the majority of them (90 subjects) the interval was greater than one month.For the experiments presented here we tested using two setups: • In the first one (SSSM and SSMM), a very challenging experimental procedure was followed, exploiting only one gray-scale albedo image or surface normal map obtained from PS, or the depth image derived from the integration of the normal field.Similarly, one albedo image, normal map or height map was used for testing.Most of the training and testing images display a different facial expression.This one-sample face recognition challenge is one of the most difficult scenarios in the field and has various applications [61].Related one-sample experiments can be found in [4], [18], [35].• In the second setup (MSSM), two samples for training were used and one for testing.In our database, we have 96 subjects with the required three or more samples.The test image for each of these was the same as that used in the one-sample experimental setup.This is in order to test whether or not recognition using two samples of the same modality is better than fusing information across different modalities.(We also check results using these 96 subjects for the one-sample setup).In order to further justify the use of Photometric stereo we conducted the following set of experiments.We trained supervised and unsupervised subspace methods (such as NMF, PCA, LDA etc.) and Elastic Bunch Graphs using jets from all four images with and without illumination compensation ([62]- [64]) and finally fuse the results (with weighted and non weighted schemes).Furthermore, we trained unsupervised subspace learning methods per light and then fuse the scores per light.In all cases the recognition rate achieved was never higher than the best recognition rate achieved by a single albedo image.It is worth noting here that the application of discriminant subspace learning techniques (i.e., LDA) resulted in much poorer performance than using a single albedo image (something that has been previously reported in the literature [65]).On the other hand, in the PS architecture except for the albedo image we have also available the facial shape information, which, as we show, when fused (even with very simple fusion rules) achieves much better performance than the use of a single albedo image. 5Similar conclusions are drawn for the verification experiment.Finally, we note that in the following the reported recognition rates are rounded to .0 or 0.5 based on what is closer.1) Albedo Images: Four source, three source, optimal three source (optimal according to [40]) and ray trace-based PS methods were employed for albedo computation.These methods are abbreviated as 4L-PS [38], 3L-PS [38], 3L-OPT [40] and RAY-PS [39], respectively.The recognition results of the subspace methods are abbreviated as SUB (i.e., NMF), while EGM methods using subspaces for boosting the performance are abbreviated as EGM-SUB (the pipeline of the methods used for recognition using albedo and depth images can be seen in Fig. 9, along with their acronyms).Table I shows the recognition rates for each method on this one-sample setup.Note that the recognition rate is affected by the PS method applied and noticeably better recognition performance is achieved by PS methods that use all four illuminants.The best recognition rate was equal to 78% for the SUB methods and 82.5% for the EGM-SUB method (the improvement is mainly due to the alignment step in the EGM algorithm).
For the case of the two-sample experiment, we used two different approaches: Fig. 9. Pipeline of the methods used for evaluation in the albedo and depth images, along with their acronyms.We use the acronym SUB when using unsupervised subspace methods while we use the acronym SUB-DISCR when using supervised methods, and the acronyms EGM-SUB and EGM-DISCR when using unsupervised/supervised subspace learning methods to EGM features.
• Firstly, we applied the unsupervised SUB and EGM methods for feature extraction using a decision fusion strategy similar to [35].That is, we combined the matching scores for each person across the two samples of 2-D albedo images and ranked the subjects based on the combined scores.Scores from each modality are linearly normalized to the range of [0; 100] before combining (we explored other score normalization such [66] but we have not observed any performance increase).We explored various confidence-weighted versions of the sum, product and minimum rules.However the simple sum rule provided the best overall performance. 6 Secondly, we applied supervised SUB (i.e., NMF plus LDA) and EGM methods, abbreviated to SUB-DISCR and EGM-DISCR respectively (DISCRiminant).In this case, the second sample is used for learning discriminant projections and we classify using the minimum between the two distances.The recognition rates for these two-sample experiments for all PS algorithms is summarized in Table II.As can be seen, the use of more than one sample increases the recognition performance.Moreover, the methods which use all four illuminants achieved better recognition rates than those using only three, as before.The best recognition rate was equal to 95%.
It is important to note that the recognition rate when using only the 96 subjects of the second experiment in the one-sample  tests, the recognition rate was also about 78% and 82.5%.Note the major improvement of over 10% between the one-sample tests and the two-sample tests.As an extra verification of this, we repeated the one-sample test using only the 96 subjects of the two-sample test and obtained similar results.
2) Depth Images and Normal Orientations: As we described in Section V we applied five different methods for surface reconstruction from the normal field.The recognition rates for the one-sample experiment and for all reconstruction and PS methods are summarized in Table III for SUB methods and Table IV for EGM-SUB.
The best recognition result acquired was 74% for SUB and 79% for EGM-SUB.As can be seen, PS and reconstruction methods greatly affect the recognition performance.More precisely, four source PS methods always achieve better recognition results, which is conducive to the albedo image tests.Moreover, the depth maps that were produced by DCTFC constantly outperformed the PR of the depth maps produced by all other reconstruction methods.The best results for the case of albedo in the similar experimental setting was 78% for SUB and 82.5%.Hence, there is a difference of 4% and 3.5%, respectively.This is attributed to the error in reconstruction.
Experiments using two samples for training and one sample for testing are summarized in Tables V, VI, VII and VIII for SUB, EGM-SUB, SUB-DISCR and EGM-DISCR, respectively.The same methods as those used in case of albedo were applied in these experiments for fusion and classification.The experiments using the NormalFace approach for all tested PS methods are summarized in Table IX for one-sample training (1Tr.) and two-sample training (2Tr.)recognition.In the 1Tr.setting we used the kernel framework for feature extraction [59] using the proposed distance in (12).For the 2Tr.-Discr setting we used the same framework as in 1 Tr. and then LDA on the produced features.For the two-sample experiments fusion experiments, the same fusion strategy was applied to that in the albedo experiment (2Tr.-Fuse).
A summary of the best results from all the modalities is given in Table X.From the summary we can deduce the following • in the one sample experiment, the use of albedo produces better results than normal and depth • in the two sample experiment all modalities have similar performance • there is no difference in performance between the depth and normals.We have to note that these conclusions refer only to the applied methods.
3) Fusion 2-D and 3-D Data: Multimodal decision fusion was performed by combining the match scores for each person across the modalities of 2-D albedo and depth image and ranking the subjects based on the combined scores in a similar manner as in the two-sample experiments above.The sum rule provided the best performance as before.We performed fusion only on depth images derived from DCTFC method and 4LPS as these gave best results in the earlier experiments.Fusion of intensity and geometry information was conducted only on the subset of subjects that have more than 2 samples available in order to be directly comparable to the single-modality two-sample experiments.In Table XI we summarize the best results of the one sample experiments (i.e., SSSM and SSMM experimental setups) in order to highlight the advantage of fusing 2-D and 3-D information.As can be seen the fusion of 2-D and 3-D improves significantly the recognition rate.
The recognition results from multimodal fusion (in both SSMM and MSSM setups) using SUB methods and EGM-DISCR, which achieved the best results, is given in Table XII.

B. Verification Experiments
Face verification systems aim to determine whether or not an identity claim is valid.The performance of face verification systems is typically measured in terms of the False Rejection Rate (FRR) achieved at a fixed False Acceptance Rate (FAR).There is a trade-off between FAR and FRR.This trade-off between the FAR and FRR can create a Receiver Operating Characteristic (ROC), where FRR is plotted as a function of FAR.The performance of a verification system is often quoted by a particular operating point of the ROC curve where .This operating point is called Equal Error Rate (EER) and is a useful scalar figure of merit commonly adopted to quantify verification performance.In the verification experiments, we used the same matching score generators to those used in the recognition experiments.
The verification protocol used in this paper is similar to the one defined in the FERET verification protocol [67].The test (or client) set was defined by the same 126 persons as in the above recognition experiments.The first image per subject is used for training while the second is used for testing client claims.The unused 135 people in the database, with one image per person, are considered to be impostors.
In a second verification experiment, we used two images from the 96 subjects for training while the third is used for testing client claims.Again, the latter 135 persons were used for impostor claims.Furthermore, it is worth noting that, as in the recognition experiments, the use of the four images captured under the different lights did not achieved better performance than using only one albedo image.
1) Face Verification Using Albedo, Depth and Normalface Images: The EERs for albedo data calculated using the various PS methods for the one-sample experiment are summarized in Table XIII.The corresponding results for the two-sample experiments are then summarized in Table XIV.
The EERs for various PS and surface reconstruction methods for the one-sample experiment are summarized in Tables XV and XVI for SUB and EGM-SUB methods, respectively.The verification results for the two-sample experiment and for various PS and surface reconstruction methods are depicted in Tables XVII, XVIII, XIX and XX for SUB, EGM-SUB, SUB-DISCR and EGM-DISCR, respectively.
The EERs for various PS methods for the one-sample experiment for the case of Normalfaces are summarized in Table XX.The corresponding verification results for the two-sample experiments and for various PS methods are summarized in Table XXI (for the verification experiments the difference in EER between fusion and discriminant learning was negligible hence for reasons of compactness we report only the discriminant learning results).
A summary of the best verification experiments across all modalities above is shown in Table XXII.2) Multimodal Fusion: Multimodal decision fusion was performed in exactly the same way as for the recognition experiments.That is, by combining the match scores for each person across the modalities of 2-D albedo image and depth map and ranking the subjects based on the combined scores.In Table XXIII we summarize the best results of the one sample  • Four source PS methods produce facial samples (albedo, normals) that achieve constantly better recognition and verification performance than 3 source PS regardless of the reconstruction methods applied.• The reconstruction methods greatly affect the recognition and verification performance.The method which constantly produces the best recognition/verification performance proved to be the one proposed in [8], i.e.DCTFC.Moreover, we have verified most of the following findings of [35]: • In most cases, the best recognition and verification results using the recovered albedo, normals and reconstructed depth maps are approximately the same (there is difference in the one sample experiment).In some cases, the recovered albedo produces better results.• Fusion of the 2-D (albedo) and 3-D (depth or normal) information produce significantly better results than using either albedo or the depth image.• Fusion of multiple albedo images and/or reconstructed surfaces produce significantly better results than using only one albedo or the depth image.• Fusion of two albedo images (Single-Modality and Multiple-Sample (SMMS)) in the same way that we fused the results of albedo and depth map (Multiple-Modality and Single-Sample (MMSS)) gave approximately the same recognition and verification performance.The major advantage of MMSS over SMMS is that SMMS requires two recording sessions while MMSS only one.Using the four images under different lights, with or without illumination normalization, and applying the same face recognition algorithms as the ones used for the albedo, depth and normals we haven not observed better performance than using a single albedo image.The best recognition for any of the presented algorithms under the one-sample setup was 86% and for the two-sample was 96%.The corresponding verification experiments resulted in 6.4% for the one sample experiment and 3.2% for the two sample experiment.The verification experiments show that there is much space for improvement.

Fig. 1 .
Fig. 1.Image capture device.One of the light sources and an ultrasound trigger are shown to the left.The camera is located at the back panel.

Fig. 3 .
Fig. 3. Number of subjects versus number of repetitions.Note that about 20 people walked through the device more than 20 times.

Fig. 4 .
Fig. 4. Reconstructed surface using the method of Agrawal [45]: frontal view, (a) with and (b) without the computed albedo; profile view, (c) with and (d) without the computed albedo.

Fig. 5 .
Fig. 5. Reconstructed surface using the method of Agrawal [45]: frontal view (a) with and (b) without the computed albedo; profile view, (c) with and (d) without the computed albedo.

Fig. 6 .
Fig. 6.(a) Rectangular grid as a reference graph; (b) a rectangular reference graph using 3-D geometry information; (c) a reference graph built around the node.

Fig. 10 .
Fig. 10.ROC curves for the single sample single modality, single sample multiple modalities fusion, and multiple samples single modality fusion.

TABLE I SUMMARY
OF THE BEST (ACCORDING TO THE DIMENSIONS KEPT) PERCENTAGE OF RECOGNITION (PR) FOR ALBEDO IMAGES FOR THE FIRST EXPERIMENTAL SETUP (ONE-SAMPLE)

TABLE II SUMMARY
OF THE BEST PR FOR ALBEDO IMAGES FOR THE SECOND EXPERIMENTAL SETUP (TWO-SAMPLE)

TABLE V SUMMARY
OF THE BEST PR FOR DEPTH IMAGES USING SUBSPACE ALGORITHMS AND FUSION FOR THE SECOND EXPERIMENTAL SETUP

TABLE VI SUMMARY
OF THE BEST PR FOR DEPTH IMAGES USING EGM ALGORITHMS AND FUSION FOR THE SECOND EXPERIMENTAL SETUP

TABLE VII SUMMARY
OF THE BEST PR FOR DEPTH IMAGES USING DISCRIMINANT SUBSPACE ALGORITHMS FOR THE SECOND EXPERIMENTAL SETUP

TABLE VIII SUMMARY
OF THE BEST PR FOR DEPTH IMAGES USING EGM ALGORITHMS AND DISCRIMINANT SUBSPACE METHODS FOR THE SECOND EXPERIMENTAL SETUP As can also be observed, we can verify the finding of the first experiment where the images produced by 4 source PS methods and DCT-based FC surface reconstruction methods constantly outperform all other methods.

TABLE IX SUMMARY
OF THE BEST RECOGNITION RATE FOR NORMALFACE AND ALL THE TESTED METHODS FOR BOTH EXPERIMENTS

TABLE X SUMMARY
OF THE BEST PR FOR ALL THE CONDUCTED EXPERIMENTS ACROSS DIFFERENT MODALITIES

TABLE XI SUMMARY
OF THE BEST PR FOR SINGLE SAMPLE EXPERIMENTS

TABLE XII SUMMARY
OF THE BEST PR FOR FUSION ACROSS DIFFERENT SAMPLES AND MODALITIES

TABLE XIII SUMMARY
OF THE BEST EER FOR ALBEDO IMAGES FOR THE FIRST EXPERIMENTAL SETUP

TABLE XIV SUMMARY
OF THE BEST EER FOR ALBEDO IMAGES FOR THE SECOND EXPERIMENTAL SETUP

TABLE XV SUMMARY
OF THE BEST EER FOR DEPTH IMAGES USING EGM ALGORITHM FOR THE FIRST EXPERIMENTAL SETUP TABLE XVI SUMMARY OF THE BEST EER FOR DEPTH IMAGES USING SUBSPACE ALGORITHMS FOR THE FIRST EXPERIMENTAL SETUP

TABLE XVII SUMMARY
OF THE BEST EER FOR DEPTH IMAGES USING SUBSPACE ALGORITHMS AND FUSION FOR THE SECOND EXPERIMENTAL SETUP

TABLE XVIII SUMMARY
OF THE BEST EER FOR DEPTH IMAGES USING EGM AND FUSION FOR THE SECOND EXPERIMENTAL SETUP

TABLE XIX SUMMARY
OF THE BEST EER FOR DEPTH IMAGES USING DISCRIMINANT SUBSPACE METHODS AND FUSION FOR THE SECOND EXPERIMENTAL SETUP

TABLE XX SUMMARY
OF THE BEST EER FOR DEPTH IMAGES USING EGM AND DISCRIMINANT SUBSPACE METHODS FOR THE SECOND EXPERIMENTAL SETUP

TABLE XXI SUMMARY
OF THE BEST EER FOR NORMALFACE AND ALL THE TESTED METHODS FOR BOTH EXPERIMENTS TABLE XXII SUMMARY OF THE BEST PERCENTAGE OF EER FOR ALL THE CONDUCTED EXPERIMENTS ACROSS DIFFERENT MODALITIESTABLE XXIII SUMMARY OF THE BEST PERCENTAGE OF EER FOR THE SINGLE SAMPLE EXPERIMENT

TABLE XXIV SUMMARY
OF THE BEST PERCENTAGE OF EER FOR FUSION ACROSS DIFFERENT SAMPLES AND MODALITIES