Signer-Independent Sign Language Recognition with Adversarial Neural Networks

Sign Language Recognition (SLR) has become an appealing topic in modern societies because such technology can ideally be used to bridge the gap between deaf and hearing people. Although important steps have been made towards the development of real-world SLR systems, signer-independent SLR is still one of the bottleneck problems of this research field. In this regard, we propose a deep neural network along with an adversarial training objective, specifically designed to address the signer-independent problem. Concretely speaking, the proposed model consists of an encoder, mapping from input images to latent representations, and two classifiers operating on these underlying representations: (i) the sign-classifier, for predicting the class/sign labels, and (ii) the signer-classifier, for predicting their signer identities. During the learning stage, the encoder is simultaneously trained to help the sign-classifier as much as possible while trying to fool the signer-classifier. This adversarial training procedure allows learning signer-invariant latent representations that are in fact highly discriminative for sign recognition. Experimental results demonstrate the effectiveness of the proposed model and its capability of dealing with the large inter-signer variations.

has become one of the most active research topics in the human-computer interaction field. Its main purpose is to automatically translate the signs, from video or images, into the corresponding text or speech. Although recent SLR methods have demonstrated remarkable performances in signer-dependent scenarios, i.e. when training and test data come from the same signers, their recognition rates typically decrease significantly when the signer is new to the system. This performance drop is the result of the large inter-signer variability in the manual signing process of sign languages (see Fig. 1). However, a practical SLR system must operate in a signer-independent scenario, which means that the signer of the probe must not be seen during the training routine of the models. Therefore, signer-independent SLR has become one of the bottleneck problems for the development of a real-world and practical SLR system. Fig. 1. Inter-signer variability: it is possible to observe not only phonological variations (i.e., different handshapes, palm orientations, and sign locations) but also a large physical variability (i.e., different hand sizes) when six signers are performing the same sign. and competitive training scheme encourages the learned representations to be signer-invariant and highly discriminative for the sign classification task. To further constrain the latent representations to be signer-invariant, we introduce an additional training objective that operates on the hidden representations of the encoder network in order to enforce the latent distributions of different signers to be as similar as possible.
Although this adversarial training framework is similar to those initially introduced by Ganin et al [3], in the context of domain adaptation, and then by Feutry et al [2] to learn anonymized representations, our main contributions on top of these works are two-fold: 1) The application of the adversarial training concept to the signer-independent SLR problem; 2) A novel adversarial training objective that differs from the ones of Ganin et al [3] and Feutry et al [2] in two ways. First, our training objective is minimum if and only if the adversarial classifier, which in our case corresponds to the signer-classifier, produces a uniform distribution over the signer identities, meaning that our model is completely invariant to the signer identity of the training data. Second, we introduce an additional term to the adversarial training objective that further discourages the learned representations of retaining any signer-specific information, by explicitly imposing similarity in the latent distributions of different signers. This paper is an extension of our conference paper [4]. The new contributions of this paper are summarized as follows: 1) The introduction of a transfer learning strategy in the proposed adversarial training objective, yielding an overall improvement in the sign recognition performance. Concretely speaking, instead of training all the network components from scratch, the encoder network is initialized with the first 10 layers of VGG-19 [5], pre-trained on the ImageNet [6], and then finetuned to our specific task. 2) An extended experimental section to further demonstrate the effectiveness of the proposed model. Specifically, the experimental evaluation of the proposed model is extended to an additional SLR database. Moreover, we introduce a quantitative analysis of the produced latent representations and an analysis of the training behavior of the proposed model. The remainder of the paper is organized as follows. Section II presents the related work. The proposed model along with its adversarial training scheme are fully described in Section III. Experimental results and conclusions are reported in Sections IV and V, respectively.

II. RELATED WORK
According to the amount of data required from the test signers, previous signer-independent SLR works can be roughly classified into two main groups, namely (i) signer adaptation approaches, where a previously trained model is adapted to a new test signer by using a small amount of signer specific data, and (ii) truly signer-independent methods, in which a generic model robust for new test signers is built without using data of those test signers.
Greatly inspired by speaker adaptation methods from the speech recognition research, Von Agris et al [7] proposed the combination of the eigenvoice (EV) approach [8] with maximum likelihood linear regression (MLLR) and maximum a posteriori (MAP) estimation to adapt trained Hidden Markov Models (HMMs) to new signers. More recently, Kim et al [9] investigated the potential of different deep neural network adaptation strategies for the signerindependence problem. Yin et al [10] proposed an interesting weakly-supervised signer adaptation approach, in which the adaptation data from the new signer has not to be labeled. Specifically, a generic metric is first learnt from the available labeled data of several different signers and, then, adapted to the new signer by considering clustering and manifold constraints along with the collected unlabeled data. Although signer adaptation is a reasonable approach, in practice, collecting enough training data from each new signer to retrain and adapt the model may not be feasible. In this regard, several works focused on the development of truly-signer independent models that do not require any data from the new signers [11]- [17]. Most of them involved a huge feature engineering effort in order to build normalized hand-crafted feature descriptors robust to the large intersigner variations. A major weakness across all the aforementioned methods is related to the fact that representation and metric learning is not jointly performed. Motivated by the inherent difficulty of designing reliable handcrafted features to the large inter-signer variability, recent SLR systems are mostly based on deep neural networks [18]- [22]. It is well-known that deep neural networks are remarkably good in figuring out reliable highlevel feature representations from the data. However, in previous deep SLR methodologies, the learned representations are not explicitly constrained to be signerinvariant. Therefore, there is nothing to prevent the learned representations of different signers and the same class of being far apart in the representation space and, hence, signer invariance is not ensured. This paper presents a novel adversarial training objective, based on representation learning and deep neural networks, specifically designed to address the signer-independent SLR problem. Different from the aforementioned methodologies, our model jointly learns the representation and the classifier from the data, while explicitly imposing signer invariance in the high-level representations for a robust and truly signerinvariant sign recognition.

III. PROPOSED METHOD
The ultimate goal of our model is to learn signer-invariant latent representations that preserve the relevant part of the information about the signs while discarding the signerspecific traits that may hamper the sign classification task. To accomplish this purpose, we introduce a deep neural network along with an adversarial training scheme that is able to learn feature representations that combine both sign discriminativeness and signer-invariance.
More specifically, let = { , , } =1 denote a labeled dataset of samples, where represents the -th colour image, and and denote the corresponding class (sign) label and signer identity, respectively. To induce the model to learn signer-invariant representations, the proposed model comprises three distinct sub-networks: 1) An encoder network, which aims at learning an encoding function ℎ(X; ℎ ), parameterized by ℎ , that maps from an input image X to a latent representation ℎ; 2) A sign-classifier network, which operates on top of this underlying latent representation ℎ to learn our taskspecific function (ℎ; ), parameterized by , that maps from h to the predicted probabilities (y|ℎ; ) of each sign class. 3) A signer-classifier network, with the purpose of learning a signer-specific function (ℎ; ) , parameterized by , that maps the same hidden representation ℎ to the predicted probabilities (s|ℎ; ) of each signer identity. During the learning stage, the parameters of both classifiers are optimized in order to minimize their errors on their specific tasks on the training set. In addition, the parameters of the encoder network are optimized in order to minimize the loss of the sign-classifier network while forcing the signer-classifier to be a random guessing predictor. In the course of this adversarial training procedure, the learned latent representations ℎ are encouraged to be signer-invariant and highly discriminative for sign classification. To further discourage the latent representations of retaining any signer-specific traits, we introduce an additional training objective that enforces the latent distributions of different signers to be as similar as possible. The result is a truly signer-independent model robust to new test signers.

A. Architecture
As illustrated in Fig. 2, the architecture of the proposed model is composed by three main sub-networks or blocks, i.e. an encoder, a sign-classifier and a signer-classifier.
The encoder network attempts to learn a mapping from an input image X to a latent representation h. It simply consists of a sequence of Le pairs of consecutive 3 × 3 convolutional layers with Rectified Linear Units (ReLUs) as nonlinearities. For downsampling, the last convolutional layer of each pair has a stride of 2. On top of that, there is a fullyconnected layer, also with a ReLU, representing the desired signer-invariant latent representations h.
Taking the latent representations h as input, the signclassifier block is composed by a sequence of Ls fullyconnected layers, with ReLUs as the non-linear functions, for predicting the sign class ̂= arg max (ℎ; ) . Therefore, the last fully-connected layer has a softmax activation function which outputs the probabilities for each sign class.
The signer-classifier network has exactly the same topology as the sign-classifier net. However, it maps the latent representations h to the predicted signer identity ̂= arg max (ℎ; ). Therefore, the number of nodes of the output layer is defined according to the number of signers in the training set.

B. Adversarial Training
By definition, signer-invariant representations discard all signer-specific information and, as such, no function (i.e., classifier) exists that maps such representations into the correct signer identity. This naturally leads to an adversarial problem, in which: (i) a signer-classifier network (⋅; ) receives latent representations ℎ = ℎ( ; ℎ ) from an encoder network ℎ(⋅; ℎ ) and tries to predict the signer identity s corresponding to image X and (ii) the encoder network tries to fool the signer-classifier network while still providing good representations for the sign-classifier network (⋅; ) , which in turn receives the same representations ℎ and aims to predict the sign label y corresponding to image X.
Therefore, the signer-classifier network shall be trained to minimize the negative log-likelihood of correct signer predictions: In the perspective of the encoder, the predictions of the sign-classifier should be as accurate as possible and the predictions of the signer-classifier should be kept close to uniform, meaning that this latter classifier is not capable of doing better than random guessing the signer identity. Formally, this may be translated into the following constrained objective: where D KL is the Kullback-Leibler (KL) divergence and (s) denotes the discrete uniform distribution on the random variable s, defined over the set of identities S in the training set. Here, ≥ 0 determines how far from uniform the signer-classifier predictions are allowed to be (as measured by the KL divergence). The choice of the uniform distribution implies the underlying assumption that the training set is balanced relatively to the number of examples per signer (which should be true for most practical datasets). When this is not the case, the empirical distribution of signer identities in the training set may be used instead.
The constraint inequality (3) may be rewritten as: and the constrained optimization problem may be equivalently formulated as: where ≥ 0 depends on and ℒ plays the role of an adversarial loss with respect to the signer classification loss . This objective and the structure of our model are similar to those used in [3], in the context of domain adaptation, and in [2], to learn anonymized representations for privacy purposes. However, the former uses the negative signer classification loss as the adversarial term (i.e., ℒ ← −ℒ ), which is not lower bounded, leading to high gradients and difficult optimization. The latter addresses this International Journal of Machine Learning and Computing, Vol. 11, No. 2, March 2021 problem by replacing this term with the absolute difference between the adversarial loss as defined in equation (4) and the signer classification loss (i.e., ℒ ← |ℒ − ℒ |. This option has a nice information theoretic interpretation as being an empirical upper bound for the mutual information between the distribution of signer identities and the distribution of latent representations. Nonetheless, this loss vanishes for infinitely many (non-uniform) distributions. Our choice, besides being clearly lower bounded by the entropy of the uniform distribution, log | |, is minimum if and only if ( |ℎ( ; ℎ ); ) ≡ ( ), ∀ , meaning that the signer-classifier block is completely agnostic relatively to the signer identities of the training samples.

C. Signer-Transfer Training Objective
To further encourage the latent representations h to be signer-invariant, we introduce an additional term in objective (5), the so-called signer-transfer loss ℒ transfer . The core idea of ℒ transfer is to enforce the latent distributions of different signers to be as similar as possible. In practice, this is achieved by minimizing the difference between the hidden representations of different signers, at each layer of the encoder network. To measure the signers' distribution difference at the -th layer, ∈ {1, 2, … , }, we compute a distance ( ) between the hidden representations ℎ ( ) (⋅; ℎ ) of two signers and at the output of that layer, as: where ‖ ⋅ ‖ 2 is the ℓ -2 norm, and and denote the number of training examples of signers and , respectively. Accordingly, the signer-transfer loss at the -th layer is the sum of the pairwise distances between all signers, i.e.: The overall signer-transfer loss ℒ transfer is then a weighted sum of the losses computed at each layer of the encoder network, such that: where ( ) is a hyperparameter that controls the relative importance of the loss obtained at the -th layer. By combining (5) and (8), the encoder and sign-classifier networks are trained to minimize the following loss function: where ≥ 0 is the weight that controls the relative importance of the signer-transfer term. Summing up, the adversarial training procedure is organized by alternatively either training both the encoder and the sign-classifier in order to minimize objective (9) or training the signer-classifier in order to minimize objective (1).

IV. EXPERIMENTAL EVALUATION
The experimental evaluation of the proposed model was performed using three publicly available SLR databases: the Jochen-Triesch database [23], the Microsoft Kinect and Leap Motion American sign language (MKLM) database [24], [25], and the Portuguese Sign Language and Expressiveness Recognition (SI-PSL) database [26]. Jochen-Triesch [23] is a dataset of 10 hand signs performed by 24 signers against three different types of backgrounds: uniform light, uniform dark and complex. Experiments on Jochen-Triesch were conducted using the standard evaluation protocol of this dataset [27], in which 8 signers are used for the training and the remaining 16 signers are used for the test. MKLM [24], [25] contains a total of 10 signs, each one repeated 10 times by 14 different signers. In this dataset, the performance of the models is assessed using 5 random splits, created with signer-independence, yielding at each split a training set of 10 signers, a validation set of 2 signers and a test set of 2 signers. The SI-PSL database contains 31 isolated signs, representing the alphabet and the cardinal numbers 0 to 9 of the Portuguese sign language. All International Journal of Machine Learning and Computing, Vol. 11, No. 2, March 2021 the signs were performed three times by 11 native signers, in a free and natural signing environment, without any clothing restriction but with a slightly controlled uniform background. SI-PSL has a well-defined standard evaluation protocol, which consists of 6 signers for training, 1 signer for validation and the remaining 4 signers are used for testing.

A. Implementation Details
In order to extract the manual signs from the noisy background of the images, the automatic hand detection algorithm [28] is used as a pre-processing step. The images are then cropped, resized to the average sign size of the training set, and normalized to be in the range [−1, 1]. Throughout this section, the proposed model is compared with state-of-the-art methods for each dataset [15], [16], [24], [27], [28]. Nevertheless, to further attest the robustness of the proposed model, two different baselines are also implemented: 1) (Baseline 1) A CNN trained from scratch with ℓ -2 regularization. For a fair comparison, the architecture of the baseline CNN corresponds to the architecture of the encoder network followed by the sign-classifier network of the proposed model. 2) (Baseline 2) A CNN with the baseline 1 topology, but trained with the triplet loss [29]. Here, the triplet loss concept is explored in order to impose signer-independence in the representation space and, hence, build up a more robust baseline. The underlying idea is to minimize the distance between an anchor and a positive latent representation, ℎ , and ℎ , , respectively; while maximizing the distance between the anchor ℎ , and a negative representation ℎ , . It is important to note that while anchor and positive latent representations have to be from the same sign class, their signer identity may or not change. On the other hand, anchor and negative representations are from different sign classes, whereas their signer identity may also change. In order to train baseline 2 in an end-to-end fashion for sign classification, the overall loss function to be minimized is a trade-off between the triplet loss ℒ triplet and a classification loss ℒ sign , such that: where ℒ sign corresponds to the categorical cross-entropy as defined in equation (2). The second term denotes the Ltriplet, where = and ≠ , and ρ ≥ 0 is a hyperparameter controlling its relative importance. The margin enforced between positive and negative pairs was fixed as = 1. In addition, following [29], an online triplet generation strategy, by selecting the hardest positive/negative samples within every mini-batch, was adopted.
All deep models were implemented in PyTorch and trained with the Adam optimization algorithm using a batch size of 32 samples. For reproducibility purposes, the source code as well as the weights of the trained models are publicly available online 1 . The hyperparameters that are common to all the implemented models (i.e., learning rate and ℓ -2 regularization weight) as well as some hyperparameters that are specific to the proposed model (i.e., and ) and to the implemented baseline 2 (i.e., ) were optimized by means of a grid search approach and crossvalidation on the training set (see Table I for more details). The signer-transfer penalty ℒ transfer is applied to the last two layers of the encoder network with a relative weight of 1. Regarding the model's architecture, the number of consecutive convolutional layers pairs was set to 3, which results in a total of 6 convolutional layers. The number of filters starts as 32, which is then doubled after each convolutional pair. The dense layer on top of the encoder network has 128 neurons. The number of dense layers of both classifiers Ls was set to 3, and the number of nodes of each hidden layer was set as 128.   [24] 89.71 ( -) --Ferreira et al [28] 93. 17

B. Results and Discussion
Experiments on Jochen-Triesch, MKLM, and SI-PSL databases are summarized in Tables II, III, and IV respectively. The results on the Jochen-Triesch database are presented in terms of average classification accuracy in the overall test set as well as against each specific background type (i.e., uniform and complex). For the MKLM database, Table III depicts the average classification accuracy computed across all the 5 test splits, as well as the minimum and maximum accuracy value achieved by each method. As the SI-PSL database is clearly the most challenging one and contains a large number of sign classes (i.e., 31), the results are presented in terms of top-1, top-3 and top-5 classification accuracy (see Table IV). The most interesting observation is the superior performance of the proposed model. Specifically, the proposed model provides the best overall classification accuracy across all the SLR databases, clearly outperforming both implemented baselines and all the previous state-of-the-art models. In complex scenarios, as reported in Table II, the proposed model surpasses all the other methods by a large margin (i.e., 91.25% against 81.25%, 74.38% and 75.63%). In addition, by analyzing the standard deviation as well as the minimum and maximum accuracy values, it possible to observe that the proposed model is the method with the lowest variability, yielding consistently high accuracy rates across all test splits of the MKLM dataset (see Table III). These results attest the robustness of the proposed model and its capability of better dealing with the large inter-signer variability that exists in the manual signing process of sign languages. Interestingly, the obtained results also reveal that the implemented baselines are in fact fairly strong models, both of them outperforming most of the state-of-the-art methods on both datasets. Finally, it is worth mentioning the superiority of the proposed model in the most challenging database (i.e., the SI-PSL). As shown in Table IV, the proposed model outperformed both the implemented baselines in all the three classification metrics.

C. Transfer Learning
To further improve the performance of the proposed model, we introduce a transfer learning strategy in the proposed adversarial training objective. Transfer learning aims to extract knowledge from one or multiple source tasks (or domains) and, then, use this prior knowledge when learning a model for a new target task [30]. Transfer learning techniques are particularly useful when we have to deal with limited sized training sets, as it happens in most available SLR databases. In this work, we applied a conventional transfer learning strategy that can be summarized as follows: • The encoder network is initialized with the first 10 layers of VGG-19 [5], pre-trained on the ImageNet [6] database; • During the first training epochs (≈ 30), the optimization algorithm is defined so that only the parameters of both classifiers are updated; • In the remaining training epochs, the encoder network is fine-tuned for our particular task, which means that all the model parameters are updated. It is important to note that for a fair comparison, we have also employed the same transfer learning strategy to both implemented baselines. The performance of the models with transfer learning is reported in the bottom blocks of Tables II, III, and IV. As it is possible to observe, transfer learning has brought substantial gains for all the models. Besides, the most important observation is that the proposed model remains the best method by a large margin. Table V depicts an ablation study of the proposed model, in which it is possible to assess the effect of each proposed training scheme. For this purpose, the proposed model was trained either (i) with just the adversarial procedure, without the signer-transfer ℒ transfer loss, or (ii) with just the ℒ transfer penalty on the encoder network, without adversarial training. The results clearly demonstrate the complementary effect between the two training procedures, as their combination provides the best overall classification accuracy. Interestingly, each training scheme outperforms on its own both baselines and state-of-the-art methods.

E. Latent Space Visualization
To further demonstrate the effectiveness of the proposed model in promoting signer-invariant latent representation spaces, we have performed a visual inspection of the latent representations through the t-distributed stochastic neighbor embedding (t-SNE) [31] (see Fig. 3). These plots clearly demonstrate the better capability of the proposed model of imposing signer-independence in the latent representations. The proposed model yields a latent representation space in which representations of different signers and same class are close to each other and well mixed, while it keeps latent representations of different classes far apart. By analyzing the t-SNE plot of baseline 1, it is possible to observe that the latent representations of different signers and the same class tend to be far apart in the latent space. In addition, there is some overlapping between clusters of different classes. Although baseline 2 (CNN with the triplet loss) promoted slightly improvements over the standard baseline CNN, the proposed model achieved by far the best signer-invariance and class separability.

F. Cluster Analysis in the Latent Space
In order to obtain an objective quality assessment of the produced latent representations, we have evaluated how well the model is able to cluster the different sign classes (and thus ignore the signer identity) in the latent space. For this purpose, we use two cluster validation metrics: the average Silhouette coefficient [32] per cluster and the Dunn's index [33] per cluster.
The Silhouette coefficient for an observation is computed as follows. Let be the cluster (sign class) associated with the observation . The average intra-cluster distance and the minimum average inter-cluster distance for the observation are obtained as follows: where | | denotes the number of observations in the cluster and ( , ) is the Euclidean distance between the observations and . Then, the Silhouette index for the observation is defined as: Clearly, −1 ≤ ≤ 1. Intuitively, clusters are desirably compact (small ) and well separated (large ), so a larger value of indicates better clustering. However, this metric is defined per observation. Hence, in order to have a global measure of clustering quality, we compute the average Silhouette coefficient for each cluster.
Dunn's index follows a similar idea of measuring cluster compactness versus separation, but uses minimum and maximum distances instead of average distances, and is more sensitive to extreme and occasional errors. Specifically, Dunn's index for a cluster is defined as the ratio between the minimum inter-cluster distance from to all other clusters (which measures cluster separation) and the maximum intra-cluster distance Δ for the cluster (which measures cluster compactness): Again, according to this metric, larger values indicate better clustering. As anticipated by the analysis of the twodimensional t-SNE projection in Fig. 3, the results confirm that the proposed model produces the most compact and separated sign clusters, when compared with the remaining models. This observation supports the signer-invariance property of the representations produced by the proposed adversarial training framework: when exposed to images obtained from new signers, our model does a better job of grouping them according to the respective sign class only, ignoring the signer identity.
(a) CNN -baseline 1 (b) CNN with triplet loss -baseline 2 (c) Proposed model Fig. 3. Two-dimensional projection of the latent representation space using the t-distributed stochastic neighbor embedding (t-SNE) [31]. Markers • and + represent 2 different test signers, while the different colors denote the 10 sign classes.

G. Training Behavior of the Proposed Model
The evolution of the loss values along the training iterations is presented in Fig. 4. On Fig. 4a, one observes a small gap between training and validation sign classification losses, proving that the model is being regularized properly. This regularization effect is promoted by the adopted adversarial training and signer-transfer objectives, whose loss functions are depicted in Fig. 4b and Fig. 4c.
The adversarial training dynamics in Fig. 4b are an immediate consequence of the min-max game played between the signer-classifier and the encoder networks. The former aims to minimize ℒ signer , while the latter tries to maximize it (by minimizing the surrogate ℒ adv ). Note that, at the beginning of training, both losses are equal tolog | |, which is the entropy of the uniform distribution over signer identities and is the minimum possible value of ℒ adv . This results from the fact that the untrained signer-classifier is just a random predictor. As training progresses, this network starts learning to predict correct signer identities from the provided latent representations. Therefore, ℒ signer starts decreasing and, consequently, ℒ adv increases. The min-max game eventually leads to a point where both losses become stable and fairly close to their initial value, log | |. This implies that, at the ending of training, the latent representations produced by the encoder network exhibit high signer-invariance, as desired.
The signer-transfer objective exhibits a smooth evolution along the training epochs, as shown in Fig. 4c. The exception is the first few training iterations, where the corresponding loss ℒ transfer increases rapidly, as the network weights depart from their initial values (which are close to zero). After this short period, the distribution of the latent representations of different signers start becomıng closer and the loss decreases almost monotonically, until it eventually plateaus at a low value.

V. CONCLUSION
This paper presents a novel adversarial training objective, based on representation learning and deep neural networks, specifically designed to tackle the signer-independent SLR problem. The underlying idea is to learn signer-invariant latent representations that preserve as much information as possible about the signs, while discarding the signer-specific traits that are irrelevant for sign recognition. For this purpose, we introduce an adversarial training procedure for simultaneously training an encoder and a sign-classifier over the target sign variables, while preventing the latent representations of the encoder to be predictive of the signer identities. To further discourage the underlying representations of retaining any signer-specific information, we propose an additional training objective that enforces the latent distributions of different signers to be as similar as possible. Experimental results demonstrate the effectiveness of the proposed model in several SLR databases.