Feature Selection for Microarray Gene Expression Data using Simulated Annealing guided by the Multivariate Joint Entropy

In this work a new way to calculate the multivariate joint entropy is presented. This measure is the basis for a fast information-theoretic based evaluation of gene relevance in a Microarray Gene Expression data context. Its low complexity is based on the reuse of previous computations to calculate current feature relevance. The mu-TAFS algorithm --named as such to differentiate it from previous TAFS algorithms-- implements a simulated annealing technique specially designed for feature subset selection. The algorithm is applied to the maximization of gene subset relevance in several public-domain microarray data sets. The experimental results show a notoriously high classification performance and low size subsets formed by biologically meaningful genes.


I. INTRODUCTION
In cancer diagnosis, classification of the different tumor types is of great importance. An accurate prediction of different tumor types provides better treatment and toxicity minimization on patients. Traditional methods of tackling this situation are primarily based on morphological characteristics of tumorous tissue [1]. These conventional methods are reported to have several diagnosis limitations. In order to analyze the problem of cancer classification using gene expression data, more systematic approaches have been developed [2].
Pioneering work in cancer classification by gene expression using DNA microarray showed the possibility to help the diagnosis by means of Machine Learning or more generally Data Mining methods [3], which are now extensively used for this task [4]. However, in this setting gene expression data analysis entails a heavy computational consumption of resources, due to the extreme sparseness compared to standard data sets in classification tasks [5].
Typically, a gene expression data set may consist of dozens of observations but with thousands or even tens of thousands of genes. Classifying cancer types using this very high ratio between number of variables and number of observations is a delicate process. As a result, dimensionality reduction and in particular feature subset selection (FSS) techniques may be very useful. Finding small subsets of very relevant genes among a huge quantity could derive in much specific and efficient treatments.
This work addresses the problem of selecting a subset of features by using the TAFS (Thermodynamic Algorithms for Feature Selection) family of methods for the FSS problem. Given a suitable objective function, the algorithm makes uses of an special-purpose simulated annealing (SA) technique to find a good subset of features that maximizes the objective function. A distinctive characteristic of TAFS over other search algorithms for FSS is its probabilistic capability to accept momentarily worse solutions, which in the end may result in better hypotheses. Despite their powerful optimization capability, SA-based search algorithms usually lack execution speed, involving long convergence times. In consequence, they have been generally excluded as an option in FSS problems, let alone in highly complex domains such as microarray gene expression data. A few contributions using the classical SA algorithm for FSS are found in prostate protein mass spectrometry data [6], marketing applications [7], or parameter optimization in clustering gene expression analysis [8].
Our answer to these computational problems is twofold. First, we use a filter objective function for FSS (thus avoiding the development of a predictive model for every subset evaluation). Second, the objective function itself is evaluated very efficiently based in the reutilization of previous computations.
Specifically, a new way to calculate the multivariate joint entropy for categorical variables is presented that is both exact and very efficient. This measure is then used by a SA-based TAFS algorithm to search for small subsets of highly relevant genes in five public domain microarray data sets. Classification experiments yield some of the best results reported so far for these data sets and offer a drastic reduction in subset sizes.
The paper is organized as follows: section II briefly reviews the Simulated Annealing technique; section III reviews and motivates the previous Thermodynamic Algorithms for feature subset selection; section IV develops the information-theoretic measure for feature relevance and its efficient implementation; section V describes the data sets and the experimental settings; section VI presents the results and their interpretation. The paper ends with the conclusions and directions for future work.

II. SIMULATED ANNEALING
Simulated Annealing (SA) is a stochastic technique inspired on statistical mechanics for finding (near) globally optimal solutions to large optimization problems. SA is a weak method in that it needs almost no information about the structure of the search space. The algorithm works by assuming that some parts of the current solution belong to a potentially better one, and thus these parts should be retained by exploring neighbors of the current solution. Assuming the objective function is to be minimized, then SA would jump from hill to hill and hence escape or simply avoid sub-optimal solutions.
When a system S (considered as a set of possible states) is in thermal equilibrium (at a given temperature T ), the probability that it is in a certain state s, called P T (s), depends on T and on the energy E(s) of the state s. This probability follows a Boltzmann distribution: where k is the Boltzmann constant and Z acts as a normalization factor. Metropolis and his co-workers developed a stochastic relaxation method that works by simulating the behavior of a system at a given temperature T [9]. Being s the current state and s ′ a neighboring state, the probability of making a transition from s to s ′ is the ratio P T (s → s ′ ) between the probability of being in s and the probability of being in s ′ : where we have defined ∆E = E(s ′ ) − E(s). Therefore, the acceptance or rejection of s ′ as the new state depends on the difference of the energies of both states at temperature T . If P T (s ′ ) ≥ P T (s) then the "move" is always accepted. It P T (s ′ ) < P T (s) then it is accepted with probability P T (s, s ′ ) < 1 (this situation corresponds to a transition to a higher-energy state). Note that this probability depends upon the current temperature T and decreases as T does. In the end, there will be a value of T low enough (the freezing point), wherein these transitions will be very unlikely and the system will be considered frozen. In order to maximize the probability of finding states of minimal energy at every value of T , thermal equilibrium must be reached. To do this, according to Metropolis, an annealing schedule is designed to prevent the process from getting stuck at a local minimum. The SA algorithm introduced in [10] consists in using the Metropolis idea at each temperature T for a finite amount of time. In this algorithm T is first set at a initially high value, spending enough time at it so to approximate thermal equilibrium. Then a small decrement of T is performed and the process is iterated until the system is considered frozen.
If the cooling schedule is well designed, the final reached state may be considered a near-optimal solution. However, the whole process is inherently slow, mainly because of the thermal equilibrium requirement at every temperature T .

III. THERMODYNAMIC ALGORITHMS FOR FSS
In this section we review TAFS (Thermodynamic Algorithm for Feature Selection) and eTAFS, two algorithms for FSS that were originally designed for problems of moderate feature size (up to one hundred) [11]. If we consider FSS as a search of possible feature subsets of the full feature set X , then SA acts as a combinatorial optimization process [12]. In this sense, TAFS and eTAFS find a subset of features that optimize the value of a given objective function J : P(X ) → R. From now on, we assume that this function is to be maximized and that J ≥ 0 1 .
To this end, a special-purpose forward/backward mechanism is embedded into an SA algorithm, taking advantage of its most distinctive characteristic, the probabilistic acceptance of worse scenarios over a finite time. This characteristic is enhanced by the notion of an ǫ-improvement: a feature ǫ-improves a current solution if it has a higher value of the objective function or a value not worse than ǫ%. This mechanism is intended to account for noise in the evaluation of the objective function (caused either by the finiteness of the data set or introduced by the chosen resampling method).
The pseudo-code of TAFS is depicted in Algorithm 1. The algorithm consists of two major loops. The outer loop waits for the inner loop to finish and then updates T according to the chosen cooling schedule. When this loop reaches T min , the algorithm halts. It keeps track of the best solution found (which is not necessarily the current one).
The inner loop is the core of the algorithm and is composed of two interleaved procedures: Forward and Backward, that iterate until an equilibrium point is found. These procedures work independently of each other, but share information about the results of their respective search in the form of the current solution. Within them, FSS takes  place and the mechanism to escape from local minima starts working. These procedures iteratively add or remove features one at a time in such a way that an ǫ-improvement is accepted unconditionally, whereas a non ǫ-improvement is accepted probabilistically. The pseudo-code for Forward and Backward, and ǫ-improvement is outlined in Algorithms 2, 3 and 4. When Forward and Backward finish their respective tasks, TAFS checks if the current solution is the same as it was prior to their execution. If this is the case, then we consider that thermal equilibrium has been reached and T is adjusted, according to the cooling schedule. If it is not, another loop of Forward and Backward is carried out.

A. eTAFS: an Enhanced TAFS Algorithm
A modification to Algorithm 1 aimed at speeding up relaxation time is presented in this section. The algorithm -named eTAFS, see Algorithms 5 and 6-is endowed with a feature search window (of size l) in the backward step, as follows. In forward steps always the best feature is added (by looking all possible additions). In backward steps this search is limited to l tries at random (without replacement). The value of l is incremented by one at every thermalequilibrium point. This mechanism is an additional source of non-determinism and a bias towards adding a feature only when it is the best option available. On the contrary, to remove one, it suffices that its removal ǫ-improves the current solution. Another direct consequence is of course a considerable speed-up of the algorithm. Note that the design of eTAFS is such that it grows more and more deterministic, informed and costly as it converges towards the final configuration. l ← l + 1 Algorithm 6: eTAFS Backward procedure (Z, J Z are modified). Note that X 0 can be efficiently computed while in the for loop).

A. Entropy definitions
Entropy, a main concept in information theory [13], can be seen as an average of uncertainty in a random variable. If X is a discrete random variable with probability mass function p, its entropy is defined by 2 : being E[ ] the expectation operator. If a variable (X) is known and another one (Y ) is not, the conditional entropy of Y with respect to X: is the mutual entropy with respect to the corresponding conditional distribution: From these two definitions another concept is build, the mutual information (MI), which can be interpreted as a measure of the information that a random variable has or explains about another one.
The computation of the MI can be extended from the bivariate to the multivariate case of a number n ≥ 2 of variables, against another one: 2 All log are to base 2.
Conditional MI is expressed in the natural way, by conditioning in (4): The MI has been used with success as for feature selection in machine learning tasks. Currently there is no agreed-upon definition of the general multivariate mutual information I(X 1 ; . . . ; X n ). An existent proposal is the interaction information, described e.g. in [14] which, for the case of three variables X, Y, Z, is defined as I(X; Y ; Z) = I(X; Y |Z)−I(X; Y ). The extension to the multivariate case is in terms of the marginal entropies and is given by: This definition is impractical due to its exponential character. In the next section, the objective function J takes the form of an information-theoretic index of relevance based on the multivariate joint entropy, which has already been used elsewhere [15]. One of the contributions of this paper resides in a fast implementation of the calculation and its application to microarray gene expression data.

B. Incremental Multivariate Joint Entropy
For a random variable X, it is known that the joint entropy obeys the following property: This property says that joint entropy is always at least equal to the entropies of the original system: adding a new variable can never reduce the available uncertainty. If we rewrite (7) as an equation: then △ X (Y ) ≥ 0 represents the increment in entropy due to the addition of the variable Y to the system. In a feature selection setting, given Z a class variable, τ ⊂ X the current subset and H(τ ) its joint entropy, if a new feature X i ∈ X \τ is considered for possible inclusion in the current subset: It turns out that, to obtain the next calculation, it is computationally far more advantageous to store H(Z, τ ) and calculate the quantity △ Z,τ (X i ) than to compute the full joint entropy H(Z, τ ∪{X i }) directly. In order to obtain this value, and incremental procedure to calculate multivariate joint entropy has been developed, as described in the sequel. The incremental multivariate joint entropy (9) must be computed at every evaluation step involving a possible candidate feature X i to be included in the current subset τ . Throughout the process, τ is associated with its current Marginal Entropy Scheme (MES), a table storing the unique values contained in the data set for its forming features and its corresponding entropy value. An example of a MES table for two binary variables {X 1 , X 2 } is shown in Table I.
At the initial step (τ = ∅) the MES table for the addition of X 1 to ∅ is indicated in the left part of Table I. The two unique values and their entropies H(X 1 = 0) = 0.481 and H(X 1 = 1) = 0.515 are calculated. Let us suppose that a feature X 2 is to be evaluated w.r.t the current subset τ = {X 1 }. The MES table with its unique forming patterns are indicated in the right part of Table I. We can see that by introducing X 2 to the current subset τ , four partitions are generated for each unique value of X 1 : {00, 01, 10, 11}.
In the particular case of X 1 = 0, a change in its entropy contribution is produced by the action of X 2 by splitting it into two entropy values: H(X 1 = 0, X 2 = 0) = 0.488 and H(X 1 = 0, X 2 = 1) = 0.523, for a total entropy of H(X 1 = 0, X 2 ) = 1.011. The increment in entropy △ τ is obtained as the difference between the current MES (considering the addition of X 2 ) and the previous scheme (without it) -see Table II.  Finally, this last value is applied to eq. (9) to obtain the joint entropy H(X 1 , X 2 ) = H(X 1 ) + △ τ (X 2 ) = 0.996 + 0.954 = 1.950. The listings in Algorithms 7 and 8 show the pseudo-code to compute the procedure explained above. The notation D|τ stands for the restriction of the dataset D to the features in τ .
Initial entropy is evaluated in lines 2-5. This first step calculates starting joint entropy as well as its first MES (lines Hτ ← Hτ + △τ 14 Eτ ← E τ + // new MES 15 4-5), which will be taken as input to the next computation. Note that these two lines can be efficiently implemented as one function and using only one loop-cycle, with complexity θ(|D|), where |D| is the number of training instances.
In the else part of the if clause, the MES is calculated with the addition of X i to the current subset τ (named E τ + ). Taking into account that previous MES inherits the ordering sequence derived from a previous stage (because of lines 5 and 9), entropies generated by changes in the MES given by τ ∪ {X i } are summed (E τ − ) in groups (line 11) by the newly formed patterns, rendering a one-to-one correspondence between previous MES and current MES. Thus, the entropy contribution △ τ (X i ), showing the effect of adding X i to τ , is computed by the difference in both MESs (line 11), being finally added to the current entropy H τ (line 13). The implementation of lines 10-11 follows the same consideration as lines 4-5, and hence complexity is in the same order.
The incremental multivariate joint entropy is used to obtain an index of relevance (acting as the objective function) of a feature X i ∈ X to a class Z with respect to a subset τ ⊂ X \ X i and is defined by: Note the denominator acts as a normalization factor, such that J ∈ [0, 1], with J = 1 corresponding to the highest relevance. The reward of using this objective function by a TAFS-like algorithm consists in the possibility of testing it in highly complex domains such as microarray data sets. We name the combination of eTAFS and the objective function in eq. (10) as the µ-TAFS algorithm.

V. EXPERIMENTAL WORK
To compute the necessary entropies described in previous section, a discretization process is needed. This change of representation does not often result in a significant loss of accuracy (sometimes significantly improves it [16], [17]); it also offers reductions in learning time [18]. In this work, the CAIM algorithm was selected for two reasons: it is designed to work with supervised data, and does not require the user to define a specific number of intervals [19].

A. Data sets
Five public-domain microarray gene expression data sets are used to test and validate the approach proposed in this work: Colon Tumor: 62 observations of colon tissue, of which 40 are tumorous and 22 normal, 2,000 genes [20]. Leukemia: 72 bone marrow observations and 7,129 probes: 6,817 human genes and 312 control genes [3]. The goal is to tell acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL). Lung Cancer: distinction between malignant pleural mesothelioma and adenocarcinoma of lung [21]; 181 observations with 12,533 genes. Prostate Cancer: used in [22] to analyze differences in pathological features of prostate cancer and to identify genes that might anticipate its clinical behavior; 136 observations and 12,600 genes. Breast Cancer: 97 patients with primary invasive breast carcinoma; 12,600 genes analyzed [23].

B. Settings
Provided that the core nature of the µ-TAFS algorithm, and even many other algorithms -e.g. Genetic Algorithms, Neural Networks-, resides in their stochasticity, several runs have to be performed, in order to better asses the average behaviour of the methods.
The experimental design to test µ-TAFS algorithm measures performance by carrying out m = 100 different independent runs. In each run, µ-TAFS is executed on the corresponding dataset and returns the set of all those feature subsets reaching the best found performance function (maximum relevance, in this case). To overcome the existence of many solutions, the subset that offers the lowest mutual information (MI) among its elements -i.e. the less redundancy-is taken as the subset delivered in this run. After completing the m execution runs, the obtained subsets can be ordered from minimum to maximum MI value.
The µ-TAFS parameters are as follows: ǫ = 0.01, T 0 = 0.1 and T min = 0.0001. These settings were chosen after preliminary fine-tuning and are kept constant for all the problems [11]. The cooling function was chosen to be geometric α(t) = 0.9 t, following recommendations in the literature [12]. Data

A. µ-TAFS performance results
The evolution of µ-TAFS from a high temperature state to a frozen point is depicted in Fig. 1. Highly unstable -i.e. high temperature condition-readings are observed at the initial stages in each of the datasets. As soon as the algorithm becomes more relaxed due to eq. (1), worse solutions are avoided. The frozen condition is observed at the final stages of each execution, where J values consecutively reach the maximum possible value (J = 1) in all cases.
The running performance of µ-TAFS is summarized in Table III. The results show that µ-TAFS yields subsets of considerably low size and also low variability. Notorious readings correspond to Leukemia and Lung Cancer. It is conjectured that such sizes respond to the nature of the proposed information theoretic model on discretized data sets, in the sense that only a few genes significantly contribute to increase the index of relevance given by eq. (10). On the one hand, working with continuous features, the index would tend to vary smoothly -i.e. generating small increments; as a consequence, more features are added-deleted. On the other hand, discrete features variations are normalized by their discretization scheme, so small increments in the real-value are merged into a single discrete value. Therefore, mostly significant increments are truly reflected in its additiondeletion from the current subset. Computational demands when processing the smaller data sets (as for Colon tumor, with 2,000 features) are kept by µ-TAFS considerable low (5 to 10 minutes). The two more complex problems, Prostate and Breast Cancer require approximately 1.5 and 2 hours of total processing time. Unfortunately, there is scarcely any reporting on time consumption in recent scientific literature that would enable us to establish a reasonable comparison.

B. µ-TAFS accuracy results
Eight classifiers were evaluated by means of 10 times 10-fold Cross Validation (10x10 CV), a resampling method designed to handle small-sized data sets. The chosen classifiers are: the nearest-neighbor technique with Euclidean metric (kNN) and parameter k (number of neighbors running from 1 to 15), the Naïve Bayes classifier (NB), a Linear and Quadratic Discriminant classifier (LDC), Logistic Regression (LR), the Support Vector Machine with linear and quadratic kernel (lSVM and rSVM) and parameter Cregularization constant (with C = 2 k , k running from −7 to 7) and the Support Vector Machine with radial basis function kernel (rSVM) and parameter C and γ-smoothing in the kernel function (with γ = 2 k , k running from −7 to 7) 3 . The non parametric Wilcoxon signed-rank test 4 is used for the (null) hypothesis that the median of the differences between the errors of the winner classifiers per data set and another classifier's error is zero. The non-parametric Wilcoxon signed-rank test will be used for the (null) hypothesis that the median of differences between classifiers accuracies are zero, at the 95% level of significance.  The obtained solutions are displayed in Table IV. Among the eight classifiers used to test the solutions, only the final model is presented. Lung Cancer, Leukemia and Prostate Cancer reach remarkably high accuracies, while Colon Tumor and specially Breast Cancer show lower 10x10 CV readings. In all cases, the subset that delivers this performance is considerable small, having 7 genes or less (and only 3 genes in the Leukemia data set). Moreover, all Wilcoxon test p-values signal significant differences (p < 0.05) between the best method and all other methods in the corresponding data set, except for the lSVM vs. LR in Colon Tumor (p = 0.312).

C. Discussion of the results
It is a common practice to compare to similar works in the literature. Unfortunately, the methodological steps are in general very different, especially concerning resampling techniques, making an accurate comparison a delicate undertaking. Nonetheless, such a comparison is presented in Table  V. Seven references which are illustrative of recent work are indicated, including previous work from the authors. In this table the validation method, the best classifier and the best reported result are detailed (final accuracy and number of genes involved).
The Colon Tumor data set presents difficulties in classification, never reaching 90%. The solution delivered by µ-TAFS is comparable with the best known (that of BGS 3 [25]); however, it uses 5 genes against the 9 used by BGS 3 .   The other difficult problem seems to be Breast Cancer.
In this data set, µ-TAFS gives the best result among the references consulted, using also less genes and in front of solutions that employ a pure wrapper strategy. For the other three problems, µ-TAFS is also able to yield better solutions compared to other approaches, many of them using a much bigger gene subset.
Expression levels for each model in the five data sets are given in Fig. 2

D. Biological evidence in the solution subsets
The genes corresponding to the solutions displayed in Table IV are detailed in Table VI. In the following, known biological evidence is presented about the effect of gene expressions in each cancer disease. This evidence is assembled by examining recent relevant medical literature.
This gene encodes a member of the cysteine-rich protein (CSRP) family. It may be involved in regulatory processes important for development and cellular differentiation. Hypomethylation, a decrease in the epigenetic methylation of cytosine and adenosine residues in DNA, of CSRIP1 and other genes were confirmed in the cancer cells by bisulfite sequencing [32]. • H08393 COL11A2-collagen, type XI, alpha 2 (Homo sapiens). Two alpha chains of type XI collagen, a minor fibrillar collagen are encoded by this gene [33]. Upregulation of this gene in the mucosa stromal cells of both familial adenomatosis polyposis and sporadic colorectal cancer has been detected [34]. • T51849 EPHB1-Tyrosine-protein kinase receptor elk precursor. EphB1 is a member of receptor tyrosine kinases of the EphB subfamily and has been positively identified in the development, progress and prognosis of colorectal cancers [35]. • M19131 CALM2-calmodulin 2 (phosphorylase kinase, delta). Caml2 plays an important role in intracellular calcium signaling, which regulates a variety of cellular processes, such as cell proliferation and gene transcription [36]. Increased expression levels of this gene were found in anaplastic large cell lymphoma cell lines [37]. • T51288 ASS1-argininosuccinate synthase (human).
Arginine, a semi-essential amino acid in humans, is critical for the growth of human cancers as in primary ovarian, stomach and colorectal cancer, whose expression levels read high values [38].

Leukemia:
• AFFX-CREX-5 AT NOT IDENTIFIED.  • L09209 APLP2-amyloid beta (A4) precursor-like protein 2 (Homo sapiens). The function of this gene is not fully understood, but it conjectured that may play a role in the regulation of hemostasis [39]. This gene was reported as over-expressed by other scientific literature as in [40] • X95735 at ZYX-ZYXIN. It is involved in the spatial control of actin assembly and in the communication between the adhesive membrane and the cell nucleus [41]. This is a gene found in many cancer classification studies [3], [42], [43], and is highly correlated with acute myelogenous leukemia. Lung Cancer: • 37957 at ATG4-Autophagy related 4 homolog A. Autophagy is the process by which endogenous proteins and damaged organelles are destroyed intracellularly.
Autophagy is postulated to be essential for cell homeostasis and cell remodeling during differentiation, metamorphosis, non-apoptotic cell death, and aging [39].
It is activated during amino-acid deprivation and has been associated with neurodegenerative diseases, cancer, pathogen infections and myopathies [44].  [39]. No medical evidence was found in literature about its role in cancer.
Prostate Cancer: • 1230 g at MTMR11-myotubularin related protein 11. Experiments on patients with acute lymphoblastic leukemia and with Burkitt lymphoma, three putative oncogenes or tumor suppressor genes were found, one of them was the MTMR11 [47]. • 38322 at PAGE4-P antigen family, member 4 (prostate associated). This gene is strongly expressed in prostate and prostate cancer; and also expressed in other tissues as in testis, fallopian tube, uterus, placenta, as well as in testicular cancer and uterine cancer [39]. • 37639 at HPN-Hepsin. Hepsin is a cell surface serine protease and plays an essential role in cell growth and maintenance of cell morphology and it is highly related with prostate cancer, benign prostatic hyperplasia [39]. • 32909 at AQP5-aquaporin 5. Acting as a water channel protein, Aquaporins are a family of small integral membrane proteins linked to other proteins, whose role is the generation of saliva, tears and pulmonary secretions [39]. Experiments with cases of normal and epithelial ovarian tumors tissues suggest an important role of this gene in the tumorigenesis of the latter, and a possible relationship with the ascites formation of ovarian carcinoma [48]. • 660 at CYP24A1-cytochrome P450, family 24, subfamily A, polypeptide 1. This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids [39]. This gene has been reported as responsible for degradation of the active vitamin D metabolite 1,25-dihydroxyvitamin D3 which is known to be antimitotic in prostate cancer cells [49]. • 35998 at Hypothetical protein LOC284244 (LOC284244). No evidence found. • 34107 at PFKFB2-6-phosphofructo-2-kinase/fructose-2,6-biphosphatase 2. The protein encoded by this gene is involved in the synthesis and degradation of fructose-2,6-bisphosphate, a regulatory molecule that controls glycolysis in eukaryotes [39]. It has been suggested that the induction of de novo lipid synthesis -process that protects cancer cells from free radicals and chemotherapeutics-by androgen requires the upregulation of HK2 and PFKFB2 [50]. Breast Cancer: • AB014543 CLUAP1-clusterin associated protein 1 (Homo sapiens). This gene is highly expressed in osteosarcoma, ovarian, colon, and lung cancers [51]. • Contig57657 RC PAK1-p21 protein (Cdc42/Rac)activated kinase 1 (Homo sapiens). This gene encodes a family member of serine/threonine p21-activating kinases, known as PAK proteins, whose role is the regulation of cell motility and morphology [33]. Pak1 is directly related with the Etk/Bmx protein, acting this later as a control to the proliferation and tumorigenic growth of mammary epithelial cancer cells [52]. • NM 006191 PA2G4-Proliferation-associated 2G4, 38kDa (PA2G4). Also known as EBP1, this gene encodes an RNA-binding protein that is involved in growth regulation [39]. The EBP1 has been shown to be a transcriptional corepressor that inhibits the growth of human breast cancer cell lines [53]. • Contig14882 RC, Contig53822 RC, Con-tig53713 RC NOT IDENTIFIED.

VII. CONCLUSIONS
A new algorithm for feature selection using Simulated Annealing guided by the discrete multivariate joint entropy has been introduced and evaluated. Our experimental results concern the search for small subsets of highly relevant genes in five public domain Microarray Gene Expression data samples. The very promising results indicate that the algorithm offers a promising and general framework for feature selection in very high dimensional data sets.
The entropic relevance measure has shown to be a good candidate as the objective function to be optimized by the algorithm. The reported classification results are competitive to current standards in analyzing microarray gene expression data with a very affordable execution time. This last aspect should not be overlooked, since database size is constantly growing and the complexity of optimization scenarios (that make extensive use of resampling methods) is ever greater.

VIII. ACKNOWLEDGMENTS
Authors wish to thank to Spanish CICyT Project no. CGL2004-04702-C02-02, CONACyT and UABC for supporting this research from its beginning.