Files
Abstract
Big data are indispensable for machine learning and complex data modeling. Such computation tasks with big data are expensive, requiring extensive computer memory and computing time. Thus methods are often sought that can scale down the amount of raw data or model size without compromising substantial information in the original data. Such approximation aims at reducing required computing resources while achieving high performance of the related machine learning tasks. In this research, we investigate the approximation issue in machine learning areas: feature selection to learn a small set of critical features in big data and neural network sparsification to determine a small set of pertinent connections between neurons in neural networks. An additional goal of this research is to reveal pertinent relationships across these two research areas.
Information theory has great potential in machine learning, offering an alternative way for information extraction from data and for the approximation of data models. In particular, mutual information-based methods have been developed for feature learning and sparsifying neural networks, nevertheless with mixed results. The previous work has yet to establish connections across the two machine learning areas. We propose a mutual information-based framework that aims at addressing the approximation issue in these two subtopics and revealing pertinent relationships between them.
In this research, the proposed mutual information-based is tested on a large amount of microarray gene expression of human cancer data for disease classification. Microarray expression data containing tens of thousands of genes are ideal for the evaluation of methods for both feature learning and neural network sparsification.
In particular, the significant gene subset identified by our method decreases the required number of genes to perform classification tasks ten to hundred folds and outperforms the previous methods' performance. Sparsification neural networks with mutual information between neuron outputs let us removing up to $90\%$ unnecessary connections while maintaining or even improving the performance. Our experiments reveals sparsified neural network ignores unimportant (irrelevant) genes and only considers significant or pseudo-significant genes that we identify in the first part of this research through gene filtering.
Information theory has great potential in machine learning, offering an alternative way for information extraction from data and for the approximation of data models. In particular, mutual information-based methods have been developed for feature learning and sparsifying neural networks, nevertheless with mixed results. The previous work has yet to establish connections across the two machine learning areas. We propose a mutual information-based framework that aims at addressing the approximation issue in these two subtopics and revealing pertinent relationships between them.
In this research, the proposed mutual information-based is tested on a large amount of microarray gene expression of human cancer data for disease classification. Microarray expression data containing tens of thousands of genes are ideal for the evaluation of methods for both feature learning and neural network sparsification.
In particular, the significant gene subset identified by our method decreases the required number of genes to perform classification tasks ten to hundred folds and outperforms the previous methods' performance. Sparsification neural networks with mutual information between neuron outputs let us removing up to $90\%$ unnecessary connections while maintaining or even improving the performance. Our experiments reveals sparsified neural network ignores unimportant (irrelevant) genes and only considers significant or pseudo-significant genes that we identify in the first part of this research through gene filtering.