The Fifth Statistics and Machine Learning Workshop

Session I: "Pattern Recognition and Variable Selection"

Organizer: 鄭順林 國立成功大學統計系

Chair: 盧鴻興 國立交通大學統計所


Machine Learning Approaches for Multimedia Pattern Recognition






Abstract: In this talk, I will introduce some machine learning approaches we have developed for speech recognition, face recognition and information retrieval. I will first address the problems we face in multimedia information systems and present our ideas of resolving these problems. In speech recognition, we will present the factor analysis (FA) approach to identify the acoustic factors and build the FA streamed hidden Markov models (FASHMM). FASHMM fulfills the streamed Markov chains for a sequence of multivariate Gaussian mixture observations through state transitions of the partitioned FA vectors. In face recognition, it is critical to build the adaptive pattern recognition in presence of nonstationary environments due to the variations of facial angles, expressions, and background illuminations, etc. We will present a recursive Bayesian algorithm for updating the hyperparameters of a normal-gamma distribution, and so the adaptation of facial models becomes available. In information retrieval (IR), we aim to retrieve the relevant documents from user query. The ranking of the retrieved documents play an important role in evaluation of system performance. We will present a discriminative training algorithm to optimize the order of the ranked documents, or equivalently minimize the Bayes risk due to rank errors. A Bayesian retrieval rule is established and the best performance can be achieved in terms of average precision. At last, I will address the on-going topics conducted in my Lab and suggest some potential directions that we can endeavor towards a high-impacting work.





Stochastic Matching Pursuit for Bayesian Variable Selection


Ray-Bing Chen1, Chi-Hsiang Chu1, Te-You Lai1 and Ying Nian Wu2


1Institute of Statistics, National University of Kaohsiung, Taiwan

2Department of Statistics, University of California, USA


Abstract: This article proposes a stochastic version of the matching pursuit algorithm for Bayesian variable selection in linear regression models. In the Bayesian formulation, the prior distribution of each coefficient is assumed to be a mixture of a point mass at 0 and a normal distribution with zero mean and a large variance. The proposed stochastic matching pursuit algorithm is designed for sampling from the posterior distribution of the coefficients for the purpose of variable selection. The proposed algorithm combines the efficiency of the matching pursuit algorithm and the rigorous Bayesian formulation with well defined prior distributions on coefficients. The algorithm is a Metropolis scheme with a pair of reversible moves. One is the addition move, which adds a new variable into the existing set of selected variables, where the variables with larger correlations with the residuals are assigned higher probabilities of being added, in a fashion that is very similar to the original matching pursuit algorithm. The other move is the deletion move, which deletes a variable from the existing set of selected variables. Several simulated and real examples for cases of large n small p and small n large p are used to illustrate the proposed algorithm. These examples show that the algorithm is efficient in screening and selecting variables.



Segmentation of cDNA Microarray Images by Kernel Density Estimation






Abstract: The segmentation of cDNA microarray spots is essential in analyzing the intensities of microarray images for biological and medical investigation. In this work, nonparametric methods using kernel density estimation are applied to segment two-channel cDNA microarray images. This approach groups pixels into both a foreground and a background. The segmentation performance of this model is tested and evaluated with reference to 16 microarray data. In particular, spike genes with various contents are spotted in a microarray to examine and evaluate the accuracy of the segmentation results. Duplicated design is implemented to evaluate the accuracy of the model. The results of this study demonstrate that this method can cluster pixels and estimate statistics regarding spots with high accuracy.



Regularized Double Nearest Proportion Feature Extraction for Hyperspectral Image Classification






Abstract: For the classification among different landcover types in the hyperspectral image, especially in the small sample size situation, a feature extraction method is usually desired for reducing the dimension and increasing the classification accuracy. In this presentation, several existing feature extraction methods, such as Fisher’s linear discriminant analysis and nonparametric discriminant analysis, will be discussed first. Then, by constructing new scatter matrices in Fisher’s criterion, a new feature extraction method is introduced. Based on a proposed double nearest proportion structure, the consideration of overlap between class distributions can be imposed upon a weighted generally full rank between-class scatter matrix so that the boundaries between distributions can be emphasized even when overlap occurs. Moreover, based on this structure, a new regularized within-class scatter matrix which the shrinkage densities are analytically determined is also proposed. Therefore, the effect of large dimension but few samples can be reduced. The performances of several simulations and real hyperspectral images classification will also be demonstrated in this presentation.




Session  II: "Machine Learning Research at Institute of Information Science, Academia Sinica"

Organizer: 許鈞南 中央研究院資訊科學研究所

Chair: 陳春賢 私立長庚大學資訊管理學系


Periodic Step-size Adaptation for Single-pass On-line Learning






Abstract: Previously, it has been established that the second-order stochastic gradient descent (SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass through the training examples. However, second-order SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents a new second-order SGD method, called Periodic Step-size Adaptation (PSA). We analyzed the accuracy of approximation and convergence properties of PSA. Experimental results show that single-pass performance of PSA is always very close to empirical optimum for a wide variety of models and tasks.


Improving Local Learning by Exploring the Effects of Ranking






Abstract: Local learning for classification is useful in dealing with various computer vision problems. One key factor for such approaches to be effective is to find good neighbors for the learning procedure. In this talk, I will describe a novel method to rank neighbors by learning a local distance function, and meanwhile to derive the local distance function by focusing on the high-ranked neighbors. The two aspects of considerations can be elegantly coupled through a well-defined objective function, motivated by a supervised ranking method called P-Norm Push. While the local distance functions are learned independently, they can be reshaped altogether so that their values can be directly compared. We apply the proposed method to the Caltech-101 dataset, and demonstrate the use of proper neighbors can improve the performance of classification techniques based on nearest-neighbor selection.



A Bi-prototype Theory of Facial Attractiveness






Abstract: The attractiveness of human faces can be predicted with a high degree of accuracy if we represent the faces as feature vectors and compute their relative distances from two prototypes, namely, the average of attractive faces and the average of unattractive faces. Moreover, the degree of attractiveness, defined in terms of the relative distance, exhibits a high degree of correlation with the average rating scores given by human assessors. These findings motivate a bi-prototype theory that relates facial attractiveness to the averages of attractive and unattractive faces, rather than the average of all faces, as previously hypothesized by some researchers.




Session III: "Bioinformatics"

Organizer: 黃貞瑛 私立輔仁大學資訊工程學系

Chair: 張源俊 中央研究院統計科學研究所


Data-driven Computational Function Association Networks in Cancer Study






Abstract: We have implemented a computational analysis tool to construct the function association networks to analyze the enriched gene sets of microarray data in cancer and stem cell. The main goal of the functional network analysis is to study the most significant biological pathways and functions among the differentially expressed genes between leukemia stem cell and normal stem cell. The tool helps scientists to understand the major biological and functional groups, which have been involved in our focused gene list. In the past, the functional groups are displayed in table formats, and the functional group association networks allow users to be able to visualize the relationship among different functional groups and gene annotations. In addition to functional groups association networks, we also implement another algorithm to construct gene-function relationships and to identify important genes with similar functional annotations and these differentially expressed genes are the potential biomarkers for cancers and diseases. We generated a gene list of differentially expressed genes among leukemia stem cell and normal stem cell. From the gene list, we are able to construct functional groups association network and gene function association networks.




Selection of Dominant and Dormant Maker Genes from Multi-class Microarray Expression Data






Abstract: Cancer is a complex disease developing from accumulating multiple gene mutations. In different subtypes of a cancer, patients may carry diverse gene mutations even if they have similar symptoms. Therefore, lots of studies have designed to probe the mechanism of carcinogenesis by analyzing the gene expression profiles from microarray data. In this study, we have developed a gene evaluation index, referred to "Gene Dominant/Dormant Index (GDI)", to select biomarkers for each subgroup of a cancer. Moreover, we have built up a multicalss diagnostic system based on the selected biomarkers and machine learning techniques. In simulation results, we have made a visual assessment of the specificity of the selected biomarkers for each group and demonstrated the outstanding prediction performance of our diagnostic system.








Combinatorial Patterns of Somatic Gene Mutations in Cancer






Abstract: Cancer is a complex process in which the abnormalities of many genes appear to be involved.  The combinatorial patterns of gene mutations may reveal the functional relations between genes and pathways in tumorigenesis as well as identify targets for treatment.  We examined the patterns of somatic mutations of cancers from COSMIC, a large-scale database curated by the Wellcome Trust Sanger Institute.  The frequently mutated genes are well-known oncogenes and tumor suppressors that are involved in generic processes of cell cycle control, signal transduction and stress responses.  These “signatures” of gene mutations are heterogeneous when the cancers from different tissues are compared.  Mutations in genes functioning in different pathways can occur in the same cancer (i.e., co-occur), whereas those in genes functioning in the same pathway are rarely mutated in the same sample.  This observation supports the view of tumorigenesis as derived from a process like Darwinian evolution.  However, certain combinatorial mutational patterns violate these simple rules and demonstrate tissue-specific variations.  For instance, mutations of genes in the Ras and Wnt pathways tend to co-occur in the large intestine, but are mutually exclusive in cancers of the pancreas.  The relationships between mutations in different samples of a cancer can also reveal the temporal orders of mutational events.  In addition, the observed mutational patterns suggest candidates of new co-sequencing targets that can either reveal novel patterns or validate the predictions deduced from existing patterns.  These combinatorial mutational patterns provide guiding information for the ongoing cancer genome projects.





Session IV: "Theory, Software, and Applications of Statistical Machine Learning"

(English Session)

Organizer: 陳君厚 中央研究院統計科學研究所 (Chun-Houh Chen, Institute of Statistical Science, Academia Sinica)

Chair: 陳宏 國立台灣大學數學系 (Hung Chen, Department of Mathematics, National Taiwan University)


Recent Developments of Java-based Statistical Software: Jasp and Jasplot


Junji Nakano1, Yoshikazu Yamamoto2, Ikunori Kobayashi2, Takeshi Fujiwara3


1The Institute of Statistical Mathematics, Japan

2Tokushima Bunri University, Japan

3Tokyo University of Information Science, Japan


Abstract: We are developing a statistical analysis system Jasp and a statistical graphics library Jasplot by Java language. They utilize several advanced features of Java to realize advanced statistical environment such as object-based statistical language, user interface, extensibility, ease of use and distributed computing functions. In addition, we have developed a function for using programs written in the language of statistical system XploRe, which has many useful program resources and recently became free software. We have also added new functions for network use of Jasplot.





Information Divergence Geometry and Its Application to Machine Learning


江口真透 (Shinto Eguchi)


日本-東京-統計數理研究所 (The Institute of Statistical Mathematics, Japan)


Abstract: Information divergence is known to associate with a Riemannian metric and dual connections on the space of density functions. Any divergence leads to a specific statistical model and estimation method if one of the dual connections equals the mixture connection. As a typical example the Kullback-Leibler (KL) divergence associates with the exponential model and maximum likelihood. Similarly, any U-divergence associates with U-model and U-method and its application to PCA, ICA, Gaussian mixture and boosting with U function selected for specific goal. The choice of U = exp reduces to the case of KL divergence. The kernel method for multivariate data analysis is now popular on the basis of theory of reproducing kernel Hilbert space (RKHS). The U-model and U method can be extended to the kernel method on RKHS.




A Boosting Method Focusing on the Partial Area under the ROC Curve


小森 理 (Osamu Komori)


日本-東京-統計數理研究所 (The Institute of Statistical Mathematics, Japan)


Abstract: The receiver operating characteristic (ROC) curve has attracted wide attention for its utility in the medical and biostatistical field. Given a set of multiple markers obtained from some clinical test or examination, the area under the ROC (AUC) is calculated in order to measure its discriminant ability between controls and cases. Although the AUC has some useful properties that are quite different from error rates or odds ratio, it has begun to be criticized for discrepancy between the statistical and clinical evaluations of the markers. In this context, we develop a new statistical method that maximizes the partial area under the ROC curve (pAUC) based on a boosting technique. Among a predetermined set of weak classifiers that are basis functions of natural cubic splines, the algorithm selects the best one in the sense of the pAUC at every boosting step. The resultant score plots are useful for understanding how each marker is associated with the two groups(controls and cases). The performance of the partial AUCBoost is compared with other boosting methods, and its utility is illustrated using a real data set.






Session V: "Boosting and Its Applications"

Organizer: 曹振海 國立東華大學應用數學系

Chair: 銀慶剛  中央研究院統計科學研究所


The Application of Spatial Information Knowledge Mining Technique






Abstract: Rice production has drawn a great interest in applying satellite remote sensing classification techniques for exact productivity estimation.  Data mining technique has been used to extract efficient knowledge from large amounts of image database by different approaches.  Ancillary information can also be applied with machine learning theory such as CART and Rough Sets to satellite images for accurate land use and land cover identification purpose.  This research proposed boosting technique to better solve the problem of characteristic changes that were caused by re-sampling during image classification process.  The results proved to provide better clustering and identification results for rice production area than traditional image classification methods by supervised or unsupervised approaches.




Applying Kernel Approaches to L2Boosting






Abstract: Recently, the successful L2Boosting algorithm, which involves model selection criteria proposed by Buhlmann and Yu (2005) for linear regression problems has received considerable attention. The algorithm can also be applied to data arising from a nonlinear model but its performance in that context can be much improved. To this end, we propose a new method called K-L2Boosting which combines L2Boosting with kernel techniques to handle continuous data with a complicated nonlinear structure. In addition, we extend our new method to classification problems by using transformations of response variables in a way similar to that used by Friedman, Hastie and Tibishirani (2000). These new methods are illustrated with one simulation study and three real data examples. Comparisons between K-L2Boosting and other existing methods are also presented here.



Boosting for Ordinal Responses






Abstract: Classifying ordinal responses arise in many  important applications. However, the theoretical studies and practical implementations of effective multi-class classifiers for ordinal responses are relatively few. Many multi-class classifiers are essentially various combinations/votings of binary or nominal classifiers.  The information of ordering is often not fully exploited due to the limitation of these classifiers. Inspired by Friedman, Hastie and Tibishirani (2000), Chen and Tsao (2009) proposes AHCBoost (Adjustable Hyperbolic Cosine Boost) specially for multi-class problems with no reduction to binary problems. Along this line, we investigate variants of AHCBoost for ordinal responses with configurations of tuning parameters and modifications of loss functions. The results based on simulations and benchmark data sets will be reported.






Session VI: "Efficient/Robust Kernel Methods and Applications"

(English Session)

Organizer: 陳素雲 中央研究院統計科學研究所 (Su-Yun Huang, Institute of Statistical Science, Academia Sinica)

Chair: 吳鐵肩 國立成功大學統計系 (Tiee-Jian Wu, Department of Statistics, National Cheng Kung University)


Design of Efficient Methodology for Target Region Estimation in Computer Experiments


洪英超 (Ying-Chao Hung)


國立中央大學統計研究所 (Graduate Institute of Statistics, National Central University)


Abstract: In this study, we develop an efficient methodology for estimating the input region of computer experiments so that a pre-specified target output value can be met. The proposed methodology is sequential and utilizes the idea of online support vector regression (OL-SVR) and uniform design (UD). Specifically, the goal is to establish an adequate model that predicts well the boundary of the target input region and at the same time minimize the required number of experimental trials. The efficiency and accuracy of the proposed methodology are illustrated on some real examples.




Online Robust Kernel Principal Components Analysis


黃信雄 (Hsin-Hsiung Huang)


中央研究院統計科學研究所 (Institute of Statistical Science, Academia Sinica)


Abstract: Kernel principal component analysis (KPCA) has been studied in imaging coding recently. When new data join the original large group of data, it takes a lot of time to calculate the principal components again. Furthermore, if the data dimensionality is too large to compute the principal components in batch, the online algorithm still works in this case. When the common learning rate is applied in this algorithm, we can show the asymptotic stability of principal components. We do some experiments of image reconstruction in order to compare gradient descent and batch methods. Finally, we see that the online robust KPCA can save computing time as compared with batch mode. The effectiveness of the robust KPCA to reduce the influence of outlier effects will be demonstrated by image coding examples.



A Kernel-weight-based Regularized Least Squares Support Vector Machine for Gene Selection


蕭朱杏 (Chuhsing Kate Hsiao)


國立台灣大學公共衛生學系暨流行病學研究所 (Department of Public Health, College of Public Health, National Taiwan University)


Abstract: Selection of influential genes with microarray data often faces the difficulties of a large number of genes and a relatively small group of subjects. In addition to the curse of dimensionality, many gene selection methods weight the contribution from each individual subject equally. This equal-contribution assumption cannot account for the possible dependence among subjects who associate similarly to the disease, and may restrict the selection of influential genes. A novel approach to gene selection is proposed based on kernel similarities and kernel weights. We do not assume uniformity for subject contribution. Weights are calculated via regularized least squares support vector regression (RLS-SVR) of class levels on kernel similarities and are used to weight subject contribution. The cumulative sum of weighted expression levels are next ranked to select responsible genes. These procedures also work for multiclass classification. We demonstrate this algorithm on acute leukemia, colon cancer, small, round blue cell tumors of childhood, breast cancer, and lung cancer studies, using kernel Fisher discriminant analysis and support vector machines as classifiers. Other procedures are compared as well. The proposed approach is easy to implement and fast in computation for both binary and multiclass problems. The gene set provided by the RLS-SVR weight-based approach contains a less number of genes, and achieves a higher accuracy than other procedures.