Dimension Reduction for Tree-Structured Data
Erdinç Durak, Department of Industrial Engineering, METU
Statistical analysis of tree-structured data is one of the exciting research areas with expanding application areas. In classical data analysis, the data objects are points in the Euclidean space, whereas they are trees in the analysis of tree-structured data. The replacement of points in the Euclidean space with trees brings in additional complexity in data analysis and necessitates the use of dimension reduction techniques. In this study, we aim to develop dimension reduction techniques for tree-structured data where each tree is rooted and labeled. We consider two classical dimension reduction techniques, namely, the principal component analysis (PCA) and multidimensional scaling (MDS), and adapt them to tree-structured data. Unlike a previous study on the PCA for tree-structured data, the PCA techniques proposed in this study maximize a measure of variance. Computational experiments on randomly generated data and real-life data show the superiority of the proposed PCA techniques over the existing one.
In the literature, there is no study that performs MDS on tree-structured data to project them in the tree space. In this direction, we propose the first MDS method for tree-structured data, where the edges of the trees are considered as the dimensions. In the proposed MDS method, the aim is to keep the Hamming distances between pairs of trees proportionally similar. For this purpose, we propose a mixed-integer linear programming model which finds the edges to be kept in an optimal way and heuristic methods where the edges are greedily selected one by one. Computational experiments show that the proposed MDS methods are successful in keeping useful information as high clustering accuracy is achieved with only a fraction of the edges.
To be able to perform the computational experiments in a systematic way, a random tree generator algorithm is developed. This algorithm is able to generate clusters of trees with different skewness and density parameters. By changing these parameters systematically, we are able to understand the strength and weaknesses of the proposed methods.
Erdinç Durak is currently a Ph.D. student at the Industrial Engineering (IE) Department at Middle East Technical University (METU). He holds his Master's degree from the METU-IE. He also received his Bachelor's degree from the same department, where he has been working as a teaching assistant for two and a half years.
Friday, November 19, 2021, 4.00 pm - Zoom Meeting