
Dimension reduction for scientific applications: Reducing the number of dimensions, that is, the number of features representing a data point, is important in scientific applications to minimize the effect of irrelevant or redundant features in any subsequent analysis. Often many different types of features are extracted for each data point using a range of techniques, and domain information alone may not be sufficient to prune the features to keep only the relevant ones. We investigated filters, wrappers, and several non-linear dimension reduction techniques for their effectiveness in scientific applications ranging from remote sensing to astronomy and plasma physics.
Select publications (available from Google Scholar):
- Y. J. Fan and C. Kamath. “On the Selection of Dimension Reduction Techniques for Scientific Applications,” in Real World Data Mining Application, Springer Annals of Information Systems,Volume 17, pp 91-122, 2015.
- Cantu-Paz, E., Newsam, S., Kamath, C., “Feature Selection in Scientific Applications,” Proceedings, ACM International Conference on Knowledge Discovery and Data Mining, pp 788-793, August 22-25, 2004, Seattle, WA. UCRL-CONF-202657.
- Fodor, I. K., and C. Kamath, “Dimension reduction techniques and the Classification of Bent Double Galaxies,” Computational Statistics and Data Analysis journal, Volume 41, pp. 91-122, 2002.

ASPEN – Approximate splitting for ensembles: Ensembles of classifiers, where different classifiers are created from the same data set through randomization, can improve the classification accuracy. To reduce the cost of creating multiple classifiers, we considered two ways to randomize the split decision at each node of the tree – use a sub-sample of instances at the node to identify the best split, or create a histogram, evaluate splits at the mid-point of each bin, and select the split randomly in the bin that contains the best split. A combination of both ideas can furthur reduce the cost of building the ensemble.
Select publications (available from Google Scholar):
- Kamath, C., E., Cantú-Paz, and D. Littau, “Approximate Splitting for Ensembles of Trees using Histograms,” Proceedings, Second SIAM International Conference on Data Mining, pp. 370-383, April 2002.
- Kamath, C., and E. Cantu-Paz, Creating ensembles of decision trees through sampling, Proceedings, 33rd Symposium on the Interface of Computing Science and Statistics, Costa Mesa, CA, June 2001. Also available as Lawrence Livermore National Laboratory technical report, UCRL-JC-14226.