TY - JOUR
T1 - Randomized PCA forest for approximate k-nearest neighbor search
AU - Rajabinasab, Muhammad
AU - Pakdaman, Farhad
AU - Zimek, Arthur
AU - Gabbouj, Moncef
N1 - https://doi.org/10.1016/j.eswa.2024.126254
PY - 2024/12
Y1 - 2024/12
N2 - k-Nearest Neighbors (kNN) search is the problem of finding k points which are the closest to a given query point. It is used widely in a wide range of tasks and is among the most important tools in applied machine learning. Traditional algorithms for kNN search require computing distances between a query point and all other points in the dataset, and therefore is very slow and inefficient for large data. In this paper, we propose an approximate algorithm for kNN search to find the nearest neighbors fast and efficiently. We employ a tree-based structure which offers robustness and scalability. We propose to use Principal Component Analysis (PCA) to find the best splitting direction to fit the data on the trees. Seeking solutions with low computational complexity, (1) we use a randomized Singular Value Decomposition solver, which reduces PCA complexity from being associated with the number of features to being associated with the number of required principal values; (2) we reuse PCA calculations in multiple nodes to save computation while maintaining accuracy; (3) we ensemble these trees for improved performance, and (4) finally, we propose several variants of the proposed method which target a higher accuracy or a higher efficiency. Extensive experimental results show that proposed solutions outperform existing methods in terms of accuracy, while maintaining competitive complexity. The fast implementation variant of the proposed method outperforms existing techniques in terms of complexity and shows competitive accuracy in performing k-nearest neighbors’ search
AB - k-Nearest Neighbors (kNN) search is the problem of finding k points which are the closest to a given query point. It is used widely in a wide range of tasks and is among the most important tools in applied machine learning. Traditional algorithms for kNN search require computing distances between a query point and all other points in the dataset, and therefore is very slow and inefficient for large data. In this paper, we propose an approximate algorithm for kNN search to find the nearest neighbors fast and efficiently. We employ a tree-based structure which offers robustness and scalability. We propose to use Principal Component Analysis (PCA) to find the best splitting direction to fit the data on the trees. Seeking solutions with low computational complexity, (1) we use a randomized Singular Value Decomposition solver, which reduces PCA complexity from being associated with the number of features to being associated with the number of required principal values; (2) we reuse PCA calculations in multiple nodes to save computation while maintaining accuracy; (3) we ensemble these trees for improved performance, and (4) finally, we propose several variants of the proposed method which target a higher accuracy or a higher efficiency. Extensive experimental results show that proposed solutions outperform existing methods in terms of accuracy, while maintaining competitive complexity. The fast implementation variant of the proposed method outperforms existing techniques in terms of complexity and shows competitive accuracy in performing k-nearest neighbors’ search
U2 - 10.1016/j.eswa.2024.126254
DO - 10.1016/j.eswa.2024.126254
M3 - Tidsskriftartikel
SN - 0957-4174
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 126254
ER -