The Advanced Matrix Factorization Jungle

Welcome to the Jungle

This Matrix Factorization Jungle page ventures beyond classical decompositions into factorizations that impose structural assumptions on the unknowns—sparsity, low-rank, non-negativity, subspace membership, and more. As sparse recovery solvers matured, a rich ecosystem emerged. Sparse Coding & Dictionary Learning (Olshausen & Field) and NMF (Lee & Seung) were the first of such decompositions widely used with constraints on the matrices used for the factorization. Yet, while those factorizations were used by some practitioners, there was a sense that they "weren't rigorous enough" as little theoretical understanding came with these techniques. The discovery of compressed sensing (Candès, Romberg, Tao, Donoho) and its expansion to matrices i.e. nuclear norms can serve as convex proxies for rank, opened the floodgates to powerful new decomposition techniques.

What distinguishes these methods from traditional matrix factorizations and decompositions? A single method (say Sparse coding) can involve different loss functions and proxies to optimize. Each of these in turn admits multiple algorithm implementations—hence the Jungle reference. This elaborate complexity and the constraints applied to the results, yield phase transitions—sharp boundaries in parameter space. The phase transitions are very important because they are the eventual acid tests for algorithms or methods. These phase transitions are generally a sign that the algorithms or the factorization itself has hit a hard limit that may be due to the computational complexity of the problem being solved. This document maps that landscape, from video surveillance to recommendation systems, blending randomized linear algebra, nonconvex optimization, and probabilistic modeling.

Traditional Matrix Decompositions: There is a stellar and comprehensive treatment of classical matrix factorization techniques—LU, Cholesky, QR, SVD, Polar, Eigendecomposition, Jordan, Schur, CUR/Skeleton, Hessenberg, Tridiagonal, and Bidiagonal decompositions—in Jun Lu's monographs: "Matrix Decomposition and Applications" and "Numerical Matrix Decomposition and its Modern Applications". Most of the traditional techniques used to not put constraints on the matrix/data to be factorized. In scientific computing however, operators are sometimes approximated with decompositions and factorization to permit the design of better inverse problem solvers (see recent survey "Fast Direct Solvers" by Per-Gunnar Martinsson and Michael O'Neil).

The New Frontier: Additional information about LoRAs used with LLMs came from this survey by Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang.

"The Sparsest of Them All" — Evolution of phase transitions from Gauss to Krzakala, Mézard, Sausset, Sun, Zdeborová

Contributing

Know of a decomposition or factorization technique or an implementation or a phase transition that's missing? Open an issue and mention the paper reference (arxiv, biorxiv, DOI, techrxiv, etc....) where you found it, a brief description of the factorization or decomposition (a new one or one that has already been identified in the page), and ideally a link to code/repo. If you have identified a phase transition, please mention the article and the figure in which it is viewable (as a bound or as a graph).

References

B.A. Olshausen & D.J. Field, "Emergence of simple-cell receptive field properties by learning a sparse code for natural images", Nature 381, 1996.
D.D. Lee & H.S. Seung, "Learning the parts of objects by non-negative matrix factorization", Nature 401, 1999.
J. Mairal, F. Bach, J. Ponce, G. Sapiro, "Sparse Coding and Dictionary Learning for Image Analysis", ICCV 2009 Tutorial.
E.J. Candès, J. Romberg & T. Tao, "Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information", IEEE Trans. Information Theory, 2006.
D.L. Donoho, "Compressed Sensing", IEEE Trans. Information Theory, 2006.
B. Recht, M. Fazel & P.A. Parrilo, "Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization", SIAM Review, 2010.
J. Lu, "Matrix Decomposition and Applications", arXiv:2201.00145, 2022.
J. Lu, "Numerical Matrix Decomposition and its Modern Applications", arXiv:2107.02579, 2021.
P.-G. Martinsson & M. O'Neil, "Fast Direct Solvers", arXiv:2511.07773, 2025.
"Complexity Zoo" — Algorithmic Complexity Reference.
"Complexity Zoo Petting Zoo" — Introduction to Complexity Classes.

This page was first announced on Nuit Blanche in August 2011.

Quick Reference for Advanced Matrix Factorization: Structural Assumptions & Unknowns

FACTORIZATIONS (A = DX, UV^T)
Problem	Form	Loss D(A;DX)	Known	Unknown	Key Constraints	Phase Trans.
SVD	A = DX	\|\|A-DX\|\|_F²	A	D, X	D^TD = I, XX^T = Λ (diagonal)	—
pLSI	A = DX	KL(A;DX)	A	D, X	1^TD1 = 1, d_ij ≥ 0, 1^TX = 1, x_ij ≥ 0	—
NMF	A = DX	\|\|A-DX\|\|_F² KL(A;DX)	A	D, X	D ≥ 0, X ≥ 0	—
Dictionary Learning	A = DX	\|\|A-DX\|\|_F²	A	D, X	\|\|d_j\|\|₂ = 1, X sparse	Yes
Sparse PCA	A = DX	\|\|A-DX\|\|_F²	A	D, X	\|\|d\|\|₂ = 1, D sparse	Yes
K-Means	A = DX	\|\|A-DX\|\|_F²	A	D, X	XX^T = I, X_ij ∈ {0,1}	Yes
K-Medians	A = DX	\|\|A-DX\|\|₁	A	D, X	XX^T = I, X_ij ∈ {0,1}	Yes
Subspace Clustering	A = AX	\|\|A-AX\|\|_F²	A	X	diag(X) = 0, sparse/low-rank	—
MMV / Comp. Sensing	Y = AX	\|\|Y-AX\|\|_F²	Y, A	X	X joint row-sparse	Yes
BSS / ICA	Y = AX	\|\|Y-AX\|\|_F²	Y	A, X	rows of X independent	—
Recommender Systems	R = UV^T	\|\|R-UV^T\|\|_F²	R_Ω	U, V	low-rank	—
LoRA	W' = W + BA	L(W+BA)	W	B, A	B∈R^d×r, A∈R^r×k, r ≪ d,k	—
Autoencoders	X = D(E(X))	\|\|X-D(E(X))\|\|_F²	X	E, D	bottleneck dim k; linear → PCA	—
Non-Convex Opt.	M = UV^T	\|\|M-UV^T\|\|_F²	M	U, V	rank(UV^T) ≤ r	Yes
Tensor Decomp.	T = ∑ a⊗b⊗c	\|\|T-∑a⊗b⊗c\|\|_F²	T	factors	CP/Tucker/TT structure	—
Graph Matching	A = XBX^T	\|\|A-XBX^T\|\|_F²	A	X, B	X permutation matrix	Yes
Generalized MF	W⊙L = W⊙UV^T	\|\|W⊙(L-UV^T)\|\|_F²	W (mask)	U, V, L	L lowest rank	—
Archetypal Analysis	A = DX, D = AB	\|\|A-DX\|\|_F²	A	D, X, B	X, B ≥ 0, convex hull	—
Matrix Comp. Sensing	A(L) = b	\|\|A(L)-b\|\|₂²	A, b	L	L rank-r	Yes
Kernel Factorizations	K = ΦΦ^T	\|\|K-ΦΦ^T\|\|_F²	K (kernel)	Φ (features)	K ≥ 0, PSD	—

DECOMPOSITIONS (A = L + S)
Problem	Form	Loss D(A;L,S)	Known	Unknown	Key Constraints	Phase Trans.
Robust PCA	A = L + S	\|\|L\|\|_*+λ\|\|S\|\|₁	A	L, S	L low-rank, S sparse	Yes
Matrix Completion	P_Ω(M) = P_Ω(L)	\|\|L\|\|_*	M_Ω	L	L low-rank, incoherent	Yes
SPCP / Noisy RPCA	A = L + S + N	\|\|L\|\|_*+λ\|\|S\|\|₁ +\|\|N\|\|_F²	A	L, S, N	L low-rank, S sparse, N noise	Yes

Jun Lu's Taxonomy of Classical Matrix Decompositions

Figure 1: Matrix Decomposition World Map — relationships between classical decompositions

Figure 2: Matrix Decomposition World Map Under Conditions — taxonomy by matrix structure

From Jun Lu, "Matrix Decomposition and Applications" (arXiv:2201.00145)

Decomposition (L + S)

Applications: Video surveillance, face recognition, Netflix recommendations

Factorization (UV^T)

Applications: Image denoising, compression, feature learning

Spectral Clustering & K-Means

$\underbrace{A}_{\text{known}} = \underbrace{D}_{\text{unknown}} \underbrace{X}_{\text{unknown}}$ s.t. $XX^T = I$, $X_{ij} \in \{0,1\}$ Phase Transition

Matrix factorization view: Given data A, find unknown centroids D and assignment matrix X. The constraint XX^T = I enforces that each data point belongs to exactly one cluster, while X_ij ∈ {0,1} makes assignments binary. K-Median uses L₁ loss instead of L₂. Spectral methods relax the binary constraint and partition via graph Laplacian eigenvectors.

Operators: Random walk, Adjacency, Laplacian, Modularity, Non-Backtracking

K-Means++ Arthur, Vassilvitskii — improved seeding

Spectral Clustering Ng, Jordan, Weiss — paper

sklearn.cluster.KMeans scikit-learn — KMeans(n_clusters=k)64.6k Python

sklearn.cluster.SpectralClustering scikit-learn — graph Laplacian eigenvectors64.6k Python

K-means/K-median: Empirical probability of integrality for SDP/LP relaxations. Figure from: Awasthi, Bandeira, Charikar, Krishnaswamy, Villar, Ward — "Relax, no need to round: integrality of clustering formulations" (2015)

Subspace Clustering

$\underbrace{A}_{\text{known}} = \underbrace{A}_{\text{known}} \underbrace{X}_{\text{unknown}}$ s.t. $\text{diag}(X) = 0$, $X$ is sparse and/or low-rank

Matrix factorization view: Given data matrix A, find unknown coefficient matrix X where each column x_i expresses data point a_i as linear combination of other points. Constraint diag(X) = 0 prevents trivial self-representation. Additional regularization: ||X||₁ (SSC), ||X||_* (LRR), or ||X||_F (LSR).

SSC Elhamifar, Vidal

LRR Liu et al.

LSR Canyi Lu

SMCE Sparse Manifold

LLE Local Linear Embedding

SSC-ADMM JHU Vision Lab — Elhamifar official45 Matlab

SSC (Becker) S. Becker — proximal gradient + ADMM19 Matlab

LRR (Python) port from G. Liu's MATLAB34 Python

Subspace clustering criteria satisfying EBD conditions. Figure from: Elhamifar, Vidal — "Sparse Subspace Clustering: Algorithm, Theory, and Applications" (2013)

Non-Negative Matrix Factorization (NMF)

$\min_{D,X} \|\underbrace{A}_{\text{known}} - \underbrace{D}_{\text{unknown}} \underbrace{X}_{\text{unknown}}\|_F^2$ s.t. $D \in \mathbb{R}^{m \times r}_+$, $X \in \mathbb{R}^{r \times n}_+$, $D, X \geq 0$

Matrix factorization view: Given non-negative data matrix A, find unknown dictionary D and coefficients X, both constrained to be element-wise non-negative. The rank r controls the number of parts/topics. Non-negativity yields interpretable, parts-based representations.

Deep NMF (2018+): Multiple layers of factorization discover hierarchical structure. Neural NMF uses backpropagation for end-to-end training.

SPAMS J. Mairal (Inria) et al.21 Python Matlab

HALS Gillis, Glineur — Accelerated HALS

ANLS-NMF Kim, Park

HotTopixx Bittorf, Recht, Ré, Tropp NeurIPS 2012

Near-Separable NMF Gillis — robust LP-based

Itakura-Saito NMF Fevotte

K-SVD Elad, Aharon

Deep NMF Trigeorgis et al. CVPR 2014 — multi-layer

sklearn.decomposition.NMF scikit-learn — Lee & Seung multiplicative updates64.6k Python

nmf (Python) MUR, ANLS, ADMM, AO-ADMM10 Python

nmflib Lee & Seung original KL + Euclidean Matlab

Matrix Completion

$\min_L \text{rank}(\underbrace{L}_{\text{unknown}})$ s.t. $P_{\underbrace{\Omega}_{\text{known}}}(L) = P_\Omega(\underbrace{M}_{\text{known}})$, $L$ is low-rank Phase Transition

Matrix factorization view: Given partial observations M_Ω at index set Ω, recover unknown low-rank matrix L. Convex relaxation: min ||L||_* (nuclear norm). Phase transition at m ≈ O(nr log n) samples for rank-r recovery. Made famous by Netflix Prize.

Graph-Based (2019+): GNN-based inductive methods generalize to unseen users/items and even transfer across datasets.

SVT Cai, Candès, Shen

TFOCS S. Becker, Candès, Grant142 Matlab

OptSpace Keshavan et al.

LMaFit Wen, Yin, Zhang — nonlinear SOR

GRASTA He et al. — Grassmannian streaming

LRGeomCG Vandereycken

Matrix ALPS Kyrillidis, Cevher 2012

GROUSE Balzano, Nowak, Recht — rank-one updates

Jellyfish Recht, Ré VLDB 2013 — parallel SGD

BiG-AMP Parker et al.

SVP Meka et al.

RTRMC Boumal, Absil — Riemannian trust-region

IGMC Zhang & Chen ICLR 2020 — GNN inductive Python

IMC-GAE 2021 — Graph Autoencoder

ScaledGD nonconvex, near-optimal

Matrix-Completion-Methods SVT, SVP, TNNR-ADMM42 Matlab

MatrixCompletion.jl OptSpace, OR1MP, SVT Julia

Robust PCA & Stable PCA

$\underbrace{A}_{\text{known}} = \underbrace{L}_{\text{unknown}} + \underbrace{S}_{\text{unknown}}$, solve $\min_{L,S} \|L\|_* + \lambda\|S\|_1$, $L$ is low-rank, $S$ is sparse Phase Transition

Matrix factorization view: Given observed matrix A, decompose into unknown low-rank component L (rank(L) ≤ r) and unknown sparse component S (||S||₀ ≤ s). Nuclear norm ||L||_* is convex proxy for rank; L₁ norm for sparsity. Phase transition: recovery succeeds when rank(L) ≤ ρn and ||S||₀ ≤ αn².

Streaming (2015+): Nearly-linear time and single-pass algorithms now available. GRASTA tracks non-stationary subspaces online.

ALM / IALM Lin, Chen, Ma 2010

ADMM Boyd et al. — Alternating Direction

GoDec Zhou, Tao ICML 2011 — randomized BRP

GoDec (Rust) Rust implementation Rust

DECOLOR Zhou, Yang, Yu — contiguous outliers

ReProCS Qiu, Vaswani — recursive robust PCA

TGA Hauberg et al. CVPR 2014 — Grassmann averages

SpaRCS Waters et al. NeurIPS 2011

Bayesian RPCA Babacan et al. — sparse Bayesian learning

GRASTA He et al. — Grassmannian streaming

Online RPCA Xiao et al. — change point detection

Streaming Outlier RPCA Diakonikolas et al. 2023 — near-linear

History PCA streaming, parameter robust

robust-pca Candès et al. ADMM implementation249 Python

RobustPCA (MATLAB) Candès, Wright PCP Matlab

lrslibrary A. Sobral — 100+ algorithms for video882 Matlab

RPCA success rates comparison. Figure from: Parker, Schniter, Cevher — "Bilinear Generalized Approximate Message Passing" (2014)

Sparse PCA

$\max_v \underbrace{v}_{\text{unknown}}^T \underbrace{A}_{\text{known}}^T A v$ s.t. $\|v\|_2 = 1$, $\|v\|_0 \leq k$, $v$ is sparse Phase Transition

Matrix factorization view: Given data matrix A, find principal components v with at most k non-zero entries. Sparsity constraint ||v||₀ ≤ k yields interpretable loadings. NP-hard; convex relaxation via SDP or L₁ penalty. Computational-statistical gap: info-theoretic recovery possible below algorithmic threshold.

SPAMS Mairal, Bach, Ponce — sparse modeling monograph

DSPCA d'Aspremont et al. — SDP relaxation

Optimal Sparse PCA d'Aspremont, Bach, El Ghaoui 2008

Truncated Power Yuan, Zhang — sparse eigenvalue

Structured Sparse PCA Jenatton, Obozinski, Bach AISTATS 2010

Sparse Functional PCA Allen, Weylandt — joint sparsity+smoothness

sklearn.decomposition.SparsePCA scikit-learn — L1 penalty64.6k Python

spca Variable Projection — regular, randomized, robust71 R

Dictionary Learning

$\min_{D,X} \|\underbrace{A}_{\text{known}} - \underbrace{D}_{\text{unknown}} \underbrace{X}_{\text{unknown}}\|_F^2 + \lambda\|X\|_1$ s.t. $\|d_j\|_2 = 1$, $X$ is sparse Phase Transition

Matrix factorization view: Given data matrix A, learn unknown dictionary D (columns normalized ||d_j||₂=1) and unknown sparse codes X (||x_i||₀ ≤ k). Dictionary can be overcomplete (more atoms than dimensions). Alternates between sparse coding (fix D, solve for X) and dictionary update.

SPAMS Mairal — online21 Python Matlab

JL-SPCADL G. Madhuri, Atul Negi, Kaluri V. Rangarao — Implementation of JLSPCADL: Johnson-Lindenstrauss lemma based optimal projections for discriminative dictionary learning with Modified Supervised PCA Python

K-SVD Elad, Aharon 2006

Online DL Mairal, Bach, Ponce, Sapiro ICML 2009

BPFA Zhou, Carin et al. — Beta Process

BiG-AMP Parker, Schniter, Cevher

Efficient Sparse Coding Honglak Lee et al. NeurIPS 2006

sklearn.decomposition.DictionaryLearning scikit-learn — online + batch64.6k Python

sparselandtools F. Herzog — K-SVD, OMP, MP (archived)91 Python

ksvd-sparse-dictionary K-SVD + OMP sensing Python

K-SVD Toolbox M. Elad, Aharon original Matlab

Dictionary learning support recovery phase transition. Figure from: Sulam, Papyan, Romano, Elad — "Learning Simple Thresholded Features with Sparse Support Recovery" (2018)

MMV & Compressive Sensing

$\underbrace{Y}_{\text{known}} = \underbrace{A}_{\text{known}} \underbrace{X}_{\text{unknown}}$ s.t. $\|X\|_{\text{row},0} \leq k$ (joint row-sparsity) Phase Transition

Matrix factorization view: Given measurements Y and known sensing matrix A, recover unknown signal matrix X with at most k non-zero rows (joint sparsity across L measurement vectors). Donoho-Tanner phase transition: sharp boundary in (m/n, k/n) plane. Multiple measurements improve recovery over single-vector case (SMV).

SA-MUSIC Lee, Bresler — subspace-augmented

CS-MUSIC Kim et al.

AMP-MMV Ziniel, Schniter — Bayesian AMP

T-MSBL Zhang, Rao — temporal SBL

REMBO Mishali, Eldar

EM-GM-AMP Vila, Schniter — Gaussian mixture

AMP-Compressed-Sensing Krzakala et al. 2012 — Bayesian optimal Python

CoSaMP Needell, Tropp — iterative greedy

IHT / NIHT Blumensath, Davies — hard thresholding

CS-algorithms OMP, SP, IHT, CoSaMP, FISTA, ISTA Matlab Python

sparse_recovery BP, OMP, nonnegative variants Python

CS C & MATLAB OMP, AMP, IHT Matlab C++

ISTA-Net Deep unrolled ISTA (CVPR 2018) Python

Diff-OneBit Youming Chen, Zhaoqiang Liu — Implementation of Diff-OneBit algorithm for signal recovery under 1-bit quantization using diffusion model priors Python

Probabilistic reconstruction phase diagram

Probabilistic reconstruction. From: Krzakala, Mézard, Sausset, Sun, Zdeborová (2012)

SCoSaMP. From: Blanchard, Tanner (2012)

CGIHT/NIHT. From: Blanchard, Tanner, Wei (2015)

Permutation recovery. From: Pananjady, Wainwright, Courtade (2016)

Blind Source Separation / ICA

$\underbrace{Y}_{\text{known}} = \underbrace{A}_{\text{unknown}} \underbrace{X}_{\text{unknown}}$ s.t. rows of $X$ are independent

Matrix factorization view: Given mixed observations Y, recover unknown mixing matrix A and unknown source signals X. Key constraint: rows of X are statistically independent (ICA) or sparse (SCA). Non-Gaussianity of sources enables identification up to permutation and scaling.

ICALab ICA Toolbox

FastICA Hyvärinen

BLISS BSS software

DUET Sparse Component

MISEP Mutual Information

sklearn.decomposition.FastICA scikit-learn — Hyvärinen & Oja64.6k Python

Linear & Non-Linear Autoencoders

$\underbrace{X}_{\text{known}} \approx \underbrace{D}_{\text{unknown}}(\underbrace{E}_{\text{unknown}}(X))$, linear case $\equiv$ PCA

Matrix factorization view: Given data X, learn encoder E: R^d→R^k and decoder D: R^k→R^d minimizing ||X - D(E(X))||². Linear case: when E, D are linear maps (matrices), optimal solution recovers PCA—the encoder projects onto top-k principal components. Non-linear: E = f(VX), D = g(WZ) with nonlinearities f, g learn curved manifolds.

Key insight (Baldi & Hornik 1989): Linear autoencoders with bottleneck dimension k learn the subspace spanned by top-k principal components. Deep linear networks still recover PCA but with different learning dynamics.

Deep Autoencoders Hinton & Salakhutdinov 2006

VAE Kingma & Welling — variational

Denoising AE Vincent et al. — robust features

Sparse AE sparsity penalty on hidden units

β-VAE Higgins et al. — disentangled

VQ-VAE van den Oord et al. — discrete codes

sklearn.decomposition.PCA scikit-learn — linear AE equivalent64.6k Python

PyTorch VAE official example Python

LoRA & Low-Rank Adaptation (LLMs)

$W' = \underbrace{W}_{\text{frozen}} + \underbrace{B}_{\text{unknown}} \underbrace{A}_{\text{unknown}}$ s.t. $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, $r \ll \min(d,k)$

Matrix factorization view: Given frozen pretrained weight W, learn unknown low-rank update ΔW = BA where B∈R^d×r and A∈R^r×k are trainable. Rank r controls parameter count: O(r(d+k)) vs O(dk). Typical r = 4–64 for LLMs with d,k ~ 10⁴.

Key insight: Pretrained weight updates during fine-tuning have low intrinsic rank. LoRA exploits this structure to achieve full fine-tuning quality with <1% trainable parameters.

LoRA Hu et al. 2021 — original paper

QLoRA 4-bit quantized + LoRA

DoRA Weight-Decomposed Low-Rank ICML 2024

AdaLoRA adaptive rank allocation

PEFT HuggingFace library20.4k Python

GaLore gradient low-rank projection

LoftQ quantization-aware LoRA init

microsoft/LoRA Microsoft Research — pip install loralib13.1k Python

LQ-LoRA Han Guo, Philip Greengard, Eric P. Xing et al. — Paper: LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning $\underbrace{W}_{\text{known}} \approx \underbrace{Q}_{\text{unk., fixed quantized}} + \underbrace{L_1}_{\text{unk.}} \underbrace{L_2}_{\text{unk.}}$ Python

DyLoRA Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev et al. — Paper: DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation $\underbrace{W_0}_{frozen} + \underbrace{B_{1:r(t)}}_{unk.} \underbrace{A_{1:r(t)}}_{ unk.} \text{ where } r(t) \sim \text{uniform}(1, r_{\max})$ Python

SoRA Ning Ding, Xingtai Lv, Qiaosen Wang et al. — Paper: Sparse Low-rank Adaptation of Pre-trained Language Models $\underbrace{W}_\text{known} + \underbrace{G}_\text{unk.} \odot (\underbrace{B}_\text{unk.} \underbrace{A}_\text{unk.}) \text{ where } G \in \{0,1\}^r \text{ is gate vector}$

VeRA Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano — Paper: VeRA: Vector-based Random Matrix Adaptation $\underbrace{W_0}_\text{frozen} + \underbrace{\Lambda_d}_\text{unk.} \underbrace{B}_\text{frozen} \underbrace{A}_\text{frozen} \underbrace{\Lambda_b}_\text{unk.}$

MoSLoRA Taiqiang Wu, Jiahao Wang, Zhe Zhao et al. — Paper: Mixture-of-Subspaces in Low-Rank Adaptation $\underbrace{W_0}_\text{frozen} + \sum_{i=1}^{r'} \sum_{j=1}^{r'} \underbrace{W_{ij}}_\text{mixer} \underbrace{A_i}_\text{unk.} \underbrace{B_j}_\text{unk.}$

Intrinsic Dimension Fine-tuning Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta — Paper: Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning $\theta = \underbrace{\theta_0}_{ ext{known}} + \underbrace{P}_{ ext{known}} \underbrace{\theta_d}_{ ext{unknown}}$

COMPACTER Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder — Paper: Compacter: Efficient Low-Rank Hypercomplex Adapter Layers $\underbrace{W_{task}}_{ ext{unk.}} = \sum_{i} \underbrace{S_i}_{ ext{slow weights}} \otimes \underbrace{F_i}_{ ext{fast rank-1}}$ Python

Tensor Decomposition & Tensor Networks

CP: $\underbrace{\mathcal{T}}_{\text{known}} = \sum_r \underbrace{a_r \otimes b_r \otimes c_r}_{\text{unknown}}$; Tucker: $\mathcal{T} = \underbrace{G}_{\text{unk.}} \times_1 \underbrace{U}_{\text{unk.}} \times_2 \underbrace{V}_{\text{unk.}} \times_3 \underbrace{W}_{\text{unk.}}$

Tensor factorization view: Given observed tensor T, decompose into unknown factors. CP: sum of R rank-1 tensors (outer products). Tucker: core tensor G and unknown factor matrices U,V,W. Tensor Train: chain of 3-way cores G_k, storage O(dnr²) vs O(dⁿ). All factors unknown.

Tensor Networks (2011+): Tensor Train (TT) achieves O(dnr²) storage. Tensor Ring (TR, 2016) adds circular structure for permutation invariance. Tensor Wheel (2024) balances TR and fully-connected networks.

CP/PARAFAC rank-1 sum decomposition

Tucker core tensor + factor matrices

Tensor Train (TT) Oseledets 2011 — linear chain

Tensor Ring (TR) Zhao et al. 2016 — circular

Tensor Wheel (TW) SIAM 2024 — randomized

Hierarchical Tucker tree-structured

TT-Toolbox I. Oseledets (Skoltech)209 Matlab

TensorLy J. Kossaifi et al. — NumPy/PyTorch/JAX1.7k Python

TensorLy-Torch deep tensor networks Python

Tensorlab Vervliet et al. (KU Leuven) Matlab

NTFLAB Nonnegative Tensor

Fast NTF Kim, Park

t3f Novikov, Oseledets et al. — TensorFlow TT228 Python

Recommender Systems & Neural MF

$\min_{U,V} \sum_{(i,j) \in \Omega} (\underbrace{R_{ij}}_{\text{known}} - \underbrace{u_i}_{\text{unk.}}^T \underbrace{v_j}_{\text{unk.}})^2 + \lambda(\|U\|^2 + \|V\|^2)$

Matrix factorization view: Given sparse rating matrix R (observed at Ω), learn unknown user embeddings U∈R^m×k and item embeddings V∈R^n×k. Predict R_ij ≈ u_i^Tv_j. Regularization prevents overfitting. Implicit feedback: weighted by confidence c_ij.

NCF Debate (2020): Properly tuned dot product often beats learned MLP similarities. Simple baselines remain competitive.

ALS Alternating Least Squares — classic

WRMF Hu, Koren, Volinsky 2008 — implicit

BPR Rendle et al. 2009 — pairwise ranking

NCF / NeuMF He et al. 2017 — MLP + GMF

LightFM Lyst — hybrid factorization5.1k Python

Implicit B. Frederickson — GPU ALS/BPR3.8k Python

Surprise N. Hug — scikit-surprise6.8k Python

NeuMF++ 2024 — + stacked autoencoders

IGMC GNN-based inductive

RecBole RUC AI Box — 94 algorithms4.2k Python

Graph Matching

$\underbrace{A}_{\text{known}} = \underbrace{X}_{\text{unknown}} \underbrace{B}_{\text{unknown}} X^T$ s.t. $X$ is a permutation matrix Phase Transition

Matrix factorization view: Given adjacency matrix A of a graph, find unknown permutation matrix X and unknown structure B such that A = XBX^T. Key constraint: X is a permutation matrix (binary, doubly stochastic). Used for matching nodes across two graphs or recovering hidden structure.

FAQ Vogelstein et al. — Fast Approximate QAP

GRAMPA Fan et al. — spectral graph matching

SGM Seeded Graph Matching

Erdös-Rényi Phase Transition sharp threshold for exact recovery

Generalized Matrix Factorization

$\underbrace{W}_{\text{known}} \odot \underbrace{L}_{\text{unknown}} = W \odot \underbrace{U}_{\text{unk.}} \underbrace{V}_{\text{unk.}}^T$, minimize $\text{rank}(L)$

Matrix factorization view: Given mask W indicating observed entries, find low-rank matrix L = UV^T matching observations. Generalizes matrix completion (binary W) to weighted observations. The Hadamard product W⊙L selects which entries matter.

Weighted Low-Rank Srebro, Jaakkola

Weighted MC heterogeneous noise

Archetypal Analysis

$\underbrace{A}_{\text{known}} = \underbrace{D}_{\text{unknown}} \underbrace{X}_{\text{unknown}}$, $D = A\underbrace{B}_{\text{unknown}}$ s.t. $X \geq 0$, $B \geq 0$

Matrix factorization view: Given data A, find archetypes D that lie on the convex hull of the data (D = AB with B ≥ 0, columns sum to 1), and coefficients X representing each data point as convex combination of archetypes. Unlike NMF, archetypes are extreme points of the data itself.

SPAMS Archetypal Mairal et al.

Original AA Cutler, Breiman 1994

Robust AA Chen, Mairal et al. 2014 — outlier handling

py_pcha U. Aslak — Principal Convex Hull Analysis41 Python

Binary Matrix Factorization

$\underbrace{D}_{\text{known}} = \underbrace{T}_{\text{unknown}} \underbrace{A}_{\text{unknown}}$ s.t. $T \in \{0,1\}^{m \times r}$

Matrix factorization view: Given data matrix D∈R^m×n, find unknown binary factor T∈{0,1}^m×r and coefficient matrix A∈R^r×n such that D = TA. Optional constraint: A^T·1_r = 1_n (columns sum to 1 for convex combinations). Unlike Boolean Matrix Factorization, A is real-valued.

Key insight (Slawski et al. 2013): Despite the combinatorial constraint T∈{0,1}^m×r, the problem is tractable when r is small. The algorithm finds all binary vertices of aff(D) using the Littlewood-Offord lemma, achieving O(m·2^r-1) complexity.

Applications: DNA methylation unmixing (T = methylation profiles per cell type, A = mixing proportions), binary classification ensembles, topic modeling with hard assignments.

Binary Components M. Slawski, M. Hein, P. Lutsik — NeurIPS 2013

binary_matrix_factorization Python implementation Python

MeDeCom DNA methylation deconvolution (Lutsik et al. 2017)

Boolean MF Miettinen et al. 2008 — both factors binary

PROXIMUS Koyutürk, Grama — recursive binary partitioning

Kronecker Product Factorization Yannis Voet, Leonardo De Novellis — Implementation of algorithms for identifying all possible Kronecker product factorizations of binary matrices, including decomposition graph visualization and applications to sparse matrix analysis. Author Python

Matrix Compressive Sensing (MCS)

$\underbrace{\mathcal{A}}_{\text{known}}(\underbrace{L}_{\text{unknown}}) = \underbrace{b}_{\text{known}}$ s.t. $L$ is rank-$r$ Phase Transition

Matrix factorization view: Given linear measurements b = A(L) of unknown low-rank matrix L, recover L. Extends compressive sensing from vectors to matrices. Phase transition: O(r(m+n)) measurements suffice for rank-r recovery. RIP for matrices analogous to vector case.

Nuclear Norm Min Recht, Fazel, Parrilo 2007

Matrix ALPS Kyrillidis, Cevher 2012

SVP Jain, Meka, Dhillon 2009

ADMiRA Lee, Bresler 2009

SPCP / Noisy Robust PCA

$\underbrace{A}_{\text{known}} = \underbrace{L}_{\text{unknown}} + \underbrace{S}_{\text{unknown}} + \underbrace{N}_{\text{unknown}}$ s.t. $L$ is low-rank, $S$ is sparse, $N$ is noise Phase Transition

Matrix factorization view: Given noisy observations A, decompose into unknown low-rank L, unknown sparse S, and unknown dense noise N. Extends Robust PCA to handle small dense perturbations. Stable Principal Component Pursuit (SPCP) solves: min ||L||_* + λ||S||₁ s.t. ||A - L - S||_F ≤ ε.

SPCP Zhou et al. — stable PCP

Noisy RPCA Agarwal, Negahban, Wainwright 2011

GoDec+ Zhou, Tao 2013

Kernel Factorizations

$\underbrace{K}_{\text{known}} = \underbrace{\Phi}_{\text{unknown}} \Phi^T$ s.t. $K$ is PSD

Matrix factorization view: Given kernel matrix K (positive semi-definite), find feature representation Φ such that K_ij = φ(x_i)^Tφ(x_j). Kernel PCA, Nyström approximation, and random features all exploit this factorization. Low-rank kernel approximations enable scalable kernel methods.

Nyström Williams, Seeger 2001

Random Features Rahimi, Recht NIPS 2007

Kernel PCA Schölkopf, Smola, Müller 1998

FastFood Le, Sarlos, Smola 2013

Recursive Nyström Musco, Musco 2017

CERF Jianqiao Wangni, Jingwei Zhuo, Jun Zhu — Paper: Learning Random Fourier Features by Hybrid Constrained Optimization $\phi_{\text{learner}}(x) = \underbrace{\Omega}_{ ext{mask/block struct.}} \underbrace{\phi_{\text{teacher}}(x)}_{ ext{known RFF}} \text{ where } \underbrace{\Omega}_{ ext{unknown}} \text{ is orthogonal transformation}$

Non-Convex Optimization & Implicit Regularization

$\min_{U,V} \|\underbrace{U}_{\text{unknown}} \underbrace{V}_{\text{unknown}}^T - \underbrace{M}_{\text{known}}\|_F^2$ s.t. $U \in \mathbb{R}^{m \times r}$, $V \in \mathbb{R}^{n \times r}$ Phase Transition

Matrix factorization view: Given observed matrix M, find unknown factors U, V such that UV^T approximates M. The factorization implicitly constrains rank(UV^T) ≤ r. Despite non-convexity, all local minima are global when r is large enough.

Key result (Gunasekar et al. 2017): GD on matrix factorization with small init implicitly minimizes nuclear norm. Deep factorization amplifies this bias.

Scaled GD (ScaledGD) O(log(d/ε)) convergence from random init

Wirtinger Flow Candès et al. 2015 — phase retrieval

Reshaped WF O(n) sample complexity

Burer-Monteiro SDP via X=RR^T, no spurious minima when r≳√d

Riemannian GD optimization on manifolds

Alternating Minimization provable guarantees post-2015

Deep MF Implicit Bias rank regularization grows with depth

Randomized Algorithms & Sketching (RandNLA)

$\tilde{A} = \Pi A$ where $\Pi \in \mathbb{R}^{k \times m}$, $k \ll m$

Matrix factorization view: Randomized numerical linear algebra uses random projections and sketching to compute approximate matrix factorizations efficiently. Key techniques include randomized SVD, Johnson-Lindenstrauss transforms, CountSketch, and streaming algorithms like Frequent Directions. These methods trade exact computation for massive speedups while providing provable approximation guarantees.

Randomized SVD Halko, Martinsson, Tropp 2011

Frequent Directions Liberty 2013 — streaming PCA

Sparse JL Kane, Nelson 2014

CountSketch Clarkson, Woodruff 2017

Sketching Survey Woodruff 2014

τ-GRA Pettie, Wang 2022 — HyperLogLog/PCSA estimators Python

Improved Compressed Counting Ping Li — Implementation of the optimal power estimator for Compressed Counting algorithm to estimate frequency moments and Shannon entropy of data streams Python

Other & Online SVD

Boolean factorization, alignment (RASL/TILT), streaming decompositions, and other specialized methods.

RASL Batch Alignment

TILT Low-rank Textures

BMF Miettinen et al. TKDE 2008

Archetypal SPAMS v2.5

PROPACK Lanczos SVD

BLWS Block Lanczos

Oja's Algorithm J. Math. Biol. 1982

Modern Themes (2015–2024)

Contemporary research directions that cut across classical categories, reflecting the field's evolution toward neural network integration, implicit regularization, and computational-statistical trade-offs.

Implicit Regularization GD on overparameterized models learns low-rank/sparse solutions

Neural Collapse deep classifiers converge to symmetric low-rank geometry

Double Descent test error decreases again past interpolation threshold

Random Feature Models kernel approximation via random projections

Overparameterization more parameters than data points aids optimization

Edge of Stability GD operates at maximum stable learning rate

Grokking delayed generalization after memorization phase

Lottery Tickets sparse subnetworks match dense performance

Computational-Statistical Gaps

Phase Transition

The "hard phase" regime where information-theoretically recovery is possible but no polynomial-time algorithm is known. Understanding these gaps is central to modern high-dimensional statistics.

Key example: In sparse PCA, the spiked covariance model exhibits a gap: spectral methods require signal strength Θ(n^1/4) but information-theoretic limit is Θ(1/√n).

Berthet-Rigollet 2013 planted clique reduction for sparse PCA

Statistical Query Lower Bounds Feldman et al. 2012

Low-Degree Polynomials Hopkins et al. FOCS 2017

Sum-of-Squares Hierarchy Meka, Potechin, Wigderson 2015

Tensor PCA Richard, Montanari 2014

Community Detection Abbe survey 2017

Approximate Message Passing (AMP)

Phase Transition

Iterative algorithms with exact asymptotic characterization via state evolution. Predict sharp phase transitions for low-rank recovery, compressed sensing, and community detection.

State Evolution: In high-dimensional limit, AMP error follows deterministic recursion. Correctly predicts success/failure boundary.

GAMP Generalized AMP — Rangan

BiG-AMP Bilinear GAMP for matrix factorization

Low-RAMP Lesieur, Krzakala, Zdeborová 2017

BiG-VAMP Vector AMP for matrix problems

VAMP Vector AMP — Rangan, Fletcher

Spectral + AMP Spectral init + AMP refinement

AMP phase diagram: density vs noise. Figure from: Donoho, Maleki, Montanari — "Message passing algorithms for compressed sensing" (2009)

Randomized Algorithms & Sketching (RandNLA)

Use random projections to reduce massive matrices into tractable smaller problems while preserving essential structure. Sketching enables single-pass streaming algorithms.

Frequent Directions Liberty et al. 2015 — deterministic, optimal

CountSketch sparse projection, GPU-friendly

SRHT Subsampled Randomized Hadamard

Randomized SVD Halko, Martinsson, Tropp 2011

Randomized LU Shabat, Shmueli, Averbuch 2013

gensim R. Řehůřek (RARE Technologies)16.3k Python

BLENDENPIK Avron, Maymounkov, Toledo

Redsvd Okanohara — Randomized SVD C++

Random Kitchen Sinks FastFood — Le, Sarlos, Smola

Multisketching Higgins 2025 — CountSketch + Gaussian, 77% faster

Frameworks & Platforms

Comprehensive software packages integrating multiple factorization methods. GPU acceleration now mainstream via RAPIDS and PyTorch backends.

scikit-learn NMF, TruncatedSVD, PCA64.6k Python

cuML (RAPIDS) GPU-accelerated, 10-50× speedup5.1k Python

TensorLy tensor decomposition, multi-backend1.7k Python

PyTorch / JAX differentiable linalg, autograd Python

SPAMS Sparse Modeling21 Python Matlab

TFOCS S. Becker, Candès142 Matlab

Implicit GPU recommenders (ALS, BPR)3.8k Python

PEFT HuggingFace LoRA/adapters20.4k Python

Manopt manifold optimization Matlab Python

cvxpy convex optimization6.1k Python

NMF package CRAN R

Dask-ML distributed, out-of-core Python

Preconditioners & Fast Multipole Methods

Direct solvers/FMM

Fast multipole methods (FMM) can be viewed as hierarchical matrix factorizations. Exploiting low-rank structure of off-diagonal blocks enables O(N) or O(N log N) complexity for dense matrix-vector products and direct solvers. H-matrices, H²-matrices, and butterfly factorizations provide fast approximate inverses and preconditioners.

Matrix Factorization View: The multipole expansion A ≈ B̃C decomposes dense kernel matrices into structured low-rank products. Quadtree/octree decompositions yield hierarchical factorizations with sparse + low-rank blocks.

Inverse FMM Coulier, Pouransari, Darve — fast approximate direct solver as preconditioner

FMM Tutorial Martinsson — Fast algorithms in scientific computing

Fast Direct Solvers Martinsson, O'Neil — survey on hierarchical factorizations

H²-matrix decomposition: A = S + UKV^T (sparse + low-rank). Quadtree structure generalizes to higher dimensions.

Multipole expansion as matrix factorization: A ≈ B̃C. Figure from: Martinsson — FMM Tutorial

Essential References

Matrix Computations

Convex Optimization

Topics in Matrix Analysis

NMF & Tensor Factorizations

Matrix Manifolds

Convex Opt. in SP

Optimization for ML

LAPACK Guide

Phase Transition Atlas (2020s Framing)

Modern papers separate information-theoretic limits from algorithmic limits, and often explain gaps with spectral thresholds or state-evolution analysis.

Information-Theoretic Limit feasibility boundary, often via counting or mutual information

Algorithmic Threshold practical recoverability for a given solver family

Spectral / BBP Transition Baik-Ben Arous-Péché (2005): eigenvalue separates from bulk

State Evolution (AMP) predicts error curves and sharp transitions

Computational-Statistical Gap "hard phase" where info-theoretic recovery possible but poly-time fails

Planted Clique Hardness reduction-based lower bounds (Berthet-Rigollet 2013)

Rank / Sparsity vs Sampling classic completion and sparse recovery axes

Noise vs Signal Strength robustness regimes and breakdown points