The Advanced Matrix Factorization Jungle

A living document on state-of-the-art algorithms, implementations, and phase transitions

Welcome to the Jungle

This Matrix Factorization Jungle page ventures beyond classical decompositions into factorizations that impose structural assumptions on the unknowns—sparsity, low-rank, non-negativity, subspace membership, and more. As sparse recovery solvers matured, a rich ecosystem emerged. Sparse Coding & Dictionary Learning (Olshausen & Field) and NMF (Lee & Seung) were the first of such decompositions widely used with constraints on the matrices used for the factorization. Yet, while those factorizations were used by some practitioners, there was a sense that they "weren't rigorous enough" as little theoretical understanding came with these techniques. The discovery of compressed sensing (Candès, Romberg, Tao, Donoho) and its expansion to matrices i.e. nuclear norms can serve as convex proxies for rank, opened the floodgates to powerful new decomposition techniques.

What distinguishes these methods is that they come with phase transitions—sharp boundaries in parameter space where algorithms succeed or fail. These phase transitions are generally a sign that the algorithms or the factorization itself has hit a hard limit that may be due to the computational complexity of the problem being solved. This document maps that landscape, from video surveillance to recommendation systems, blending randomized linear algebra, nonconvex optimization, and probabilistic modeling.

Traditional Matrix Decompositions: There is a stellar and comprehensive treatment of classical matrix factorization techniques—LU, Cholesky, QR, SVD, Polar, Eigendecomposition, Jordan, Schur, CUR/Skeleton, Hessenberg, Tridiagonal, and Bidiagonal decompositions—in Jun Lu's monographs: "Matrix Decomposition and Applications" and "Numerical Matrix Decomposition and its Modern Applications". Most of the traditional techniques used to not put constraints on the matrix/data to be factorized. In scientific computing however, operators are sometimes approximated with decompositions and factorization to permit the design of better inverse problem solvers (see recent survey "Fast Direct Solvers" by Per-Gunnar Martinsson and Michael O'Neil).

The New Frontier: Additional information about LoRAs used with LLMs came from this survey by Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang.

Notation: A = observed  |  L = low-rank  |  S = sparse  |  N = noise  |  D = dictionary  |  X = coefficients
Matrix Factorization Jungle Evolution of Phase Transitions

"The Sparsest of Them All" — Evolution of phase transitions from Gauss to Krzakala, Mézard, Sausset, Sun, Zdeborová

References

This page was first announced on Nuit Blanche in August 2011.

Quick Reference for Advanced Matrix Factorization: Structural Assumptions & Unknowns

FACTORIZATIONS (A = DX, UVT)
Problem Form Loss D(A;DX) Known Unknown Key Constraints Phase Trans.
SVD A = DX ||A-DX||F2 A D, X DTD = I, XXT = Λ (diagonal)
pLSI A = DX KL(A;DX) A D, X 1TD1 = 1, dij ≥ 0, 1TX = 1, xij ≥ 0
NMF A = DX ||A-DX||F2
KL(A;DX)
A D, X D ≥ 0, X ≥ 0
Dictionary Learning A = DX ||A-DX||F2 A D, X ||dj||2 = 1, X sparse Yes
Sparse PCA A = DX ||A-DX||F2 A D, X ||d||2 = 1, D sparse Yes
K-Means A = DX ||A-DX||F2 A D, X XXT = I, Xij ∈ {0,1} Yes
K-Medians A = DX ||A-DX||1 A D, X XXT = I, Xij ∈ {0,1} Yes
Subspace Clustering A = AX ||A-AX||F2 A X diag(X) = 0, sparse/low-rank
MMV / Comp. Sensing Y = AX ||Y-AX||F2 Y, A X X joint row-sparse Yes
BSS / ICA Y = AX ||Y-AX||F2 Y A, X rows of X independent
Recommender Systems R = UVT ||R-UVT||F2 RΩ U, V low-rank
LoRA W' = W + BA L(W+BA) W B, A B∈Rd×r, A∈Rr×k, r ≪ d,k
Autoencoders X = D(E(X)) ||X-D(E(X))||F2 X E, D bottleneck dim k; linear → PCA
Non-Convex Opt. M = UVT ||M-UVT||F2 M U, V rank(UVT) ≤ r Yes
Tensor Decomp. T = ∑ a⊗b⊗c ||T-∑a⊗b⊗c||F2 T factors CP/Tucker/TT structure
Graph Matching A = XBXT ||A-XBXT||F2 A X, B X permutation matrix Yes
Generalized MF W⊙L = W⊙UVT ||W⊙(L-UVT)||F2 W (mask) U, V, L L lowest rank
Archetypal Analysis A = DX, D = AB ||A-DX||F2 A D, X, B X, B ≥ 0, convex hull
Matrix Comp. Sensing A(L) = b ||A(L)-b||22 A, b L L rank-r Yes
Kernel Factorizations K = ΦΦT ||K-ΦΦT||F2 K (kernel) Φ (features) K ≥ 0, PSD
DECOMPOSITIONS (A = L + S)
Problem Form Loss D(A;L,S) Known Unknown Key Constraints Phase Trans.
Robust PCA A = L + S ||L||*+λ||S||1 A L, S L low-rank, S sparse Yes
Matrix Completion PΩ(M) = PΩ(L) ||L||* MΩ L L low-rank, incoherent Yes
SPCP / Noisy RPCA A = L + S + N ||L||*+λ||S||1
+||N||F2
A L, S, N L low-rank, S sparse, N noise Yes

Jun Lu's Taxonomy of Classical Matrix Decompositions

Matrix Decomposition World Map

Figure 1: Matrix Decomposition World Map — relationships between classical decompositions

Matrix Decomposition World Map Under Conditions

Figure 2: Matrix Decomposition World Map Under Conditions — taxonomy by matrix structure

From Jun Lu, "Matrix Decomposition and Applications" (arXiv:2201.00145)

Decomposition (L + S)
A = L + + N

Applications: Video surveillance, face recognition, Netflix recommendations

Factorization (UVT)
M = D X

Applications: Image denoising, compression, feature learning

Spectral Clustering & K-Means

$\underbrace{A}_{\text{known}} = \underbrace{D}_{\text{unknown}} \underbrace{X}_{\text{unknown}}$ s.t. $XX^T = I$, $X_{ij} \in \{0,1\}$ Phase Transition

Matrix factorization view: Given data A, find unknown centroids D and assignment matrix X. The constraint XXT = I enforces that each data point belongs to exactly one cluster, while Xij ∈ {0,1} makes assignments binary. K-Median uses L1 loss instead of L2. Spectral methods relax the binary constraint and partition via graph Laplacian eigenvectors.

Operators: Random walk, Adjacency, Laplacian, Modularity, Non-Backtracking

K-Means++ Arthur, Vassilvitskii — improved seeding
Spectral Clustering Ng, Jordan, Weiss — paper
sklearn.cluster.KMeans scikit-learn — KMeans(n_clusters=k)64.6k Python
sklearn.cluster.SpectralClustering scikit-learn — graph Laplacian eigenvectors64.6k Python
K-means/K-median phase transitions

K-means/K-median: Empirical probability of integrality for SDP/LP relaxations. Figure from: Awasthi, Bandeira, Charikar, Krishnaswamy, Villar, Ward — "Relax, no need to round: integrality of clustering formulations" (2015)

Subspace Clustering

$\underbrace{A}_{\text{known}} = \underbrace{A}_{\text{known}} \underbrace{X}_{\text{unknown}}$ s.t. $\text{diag}(X) = 0$, $X$ is sparse and/or low-rank

Matrix factorization view: Given data matrix A, find unknown coefficient matrix X where each column xi expresses data point ai as linear combination of other points. Constraint diag(X) = 0 prevents trivial self-representation. Additional regularization: ||X||1 (SSC), ||X||* (LRR), or ||X||F (LSR).

SSC Elhamifar, Vidal
LRR Liu et al.
LSR Canyi Lu
SMCE Sparse Manifold
LLE Local Linear Embedding
SSC-ADMM JHU Vision Lab — Elhamifar official45 Matlab
SSC (Becker) S. Becker — proximal gradient + ADMM19 Matlab
LRR (Python) port from G. Liu's MATLAB34 Python
Subspace criteria

Subspace clustering criteria satisfying EBD conditions. Figure from: Elhamifar, Vidal — "Sparse Subspace Clustering: Algorithm, Theory, and Applications" (2013)

Non-Negative Matrix Factorization (NMF)

$\min_{D,X} \|\underbrace{A}_{\text{known}} - \underbrace{D}_{\text{unknown}} \underbrace{X}_{\text{unknown}}\|_F^2$ s.t. $D \in \mathbb{R}^{m \times r}_+$, $X \in \mathbb{R}^{r \times n}_+$, $D, X \geq 0$

Matrix factorization view: Given non-negative data matrix A, find unknown dictionary D and coefficients X, both constrained to be element-wise non-negative. The rank r controls the number of parts/topics. Non-negativity yields interpretable, parts-based representations.

Deep NMF (2018+): Multiple layers of factorization discover hierarchical structure. Neural NMF uses backpropagation for end-to-end training.

SPAMS J. Mairal (Inria) et al.21 Python Matlab
HALS Gillis, Glineur — Accelerated HALS
ANLS-NMF Kim, Park
HotTopixx Bittorf, Recht, Ré, Tropp NeurIPS 2012
Near-Separable NMF Gillis — robust LP-based
K-SVD Elad, Aharon
Deep NMF Trigeorgis et al. CVPR 2014 — multi-layer
sklearn.decomposition.NMF scikit-learn — Lee & Seung multiplicative updates64.6k Python
nmf (Python) MUR, ANLS, ADMM, AO-ADMM10 Python
nmflib Lee & Seung original KL + Euclidean Matlab

Matrix Completion

$\min_L \text{rank}(\underbrace{L}_{\text{unknown}})$ s.t. $P_{\underbrace{\Omega}_{\text{known}}}(L) = P_\Omega(\underbrace{M}_{\text{known}})$, $L$ is low-rank Phase Transition

Matrix factorization view: Given partial observations MΩ at index set Ω, recover unknown low-rank matrix L. Convex relaxation: min ||L||* (nuclear norm). Phase transition at m ≈ O(nr log n) samples for rank-r recovery. Made famous by Netflix Prize.

Graph-Based (2019+): GNN-based inductive methods generalize to unseen users/items and even transfer across datasets.

SVT Cai, Candès, Shen
TFOCS S. Becker, Candès, Grant142 Matlab
OptSpace Keshavan et al.
LMaFit Wen, Yin, Zhang — nonlinear SOR
GRASTA He et al. — Grassmannian streaming
LRGeomCG Vandereycken
Matrix ALPS Kyrillidis, Cevher 2012
GROUSE Balzano, Nowak, Recht — rank-one updates
Jellyfish Recht, Ré VLDB 2013 — parallel SGD
BiG-AMP Parker et al.
SVP Meka et al.
RTRMC Boumal, Absil — Riemannian trust-region
IGMC Zhang & Chen ICLR 2020 — GNN inductive Python
IMC-GAE 2021 — Graph Autoencoder
ScaledGD nonconvex, near-optimal
Matrix-Completion-Methods SVT, SVP, TNNR-ADMM42 Matlab
MatrixCompletion.jl OptSpace, OR1MP, SVT Julia

Robust PCA & Stable PCA

$\underbrace{A}_{\text{known}} = \underbrace{L}_{\text{unknown}} + \underbrace{S}_{\text{unknown}}$, solve $\min_{L,S} \|L\|_* + \lambda\|S\|_1$, $L$ is low-rank, $S$ is sparse Phase Transition

Matrix factorization view: Given observed matrix A, decompose into unknown low-rank component L (rank(L) ≤ r) and unknown sparse component S (||S||0 ≤ s). Nuclear norm ||L||* is convex proxy for rank; L1 norm for sparsity. Phase transition: recovery succeeds when rank(L) ≤ ρn and ||S||0 ≤ αn2.

Streaming (2015+): Nearly-linear time and single-pass algorithms now available. GRASTA tracks non-stationary subspaces online.

ALM / IALM Lin, Chen, Ma 2010
ADMM Boyd et al. — Alternating Direction
GoDec Zhou, Tao ICML 2011 — randomized BRP
GoDec (Rust) Rust implementation Rust
DECOLOR Zhou, Yang, Yu — contiguous outliers
ReProCS Qiu, Vaswani — recursive robust PCA
TGA Hauberg et al. CVPR 2014 — Grassmann averages
SpaRCS Waters et al. NeurIPS 2011
Bayesian RPCA Babacan et al. — sparse Bayesian learning
GRASTA He et al. — Grassmannian streaming
Online RPCA Xiao et al. — change point detection
Streaming Outlier RPCA Diakonikolas et al. 2023 — near-linear
History PCA streaming, parameter robust
robust-pca Candès et al. ADMM implementation249 Python
RobustPCA (MATLAB) Candès, Wright PCP Matlab
lrslibrary A. Sobral — 100+ algorithms for video882 Matlab

Sparse PCA

$\max_v \underbrace{v}_{\text{unknown}}^T \underbrace{A}_{\text{known}}^T A v$ s.t. $\|v\|_2 = 1$, $\|v\|_0 \leq k$, $v$ is sparse Phase Transition

Matrix factorization view: Given data matrix A, find principal components v with at most k non-zero entries. Sparsity constraint ||v||0 ≤ k yields interpretable loadings. NP-hard; convex relaxation via SDP or L1 penalty. Computational-statistical gap: info-theoretic recovery possible below algorithmic threshold.

SPAMS Mairal, Bach, Ponce — sparse modeling monograph
DSPCA d'Aspremont et al. — SDP relaxation
Optimal Sparse PCA d'Aspremont, Bach, El Ghaoui 2008
Truncated Power Yuan, Zhang — sparse eigenvalue
Structured Sparse PCA Jenatton, Obozinski, Bach AISTATS 2010
Sparse Functional PCA Allen, Weylandt — joint sparsity+smoothness
sklearn.decomposition.SparsePCA scikit-learn — L1 penalty64.6k Python
spca Variable Projection — regular, randomized, robust71 R

Dictionary Learning

$\min_{D,X} \|\underbrace{A}_{\text{known}} - \underbrace{D}_{\text{unknown}} \underbrace{X}_{\text{unknown}}\|_F^2 + \lambda\|X\|_1$ s.t. $\|d_j\|_2 = 1$, $X$ is sparse Phase Transition

Matrix factorization view: Given data matrix A, learn unknown dictionary D (columns normalized ||dj||2=1) and unknown sparse codes X (||xi||0 ≤ k). Dictionary can be overcomplete (more atoms than dimensions). Alternates between sparse coding (fix D, solve for X) and dictionary update.

SPAMS Mairal — online21 Python Matlab
JL-SPCADL G. Madhuri, Atul Negi, Kaluri V. Rangarao — Implementation of JLSPCADL: Johnson-Lindenstrauss lemma based optimal projections for discriminative dictionary learning with Modified Supervised PCA Python
K-SVD Elad, Aharon 2006
Online DL Mairal, Bach, Ponce, Sapiro ICML 2009
BPFA Zhou, Carin et al. — Beta Process
BiG-AMP Parker, Schniter, Cevher
Efficient Sparse Coding Honglak Lee et al. NeurIPS 2006
sklearn.decomposition.DictionaryLearning scikit-learn — online + batch64.6k Python
sparselandtools F. Herzog — K-SVD, OMP, MP (archived)91 Python
ksvd-sparse-dictionary K-SVD + OMP sensing Python
K-SVD Toolbox M. Elad, Aharon original Matlab

MMV & Compressive Sensing

$\underbrace{Y}_{\text{known}} = \underbrace{A}_{\text{known}} \underbrace{X}_{\text{unknown}}$ s.t. $\|X\|_{\text{row},0} \leq k$ (joint row-sparsity) Phase Transition

Matrix factorization view: Given measurements Y and known sensing matrix A, recover unknown signal matrix X with at most k non-zero rows (joint sparsity across L measurement vectors). Donoho-Tanner phase transition: sharp boundary in (m/n, k/n) plane. Multiple measurements improve recovery over single-vector case (SMV).

SA-MUSIC Lee, Bresler — subspace-augmented
CS-MUSIC Kim et al.
AMP-MMV Ziniel, Schniter — Bayesian AMP
T-MSBL Zhang, Rao — temporal SBL
REMBO Mishali, Eldar
EM-GM-AMP Vila, Schniter — Gaussian mixture
AMP-Compressed-Sensing Krzakala et al. 2012 — Bayesian optimal Python
CoSaMP Needell, Tropp — iterative greedy
IHT / NIHT Blumensath, Davies — hard thresholding
CS-algorithms OMP, SP, IHT, CoSaMP, FISTA, ISTA Matlab Python
sparse_recovery BP, OMP, nonnegative variants Python
CS C & MATLAB OMP, AMP, IHT Matlab C++
ISTA-Net Deep unrolled ISTA (CVPR 2018) Python
Diff-OneBit Youming Chen, Zhaoqiang Liu — Implementation of Diff-OneBit algorithm for signal recovery under 1-bit quantization using diffusion model priors Python
Probabilistic reconstruction phase diagram

Probabilistic reconstruction. From: Krzakala, Mézard, Sausset, Sun, Zdeborová (2012)

SCoSaMP Phase Transitions

SCoSaMP. From: Blanchard, Tanner (2012)

Algorithm Comparison

CGIHT/NIHT. From: Blanchard, Tanner, Wei (2015)

Permutation recovery

Permutation recovery. From: Pananjady, Wainwright, Courtade (2016)

Blind Source Separation / ICA

$\underbrace{Y}_{\text{known}} = \underbrace{A}_{\text{unknown}} \underbrace{X}_{\text{unknown}}$ s.t. rows of $X$ are independent

Matrix factorization view: Given mixed observations Y, recover unknown mixing matrix A and unknown source signals X. Key constraint: rows of X are statistically independent (ICA) or sparse (SCA). Non-Gaussianity of sources enables identification up to permutation and scaling.

ICALab ICA Toolbox
FastICA Hyvärinen
BLISS BSS software
DUET Sparse Component
MISEP Mutual Information
sklearn.decomposition.FastICA scikit-learn — Hyvärinen & Oja64.6k Python

Linear & Non-Linear Autoencoders

$\underbrace{X}_{\text{known}} \approx \underbrace{D}_{\text{unknown}}(\underbrace{E}_{\text{unknown}}(X))$, linear case $\equiv$ PCA

Matrix factorization view: Given data X, learn encoder E: Rd→Rk and decoder D: Rk→Rd minimizing ||X - D(E(X))||2. Linear case: when E, D are linear maps (matrices), optimal solution recovers PCA—the encoder projects onto top-k principal components. Non-linear: E = f(VX), D = g(WZ) with nonlinearities f, g learn curved manifolds.

Key insight (Baldi & Hornik 1989): Linear autoencoders with bottleneck dimension k learn the subspace spanned by top-k principal components. Deep linear networks still recover PCA but with different learning dynamics.

Deep Autoencoders Hinton & Salakhutdinov 2006
VAE Kingma & Welling — variational
Denoising AE Vincent et al. — robust features
Sparse AE sparsity penalty on hidden units
β-VAE Higgins et al. — disentangled
VQ-VAE van den Oord et al. — discrete codes
sklearn.decomposition.PCA scikit-learn — linear AE equivalent64.6k Python
PyTorch VAE official example Python

LoRA & Low-Rank Adaptation (LLMs)

$W' = \underbrace{W}_{\text{frozen}} + \underbrace{B}_{\text{unknown}} \underbrace{A}_{\text{unknown}}$ s.t. $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, $r \ll \min(d,k)$

Matrix factorization view: Given frozen pretrained weight W, learn unknown low-rank update ΔW = BA where B∈Rd×r and A∈Rr×k are trainable. Rank r controls parameter count: O(r(d+k)) vs O(dk). Typical r = 4–64 for LLMs with d,k ~ 104.

Key insight: Pretrained weight updates during fine-tuning have low intrinsic rank. LoRA exploits this structure to achieve full fine-tuning quality with <1% trainable parameters.

LoRA Hu et al. 2021 — original paper
QLoRA 4-bit quantized + LoRA
DoRA Weight-Decomposed Low-Rank ICML 2024
AdaLoRA adaptive rank allocation
PEFT HuggingFace library20.4k Python
GaLore gradient low-rank projection
LoftQ quantization-aware LoRA init
microsoft/LoRA Microsoft Research — pip install loralib13.1k Python
LQ-LoRA Han Guo, Philip Greengard, Eric P. Xing et al. — Paper: LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning $\underbrace{W}_{\text{known}} \approx \underbrace{Q}_{\text{unk., fixed quantized}} + \underbrace{L_1}_{\text{unk.}} \underbrace{L_2}_{\text{unk.}}$ Python
DyLoRA Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev et al. — Paper: DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation $\underbrace{W_0}_{frozen} + \underbrace{B_{1:r(t)}}_{unk.} \underbrace{A_{1:r(t)}}_{ unk.} \text{ where } r(t) \sim \text{uniform}(1, r_{\max})$ Python
SoRA Ning Ding, Xingtai Lv, Qiaosen Wang et al. — Paper: Sparse Low-rank Adaptation of Pre-trained Language Models $\underbrace{W}_\text{known} + \underbrace{G}_\text{unk.} \odot (\underbrace{B}_\text{unk.} \underbrace{A}_\text{unk.}) \text{ where } G \in \{0,1\}^r \text{ is gate vector}$
VeRA Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano — Paper: VeRA: Vector-based Random Matrix Adaptation $\underbrace{W_0}_\text{frozen} + \underbrace{\Lambda_d}_\text{unk.} \underbrace{B}_\text{frozen} \underbrace{A}_\text{frozen} \underbrace{\Lambda_b}_\text{unk.}$
MoSLoRA Taiqiang Wu, Jiahao Wang, Zhe Zhao et al. — Paper: Mixture-of-Subspaces in Low-Rank Adaptation $\underbrace{W_0}_\text{frozen} + \sum_{i=1}^{r'} \sum_{j=1}^{r'} \underbrace{W_{ij}}_\text{mixer} \underbrace{A_i}_\text{unk.} \underbrace{B_j}_\text{unk.}$
Intrinsic Dimension Fine-tuning Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta — Paper: Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning $\theta = \underbrace{\theta_0}_{ ext{known}} + \underbrace{P}_{ ext{known}} \underbrace{\theta_d}_{ ext{unknown}}$
COMPACTER Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder — Paper: Compacter: Efficient Low-Rank Hypercomplex Adapter Layers $\underbrace{W_{task}}_{ ext{unk.}} = \sum_{i} \underbrace{S_i}_{ ext{slow weights}} \otimes \underbrace{F_i}_{ ext{fast rank-1}}$ Python

Tensor Decomposition & Tensor Networks

CP: $\underbrace{\mathcal{T}}_{\text{known}} = \sum_r \underbrace{a_r \otimes b_r \otimes c_r}_{\text{unknown}}$; Tucker: $\mathcal{T} = \underbrace{G}_{\text{unk.}} \times_1 \underbrace{U}_{\text{unk.}} \times_2 \underbrace{V}_{\text{unk.}} \times_3 \underbrace{W}_{\text{unk.}}$

Tensor factorization view: Given observed tensor T, decompose into unknown factors. CP: sum of R rank-1 tensors (outer products). Tucker: core tensor G and unknown factor matrices U,V,W. Tensor Train: chain of 3-way cores Gk, storage O(dnr2) vs O(dn). All factors unknown.

Tensor Networks (2011+): Tensor Train (TT) achieves O(dnr²) storage. Tensor Ring (TR, 2016) adds circular structure for permutation invariance. Tensor Wheel (2024) balances TR and fully-connected networks.

CP/PARAFAC rank-1 sum decomposition
Tucker core tensor + factor matrices
Tensor Train (TT) Oseledets 2011 — linear chain
Tensor Ring (TR) Zhao et al. 2016 — circular
Tensor Wheel (TW) SIAM 2024 — randomized
Hierarchical Tucker tree-structured
TensorLy J. Kossaifi et al. — NumPy/PyTorch/JAX1.7k Python
TensorLy-Torch deep tensor networks Python
Tensorlab Vervliet et al. (KU Leuven) Matlab
NTFLAB Nonnegative Tensor
Fast NTF Kim, Park
t3f Novikov, Oseledets et al. — TensorFlow TT228 Python

Recommender Systems & Neural MF

$\min_{U,V} \sum_{(i,j) \in \Omega} (\underbrace{R_{ij}}_{\text{known}} - \underbrace{u_i}_{\text{unk.}}^T \underbrace{v_j}_{\text{unk.}})^2 + \lambda(\|U\|^2 + \|V\|^2)$

Matrix factorization view: Given sparse rating matrix R (observed at Ω), learn unknown user embeddings U∈Rm×k and item embeddings V∈Rn×k. Predict Rij ≈ uiTvj. Regularization prevents overfitting. Implicit feedback: weighted by confidence cij.

NCF Debate (2020): Properly tuned dot product often beats learned MLP similarities. Simple baselines remain competitive.

ALS Alternating Least Squares — classic
WRMF Hu, Koren, Volinsky 2008 — implicit
BPR Rendle et al. 2009 — pairwise ranking
NCF / NeuMF He et al. 2017 — MLP + GMF
LightFM Lyst — hybrid factorization5.1k Python
Implicit B. Frederickson — GPU ALS/BPR3.8k Python
Surprise N. Hug — scikit-surprise6.8k Python
NeuMF++ 2024 — + stacked autoencoders
IGMC GNN-based inductive
RecBole RUC AI Box — 94 algorithms4.2k Python

Graph Matching

$\underbrace{A}_{\text{known}} = \underbrace{X}_{\text{unknown}} \underbrace{B}_{\text{unknown}} X^T$ s.t. $X$ is a permutation matrix Phase Transition

Matrix factorization view: Given adjacency matrix A of a graph, find unknown permutation matrix X and unknown structure B such that A = XBXT. Key constraint: X is a permutation matrix (binary, doubly stochastic). Used for matching nodes across two graphs or recovering hidden structure.

FAQ Vogelstein et al. — Fast Approximate QAP
GRAMPA Fan et al. — spectral graph matching
SGM Seeded Graph Matching
Erdös-Rényi Phase Transition sharp threshold for exact recovery

Generalized Matrix Factorization

$\underbrace{W}_{\text{known}} \odot \underbrace{L}_{\text{unknown}} = W \odot \underbrace{U}_{\text{unk.}} \underbrace{V}_{\text{unk.}}^T$, minimize $\text{rank}(L)$

Matrix factorization view: Given mask W indicating observed entries, find low-rank matrix L = UVT matching observations. Generalizes matrix completion (binary W) to weighted observations. The Hadamard product W⊙L selects which entries matter.

Weighted Low-Rank Srebro, Jaakkola
Weighted MC heterogeneous noise

Archetypal Analysis

$\underbrace{A}_{\text{known}} = \underbrace{D}_{\text{unknown}} \underbrace{X}_{\text{unknown}}$, $D = A\underbrace{B}_{\text{unknown}}$ s.t. $X \geq 0$, $B \geq 0$

Matrix factorization view: Given data A, find archetypes D that lie on the convex hull of the data (D = AB with B ≥ 0, columns sum to 1), and coefficients X representing each data point as convex combination of archetypes. Unlike NMF, archetypes are extreme points of the data itself.

SPAMS Archetypal Mairal et al.
Original AA Cutler, Breiman 1994
Robust AA Chen, Mairal et al. 2014 — outlier handling
py_pcha U. Aslak — Principal Convex Hull Analysis41 Python

Binary Matrix Factorization

$\underbrace{D}_{\text{known}} = \underbrace{T}_{\text{unknown}} \underbrace{A}_{\text{unknown}}$ s.t. $T \in \{0,1\}^{m \times r}$

Matrix factorization view: Given data matrix D∈Rm×n, find unknown binary factor T∈{0,1}m×r and coefficient matrix A∈Rr×n such that D = TA. Optional constraint: AT·1r = 1n (columns sum to 1 for convex combinations). Unlike Boolean Matrix Factorization, A is real-valued.

Key insight (Slawski et al. 2013): Despite the combinatorial constraint T∈{0,1}m×r, the problem is tractable when r is small. The algorithm finds all binary vertices of aff(D) using the Littlewood-Offord lemma, achieving O(m·2r-1) complexity.

Applications: DNA methylation unmixing (T = methylation profiles per cell type, A = mixing proportions), binary classification ensembles, topic modeling with hard assignments.

Binary Components M. Slawski, M. Hein, P. Lutsik — NeurIPS 2013
binary_matrix_factorization Python implementation Python
MeDeCom DNA methylation deconvolution (Lutsik et al. 2017)
Boolean MF Miettinen et al. 2008 — both factors binary
PROXIMUS Koyutürk, Grama — recursive binary partitioning
Kronecker Product Factorization Yannis Voet, Leonardo De Novellis — Implementation of algorithms for identifying all possible Kronecker product factorizations of binary matrices, including decomposition graph visualization and applications to sparse matrix analysis. Author Python

Matrix Compressive Sensing (MCS)

$\underbrace{\mathcal{A}}_{\text{known}}(\underbrace{L}_{\text{unknown}}) = \underbrace{b}_{\text{known}}$ s.t. $L$ is rank-$r$ Phase Transition

Matrix factorization view: Given linear measurements b = A(L) of unknown low-rank matrix L, recover L. Extends compressive sensing from vectors to matrices. Phase transition: O(r(m+n)) measurements suffice for rank-r recovery. RIP for matrices analogous to vector case.

Nuclear Norm Min Recht, Fazel, Parrilo 2007
Matrix ALPS Kyrillidis, Cevher 2012
SVP Jain, Meka, Dhillon 2009
ADMiRA Lee, Bresler 2009

SPCP / Noisy Robust PCA

$\underbrace{A}_{\text{known}} = \underbrace{L}_{\text{unknown}} + \underbrace{S}_{\text{unknown}} + \underbrace{N}_{\text{unknown}}$ s.t. $L$ is low-rank, $S$ is sparse, $N$ is noise Phase Transition

Matrix factorization view: Given noisy observations A, decompose into unknown low-rank L, unknown sparse S, and unknown dense noise N. Extends Robust PCA to handle small dense perturbations. Stable Principal Component Pursuit (SPCP) solves: min ||L||* + λ||S||1 s.t. ||A - L - S||F ≤ ε.

SPCP Zhou et al. — stable PCP
Noisy RPCA Agarwal, Negahban, Wainwright 2011
GoDec+ Zhou, Tao 2013

Kernel Factorizations

$\underbrace{K}_{\text{known}} = \underbrace{\Phi}_{\text{unknown}} \Phi^T$ s.t. $K$ is PSD

Matrix factorization view: Given kernel matrix K (positive semi-definite), find feature representation Φ such that Kij = φ(xi)Tφ(xj). Kernel PCA, Nyström approximation, and random features all exploit this factorization. Low-rank kernel approximations enable scalable kernel methods.

Nyström Williams, Seeger 2001
Random Features Rahimi, Recht NIPS 2007
Kernel PCA Schölkopf, Smola, Müller 1998
FastFood Le, Sarlos, Smola 2013
Recursive Nyström Musco, Musco 2017
CERF Jianqiao Wangni, Jingwei Zhuo, Jun Zhu — Paper: Learning Random Fourier Features by Hybrid Constrained Optimization $\phi_{\text{learner}}(x) = \underbrace{\Omega}_{ ext{mask/block struct.}} \underbrace{\phi_{\text{teacher}}(x)}_{ ext{known RFF}} \text{ where } \underbrace{\Omega}_{ ext{unknown}} \text{ is orthogonal transformation}$

Non-Convex Optimization & Implicit Regularization

$\min_{U,V} \|\underbrace{U}_{\text{unknown}} \underbrace{V}_{\text{unknown}}^T - \underbrace{M}_{\text{known}}\|_F^2$ s.t. $U \in \mathbb{R}^{m \times r}$, $V \in \mathbb{R}^{n \times r}$ Phase Transition

Matrix factorization view: Given observed matrix M, find unknown factors U, V such that UVT approximates M. The factorization implicitly constrains rank(UVT) ≤ r. Despite non-convexity, all local minima are global when r is large enough.

Key result (Gunasekar et al. 2017): GD on matrix factorization with small init implicitly minimizes nuclear norm. Deep factorization amplifies this bias.

Scaled GD (ScaledGD) O(log(d/ε)) convergence from random init
Wirtinger Flow Candès et al. 2015 — phase retrieval
Reshaped WF O(n) sample complexity
Burer-Monteiro SDP via X=RRT, no spurious minima when r≳√d
Riemannian GD optimization on manifolds
Alternating Minimization provable guarantees post-2015
Deep MF Implicit Bias rank regularization grows with depth

Randomized Algorithms & Sketching (RandNLA)

$\tilde{A} = \Pi A$ where $\Pi \in \mathbb{R}^{k \times m}$, $k \ll m$

Matrix factorization view: Randomized numerical linear algebra uses random projections and sketching to compute approximate matrix factorizations efficiently. Key techniques include randomized SVD, Johnson-Lindenstrauss transforms, CountSketch, and streaming algorithms like Frequent Directions. These methods trade exact computation for massive speedups while providing provable approximation guarantees.

Randomized SVD Halko, Martinsson, Tropp 2011
Frequent Directions Liberty 2013 — streaming PCA
Sparse JL Kane, Nelson 2014
CountSketch Clarkson, Woodruff 2017
Sketching Survey Woodruff 2014
τ-GRA Pettie, Wang 2022 — HyperLogLog/PCSA estimators Python
Improved Compressed Counting Ping Li — Implementation of the optimal power estimator for Compressed Counting algorithm to estimate frequency moments and Shannon entropy of data streams Python

Other & Online SVD

Boolean factorization, alignment (RASL/TILT), streaming decompositions, and other specialized methods.

RASL Batch Alignment
TILT Low-rank Textures
BMF Miettinen et al. TKDE 2008
Archetypal SPAMS v2.5
PROPACK Lanczos SVD
BLWS Block Lanczos
Oja's Algorithm J. Math. Biol. 1982

Modern Themes (2015–2024)

Contemporary research directions that cut across classical categories, reflecting the field's evolution toward neural network integration, implicit regularization, and computational-statistical trade-offs.

Implicit Regularization GD on overparameterized models learns low-rank/sparse solutions
Neural Collapse deep classifiers converge to symmetric low-rank geometry
Double Descent test error decreases again past interpolation threshold
Random Feature Models kernel approximation via random projections
Overparameterization more parameters than data points aids optimization
Edge of Stability GD operates at maximum stable learning rate
Grokking delayed generalization after memorization phase
Lottery Tickets sparse subnetworks match dense performance

Computational-Statistical Gaps

Phase Transition

The "hard phase" regime where information-theoretically recovery is possible but no polynomial-time algorithm is known. Understanding these gaps is central to modern high-dimensional statistics.

Key example: In sparse PCA, the spiked covariance model exhibits a gap: spectral methods require signal strength Θ(n1/4) but information-theoretic limit is Θ(1/√n).

Berthet-Rigollet 2013 planted clique reduction for sparse PCA
Statistical Query Lower Bounds Feldman et al. 2012
Low-Degree Polynomials Hopkins et al. FOCS 2017
Sum-of-Squares Hierarchy Meka, Potechin, Wigderson 2015
Tensor PCA Richard, Montanari 2014
Community Detection Abbe survey 2017

Approximate Message Passing (AMP)

Phase Transition

Iterative algorithms with exact asymptotic characterization via state evolution. Predict sharp phase transitions for low-rank recovery, compressed sensing, and community detection.

State Evolution: In high-dimensional limit, AMP error follows deterministic recursion. Correctly predicts success/failure boundary.

GAMP Generalized AMP — Rangan
BiG-AMP Bilinear GAMP for matrix factorization
Low-RAMP Lesieur, Krzakala, Zdeborová 2017
BiG-VAMP Vector AMP for matrix problems
VAMP Vector AMP — Rangan, Fletcher
Spectral + AMP Spectral init + AMP refinement

Randomized Algorithms & Sketching (RandNLA)

Use random projections to reduce massive matrices into tractable smaller problems while preserving essential structure. Sketching enables single-pass streaming algorithms.

Frequent Directions Liberty et al. 2015 — deterministic, optimal
CountSketch sparse projection, GPU-friendly
SRHT Subsampled Randomized Hadamard
Randomized SVD Halko, Martinsson, Tropp 2011
Randomized LU Shabat, Shmueli, Averbuch 2013
gensim R. Řehůřek (RARE Technologies)16.3k Python
BLENDENPIK Avron, Maymounkov, Toledo
Redsvd Okanohara — Randomized SVD C++
Random Kitchen Sinks FastFood — Le, Sarlos, Smola
Multisketching Higgins 2025 — CountSketch + Gaussian, 77% faster

Frameworks & Platforms

Comprehensive software packages integrating multiple factorization methods. GPU acceleration now mainstream via RAPIDS and PyTorch backends.

scikit-learn NMF, TruncatedSVD, PCA64.6k Python
cuML (RAPIDS) GPU-accelerated, 10-50× speedup5.1k Python
TensorLy tensor decomposition, multi-backend1.7k Python
PyTorch / JAX differentiable linalg, autograd Python
SPAMS Sparse Modeling21 Python Matlab
TFOCS S. Becker, Candès142 Matlab
Implicit GPU recommenders (ALS, BPR)3.8k Python
PEFT HuggingFace LoRA/adapters20.4k Python
Manopt manifold optimization Matlab Python
cvxpy convex optimization6.1k Python
Dask-ML distributed, out-of-core Python

Preconditioners & Fast Multipole Methods

Direct solvers/FMM

Fast multipole methods (FMM) can be viewed as hierarchical matrix factorizations. Exploiting low-rank structure of off-diagonal blocks enables O(N) or O(N log N) complexity for dense matrix-vector products and direct solvers. H-matrices, H²-matrices, and butterfly factorizations provide fast approximate inverses and preconditioners.

Matrix Factorization View: The multipole expansion A ≈ B̃C decomposes dense kernel matrices into structured low-rank products. Quadtree/octree decompositions yield hierarchical factorizations with sparse + low-rank blocks.

Inverse FMM Coulier, Pouransari, Darve — fast approximate direct solver as preconditioner
FMM Tutorial Martinsson — Fast algorithms in scientific computing
Fast Direct Solvers Martinsson, O'Neil — survey on hierarchical factorizations
H²-matrix decomposition

H²-matrix decomposition: A = S + UKVT (sparse + low-rank). Quadtree structure generalizes to higher dimensions.

Multipole expansion as matrix factorization

Multipole expansion as matrix factorization: A ≈ B̃C. Figure from: Martinsson — FMM Tutorial

Essential References

Matrix Computations

Matrix Computations

Convex Optimization

Convex Optimization

Topics in Matrix Analysis

Topics in Matrix Analysis

NMF Book

NMF & Tensor Factorizations

Matrix Manifolds

Matrix Manifolds

Convex Opt SP

Convex Opt. in SP

Optimization ML

Optimization for ML

LAPACK

LAPACK Guide

Phase Transition Atlas (2020s Framing)

Modern papers separate information-theoretic limits from algorithmic limits, and often explain gaps with spectral thresholds or state-evolution analysis.

Information-Theoretic Limit feasibility boundary, often via counting or mutual information
Algorithmic Threshold practical recoverability for a given solver family
Spectral / BBP Transition Baik-Ben Arous-Péché (2005): eigenvalue separates from bulk
State Evolution (AMP) predicts error curves and sharp transitions
Computational-Statistical Gap "hard phase" where info-theoretic recovery possible but poly-time fails
Planted Clique Hardness reduction-based lower bounds (Berthet-Rigollet 2013)
Rank / Sparsity vs Sampling classic completion and sparse recovery axes
Noise vs Signal Strength robustness regimes and breakdown points