| sample size | |||||
| Method | 100 | 300 | 1000 | 3000 | 10000 |
| SBTM (ours) | 0.013 | 0.0032 | 0.0019 | 0.0020 | 0.00099 |
| SDE | 0.020 | 0.022 | 0.0094 | 0.0036 | 0.0012 |
| SVGD | 0.33 | 0.23 | 0.16 | 0.20 | 0.29 |
Early Access articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Early Access publication benefits the research community by making new scientific discoveries known as quickly as possible.
Readers can access Early Access articles via the “Early Access” tab for the selected journal.
We propose a deterministic sampling framework using Score-Based Transport Modeling for sampling an unnormalized target density $ \pi $ given only its score $ \nabla \log \pi $. Our method approximates the Wasserstein gradient flow on $ \mathrm{KL}(f_t\|\pi) $ by learning the time-varying score $ \nabla \log f_t $ on the fly using score matching. While having the same marginal distribution as Langevin dynamics, our method produces smooth deterministic trajectories, resulting in monotone noise-free convergence. We prove that our method dissipates relative entropy at the same rate as the exact gradient flow, provided sufficient training. Numerical experiments validate our theoretical findings: our method converges at the optimal rate, has smooth trajectories, and is often more sample efficient than its stochastic counterpart. Experiments on high-dimensional image data show that our method produces high-quality generations in as few as 15 steps and exhibits natural exploratory behavior. The memory and runtime scale linearly in the sample size.
| Citation: |
Figure 3. Experiment 4.1, log-concave target. Top: relative entropy dissipation rate of SBTM (ours) and SDE (stochastic). SBTM approximates entropy decay rate well, while SDE is noisy. Bottom left: relative entropy of SBTM, SDE and the ground truth. SBTM approximates the ground truth well. Bottom right: L2 error to the true ground truth solution. SBTM produces lower error with smoother trajectory
Figure 4. Experiment 4.2, 1D Gaussian mixture. Left: KL divergence of SBTM (ours) and SDE (stochastic) over time. SBTM exhibits smoother convergence. Right: entropy dissipation of SBTM and SDE. SBTM approximates entropy decay rate perfectly with the computable quantity $ {\operatorname{F}\!\left({f_t}\, \|\, {\pi}\right)} $, while SDE is noisy
Figure 5. Experiment 4.3, well-separated 1D Gaussian mixture. Left: reconstructed density of SBTM. It approximates the solution well despite the non-log-concavity. Right: entropy dissipation of SBTM (ours) and SDE (stochastic). SBTM approximates entropy decay rate perfectly even in annealed dynamics
Figure 7. Experiment 4.5, well-separated 2D Gaussian mixture. Scatter plot over time. Top: SBTM (ours) with the dilation annealing. Bottom: SDE with the dilation annealing [2]. SBTM separates into modes early on, compared to the SDE
Figure 10. Experiment 4.6, high-dimensional. Starting from the same initial point, SBTM produces distinct sample trajectories depending on the training schedule. The amount of training controls the strength of interaction between particles. Top to bottom: SBTM without training (equivalent to gradient ascent on $ \nabla \log \pi $), SBTM with small amount training, SBTM with large amount training, and finally the SDE (Langevin)
Figure 11. Experiment 4.6, high-dimensional. Cosine similarity between the true score $ \nabla \log \pi $ and learned score $ \nabla \log f_t $ over simulation time. Even with very little training the model learns the score well. Different lines use different numbers of epochs per time-step, effectively changing $ \eta $ in (3.8)
Table 1.
KL divergence (
| sample size | |||||
| Method | 100 | 300 | 1000 | 3000 | 10000 |
| SBTM (ours) | 0.013 | 0.0032 | 0.0019 | 0.0020 | 0.00099 |
| SDE | 0.020 | 0.022 | 0.0094 | 0.0036 | 0.0012 |
| SVGD | 0.33 | 0.23 | 0.16 | 0.20 | 0.29 |
Table 2.
KL divergence (
| sample size | |||||
| 100 | 300 | 1000 | 3000 | 10000 | |
| SBTM (ours) | 0.022 | 0.018 | 0.0082 | 0.0082 | 0.0036 |
| SDE | 0.029 | 0.013 | 0.014 | 0.0068 | 0.0043 |
| SVGD | 2.8 | 1.4 | 2.4 | 2.1 | 2.0 |
| annealing | |||||
| Non-annealed | Geometric | Dilation | |||
| SBTM (ours) | 0.022 | 0.058 | 0.037 | ||
| SDE | 0.029 | 0.060 | 0.062 | ||
| SVGD | 2.800 | 0.470 | 0.480 | ||
| [1] |
N. M. Boffi and E. Vanden-Eijnden, Probability flow solution of the Fokker–Planck equation, Mach. Learn.: Sci. Technol., 4 (2023), 35 pp.
doi: 10.1088/2632-2153/ace2aa.
|
| [2] |
O. Chehab and A. Korba, A Practical Diffusion Path for Sampling, preprint, 2024, arXiv: 2406.14040.
|
| [3] |
J. Chemseddine, C. Wald, R. Duong and G. Steidl, Neural sampling from Boltzmann densities: Fisher-Rao curves in the Wasserstein geometry, preprint, 2024, arXiv: 2410.03282.
|
| [4] |
S. Chen, S. Chewi, H. Lee, Y. Li, J. Lu and A. Salim, The probability flow ode is provably fast, Adv. Neural Inf. Process. Syst, 36 (2024).
|
| [5] |
S. Chen, S. Chewi, J. Li, Y. Li, A. Salim and A. R. Zhang, Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions, preprint, 2023, arXiv: 2209.11215.
|
| [6] |
S. Chewi, Log-concave sampling, Available from: https://chewisinho.github.io
|
| [7] |
M. Corrales, S. Berti, B. Denel, P. Williamson, M. Aleardi and M. Ravasi, Annealed stein variational gradient descent for improved uncertainty estimation in full-waveform inversion, Geophys. J. Int., 241 (2025), 1088-1113.
|
| [8] |
A. Dalalyan and A. Karagulyan, User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient, Stochastic Processes Appl., 129 (2019), 5278-5311.
doi: 10.1016/j.spa.2019.02.016.
|
| [9] |
L. L. di Langosco, V. Fortuin and H. Strathmann, Neural Variational Gradient Descent, in Fourth Symposium on Advances in Approximate Bayesian Inference, 2022.
|
| [10] |
A. Duncan, N. Nüsken and L. Szpruch, On the geometry of Stein variational gradient descent, J. Mach. Learn. Res., 24 (2023), 1-39.
|
| [11] |
K. Elamvazhuthi, X. Zhang, M. Jacobs, S. Oymak and F. Pasqualetti, A Score-Based Deterministic Diffusion Algorithm with Smooth Scores for General Distributions, Proc. AAAI Conf. Artif. Intell., 38 (2024), 11866-11873.
doi: 10.1609/aaai.v38i11.29072.
|
| [12] |
F. Han, S. Osher and W. Li, Convergence of noise-free sampling algorithms with regularized Wasserstein proximals, preprint, 2024, arXiv: 2409.01567.
|
| [13] |
Y. He and C. Zhang, On the query complexity of sampling from non-log-concave distributions, preprint, 2025, arXiv: 2502.06200.
|
| [14] |
D. Z. Huang, J. Huang and Z. Lin, Convergence analysis of probability flow ode for score-based generative models, IEEE Trans. Inform. Theory, 71 (2025), 4581-4601.
doi: 10.1109/TIT.2025.3557050.
|
| [15] |
Y. Huang and L. Wang, A score-based particle method for homogeneous landau equation, J. Comput. Phys., 536 (2025), Paper No. 114053, 23 pp.
doi: 10.1016/j.jcp.2025.114053.
|
| [16] |
M. F. Hutchinson, A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines, Commun. Stat. Simul. Comput., 18 (1989), 1059-1076.
|
| [17] |
A. Hyvärinen, Estimation of Non-Normalized Statistical Models by Score Matching, J. Mach. Learn. Res., 6 (2005), 695-709.
|
| [18] |
V. Ilin, H. Wang, J. Wang and Z. Wang, Transport based particle methods for the Fokker–Planck–Landau equation, Commun. Math. Sci., 23 (2025), 1763-1788.
|
| [19] |
R. Jordan, D. Kinderlehrer and F. Otto, The variational formulation of the Fokker–Planck equation, SIAM J. Math. Anal., 29 (1998), 1-17.
|
| [20] |
K. Karhadkar, M. Murray and G. Montufar, Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimension, Adv. Neural Inf. Process. Syst., 37 (2024), 138197-138249.
|
| [21] |
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, 86 (1998), 2278-2324.
doi: 10.1109/5.726791.
|
| [22] |
Q. Liu, Stein variational gradient descent as gradient flow, preprint, 2017, arXiv: 1704.07520.
|
| [23] |
Q. Liu and D. Wang, Stein variational gradient descent: A general purpose bayesian inference algorithm, Adv. Neural Inf. Process. Syst., 29 (2016), 9 pp.
|
| [24] |
C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li and J. Zhu, Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, Adv. Neural Inf. Process. Syst., 35 (2022), 5775-5787.
|
| [25] |
J. Lu, Y. Wu and Y. Xiang, Score-based transport modeling for mean-field Fokker-Planck equations, J. Comput. Phys., 503 (2024), Paper No. 112859, 19 pp.
doi: 10.1016/j.jcp.2024.112859.
|
| [26] |
B. Máté and F. Fleuret, Learning Interpolations between Boltzmann Densities, Mach. Learn. Res., (2023), 15 pp.
|
| [27] |
R. N. Neal, Annealed importance sampling, Stat. Comput., 11 (2001), 125-139.
|
| [28] |
J. Song, C. Meng and S. Ermon, Denoising Diffusion Implicit Models, in International Conference on Learning Representations, 2021.
|
| [29] |
Y. Song and S. Ermon, Generative modeling by estimating gradients of the data distribution, Adv. Neural Inf. Process. Syst., 32 (2019), 23 pp.
|
| [30] |
S. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar and S. Ermon, Score-Based Generative Modeling through Stochastic Differential Equations, in International Conference on Learning Representations, 2021.
|
| [31] |
H. Tan, Y. Ye, S. Osher and W. Li, Noise-free sampling algorithms via regularized Wasserstein proximals, Res. Math. Sci., 11 (2024), Paper No. 65, 32 pp.
|
| [32] |
S. Vempala and A. Wibisono, Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices, Adv. Neural Inf. Process. Syst., 32 (2019), 53 pp.
|
| [33] |
A. Wibisono, Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem, Conf. Learn. Theory., (2018), 2093-3027.
|
| [34] |
C. Xu, X. Cheng, X. Liu and Y. Xie, Normalizing flow neural networks by JKO scheme, Adv. Neural Inf. Process. Syst., 36 (2023), 47379-47405.
|
| [35] |
J. Zhuo, J. Liu, C. Shi, J. Zhu, J. Chen, N. Zhang and B. Zhang, Message passing Stein variational gradient descent, Int. Conf. Mach. Learn., (2018), 6018-6027.
|
Left: Langevin dynamics (stochastic). Right: ours (deterministic). The deterministic algorithm has the same marginal distributions as the stochastic one but with smooth trajectories. In this plot both algorithms interpolate between the unit Gaussian and a mixture of two Gaussians
Visualization of (GF ODE). Particles are pulled towards target
Experiment 4.1, log-concave target. Top: relative entropy dissipation rate of SBTM (ours) and SDE (stochastic). SBTM approximates entropy decay rate well, while SDE is noisy. Bottom left: relative entropy of SBTM, SDE and the ground truth. SBTM approximates the ground truth well. Bottom right: L2 error to the true ground truth solution. SBTM produces lower error with smoother trajectory
Experiment 4.2, 1D Gaussian mixture. Left: KL divergence of SBTM (ours) and SDE (stochastic) over time. SBTM exhibits smoother convergence. Right: entropy dissipation of SBTM and SDE. SBTM approximates entropy decay rate perfectly with the computable quantity
Experiment 4.3, well-separated 1D Gaussian mixture. Left: reconstructed density of SBTM. It approximates the solution well despite the non-log-concavity. Right: entropy dissipation of SBTM (ours) and SDE (stochastic). SBTM approximates entropy decay rate perfectly even in annealed dynamics
Experiment 4.4, noisy circle. SBTM (ours, top) leaves the vacuum region empty, while SDE (bottom) fills it, demonstrating the effect of determinism
Experiment 4.5, well-separated 2D Gaussian mixture. Scatter plot over time. Top: SBTM (ours) with the dilation annealing. Bottom: SDE with the dilation annealing [2]. SBTM separates into modes early on, compared to the SDE
Experiment 4.5, well-separated 2D Gaussian mixture. The estimate in (3.4) holds empirically, indicating small score-matching loss and good score approximation in annealed dynamics
Experiment 4.6, high-dimensional. Sampled MNIST digits using
Experiment 4.6, high-dimensional. Starting from the same initial point, SBTM produces distinct sample trajectories depending on the training schedule. The amount of training controls the strength of interaction between particles. Top to bottom: SBTM without training (equivalent to gradient ascent on
Experiment 4.6, high-dimensional. Cosine similarity between the true score