Solved ASSIGNMENT III Generative modeling SH2150 Machine Learning in Physics

TO FIND THE NEXT OR DIFFERENT PROJECT CLICK ON THE SEARCH BUTTON ON THE TOP RIGHT MENU AND SEARCH USING COURSE CODE OR PROJECT TITLE.

Starting from:

~~$30~~

$19.50

Diffusion models have become one of the most exciting ideas in modern generative AI, and at their core is a beautiful interplay between randomness and structure described by stochastic differential equations (SDEs). In this assignment, we dive into that world: from exploring classical processes like Brownian motion and the Ornstein–Uhlenbeck dynamics, to simulating trajectories with Euler–Maruyama, to building neural networks that learn how data is gradually corrupted and then reconstructed. By moving from simple 1D SDEs all the way to UNet-based image models, we get to experience how mathematical theory, numerical simulation, and deep learning come together to form the foundation of today’s powerful diffusion and flow-based generative models. Your tasks are to • Get familiar with SDEs • Set up a training/inference pipeline for generative models • Sample from a 2D distribution using a flow model • Define a UNet architecture for image generation • Generate images from a distribution of your choice 1. The theoretical basis for diffusion models is stochastic differential equations (SDE). An SDE is a differential equation symbolically of the form dXt = ut(Xt)dt + σtdWt, X0 = x0, (1) where Xt is the trajectory, ut is the vector field or drift coefficient, σt is the diffusion coefficient determining the amount of noise, and Wt is a Brownian motion. The term σtdWt is what gives rise to the stochasticity (randomness) in the system – without it the system reduces to an ordinary differential equation (ODE) dXt dt = ut(Xt). (a) Different choices of ut yield different stochastic processes that can be used to model (1p) many different physical phenomena. The Ornstein-Uhlenbeck (OU) process is given by ut(Xt) = −θXt, σt = σ, for some constants θ, σ. It can be used to model massive particles under the influence of friction. Show that the OU process is equivalent to Langevin dynamics defined by the drift coefficient ut(Xt) = 1 2 σ 2 t ∇ log p(Xt), σt = σ, when p(Xt) = N (0, σ 2 2θ ). 1 (b) Being able to simulate SDEs will be needed to sample new data points from gener- (1p) ative models. The simplest solver for SDEs is the Euler-Maruyama method defined as Xt+h = Xt + hut(Xt) + √ hσtϵt, ϵt ∼ N (0, I), (2) where h is the step size. This reduces to the Euler method for ODEs when σ = 0. Implement the Euler-Maruyama method. (c) Simulate 10 Brownian motions defined by σt = σ, ut = 0, X0 = 0. What happens (1p) for different values of σ? (d) Simulate the OU process defined above. Run it with a variety of different initial (2p) points X0. Pay attention to the ratio σ 2 2θ , and comment on the convergence behavior of the solutions. Are they approaching a particular point or a distribution? 2. A diffusion model is an SDE where the vector field ut is parametrized by a learnable neural network u θ t . If there is no diffusion term, and the SDE is reduced to an ODE, we get a so-called flow model. The network u θ t is trained on data points X1 ∼ pdata that have been corrupted with varying amounts of noise corresponding to different time steps Xt, t ∈ [0, 1], of the SDE/ODE. By simulating the SDE forwards from t = 0 to t = 1 with our trained vector field, starting with pure noise X0 ∼ pnoise, we can generate new data samples from pdata. In a common type of generative model, called denoising diffusion model, the conditional path (which describes how a data point z is corrupted) is given by xt = αtz + βtϵ, ϵ ∼ N (0, I) where αt, βt are continuously differentiable and monotonic noise schedulers with α0 = β1 = 0 and α1 = β0 = 1. This implies xt ∼ pt(· | z) = N (αtz, β2 t I). (a) Implement linear noise schedulers αt = t, βt = 1 − t. (1p) (b) Implement the conditional path that given a data point z and a time t returns the (1p) corrupted data point xt = αtz + βtϵ. (c) Define a 2D toy distribution (e.g. a Gaussian mixture) pdata. Simulate the corrup- (1p) tion process on 1000 samples from pdata for t = 0, 0.25, 0.50, 0.75, 1, and plot one 2D histogram per time point. (d) Modify the noise schedulers so that they are non-linear (while still satisfying the (1p) requirements stated above). Then, generate a plot analogous to panel (c) using the same data points z corrupted under this new schedule, and discuss how the change in scheduling influences the corruption process. 3. There are many ways to train generative models. In this case, we will want to minimize the conditional flow matching loss given by L(θ) = Et∼U[0,1],z∼pdata,x∼pt(·|z) [ u θ t (x) − ut(x | z) 2 ]. (a) The conditional vector field is given by (1p) ut(x | z) = α˙ t − β˙ t βt αt ! z + β˙ t βt x. Find a simplified expression for ut(xt | z) when xt is drawn from the conditional path pt(· | z), and implement it. Page 2 (b) Implement a MLP architecture that takes (x, t) ∈ R (1p) 3 as inputs and outputs the estimated vector field u θ t (xt) ∈ R 2 . (c) Implement a training loop according to Algorithm 1, and train an MLP with 4 (1p) hidden layers of dimension 64 on the 2D toy dataset. Choose noise schedulers αt, βt according to your liking (that fulfills the needed criteria), and state what you used. Does the loss converge? (d) Let the number of time steps nt = 1000, and sample 300 realizations from pdata (1p) using the Euler method (σt = 0) applied to the ODE parametrized by your trained vector field. Plot a 2D scatter plot of 300 corrupted data points according to the true conditional path for t = 0, 0.25, 0.50, 0.75, 1. Provide a similar plot for the 300 generated data points together with a plot of the simulated trajectories Xt. How does the choice of nt affect the sampling? Algorithm 1: Conditional flow matching Require: data set with samples pdata, vector field u θ t for each batch do Sample data point z ∼ pdata. Sample random time t ∼ U[0, 1]. Sample noise ϵ ∼ N (0, I). Set xt = αtz + βtϵ. Compute loss L(θ) = u θ t (xt) − ut(xt | z) 2 . Update model parameters θ using gradient step on L(θ). end 4. Let us now turn to image generation. To handle high-dimensional image data, we need another architecture than an MLP to parameterize our vector field. We will use the famous UNet architecture with some modification to allow for the time embedding (the network must be informed about the current time t and for CNNs it is not as straightforward to do this as for MLPs where we just fed it as an additional input). (a) Select a few data points z from your choice of image data set, and provide a plot of (1p) the corresponding corrupted data points xt for time points t = 0, 0.25, 0.50, 0.75, 1. (b) Implement the residual layer of the UNet architecture and add the pre-defined time (3p) embedding according to figure 1. Train a vector field over 5000 epochs on the image data with batch size of 250 using your UNet. (c) Sample a couple of images from the image distribution by solving the flow ODE (1p) dXt = u θ t (Xt)dt while using your trained UNet as the drift term. (d) Play around with different noise schedulers and comment on how they affect the (1p) sample quality. 5. So far we have only trained flow models (no diffusion term). Via the Fokker-Planck equation, it can be shown that the following SDE dXt = u θ t (Xt) + σ 2 t 2 ∇ log pt(Xt) dt + σtdWt (3) have the same probability paths pt as the flow ODE dXt dt = u θ t (Xt). Hence, to convert the flow model into a diffusion model, we have to learn the second drift term ∇ log pt(Xt) called the score function. This can be done by training a network s θ t using conditional score matching L(θ) = Et∼U[0,1],z∼pdata,x∼pt(·|z) [ s θ t (x) − ∇ log pt(x | z) 2 ]. Page 3 Initial Conv Encoder Encoder Encoder Bottleneck Decoder Decoder Decoder Final Conv 𝑥𝑡 𝑡 𝑢𝑡 𝜃 (𝑥𝑡) Embed Residual connection Residual connection Residual connection(s) SiLU + BatchNorm + Conv SiLU + BatchNorm + Conv MLP 𝑡 Residual layer Figure 1: UNet architecture together with a detailed overview of the residual layer. Each encoder consists of two residual layers followed by downsampling. Each decoder consists of upsampling followed by two residual layers. The bottleneck module consist of three residual layers. (a) Derive an explicit formula for the conditional score function ∇ log pt(x | z) from (3p) the conditional path and implement it. Then train a score network s θ t on the image dataset using the conditional score matching loss. (b) Sample new data using the diffusion model by combining the trained vector field (3p) u θ t and the score network s t θ and simulate the SDE in (3). How does the value of σt affect the samples? Final question: Did you use an AI tool (other than the machine learning models you trained in this exercise) for anything else than information searching, when solving this problem set? If so, please write a brief statement of how you used AI. Total number of points: 25 Motivate your answers wherever applicable. Good luck! Page 4