Cluster Multi-GPU Support in DiffEqGPU The DiffEqGPU automated GPU parallelism tools now support multiple GPUs. The README now shows that one can do things like: # Setup processes with different CUDA devices using Distributed addprocs(numgpus) import CUDAdrv, CUDAnative let gpuworkers = asyncmap(collect(zip(workers(), CUDAdrv.devices()))) do (p, d) remotecall_wait(CUDAnative.device!, p, d) p end to setup each individual process with a separate GPU, and then the standard usage of DiffEqGPU.jl: function lorenz(du,u,p,t) @inbounds begin du = p*(u-u) du = u*(p-u) - u du = u*u - p*u end nothing end u0 = Float32[1.0;0.0;0.0] tspan = (0.0f0,100.0f0) p = (10.0f0,28.0f0,8/3f0) prob = ODEProblem(lorenz,u0,tspan,p) prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p) monteprob = EnsembleProblem(prob, prob_func = prob_func) @time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000, batch_size = 10_000, saveat=1.0f0) will now make use of these GPUs per batch of trajectories. We can see effective parallel solving of over 100,000 ODEs all simultaneously using this approach on just a few compute nodes! SciPy and deSolve (R) (+Updated MATLAB) Common Interface Bindings for Ease of Translation With the new SciPyDiffEq.jl, deSolveDiffEq.jl, and the update MATLABDiffEq.jl bindings, you can now solve common interface defined ordinary differential equations using the solver suites from Python, R, and MATLAB respectively. These libraries have been developed due to popular demand as a large influx of users from these communities who want to ensure that their Julia-translated models are correct. Now, one can install these solvers can double check their models with the original libraries to double check that the translation is correct. To see this in action, the following solves the Lorenz equations with SciPy’s solve_ivp’s RK45, deSolve’s (R) lsoda wrapper, and MATLAB’s ode45: using SciPyDiffEq, MATLABDiffEq, deSolveDiffEq function lorenz(u,p,t) du1 = 10.0(u-u) du2 = u*(28.0-u) - u du3 = u*u - (8/3)*u [du1, du2, du3] end tspan = (0.0,10.0) u0 = [1.0,0.0,0.0] prob = ODEProblem(lorenz,u0,tspan) sol = solve(prob,SciPyDiffEq.RK45()) sol = solve(prob,MATLABDiffEq.ode45()) sol = solve(prob,deSolveDiffEq.lsoda()) As an added bonus, this gives us a fairly simple way to track performance differences between the common ODE solver packages of each language. A new benchmark page is focused on cross language wrapper overhead and showcases the performance differences between these language’s differential equation suites on 4 ODE test problems (non-stiff and stiff). For example, on a system of 7 stiff ODEs, we see the following: which showcases the native Julia solvers as the fastest, benchmarking close to 50x faster than MATLAB, 100x faster than deSolve (R), and nearly 10,000x faster than SciPy. Thus, with these new tools, users can have a one line change to both ensure their models have translated correctly while understanding the true performance difference in their real-world context. Automated GPU-based Parameter Parallelism Support for Stiff ODEs and Event Handling DiffEqGPU now supports stiff ODEs through implicit and Rosenbrock methods, and callbacks (both ContinuousCallback and DiscreteCallback) are allowed. To see this in action, one could for example do the following: function lorenz(du,u,p,t) @inbounds begin du = p*(u-u) du = u*(p-u) - u du = u*u - p*u end nothing end u0 = Float32[1.0;0.0;0.0] tspan = (0.0f0,100.0f0) p = (10.0f0,28.0f0,8/3f0) function lorenz_jac(J,u,p,t) @inbounds begin σ = p ρ = p β = p x = u y = u z = u J[1,1] = -σ J[2,1] = ρ - z J[3,1] = y J[1,2] = σ J[2,2] = -1 J[3,2] = x J[1,3] = 0 J[2,3] = -x J[3,3] = -β end nothing end function lorenz_tgrad(J,u,p,t) nothing end func = ODEFunction(lorenz,jac=lorenz_jac,tgrad=lorenz_tgrad) prob_jac = ODEProblem(func,u0,tspan,p) prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p) monteprob_jac = EnsembleProblem(prob_jac, prob_func = prob_func) solve(monteprob_jac,Rodas5(linsolve=LinSolveGPUSplitFactorize()),EnsembleGPUArray(),dt=0.1,trajectories=10_000,saveat=1.0f0) solve(monteprob_jac,TRBDF2(linsolve=LinSolveGPUSplitFactorize()),EnsembleGPUArray(),dt=0.1,trajectories=10_000,saveat=1.0f0) This solves the Lorenz equations with Rosenbrock and implicit ODE solvers for 10,000 different parameters. On an example stiff ODE we’ve been testing (26 ODEs), a single RTX 2080 card was 5x faster than a multithreaded 16 core Xeon computer, meaning the time savings to do a parameter sweep with just one GPU can be tremendous, even (especially) on a stiff ODE. Stiff ODE Linear Solver Performance Improvements Thanks to Yingbo Ma (@YingboMa), our implicit ODE solvers got a pretty major improvement in certain stiff ODEs which have fast oscillatory terms. Now it’s hard to find a stiff ODE benchmark where a native Julia method isn’t performing the best, except for super large systems where Newton-Krylov methods are used. Our next goal is to better enhance the performance of our Newton-Krylov support. More Precise Package Maintenance: Strict Versioning and Bounds All of JuliaDiffEq now has upper bounds on its packages, along with CompatHelper installed so that every dependency change gets an automatic pull request and a notification to the JuliaDiffEq maintainers to inform us about changes in the wider Julia ecosystem. This should help us stay on top of all changes and keep the system stable. Next Directions Here’s some things to look forward to: Automated matrix-free finite difference PDE operators Jacobian reuse efficiency in Rosenbrock-W methods Native Julia fully implicit ODE (DAE) solving in OrdinaryDiffEq.jl High Strong Order Methods for Non-Commutative Noise SDEs Stochastic delay differential equations
This release covers the completion of another successful summer. We have now completed a new round of tooling for solving large stiff and sparse differential equations. Most of this is covered in the exciting…. New Tutorial: Solving Stiff Equations for Advanced Users! That is right, we now have a new tutorial added to the documentation on solving stiff differential equations. This tutorial goes into depth, showing how to use our recent developments to do things like automatically detect and optimize a solver with respect to sparsity pattern, or automatically symbolically calculate a Jacobian from a numerical code. This should serve as a great resource for the advanced users who want to know how to get started with those finer details like sparsity patterns and mass matrices. Automatic Colorization and Optimization for Structured Matrices As showcased in the tutorial, if you have jac_prototype be a structured matrix, then the colorvec is automatically computed, meaning that things like BandedMatrix are now automatically optimized. The default linear solvers make use of their special methods, meaning that DiffEq has full support for these structured matrix objects in an optimal manner. Implicit Extrapolation and Parallel DIRK for Stiff ODEs At the tail end of the summer, a set of implicit extrapolation methods were completed. We plan to parallelize these over the next year, seeing what can happen on small stiff ODEs if parallel W-factorizations are allowed. Automatic Conversion of Numerical to Symbolic Code with Modelingtoolkitize This is just really cool and showcased in the new tutorial. If you give us a function for numerically computing the ODE, we can now automatically convert said function into a symbolic form in order to compute quantities like the Jacobia and then build a Julia code for the generated Jacobian. Check out the new tutorial if you’re curious, because although it sounds crazy… this is now a standard feature! GPU-Optimized Sparse (Colored) Automatic and Finite Differentiation SparseDiffTools.jl and DiffEqDiffTools.jl were made GPU-optimized, meaning that the stiff ODE solvers now do not have a rate-limiting step at the Jacobian construction. DiffEqBiological.jl: Homotopy Continuation DiffEqBiological got support for automatic bifurcation plot generation by connecting with HomotopyContinuation.jl. See the new tutorial Greatly improved delay differential equation solving David Widmann (@devmotion) greatly improved the delay differential equation solver’s implicit step handling, along with adding a bunch of tests to show that it passes the special RADAR5 test suite! Color Differentiation Integration with Native Julia DE Solvers The ODEFunction, DDEFunction, SDEFunction, DAEFunction, etc. constructors now allow you to specify a color vector. This will reduce the number of f calls required to compute a sparse Jacobian, giving a massive speedup to the computation of a Jacobian and thus of an implicit differential equation solve. The color vectors can be computed automatically using the SparseDiffTools.jl library’s matrix_colors function. Thank JSoC student Langwen Huang (@huanglangwen) for this contribution. Improved compile times Compile times should be majorly improved now thanks to work from David Widmann (@devmotion) and others. Next Directions Our current development is very much driven by the ongoing GSoC/JSoC projects, which is a good thing because they are outputting some really amazing results! Here’s some things to look forward to: Automated matrix-free finite difference PDE operators Jacobian reuse efficiency in Rosenbrock-W methods Native Julia fully implicit ODE (DAE) solving in OrdinaryDiffEq.jl High Strong Order Methods for Non-Commutative Noise SDEs Stochastic delay differential equations
Let’s just jump right in! This time we have a bunch of new GPU tools and sparsity handling. (Breaking with Deprecations) DiffEqGPU: GPU-based Ensemble Simulations The MonteCarloProblem interface received an overhaul. First of all, the interface has been renamed to Ensemble. The changes are: MonteCarloProblem -> EnsembleProblem MonteCarloSolution -> EnsembleSolution MonteCarloSummary -> EnsembleSummary num_monte -> trajectories Specifying parallel_type has been deprecated and a deprecation warning is thrown mentioning this. So don’t worry: your code will work but will give warnings as to what to change. Additionally, the DiffEqMonteCarlo.jl package is no longer necessary for any of this functionality. Now, solve of a EnsembleProblem works on the same dispatch mechanism as the rest of DiffEq, which looks like solve(ensembleprob,Tsit5(),EnsembleThreads(),trajectories=n) where the third argument is an ensembling algorithm to specify the threading-based form. Code with the deprecation warning will work until the release of DiffEq 7.0, at which time the alternative path will be removed. See the updated ensembles page for more details The change to dispatch was done for a reason: it allows us to build new libraries specifically for sophisticated handling of many trajectory ODE solves without introducing massive new dependencies to the standard DifferentialEquations.jl user. However, many people might be interested in the first project to make use of this: DiffEqGPU.jl. DiffEqGPU.jl lets you define a problem, like an ODEProblem, and then solve thousands of trajectories in parallel using your GPU. The syntax looks like: monteprob = EnsembleProblem(my_ode_prob) solve(monteprob,Tsit5(),EnsembleGPUArray(),num_monte=100_000) and it will return 100,000 ODE solves. We have seen between a 12x and 90x speedup depending on the GPU of the test systems, meaning that this can be a massive improvement for parameter space exploration on smaller systems of ODEs. Currently there are a few limitations of this method, including that events cannot be used, but those will be solved shortly. Additional methods for GPU-based parameter parallelism are coming soon to the same interface. Also planned are GPU-accelerated multi-level Monte Carlo methods for faster weak convergence of SDEs. Again, this is utilizing compilation tricks to take the user-defined f and recompile it on the fly to a .ptx kernel, and generating kernel-optimized array-based formulations of the existing ODE solvers Automated Sparsity Detection Shashi Gowda (@shashigowda) implemented a sparsity detection algorithm which digs through user-defined Julia functions with Cassette.jl to find out what inputs influence the output. The basic version checks at a given trace, but a more sophisticated version, which we are calling Concolic Combinatoric Analysis, looks at all possible branch choices and utilizes this to conclusively build a Jacobian whose sparsity pattern captures the possible variable interactions. The nice part is that this functionality is very straightforward to use. For example, let’s say we had the following function: function f(dx,x,p,t) for i in 2:length(x)-1 dx[i] = x[i-1] - 2x[i] + x[i+1] end dx = -2x + x dx[end] = x[end-1] - 2x[end] nothing end If we want to find out the sparsity pattern of f, we would simply call: sparsity_pattern = sparsity!(f,output,input,p,t) where output is an array like dx, input is an array like x, p are possible parameters, and t is a possible t. The function will then be analyzed and sparsity_pattern will return a Sparsity type of I and J which denotes the terms in the Jacobian with non-zero elements. By doing sparse(sparsity_pattern) we can turn this into a SparseMatrixCSC with the correct sparsity pattern. This functionality highlights the power of Julia since there is no way to conclusively determine the Jacobian of an arbitrary program f using numerical techniques, since all sorts of scenarios lead to “fake zeros” (cancelation, not checking a place in parameter space where a branch is false, etc.). However, by directly utilizing Julia’s compiler and the SSA provided by a Julia function definition we can perform a non-standard interpretation that tells all of the possible numerical ways the program can act, thus conclusively determining all of the possible variable interactions. Of course, you can still specify analytical Jacobians and sparsity patterns if you want, but if you’re lazy… :) See SparsityDetection.jl’s README for more details. GPU Offloading in Implicit DE Solving We are pleased to announce the LinSolveGPUFactorize option which allows for automatic offloading of linear solves to the GPU. For a problem with a large enough dense Jacobian, using linsolve=LinSolveGPUFactorize() will now automatically perform the factorization and back-substitution on the GPU, allowing for better scaling. For example: using CuArrays Rodas5(linsolve = LinSolveGPUFactorize()) This simply requires a working installation of CuArrays.jl. See the linear solver documentation for more details. Experimental: Automated Accelerator (GPU) Offloading We have been dabbling in allowing automated accelerator (GPU, multithreading, distributed, TPU, etc.) offloading when the right hardware is detected and the problem size is sufficient to success a possible speedup. A working implementation exists as a PR for DiffEqBase which would allow automated acceleration of linear solves in implicit DE solving. However, this somewhat invasive of a default, and very architecture dependent, so it is unlikely we will be releasing this soon. However, we are investigating this concept in more detail in the AutoOffload.jl. If you’re interested in Julia-wide automatic acceleration, please take a look at the repo and help us get something going! A Complete Set of Iterative Solver Routines for Implicit DEs Previous releases had only a pre-built GMRES implementation. However, as detailed on the linear solver page, we now have an array of iterative solvers readily available, including: LinSolveGMRES – GMRES LinSolveCG – CG (Conjugate Gradient) LinSolveBiCGStabl – BiCGStabl Stabilized Bi-Conjugate Gradient LinSolveChebyshev – Chebyshev LinSolveMINRES – MINRES These are all compatible with matrix-free implementations of a AbstractDiffEqOperator. Exponential integrator improvements Thanks to Yingbo Ma (@YingboMa), the exprb methods have been greatly improved. Next Directions Our current development is very much driven by the ongoing GSoC/JSoC projects, which is a good thing because they are outputting some really amazing results! Here’s some things to look forward to: Automated matrix-free finite difference PDE operators Surrogate optimization Jacobian reuse efficiency in Rosenbrock-W methods Native Julia fully implicit ODE (DAE) solving in OrdinaryDiffEq.jl High Strong Order Methods for Non-Commutative Noise SDEs GPU-Optimized Sparse (Colored) Automatic Differentiation Parallelized Implicit Extrapolation of ODEs
DifferentialEquations.jl v6.6.0: Sparse Jacobian Coloring, Quantum Computer ODE Solvers, and Stiff SDEs
Sparsity Performance: Jacobian coloring with numerical and forward differentiation If you have a function f!(du,u) which has a Tridiagonal Jacobian, you could calculate that Jacobian by mixing perturbations. For example, instead of doing u .+ [epsilon,0,0,0,0,0,0,0,...], you’d do u .+ [epsilon,0,0,epsilon,0,0,...]. Because the epsilons will never overlap, you can then decode this “compressed” Jacobian into the sparse form. Do that 3 times and boom, full Jacobian in 4 calls to f! no matter the size of u! Without a color vector, this matrix would take 1+length(u) f! calls, so I’d say that’s a pretty good speedup. This is called Jacobian coloring. [1,2,3,1,2,3,1,2,3,...] are the colors in this example, and places with the same color can be differentiated simultaneously. Now, the DiffEqDiffTools.jl internals allow for passing a color vector into the numerical differentiation libraries and automatically decompressing into a sparse Jacobian. This means that DifferentialEquations.jl will soon be compatible with this dramatic speedup technique. In addition, other libraries in Julia with rely on our utility libraries, like Optim.jl, could soon make good use of this. What if you don’t know a good color vector for your Jacobian? No sweat! The soon to be released SparseDiffTools.jl repository has methods for automatically generating color vectors using heuristic graphical techniques. DifferentialEquations.jl will soon make use of this automatically if you specify a sparse matrix for your Jacobian! Note that the SparseDiffTools.jl repository also includes functions for calculating the sparse Jacobians using color vectors and forward-mode automatic differentiation (using Dual numbers provided by ForwardDiff.jl). In this case, the number of Dual partials is equal to the number of colors, which can be dramatically lower than the length(u) (the dense default!), thereby dramatically reducing compile and run time. Stay tuned for the next releases which begin to auto-specialize everything along the way based on sparsity structure. Thanks to JSoC student Pankaj (@pkj-m) for this work. Higher weak order SROCK methods for stiff SDEs Deepesh Thakur (@deeepeshthakur) continues his roll with stiff stochastic differential equation solvers by implementing not 1 but 7 new high weak order stiff SDE solvers. SROCK1 with generalized noise, SKSROCK, and a bunch of variants of SROCK2. Benchmark updates will come soon, but I have a feeling that these new methods may be by far the most stable methods in the library, and the ones which achieve the lowest error in the mean solution most efficiently. DiffEqBot GSoC student Kanav Gupta (@kanav99) implemented a bot for the JuliaDiffEq team that allows us to run performance regression benchmarks on demand with preset Gitlab runners. Right now this has a dedicated machine for CPU and parallelism performance testing, and soon we’ll have a second machine up and running for performance testing on GPUs. If you haven’t seen the Julialang blog post on this topic, please check it out!. Quantum ODE Solver QuLDE If you happen to have a quantum computer handy, hold your horses. QuLDE from QuDiffEq.jl is an ODE solver designed for quantum computers. It utilizes the Yao.jl quantum circuit simulator to run, but once Yao.jl supports QASM then this will compile to something compatible with (future) quantum computing hardware. This means that, in order to enter the new age of computing, all you have to do is change solve(prob,Tsit5()) to solve(prob,QuLDE()) and you’re there. Is it practical? Who knows (please let us know). Is it cool? Oh yeah! See the quantum ODE solver blog post for more details. Commutative Noise GPU compatibility The commutative noise SDE solvers are now GPU-compatible thanks to GSoC student Deepesh Thakur (@deeepeshthakur). The next step will be to implement high order non-commutative noise SDE solvers and the associated iterated integral approximations in a manner that is GPU-compatible. New benchmark and tutorial repository setups DiffEqBenchmarks.jl and DiffEqTutorials.jl are now fully updated to a Weave.jl form. We still need to fix up a few benchmarks, but it’s in a state that is ready for new contributions. Optimized multithreaded extrapolation The GBS extrapolation methods have gotten optimized, and they now are the one of the most efficient methods at lower tolerances of the Float64 range for non-stiff ODEs: Thank you to Konstantin Althaus (@AlthausKonstantin) for contributing the first version of this algorithm and GSoC student Saurabh Agarwal (@saurabhkgp21) for adding automatic parallelization of the method. This method will soon see improvements as multithreading will soon be improved in Julia v1.2. The new PARTR features will allow our internal @threads loop to perform dynamic work-stealing which will definitely be a good improvement to the current parallelism structure. So stay tuned: this will likely benchmark even better in a few months. Fully non-allocating exp! in exponential integrators Thanks to Yingbo Ma (@YingboMa) for making the internal exp calls of the exponential integrators non-allocating. Continued improvements to this category of methods is starting to show promise in the area of semilinear PDEs. Rosenbrock-W methods JSoC student Langwen Huang (@huanglangwen) has added the Rosenbrock-W class of methods to OrdinaryDiffEq.jl. These methods are like the Rosenbrock methods but are able to reuse their W matrix for multiple steps, allowing the method to scale to larger ODEs more efficiently. Since the Rosenbrock methods benchmark as the fastest methods for small ODEs right now, this is an exciting new set of methods which will get optimized over the course of the summer. Efficient Jacobian reuse techniques and the ability to utilize the sparse differentiation tooling are next on this project. Next Directions Our current development is very much driven by the ongoing GSoC/JSoC projects, which is a good thing because they are outputting some really amazing results! Here’s some things to look forward to: Higher order SDE methods for non-commutative noise Parallelized methods for stiff ODEs Integration of sparse colored differentiation into the differential equation solvers Jacobian reuse efficiency in Rosenbrock-W methods Exponential integrator improvements Native Julia fully implicit ODE (DAE) solving in OrdinaryDiffEq.jl Automated matrix-free finite difference PDE operators Surrogate optimization GPU-based Monte Carlo parallelism
Well, we zoomed towards this one. In this release we have a lot of very compelling new features for performance in specific domains. Large ODEs, stiff SDEs, high accuracy ODE solving, many callbacks, etc. are all specialized on and greatly improved in this PR.