Stratus3D

A blog on software engineering by Trevor Brown

Debugging a Slow Starting Elixir Application

I recently had to fix an Elixir service that was slow to start. I was able to pinpoint the issue with only a few commands and I want to share a couple of the things I learned.

Dependencies

In Elixir all dependencies are "applications". The term "application" means something different than it does outside of Elixir. In Elixir an "application" is a set of modules and behaviors. Some of these applications define their own supervision trees and must be started by Application.start/2 before they can be used. When you start your Elixir service, either via Mix or a generated Elixir release, the dependencies you specified in your mix.exs file will be started before your own code is started. If an application listed as a dependency is slow to start your application must wait until the dependency is running before it can be started.

While the behavior is simple it is recursive. Each application has its own set of dependencies that must be running before that application can be started, and some of those dependencies have dependencies of their own that must be running before they can start. This results in a dependency tree structure. To illustrate this with a little ASCII:

- your_app
  - dependency_1
    - hidden_dependency_1
    - hidden_dependency_2
  - dependency_2
    - hidden_dependency_3

For this application, the Erlang VM would likely start these applications in this order:

  1. hidden_dependency_3

  2. dependency_2

  3. hidden_dependency_2

  4. hidden_dependency_1

  5. dependency_1

  6. your_app

The application I had to fix had a lot of dependencies. Profiling each application would be tedious and time-consuming, and I had a hunch there was probably a single dependency that was the problem. Turns out it’s pretty easy to write a little code that times the start up of each application.

Profiling

Start an IEx shell the --no-start flag so that the application is available but not yet loaded or started:

iex -S mix run --no-start

Then load this code into the shell:

defmodule StartupBenchmark do
  def run(application) do
    complete_deps = deps_list(application) # (1)

    dep_start_times = Enum.map(complete_deps, fn(app) -> # (2)
      case :timer.tc(fn() -> Application.start(app) end) do
        {time, :ok} -> {time, app}
        # Some dependencies like :kernel may have already been started, we can ignore them
        {time, {:error, {:already_started, _}}} -> {time, app}
        # Raise an exception if we get an non-successful return value
        {time, error} -> raise(error)
      end
    end)

    dep_start_times
    |> Enum.sort() # (3)
    |> Enum.reverse()
  end

  defp deps_list(app) do
    # Get all dependencies for the app
    deps = Application.spec(app, :applications)

    # Recursively call to get all sub-dependencies
    complete_deps = Enum.map(deps, fn(dep) -> deps_list(dep) end)

    # Build a complete list of sub dependencies, with the top level application
    # requiring them listed last, also remove any duplicates
    [complete_deps, [app]]
    |> List.flatten()
    |> Enum.uniq()
  end
end

To highlight the important pieces from this module:

  1. Recursively get all applications that must be started in the order they need to be started in.

  2. Start each application in order; timing each one.

  3. Sort applications by start time so the slowest application is the first item in the list.

With this code finding applications that are slow to start is easy:

> StartupBenchmark.run(:your_app)
[
  {6651969, :prometheus_ecto},
  {19621, :plug_cowboy},
  {14336, :postgrex},
  {13598, :ecto_sql},
  {5123, :yaml_elixir},
  {3871, :phoenix_live_dashboard},
  {1159, :phoenix_ecto},
  {123, :prometheus_plugs},
  {64, :ex_json_logger},
  {56, :prometheus_phoenix},
  {56, :ex_ops},
  {36, :kernel},
  ...
]

These times are in microseconds so in this case prometheus_ecto is taking 6.6 seconds to start. All other applications are taking less than 20 milliseconds to start and many of them are taking less than 1 millisecond to start. prometheus_ecto is the culprit here.

Conclusion

With the code above I was able to identify prometheus_ecto as the problem. With this information I was then able to use eFlambe and a few other tools to figure out why prometheus_ecto was so slow and quickly fixed the issue.

I hope the snippet of code above will be helpful to some of you. If you like reading my blog posts please subscribe to my newsletter. I send emails out once a month with my latest posts.