Performance Improvements in ASP.NET Core 8
ASP.NET Core 8 and .NET 8 bring many exciting performance improvements. In this blog post, we will highlight some of the enhancements made in ASP.NET Core and show you how they can boost your web app’s speed and efficiency. This is a continuation of last year’s post on Performance improvements in ASP.NET Core 7. And, of course, it continues to be inspired by Performance Improvements in .NET 8. Many of those improvements either indirectly or directly improve the performance of ASP.NET Core as well.
Benchmarking Setup
We will use BenchmarkDotNet for many of the examples in this blog post.
To setup a benchmarking project:
- Create a new console app (
dotnet new console
) - Add a Nuget reference to BenchmarkDotnet (
dotnet add package BenchmarkDotnet
) version 0.13.8+ - Change Program.cs to
var summary = BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run();
- Add the benchmarking code snippet below that you want to run
- Run
dotnet run -c Release
and enter the number of the benchmark you want to run when prompted
Some of the benchmarks test internal types, and a self-contained benchmark cannot be written. In those cases we’ll either reference numbers that are gotten by running the benchmarks in the repository (and link to the code in the repository), or we’ll provide a simplified example to showcase what the improvement is doing.
There are also some cases where we will reference our end-to-end benchmarks which are public at https://aka.ms/aspnet/benchmarks. Although we only display the last few months of data so that the page will load in a reasonable amount of time.
Servers
We have 3 server implementations in ASP.NET Core; Kestrel, Http.Sys, and IIS. The latter two are only usable on Windows and share a lot of code. Server performance is extremely important because it’s what processes incoming requests and forwards them to your application code. The faster we can process a request, the faster you can start running application code.
Kestrel
Header parsing is one of the first parts of processing done by a server for every request. Which means the performance is critical to allow requests to reach your application code as fast as possible.
In Kestrel we read bytes off the connection into a System.IO.Pipelines.Pipe
which is essentially a list of byte[]
s. When parsing headers we are reading from that list of byte[]
s and have two different code paths. One for when the full header is inside a single byte[]
and another for when a header is split across multiple byte[]
s.
dotnet/aspnetcore#45044 updated the second (slower) code path to avoid allocating a byte[]
when parsing the header, as well as optimizes our SequenceReader
usage to mostly use the underlying ReadOnlySequence<byte>
which can be faster in some cases.
This resulted in a ~18% performance improvement for multi-span headers as well as making it allocation free which helps reduce GC pressure. The following microbenchmark is using internal types in Kestrel and isn’t easy to isolate as a minimal sample. For those interested it is located with the Kestrel source code and was run before and after the change.
Method | Mean | Op/s | Gen 0 | Allocated |
---|---|---|---|---|
MultispanUnicodeHeader – Before | 573.8 ns | 1,742,893.2 | – | 48 B |
MultispanUnicodeHeader – After | 484.9 ns | 2,062,450.8 | – | – |
Below is an allocation profile of an end-to-end benchmark we run on our CI showing the different with this change. We reduced the byte[]
allocations of the scenario by 73%. From 7.8GB to 2GB (during the lifetime of the benchmark run).
dotnet/aspnetcore#48368 replaced some internal custom vectorized code for ascii comparison checks with the new Ascii
class in .NET 8. This allowed us to remove ~400 lines of code and take advantage of improvements like AVX512 and ARM AdvSIMD that are implemented in the Ascii
code that we didn’t have in Kestrel’s implementation.
Http.Sys
Near the end of 7.0 we removed some extra thread pool dispatching in Kestrel that improved performance significantly. More details are in last years performance post. At the beginning of 8.0 we made similar changes to the Http.Sys server in dotnet/aspnetcore#44409. This improved our Json end to end benchmark by 11% from ~469k to ~522k RPS.
Another change we made affects large responses especially in higher latency connections. dotnet/aspnetcore#47776 adds an on-by-default option to enable Kernel-mode response buffering. This allows application writes to be buffered in the OS layer regardless of whether the client connection has acked previous writes or not, and then the OS can optimize sending the data by parallelizing writes and/or sending larger chunks of data at a time. The benefits are clear when using connections with higher latency.
To show a specific example we hosted a server in Sweden and a client in West Coast USA to create some latency in the connection. The following server code was used:
var builder = WebApplication.CreateBuilder(args);
builder.WebHost.UseHttpSys(options =>
{
options.UrlPrefixes.Add("http://+:12345");
options.Authentication.Schemes = AuthenticationSchemes.None;
options.Authentication.AllowAnonymous = true;
options.EnableKernelResponseBuffering = true; // <-- new setting in 8.0
});
var app = builder.Build();
app.UseRouting();
app.MapGet("/file", () =>
{
return TypedResults.File(File.Open("pathToLargeFile", FileMode.Open, FileAccess.Read));
});
app.Run();
The latency was around 200ms (round-trip) between client and server and the server was responding to client requests with a 212MB file.
When setting HttpSysOptions.EnableKernelResponseBuffering
to false the file download took ~11 minutes. And when setting it to true it took ~30 seconds to download the file. That’s a massive improvement, ~22x faster in this specific scenario!
More details on how response buffering works can be found in this blog post.
dotnet/aspnetcore#44561 refactors the internals of response writing in Http.Sys to remove a bunch of GCHandle
allocations and conveniently removes a List<GCHandle>
that was used to track handles for freeing. It does this by allocating and writing directly to NativeMemory
when writing headers. By not pinning managed memory we are reducing GC pressure and helping reduce heap fragmentation. A downside is that we need to be extra careful to free the memory because the allocations are no longer tracked by the GC.
Running a simple web app and tracking GCHandle usage shows that in 7.0 a small response with 4 headers was using 8 GCHandle
s per request, and when adding more headers it was using 2 more GCHandle
s per header. In 8.0 the same app was using only 4 GCHandle
s per request, regardless of the number of headers.
dotnet/aspnetcore#45156 by @ladeak improved the implementation of HttpContext.Request.Headers.Keys
and HttpContext.Request.Headers.Count
in Http.Sys, which is also the same implementation used by IIS so double win. Before, those properties had generic implementations that used IEnumerable
and linq expressions. Now they manually count and minimize allocations, making accessing Count
completely allocation free.
This benchmark uses internal types, so I’ll link to the microbenchmark source instead of providing a standalone microbenchmark.
Before:
Method | Mean | Op/s | Gen 0 | Allocated |
---|---|---|---|---|
CountSingleHeader | 381.3 ns | 2,622,896.1 | 0.0010 | 176 B |
CountLargeHeaders | 3,293.4 ns | 303,639.9 | 0.0534 | 9,032 B |
KeysSingleHeader | 483.5 ns | 2,068,299.5 | 0.0019 | 344 B |
KeysLargeHeaders | 3,559.4 ns | 280,947.4 | 0.0572 | 9,648 B |
After:
Method | Mean | Op/s | Gen 0 | Allocated |
---|---|---|---|---|
CountSingleHeader | 249.1 ns | 4,014,316.0 | – | – |
CountLargeHeaders | 278.3 ns | 3,593,059.3 | – | – |
KeysSingleHeader | 506.6 ns | 1,974,125.9 | – | 32 B |
KeysLargeHeaders | 1,314.6 ns | 760,689.5 | 0.0172 | 2,776 B |
Native AOT
Native AOT was first introduced in .NET 7 and only worked with console applications and a limited number of libraries. In .NET 8.0 we’ve improved the number of libraries that are supported in Native AOT as well as added support for ASP.NET Core applications. AOT apps can have minimized disk footprint, reduced startup times, and reduced memory demand. But before we talk about AOT more and show some numbers, we should talk about a prerequisite, trimming.
Starting in .NET 6 trimming applications became a fully supported feature. Enabling this feature with <PublishTrimmed>true</PublishTrimmed>
in your .csproj
enables the trimmer to run during publish and remove code your application isn’t using. This can result in smaller deployed application sizes, useful in scenarios where you are running on memory constrained devices. Trimming isn’t free though, libraries might need to annotate types and method calls to tell the trimmer about code being used that the trimmer can’t determine, otherwise the trimmer might trim away code you’re relying on and your app won’t run as expected. The trimmer will raise warnings when it sees code that might not be compatible with trimming. Until .NET 8 the <TrimMode>
property for publishing web apps was set to partial
. This meant that only assemblies that explicitly stated they supported trimming would be trimmed. Now in 8.0, full
is used for <TrimMode>
which means all assemblies used by the app will be trimmed. These settings are documented in the trimming options docs.
In .NET 6 and .NET 7 a lot of libraries weren’t compatible with trimming yet, notably ASP.NET Core libraries. If you tried to publish a simple ASP.NET Core app in 7.0 you would get a bunch of trimmer warnings because most of ASP.NET Core didn’t support trimming yet.
The following is an ASP.NET Core app to show trimming in net7.0 vs. net8.0. All the numbers are for a windows publish.
<Project Sdk="Microsoft.NET.Sdk.Web">
<PropertyGroup>
<TargetFrameworks>net7.0;net8.0</TargetFrameworks>
<Nullable>enable</Nullable>
<ImplicitUsings>enable</ImplicitUsings>
</PropertyGroup>
</Project>
// dotnet publish --self-contained --runtime win-x64 --framework net7.0 -p:PublishTrimmed=true -p:PublishSingleFile=true --configuration Release
var app = WebApplication.Create();
app.Run((c) => c.Response.WriteAsync("hello world"));
app.Run();
TFM | Trimmed | Warnings | App Size | Publish duration |
---|---|---|---|---|
net7.0 | false | 0 | 88.4MB | 3.9 sec |
net8.0 | false | 0 | 90.9MB | 3.9 sec |
net7.0 | true | 16 | 28.9MB | 16.4 sec |
net8.0 | true | 0 | 17.3MB | 10.8 sec |
In addition to no more warnings when publishing trimmed in net8.0, the app size is smaller because we’ve annotated more libraries so the linker can find more code that isn’t being used by the app. Part of annotating the libraries involved analyzing what code is being kept by the trimmer and changing code to improve what can be trimmed. You can see numerous PRs to help this effort; dotnet/aspnetcore#47567, dotnet/aspnetcore#47454, dotnet/aspnetcore#46082, dotnet/aspnetcore#46015, dotnet/aspnetcore#45906, dotnet/aspnetcore#46020, and many more.
The Publish duration
field was calculated using the Measure-Command
in powershell (and deleting /bin/ and /obj/ between every run). As you can see, enabling trimming can increase the publish time because the trimmer has to analyze the whole program to see what it can remove, which isn’t a free operation.
We also introduced two smaller versions of WebApplication
if you want even smaller apps via CreateSlimBuilder
and CreateEmptyBuilder
.
Changing the previous app to use CreateSlimBuilder
:
// dotnet publish --self-contained --runtime win-x64 --framework net8.0 -p:PublishTrimmed=true -p:PublishSingleFile=true --configuration Release
var builder = WebApplication.CreateSlimBuilder(args);
var app = builder.Create();
app.Run((c) => c.Response.WriteAsync("hello world"));
app.Run();
will result in an app size of 15.5MB.
And then going one step further with CreateEmptyBuilder
:
// dotnet publish --self-contained --runtime win-x64 --framework net8.0 -p:PublishTrimmed=true -p:PublishSingleFile=true --configuration Release
var builder = WebApplication.CreateEmptyBuilder(new WebApplicationOptions()
{
Args = args
});
var app = builder.Create();
app.Run((c) => c.Response.WriteAsync("hello world"));
app.Run();
will result in an app size of 13.7MB, although in this case the app won’t work because there is no server implementation registered. So if we add Kestrel via builder.WebHost.UseKestrelCore();
the app size becomes 15MB.
TFM | Builder | App Size |
---|---|---|
net8.0 | Create | 17.3MB |
net8.0 | Slim | 15.5MB |
net8.0 | Empty | 13.7MB |
net8.0 | Empty+Server | 15.0MB |
Note that both these APIs are available starting in 8.0 and remove a lot of defaults so it’s more pay for play.
Now that we’ve taken a small look at trimming and seen that 8.0 has more trim compatible libraries, let’s take a look at Native AOT. Just like with trimming, if your app/library isn’t compatible with Native AOT you’ll get warnings when building for Native AOT and there are additional limitations to what works in Native AOT.
Using the same app as before, we’ll enable Native AOT by adding <PublishAot>true</PublishAot>
to our csproj.
TFM | AOT | App Size | Publish duration |
---|---|---|---|
net7.0 | false | 88.4MB | 3.9 sec |
net8.0 | false | 90.9MB | 3.9 sec |
net7.0 | true | 40MB | 71.7 sec |
net8.0 | true | 12.6MB | 22.7 sec |
And just like with trimming, we can test the WebApplication
APIs that have less defaults enabled.
TFM | Builder | App Size |
---|---|---|
net8.0 | Create | 12.6MB |
net8.0 | Slim | 8.8MB |
net8.0 | Empty | 5.7MB |
net8.0 | Empty+Server | 7.8MB |
That’s pretty cool! A small net8.0 app is 90.9MB and when published as Native AOT it’s 12.6MB, or as low as 7.8MB (assuming we want a server, which we probably do).
Now let’s take a look at some other performance characteristics of a Native AOT app; startup speed, memory usage, and RPS.
In order to properly show E2E benchmark numbers we need to use a multi-machine setup so that the server and client processes don’t steal CPU from each other and we don’t have random processes running like you would for a local machine. I’ll be using our internal benchmarking infrastructure that makes use of the benchmarking tool crank and our aspnet-citrine-win and aspnet-citrine-lin machines for server and load respectively. Both machine specs are described in our benchmarks readme. And finally, I’ll be using an application that uses Minimal APIs to return a json payload. This app uses the Slim builder we showed earlier as well as sets <InvariantGlobalization>true</InvariantGlobalization>
in the csproj.
If we run the app without any extra settings:
crank –config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/goldilocks.benchmarks.yml –config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml –config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/steadystate.profile.yml –scenario basicminimalapivanilla –profile intel-win-app –profile intel-lin-load –application.framework net8.0 –application.options.collectCounters true
This gives us a ~293ms startup time, 444MB working set, and ~762k RPS.
If we run the same app but publish it as Native AOT:
crank –config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/goldilocks.benchmarks.yml –config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml –config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/steadystate.profile.yml –scenario basicminimalapipublishaot –profile intel-win-app –profile intel-lin-load –application.framework net8.0 –application.options.collectCounters true
We get ~67ms startup time, 56MB working set, and ~681k RPS. That’s ~77% faster startup speed, ~87% lower working set, and ~12% lower RPS. The startup speed is expected because the app has already been optimized, and there is no JIT running to start optimizing code. Also, in non-Native AOT apps, because startup methods are likely only called once, tiered compilation will never run on the startup methods so they won’t be as optimized as they could be, but in NativeAOT the startup method will be fully optimized. The working set is a bit surprising, it is lower because Native AOT apps by default run with the new Dynamic Adaptation To Application Sizes (DATAS) GC. This GC setting tries to maintain a balance between throughput and overall memory usage, which we can see it doing with an ~87% lower working set at the cost of some RPS. You can read more about the new GC setting in Maoni0’s blog.
Let’s also compare the Native AOT vs. non-Native AOT apps with the Server GC. So we’ll add --application.environmentVariables DOTNET_GCDynamicAdaptationMode=0
when running the Native AOT app.
This time we get ~64ms startup time, 403MB working set, and ~730k RPS. The startup time is still extremely fast because changing the GC doesn’t affect that, our working set is closer to the non-Native AOT app but smaller due in part to not having the JIT compiler loaded and running, and our RPS is closer to the non-Native AOT app because we’re using the Server GC which optimizes throughput more than memory usage.
AOT | GC | Startup | Working Set | RPS |
---|---|---|---|---|
false | Server | 293ms | 444MB | 762k |
false | DATAS | 303ms | 77MB | 739k |
true | Server | 64ms | 403MB | 730k |
true | DATAS | 67ms | 56MB | 681k |
Non-Native AOT apps have the JIT optimizing code while it’s running, and starting in .NET 8 the JIT by default will make use of dynamic PGO, this is a really cool feature that Native AOT isn’t able to benefit from and is one reason non-Native AOT apps can have more throughput than Native AOT apps. You can read more about dynamic PGO in the .NET 8 performance blog.
If you’re willing to trade some publish size for potentially more optimized code you can pass /p:OptimizationPreference=Speed
when building and publishing your Native AOT app. When we do this for our benchmark app (with Server GC) we get a publish size of 9.5MB instead of 8.9MB and 745k RPS instead of 730k.
The app we’ve been using makes use of Minimal APIs which by default isn’t trim friendly. It does a lot of reflection and dynamic code generation that isn’t statically analyzable so the trimmer isn’t able to safely trim the app. So why don’t we see warnings when we Native AOT publish this app? Because we wrote a source-generator called Request Delegate Generator (RDG) that replaces your MapGet
, MapPost
, etc. methods with trim friendly code. This source-generator is automatically used for ASP.NET Core apps when trimming/aot publishing. Which leads us into the next section where we dive into RDG.
Request Delegate Generator
The Request Delegate Generator (RDG) is a source-generator created to make Minimal APIs trimmer and Native AOT friendly. Without RDG, using Minimal APIs will result in many warnings and your app likely won’t work as expected. Here is a quick example to show an endpoint that will result in an exception when using Native AOT without RDG but will work with RDG enabled (or when not using Native AOT).
app.MapGet("/test", (Bindable b) => "Hello world!");
public class Bindable
{
public static ValueTask<Bindable?> BindAsync(HttpContext context, ParameterInfo parameter)
{
return new ValueTask<Bindable?>(new Bindable());
}
}
This app throws when you send a GET
request to /test
because the Bindable.BindAsync
method is referenced via reflection and so the trimmer can’t statically figure out that the method is being used and will remove it. Minimal APIs then sees the MapGet
call as needing a request body which isn’t allowed by default for GET
calls.
Besides fixings warnings and making the app work as expected in Native AOT, we get improved first response time and reduced publish size.
Without RDG, the first time a request is made to the app is when all the expression trees are generated for all endpoints in the application. Because RDG generates the source for an endpoint at compile time, there is no expression tree generation needed, the code for a specific endpoint is already available and can execute immediately.
If we take the app used earlier for benchmarking AOT and look at time to first request we get ~187ms when not running as AOT and without RDG. We then get ~130ms when we enable RDG. When publishing as AOT, the time to first request is ~60ms regardless of using RDG. But this app only has 2 endpoints, so let’s add 1000 more endpoints and see the difference!
2 Routes:
AOT | RDG | First Request | Publish Size |
---|---|---|---|
false | false | 187ms | 97MB |
false | true | 130ms | 97MB |
true | false | 60ms | 11.15MB |
true | true | 60ms | 8.89MB |
1002 Routes:
AOT | RDG | First Request | Publish Size |
---|---|---|---|
false | false | 1082ms | 97MB |
false | true | 176ms | 97MB |
true | false | 157ms | 11.15MB |
true | true | 84ms | 8.89MB |
Runtime APIs
In this section we’ll be looking at changes that mainly involve updating to use new APIs introduced in .NET 8 in the Base Class Library (BCL).
SearchValues
dotnet/aspnetcore#45300 by @gfoidl, dotnet/aspnetcore#47459, dotnet/aspnetcore#49114, and dotnet/aspnetcore#49117 all make use of the new SearchValues
type which lets these code paths take advantage of optimized search implementations for the specific values being searched for. The SearchValues
section of the .NET 8 performance blog explains more details about the different search algorithms used and why this type is so cool!
Spans
dotnet/aspnetcore#46098 makes use of the new MemoryExtensions.Split(ReadOnlySpan<char> source, Span<Range> destination, char separator)
method. This allows certain cases of string.Split(...)
to be replaced with a non-allocating version. This saves the string[]
allocation as well as the individual string
allocations for the items in the string[]
. More details on this new API can be seen in the .NET 8 Performance post span section.
FrozenDictionary
Another new type introduced is FrozenDictionary
. This allows constructing a dictionary optimized for read operations at the cost of slower construction.
dotnet/aspnetcore#49714 switches a Dictionary
in routing to use FrozenDictionary
. This dictionary is used when routing an http request to the appropriate endpoint which is almost every request to an application. The following tables show the cost of creating the dictionary vs. frozen dictionary, and then the cost of using a dictionary vs. frozen dictionary respectively. You can see that constructing a FrozenDictionary
can be up to 13x slower, but the overall time is still in the micro second range (1/1000th of a millisecond) and the FrozenDictionary
is only constructed once for the app. What we all like to see is that the per operation performance of using FrozenDictionary
is 2.5x-3.5x faster than a Dictionary
!
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory)]
public class JumpTableMultipleEntryBenchmark
{
private string[] _strings;
private int[] _segments;
private JumpTable _dictionary;
private JumpTable _frozenDictionary;
private List<(string text, int _)> _entries;
[Params(1000)]
public int NumRoutes;
[GlobalSetup]
public void Setup()
{
_strings = GetStrings(1000);
_segments = new int[1000];
for (var i = 0; i < _strings.Length; i++)
{
_segments[i] = _strings[i].Length;
}
var samples = new int[NumRoutes];
for (var i = 0; i < samples.Length; i++)
{
samples[i] = i * (_strings.Length / NumRoutes);
}
_entries = new List<(string text, int _)>();
for (var i = 0; i < samples.Length; i++)
{
_entries.Add((_strings[samples[i]], i));
}
_dictionary = new DictionaryJumpTable(0, -1, _entries.ToArray());
_frozenDictionary = new FrozenDictionaryJumpTable(0, -1, _entries.ToArray());
}
[BenchmarkCategory("GetDestination"), Benchmark(Baseline = true, OperationsPerInvoke = 1000)]
public int Dictionary()
{
var strings = _strings;
var segments = _segments;
var destination = 0;
for (var i = 0; i < strings.Length; i++)
{
destination = _dictionary.GetDestination(strings[i], segments[i]);
}
return destination;
}
[BenchmarkCategory("GetDestination"), Benchmark(OperationsPerInvoke = 1000)]
public int FrozenDictionary()
{
var strings = _strings;
var segments = _segments;
var destination = 0;
for (var i = 0; i < strings.Length; i++)
{
destination = _frozenDictionary.GetDestination(strings[i], segments[i]);
}
return destination;
}
[BenchmarkCategory("Create"), Benchmark(Baseline = true)]
public JumpTable CreateDictionaryJumpTable() => new DictionaryJumpTable(0, -1, _entries.ToArray());
[BenchmarkCategory("Create"), Benchmark]
public JumpTable CreateFrozenDictionaryJumpTable() => new FrozenDictionaryJumpTable(0, -1, _entries.ToArray());
private static string[] GetStrings(int count)
{
var strings = new string[count];
for (var i = 0; i < count; i++)
{
var guid = Guid.NewGuid().ToString();
// Between 5 and 36 characters
var text = guid.Substring(0, Math.Max(5, Math.Min(i, 36)));
if (char.IsDigit(text[0]))
{
// Convert first character to a letter.
text = ((char)(text[0] + ('G' - '0'))) + text.Substring(1);
}
if (i % 2 == 0)
{
// Lowercase half of them
text = text.ToLowerInvariant();
}
strings[i] = text;
}
return strings;
}
}
public abstract class JumpTable
{
public abstract int GetDestination(string path, int segmentLength);
}
internal sealed class DictionaryJumpTable : JumpTable
{
private readonly int _defaultDestination;
private readonly int _exitDestination;
private readonly Dictionary<string, int> _dictionary;
public DictionaryJumpTable(
int defaultDestination,
int exitDestination,
(string text, int destination)[] entries)
{
_defaultDestination = defaultDestination;
_exitDestination = exitDestination;
_dictionary = entries.ToDictionary(e => e.text, e => e.destination, StringComparer.OrdinalIgnoreCase);
}
public override int GetDestination(string path, int segmentLength)
{
if (segmentLength == 0)
{
return _exitDestination;
}
var text = path.Substring(0, segmentLength);
if (_dictionary.TryGetValue(text, out var destination))
{
return destination;
}
return _defaultDestination;
}
}
internal sealed class FrozenDictionaryJumpTable : JumpTable
{
private readonly int _defaultDestination;
private readonly int _exitDestination;
private readonly FrozenDictionary<string, int> _dictionary;
public FrozenDictionaryJumpTable(
int defaultDestination,
int exitDestination,
(string text, int destination)[] entries)
{
_defaultDestination = defaultDestination;
_exitDestination = exitDestination;
_dictionary = entries.ToFrozenDictionary(e => e.text, e => e.destination, StringComparer.OrdinalIgnoreCase);
}
public override int GetDestination(string path, int segmentLength)
{
if (segmentLength == 0)
{
return _exitDestination;
}
var text = path.Substring(0, segmentLength);
if (_dictionary.TryGetValue(text, out var destination))
{
return destination;
}
return _defaultDestination;
}
}
Method | NumRoutes | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
CreateDictionaryJumpTable | 25 | 735.797 ns | 8.5503 ns | 7.5797 ns | 1.00 | 0.00 |
CreateFrozenDictionaryJumpTable | 25 | 4,677.927 ns | 80.4279 ns | 71.2972 ns | 6.36 | 0.11 |
CreateDictionaryJumpTable | 50 | 1,433.309 ns | 19.4435 ns | 17.2362 ns | 1.00 | 0.00 |
CreateFrozenDictionaryJumpTable | 50 | 10,065.905 ns | 188.7031 ns | 176.5130 ns | 7.03 | 0.12 |
CreateDictionaryJumpTable | 100 | 2,712.224 ns | 46.0878 ns | 53.0747 ns | 1.00 | 0.00 |
CreateFrozenDictionaryJumpTable | 100 | 28,397.809 ns | 358.2159 ns | 335.0754 ns | 10.46 | 0.20 |
CreateDictionaryJumpTable | 1000 | 28,279.153 ns | 424.3761 ns | 354.3733 ns | 1.00 | 0.00 |
CreateFrozenDictionaryJumpTable | 1000 | 313,515.684 ns | 6,148.5162 ns | 8,208.0925 ns | 11.26 | 0.33 |
Dictionary | 25 | 21.428 ns | 0.1816 ns | 0.1516 ns | 1.00 | 0.00 |
FrozenDictionary | 25 | 7.137 ns | 0.0588 ns | 0.0521 ns | 0.33 | 0.00 |
Dictionary | 50 | 21.630 ns | 0.1978 ns | 0.1851 ns | 1.00 | 0.00 |
FrozenDictionary | 50 | 7.476 ns | 0.0874 ns | 0.0818 ns | 0.35 | 0.00 |
Dictionary | 100 | 23.508 ns | 0.3498 ns | 0.3272 ns | 1.00 | 0.00 |
FrozenDictionary | 100 | 7.123 ns | 0.0840 ns | 0.0745 ns | 0.30 | 0.00 |
Dictionary | 1000 | 23.761 ns | 0.2360 ns | 0.2207 ns | 1.00 | 0.00 |
FrozenDictionary | 1000 | 8.516 ns | 0.1508 ns | 0.1337 ns | 0.36 | 0.01 |
Other
This section is a compilation of changes that enhance performance but do not fall under any of the preceding categories.
Regex
As part of the AOT effort, we noticed the regex created in RegexRouteConstraint
(see route constraints for more info) was adding ~1MB to the published app size. This is because the route constraints are dynamic (application code defines them) and we were using the Regex
constructor that accepts RegexOptions
. This meant the trimmer has to keep around all regex code that could potentially be used, including the NonBacktracking
engine which keeps ~.8MB of code. By adding RegexOptions.Compiled
the trimmer can now see that the NonBacktracking
code will not be used and it can reduce the application size by ~.8MB. Additionally, using compiled regexes is faster than using the interpreted regex. The quick fix was to just add RegexOptions.Compiled
when creating the Regex
which was done in dotnet/aspnetcore#46192 by @eugeneogongo. The problem is that this slows down app startup because we resolve constraints when starting the app and compiled regexes are slower to construct.
dotnet/aspnetcore#46323 fixes this by lazily initializing the regexes so app startup is actually faster than 7.0 when we weren’t using compiled regexes. It also added caching to the route constraints which means if you share constraints in multiple routes you will save allocations by sharing constraints across routes.
Running a microbenchmark for the route builder to measure startup performance shows an almost 450% improvement when using 1000 routes due to no longer initializing the regexes. The benchmark lives in the dotnet/aspnetcore repo. It has a lot of setup code and would be a bit too long to put in this post.
Before with interpreted regexes:
Method | Mean | Op/s | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|
Build | 6.739 ms | 148.4 | 15.6250 | – | 7 MB |
After with compiled and lazy regexes:
Method | Mean | Op/s | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|
Build | 1.521 ms | 657.2 | 5.8594 | 1.9531 | 2 MB |
Another Regex improvement came from dotnet/aspnetcore#44770 which switched a Regex usage in routing to use the Regex Source Generator. This moves the cost of compiling the Regex to build time, as well as resulting in faster Regex code due to optimizations the source generator takes advantage of that the in-process Regex compiler does not.
We’ll show a simplified example that demonstrates using the generated regex vs. the compiled regex.
public partial class AlphaRegex
{
static Regex Net7Constraint = new Regex(
@"^[a-z]*$",
RegexOptions.CultureInvariant | RegexOptions.Compiled | RegexOptions.IgnoreCase,
TimeSpan.FromSeconds(10));
static Regex Net8Constraint = GetAlphaRouteRegex();
[GeneratedRegex(@"^[A-Za-z]*$")]
private static partial Regex GetAlphaRouteRegex();
[Benchmark(Baseline = true)]
public bool CompiledRegex()
{
return Net7Constraint.IsMatch("Administration") && Net7Constraint.IsMatch("US");
}
[Benchmark]
public bool SourceGenRegex()
{
return Net8Constraint.IsMatch("Administration") && Net8Constraint.IsMatch("US");
}
}
Method | Mean | Error | StdDev | Ratio |
---|---|---|---|---|
CompiledRegex | 86.92 ns | 0.572 ns | 0.447 ns | 1.00 |
SourceGenRegex | 57.81 ns | 0.860 ns | 0.805 ns | 0.66 |
Analyzers
Analyzers are useful for pointing out issues in code that can be hard to convey in API signatures, suggesting code patterns that are more readable, and they can also suggest more performant ways to write code. dotnet/aspnetcore#44799 and dotnet/aspnetcore#44791 both from @martincostello enabled CA1854 which helps avoid 2 dictionary lookups when only 1 is needed, and dotnet/aspnetcore#44269 enables a bunch of analyzers many of which help use more performant APIs and are described in more detail in last years .NET 7 Performance Post.
I would encourage developers who are interested in performance in their own products to checkout performance focused analyzers which contains a list of many analyzers that will help avoid easy to fix performance issues.
StringBuilder
StringBuilder
is an extremely useful class for constructing a string when you either can’t precompute the size of the string to create or want an easy way to construct a string without the complications involved with using string.Create(...)
.
StringBuilder
comes with a lot of helpful methods as well as a custom implementation of an InterpolatedStringHandler. What this means is that you can “create” strings to add to the StringBuilder
without actually allocating the string. For example, previously you might have written stringBuilder.Append(FormattableString.Invariant($"{key} = {value}"));
. This would have allocated a string via FormattableString.Invariant(...)
then put it in the StringBuilder
s internal char[]
buffer, making the string a temporary allocation. Instead you can write stringBuilder.Append(CultureInfo.InvariantCulture, $"{key} = {value}");
. This also looks like it would allocate a string via $"{key} = {value}"
, but because StringBuilder
has a custom InterpolatedStringHandler
the string isn’t actually allocated and instead is written directly to the internal char[]
.
dotnet/aspnetcore#44691 fixes some usage patterns with StringBuilder
to avoid allocations as well as makes use of the InterpolatedStringHandler
overload(s).
One specific example was taking a byte[]
and converting it into a string in hexidecimal format so we could send it as a query string.
[MemoryDiagnoser]
public class AppendBenchmark
{
private byte[] _b = new byte[30];
[GlobalSetup]
public void Setup()
{
RandomNumberGenerator.Fill(_b);
}
[Benchmark]
public string AppendToString()
{
var sb = new StringBuilder();
foreach (var b in _b)
{
sb.Append(b.ToString("x2", CultureInfo.InvariantCulture));
}
return sb.ToString();
}
[Benchmark]
public string AppendInterpolated()
{
var sb = new StringBuilder();
foreach (var b in _b)
{
sb.Append(CultureInfo.InvariantCulture, $"{b:x2}");
}
return sb.ToString();
}
}
Method | Mean | Gen0 | Allocated |
---|---|---|---|
AppendToString | 748.7 ns | 0.1841 | 1448 B |
AppendInterpolated | 739.7 ns | 0.0620 | 488 B |
Summary
Thanks for reading! Try out .NET 8 and let us know how your app’s performance has changed! We are always looking for feedback on how to improve the product and look forward to your contributions, be it an issue report or a PR. If you want more performance goodness, you can read the Performance Improvements in .NET 8 post. Also, take a look at Developer Stories which showcases multiple teams at Microsoft migrating from .NET Framework to .NET or to newer versions of .NET and seeing major performance and operating cost wins.
5 comments
Exciting improvements, thanks!
could anyone write a blog on ArrayPool ? in the past, I think the arrayPool is something like a
LinkedList<T[]>
, and it hold all the arry it allecate , and thus there will be a memeory leak if you not return to the pool. but as metion as dotnet/aspnetcore#45044, it seems it’s ok to let GC to collect the ArrayPool array.The logic behind renting is effectively to check whether there’s a usable array currently stored in the pool: if there is, take it out and return it, and if there isn’t, allocate a new one and return it. The logic behind returning is effectively to check whether there’s currently space in the pool to store it: if there is, put it back, and if there isn’t, throw it away. As such, if you don’t return an array to the pool, it just means that someone else who comes along to rent one is going to be more likely to need to allocate one, but then when they return theirs, it’ll still be stored in the pool. So it’s not a permanent leak.
There are three main downsides to renting and not returning (as opposed to just allocating an array without using the pool):
1. There’s more overhead associated with
Rent
than there is generally withnew[]
, so if you’re not going to return the array, you should just be usingnew[]
.2. The arrays from the pool are generally more valuable than ones freshly allocated, because ones in the pool are more likely to have already been promoted to gen2. Thus by using one of those arrays and then dropping it, you’re creating more pressure for the GC to perform a gen2 GC.
3. Code that might have been written previously to avoid using allocation might now use ArrayPool under the premise that it’s always going to get a pooled array. But if some code somewhere is taking arrays from the pool and frequently not returning them, that can in turn violate those assumptions made by other code and make other code more expensive.
In general, then, you should strive to return arrays you rent, but not doing so every once in a while isn’t a big deal. We always return arrays on success paths (or if we don’t, we consider that a bug), but we’re ok dropping an array here or there in the case of exceptions occurring.
I wondered about that too… mini post…
Firstly, your point about the linked list would be an issue if the implementation were actually using a linked list, as you point out the GC would be held at bay by the references in the list. However, this is not what happens. There is a CLR (common language runtime) mechanism to hold object references, (arrays in this case), without keeping them alive. In C# this shows up as a “WeakReference”; WeakReferences are tracked separately by the GC, and dont count as “alive”. WeakReference is a wrapper around a GCHandle with the “weak” flag set, – which is how the GC can track WeakReferences separately. It is a core feature of the GCHandle system for managed memory, the other one is “Pinning”, (the third feature flag is a scary place).
Looking at the implementation for “ArrayPool.Shared” – this must keep track of the arrays, but that table of references should not prevent the garbage collector from recycling the memory that has been rented out as Arrays from the pool (- so that if there are no longer any references elsewhere, the GC can reuse the space). The specific “ArrayPool.Shared” implementation uses a ConditionalWeakTable which holds “weakreferences”. So not returning every single array is not a crime, the GC will find them.
Since the “Pool” classes are abstract, (also nice and simple), one can implement them in other ways, so this discussion is not really about “the” ArrayPool, but just a specific “ArrayPool.Shared” implementation detail.
Two key things to know about renting Arrays from the runtime ArrayPool implementations:
1. The Arrays are full of dirty dishes – meaning the data is not wiped clean, you get what ever was left there last time this array was rented out. – sounds simple, but I learnt it the hard way. (there is a flag to get the arrays cleaned in recent versions). The reason is that it is faster not to clear the memory of course (but is it really ?).
2. The Arrays are the wrong size (most of the time). The Arrays in the runtime implementations are allocated in a small set of discrete sizes (powers of two today, but dont rely on that). You always get at least what you ask for, but usually more. (The reason for that is to avoid keeping tables for every possible size). This is inconvenient, but hard to fix. One way round the “wrong length” issue is to use “ArraySegments”, which can be the right length and cough up a reference to the array when it is time to return the used array to the ArrayPool for the next customer. AsMemory and Slice/Span let you do a similar thing, but one way or another you need to bring the original array back to return it to the ArrayPool (otherwise why use a pool).
Two common basic use cases :
Firstly renting arrays for temporary space where we rent the array, then “try” and use it for file parsing (for example) and “finally” we return it to the pool. This is especially useful for something that happens often (eg. network IO), and for which we would rather avoid allocating a new array each time. This feels right compared to static/threadlocal.
While the first use case is local and small, the second use case is the opposite, specifically solving the problem of non-local creation of arrays in one place and disposal of the same arrays for reuse in another place in the system. For example, imagine we are receiving live temperature data from UDP datagram messages in one Task or thread and displaying the temperature on a chart in a different Task/thread. We can use a channel to queue messages from the Task receiving UDP messages to the Task updating the temperature plot Chart. Thats ok, but to make it hum we can use an ArrayPool to avoid allocating new messages every time. The ArrayPool allows us to rent arrays at the UDP network end to fill with data when the UDP messages arrive, then we push the newly filled rented messages into the channel. At the other end we use the data in the messages for our temperature plot and simply return each message to the ArrayPool after the message is read. The arrays in this case cycle between the Pool and the channel. The ArrayPool.Shared is a good fit for this.
Here is some space to draw your own diagram…
References
Jeff Richter’s book on the CLR
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Buffers/SharedArrayPool.cs,5f646655a4d1632b
Uncalled for opinion: CLR Arrays are great and awful and capricious at the same time, almost as annoying as IDisposable.
Fascinating stuff! Keep up the amazing work!