/

hasura-header-illustration

Hasura and Well-Typed collaborate on Haskell tooling

Well-Typed and Hasura have been working together since 2020 to improve Haskell tooling for commercial Haskell users, taking advantage of Well-Typed's expertise maintaining the Glasgow Haskell Compiler and Hasura's
experience using Haskell in production at scale. Over the last two years we
have continued our productive relationship working on a wide variety of
projects, in particular related to the profiling and debugging capabilities of the compiler, many of which have had a reinvention or spruce-up. In this post we'll look at back at the progress we have made together.

Memory profiling and heap analysis

ghc-debug

One of the first big projects we worked on was ghc-debug, a new heap analysis tool that can gather detailed information about the heap of a running process or analyse a snapshot. This tool gives precise results so it can be used to reliably investigate memory usage issues, and we have used it numerous times to fix bugs in the GHC code base. Within Hasura we have used it to investigate fragmentation issues more closely and also to diagnose a critical memory leak regression before a release.

Since GHC 9.2, ghc-debug is supported natively in GHC. All the libraries and executables are on Hackage so it can be installed and used like any normal Haskell library.

Info table profiling

Also in GHC 9.2 we introduced "info table profiling" (or -hi profiling), a new heap profiling mode that analyses memory usage over time and relates it to source code locations. Crucially, it does not require introducing cost centres and recompiling with profiling enabled (which may distort performance). It works by storing a map from info tables to meta-information such as where it originated, what type it is and so on.
The resulting profile can be viewed using eventlog2html to give a detailed table about the memory behaviour of each closure type over the course of a program.

We have used info table profiling extensively on GHC itself to find and resolve memory issues, meaning that GHC 9.2 and 9.4 bring significant reductions in compile-time memory usage.

Understanding memory fragmentation

Our early work with Hasura investigated why there was
a large discrepency between the memory usage reported by the operating system and the Haskell runtime. The initial hypothesis was that, due to the extensive use of pinned bytestrings in Hasura's code base, we were losing memory due to heap fragmentation.

We developed an understanding of how exactly fragmentation could occur on a Haskell heap, tooling to analyse the extent of fragmentation and ultimately some fixes to GHC's memory allocation strategy to reduce fragmentation caused by short-lived bytestring allocations.

This investigation also led to a much deeper understanding of the memory
retention behaviour of the GHC runtime and led to some additional improvements in how much memory the runtime will optimistically retain. For long-lived server applications the amount of memory used should return to a steady baseline after being idle for a long period.

This work also highlighted how other compilers trigger idle garbage collections.
In particular, we may want to investigate triggering idle collections by allocation rate rather than simple idleness, as applications may continue to still do a small amount of work in their idle periods.

Runtime performance profiling and monitoring

Late cost centre profiling

Code centre profiling, the normal tool recommended for GHC users profiling their Haskell programs, allows recording both time/allocation and heap profiles.
It requires compiling the project in profiling mode, which inserts cost centres to the compiled code. Traditionally, the issue with cost centre profiling has been that adding cost centres severly affects how your program is optimised. This means that the existing strategies for automatically inserting cost centres (such as -fprof-auto) can lead to major skew if they are inserted in an inopportune place.

We have implemented a new cost centre insertion mode in GHC 9.4, -fprof-late, which inserts cost centres after the optimiser has finished running. Therefore the cost centres will not affect how your code is optimised and the profile gives a more accurate view of how your unprofiled program would perform. The trade-off is that the names of the cost centres contain internal names, but they are nearly always easily understandable.

The utility of this mode can not be understated, you now get a very fine-grained profile that accurately reflects the actual runtime behaviour of your program. It's made me start using the cost-centre profiler again!

We also developed a plugin which can be used to approximate this mode if you are using
GHC 9.0 or 9.2.

Ticky-ticky profiling

Hasura have a suite of benchmarks that track different runtime metrics, such as bytes allocated.[1] Investigating
regressions in these benchmarks requires a profiling tool geared
towards profiling allocations. GHC has long had support for
ticky profiling, which gives a low level view about which functions are allocating.
However, in the past ticky profiling has been used almost exclusively by GHC
developers, not users, and profiles were only consumable in a rudimentary
text-based format.

We added support to emit ticky samples via the eventlog in GHC 9.4, and support for rendering the information in the profile to an interactive HTML table using eventlog2html. In addition, we integrated the new info table mapping (as used by ghc-debug and -hi profiling) to give precise locations
for each ticky counter, making it easier to interpret the profile.

Live profiling and monitoring via the eventlog

For a long time we have been interested in unifying GHC's various profiling
mechanisms via the eventlog, and making them easier to monitor. We developed a prototype live monitoring setup for Hasura,
eventlog-live, that could
attach to the eventlog and read events whilst the program was running. This
prototype was subsequently extended thanks to funding from IOG.

Native Stack Pointer register

GHC-compiled programs use separate registers for the C stack and Haskell stack.
One consequence of this is that native Linux debugging and statistical profiling tools (such as perf) see only the C stack pointer, and hence provide a very limited window into the behaviour of Haskell programs.

Hasura commissioned some experimental investigatory work to see whether it would be possible to use the native stack pointer register for the Haskell stack, and hence get more useful output from off-the-shelf debugging tools. Unfortunately we ran into issues getting perf to understand the debugging information generated by GHC, and there are challenges related to maintaining LLVM compatibility, but we remain interested in exploring this further.

Haskell Language Server

Lately we have started to support maintenance of the Haskell Language Server
(HLS). The language server is now a key part of many developers' workflows, so it is a priority to make sure it is kept up-to-date and works reliably, and sponsorship from companies like Hasura is crucial to enabling this.

Recently our work on HLS has included:

  • Supporting the GHC 9.2 release series, as Hasura were keen to upgrade and
    have access to all the improved functionality we discussed in this post.
  • Diagnosing and resolving difficult-to-reproduce segfaults experienced by HLS users. It turned out that the version compatability checks were not strict enough, and HLS could load incompatible object files when running Template Haskell. In particular, you must build haskell-language-server with exactly the same version of GHC with which you compiled your package dependencies, so that object files for dependencies have the correct ABI.
  • Starting to take advantage of the recently completed support for Multiple Home Units in GHC to make HLS work more robustly for projects consisting of multiple components.

Conclusion

Well-Typed are grateful to Hasura for funding this work, as it will benefit the whole Haskell community. With their help we have made significant progress in the last two years improving debugging and profiling capabilities of the compiler, and improving the developer experience using HLS. We look forward to continuing our productive collaboration in the future.

As well as experimenting with all these tools on Hasura's code base, we have
also been using them to great effect on GHC's code base, in order to reduce
memory usage and increase performance of the compiler itself (e.g. by profiling GHC compiling Hasura's graphql-engine). The new profiling tools have been useful in finding places to optimise: ghc-debug and -hi profiling made eliminating memory leaks straightforward, the late cost centre patch gives a great overview of where GHC spends time, and ticky profiling gives a low level overview of the allocations. They have also been very helpful for our work on
improving HLS
performance
.

Well-Typed are actively looking for funding to continue maintaining and
enhancing GHC and HLS. If your company relies on robust Haskell tooling, and you could support this work, or would like help improving the developer experience for your Haskell engineers, please get in touch with us via
[email protected]!


  1. The number of bytes allocated acts as a proxy for the amount of computation performed, since Haskell programs tend to allocate frequently, and allocations are more consistent than CPU or wall clock time. ↩︎

About Author

This post was written by Matthew from Well-Typed.

Matthew learned Haskell as an undergraduate at the University of Oxford before completing a PhD in Computer Science from the University of Bristol in 2020. His thesis focused on improvements to Typed Template Haskell and other challenges related to multi-stage programming. He has also contributed to the development of many open source Haskell libraries including ghcide, worked on profiling and debugging tools and implemented many patches for GHC.

Blog
11 May, 2022
Email
Subscribe to stay up-to-date on all things Hasura. One newsletter, once a month.
Loading...
v3-pattern
Accelerate development and data access with radically reduced complexity.