PyOxidizer 0.7
April 09, 2020 at 09:00 PM | categories: Python, PyOxidizer
I am very pleased to announce the 0.7 release of PyOxidizer, a modern Python application packaging tool.
There are a host of notable new features in this release. You can read all about them in the project history.
I want to use this blog post to call out the more meaningful ones.
I started PyOxidizer as a science experiment of sorts: I sat out to prove the hypothesis that it was possible to produce high performance single file executables embedding Python and all of its resources (Python modules, non-module resource files, compiled extensions, etc). PyOxidizer has achieved this on Windows, Linux, and macOS since its very earliest releases. Hypothesis confirmed!
In order to actually achieve single file executables, you have to
fundamentally change aspects of Python's behavior. Some of these
changes invalidate deeply rooted assumptions about how Python works,
such as the existence of __file__
in modules. As you can imagine,
these broken assumptions translated to numerous compatibility issues
and PyOxidizer didn't work with many popular Python packages.
With the science experiment phase of PyOxidizer out of the way, I have been making a concerted effort to broaden the user base of PyOxidizer. While single file executables can be an amazing property, it isn't critical for many use cases and the issues it was causing were preventing people from exploring PyOxidizer.
This brings us to what I think are the major new features in PyOxidizer 0.7.
Better Support for Loading Extension Modules
Earlier versions of PyOxidizer insisted that you compile Python (C) extension modules from source and statically link them into a produced binary. This requirement prevented the use of pre-built extension modules (commonly found in Python binary wheels available on PyPI) with PyOxidizer, forcing people to compile them locally. While this often just worked for many extension modules, it frequently failed on complex extension modules and it frequently failed on Windows.
PyOxidizer now supports loading compiled extension modules from
standalone files (typically .so
or .pyd
files, which are actually
shared libraries). There are still some sharp edges and known
deficiencies. But in many cases, if you tell PyOxidizer to run
pip install
and package the result, pre-built wheels can be
installed and PyOxidizer will pick up the standalone files.
On Windows, PyOxidizer even supports embedding the shared library
data into the produced .exe
and loading the .pyd
/DLL directly
from memory.
Loading Resources from the Filesystem
Binaries built with PyOxidizer contain a blob holding an index of available Python resources along with their data.
Earlier versions of PyOxidizer only allowed you to define resources as in-memory. If the resource was defined in this blob, it was imported from memory. Otherwise it wasn't known to PyOxidizer. You could still install files next to the produced binary and tell PyOxidizer to enable Python's default filesystem-based importer. But PyOxidizer didn't explicitly know about these files on the filesystem.
In PyOxidizer 0.7, the blob index of Python resources is able to express different locations for that resource. Currently, a resource can have its data made available in-memory or filesystem-relative. in-memory works as before: the raw data is embedded next to the next in memory and loaded from there (using 0-copy). filesystem-relative encodes a filesystem path to the resource. During packaging, PyOxidizer will place the resource next to the executable (using a typical Python file layout scheme) and store the relative path to that resource in the resources index.
The filesystem-relative resource indexing feature has a few implications for PyOxidizer.
First, it is more standard. When PyOxidizer loads a Python
module from the filesystem, it sets __file__
, __path__
,
etc and the module semantics should behave as if the file
were imported by Python's standard importer. This means that
if a package is having issues with in-memory importing, you
can simply fall back to filesystem-relative to get standard
Python behavior and everything should just work.
Second, PyOxidizer's filesystem resource loading is faster
than Python's! When Python's standard importer goes to
import
a module, it needs to stat()
various paths to
first locate the file. It then performs some sanity checking
and other minor actions before actually importing the module.
All of this has overhead. Since the goal of PyOxidizer is
to produce standalone applications and applications should
be immutable, PyOxidizer can avoid most of this overhead.
PyOxidizer simply tries to open()
and read()
the relative
path baked into the resource index at build time. If that
works, the resource is loaded. Else there is a failure.
The code path in PyOxidizer to locate a Python resource
is effectively a lookup in a Rust HashMap<&str, T>
.
I thought it would be interesting to isolate the performance
benefits of this new feature. I ran Mercurial's test harness
with different variants of hg
on Linux on my Ryzen 3950X.
- traditional - A
hg
script with a#!/path/to/python3.7
shebang. - oxidized - A
hg
executable built with PyOxidizer, without PyOxidizer's custom module importer. - filesystem - A
hg
executable built with PyOxidizer using the new filesystem-relative resource index. - in-memory - A
hg
executable built with PyOxidizer with all resources loaded from memory (how PyOxidizer has traditionally worked).
The results are quite clear:
Variant | CPU Time (s) | Delta (s) | % Orig |
---|---|---|---|
traditional | 11,287 | -552 | 100 |
oxidized | 10,735 | -552 | 95.1 |
filesystem | 10,186 | -1,101 | 90.2 |
in-memory | 9,883 | -1,404 | 87.6 |
We see a nice win just from using a native executable built with PyOxidizer (traditional to oxidized).
Then from oxidized to filesystem we see another jump of ~5%. This difference is attributed to using PyOxidizer's Rust-powered importer with an index of resources available on the filesystem. In other words, all that work that Python's standard importer is doing to discover files and then operate on them is non-trivial!
Finally, the smaller jump from filesystem to in-memory isolates the benefits of importing resource data from memory instead of involving filesystem I/O. (Filesystems are generally slow.) While I haven't measured explicitly, I hypothesize that macOS and Windows will see a bigger jump between these two variants, as the filesystem performance on these platforms generally isn't as good as it is on Linux.
PyOxidizer's Future
With PyOxidizer now supporting a couple of much-needed features to support a broader set of users, I'm hoping that future releases of PyOxidizer continue to broaden the utility of PyOxidizer.
The over-arching goal of PyOxidizer is to solve large aspects of the Python application packaging and distribution problem. So far a lot of focus has been spent on the former. PyOxidizer in its current form can materialize files on the filesystem that you can copy or package up manually and distribute. But I want these processes to be part of PyOxidizer: I want it to be possible for PyOxidizer to emit a Windows MSI installer, a macOS dmg, a Debian package, etc for a Python application.
In order to support the aforementioned marquee features of this PyOxidizer release, I had to pay down a lot of technical debt in the code base left over from the science experiment phase of PyOxidizer's inception.
In the short term, I plan to continue shoring up the code base
and rounding out support for features requested in the
issue tracker on GitHub. The next release of PyOxidizer will
also likely require
Python 3.8, as this will improve run-time control over the
embedded Python interpreter and enable PyOxidizer to better
support package metadata (importlib.metadata
), enabling
support for features like entry points.
I've also been thinking about extracting PyOxidizer's custom
module importer to be usable as a standalone Python extension
module. I think there's some value in publishing a
pyoxidizer_importer
package on PyPI that you can easily
add to your installed packages to speed up Python's
standard filesystem importer by a few percent. If nothing else,
this may drum up interest in the larger Python community for
standardizing a format for serializing Python resources in a
single file. Perhaps we can get other Python packaging tools
producing the same
packed resources data
blob that PyOxidizer uses so we can all standardize on a
more efficient mechanism for loading Python modules. Time
will tell.
Enjoy the new release. File issues at https://github.com/indygreg/PyOxidizer as you encounter them.