Improved portability and performance 🦀

libimagequant is a library for generating high-quality palettes, useful for compression of transparent PNG images (~75% smaller!) and making nice GIF animations.

libimagequant is now a pure Rust library. The new version is a drop-in replacement (ABI-compatible), so C projects can continue using it. The C version will be supported for a while to give library users and distros time to switch.

Why?

In short: There's no standard for building C programs. It's a non-portable mess, and a time sink. Cross-compilation of OpenMP was the last straw. Rust/Cargo is much more dependable, and enables me to support more features on more platforms.

I didn't plan to rewrite the library yet. The C codebase is stable and gets the job done. However, I've switched to an M1 Mac, and this made x86-64 a cross-compilation target for me. Building of redistributable executables with OpenMP was already an annoyance, but cross-compilation added a whole another level of brokenness to deal with. It felt silly to get a new fast CPU only to emulate an old one, or rely on someone else's CPU in the cloud instead. Over the holiday break I tested whether it would be easier to rewrite the library in Rust instead, and indeed it was.

Why not just drop multi-threading, or OpenMP?

That's what I used to do: OpenMP was optional, but being limited to 1/8th or 1/16th of a CPU sucked. I use libimagequant in gif.ski, which for sad GIFfy reasons is CPU-intensive, with quantization is in its hot path, so it really matters.

OpenMP was actually quite nice for basic for loops — when it worked. Unfortunately, OpenMP is not a library, but a compiler extension, so it's at mercy of built-in compiler support. It can be outdated, incomplete, or completely absent.

I could have kept the library in C and replaced OpenMP with some other solution instead, but I wasn't keen on reinventing this wheel. Even basic things like spawning a thread run into C's portability woes. The platonic ideal portable C exists only as a hypothetical construct in the C standard. The C that exists in the real world is whatever Microsoft, Apple, and others have shipped. That C is a mess of vendor-specific toolchains, each with its own way of doing things, missing features, broken headers, and leaky abstractions. Shouting "it's not C's fault, screw <insert vendor name>!" doesn't solve the problem, but switching to Rust does.

How?

Some parts were simply rewritten from scratch. Some parts were converted with Citrus, which is my C-to-Rust "transpiler". Unlike c2rust, it's not semantically accurate, but it generates readable Rust source code that is easier to clean up. I did the first quick'n'dirty (but fully functional) conversion at a rate of about 1000 lines per day. Later rewriting, refactoring and polishing took the average down to under 200 lines per day. rust-analyzer, cargo clippy, and cargo fix have been very helpful.

Results

I'm quite happy with the result. 3425 lines of mostly C have been replaced with 3413 lines of Rust. The new version exports the same ABI, so for C programs and FFI users (Python, Java, etc.) it's a drop-in replacement. The new code uses the same algorithm, but there are minor differences from adaptation to Rust's idioms.

It's a coincidence that the number of lines ended up being so close, because different parts of code translated to significantly different amount of lines. For example, I was able to delete pngquant's entire DIY hash table implementation, and use one from Rust's standard library instead. OTOH I've written an abstraction for iterating over images, which has replaced a few lines of pointer arithmetic with a bunch of objects, methods, and interfaces.

Executable size

The Rust executable statically links with the Rust's standard library (unlike C executables that usually use OS's libc) and this costs ~260KB of executable size. I'm not happy about this, but fortunately it's only a one-time cost that doesn't grow with code size. Rust's standard library can be disabled, so I have an option of trimming it down if necessary, but I'm not that bothered by it.

I've compared sizes of executables (aarch64, with LTO, panic=abort, and stripped). Apart from the libstd overhead, byte size of the single-threaded Rust version of the library has grown by about 11% compared to the C version without OpenMP. Rust version using rayon was actually 4% smaller than with OpenMP linked statically. I've liberally used high-level Rust abstractions, generic libstd containers, and a few dependencies, so I'm very happy that the Rust version turned out to be reasonably small. rayon is a substantial library with thread pools, work-stealing queues, parallel iterators, and many generic interfaces. It being smaller than the "compiler-native" OpenMP was a nice surprise too.

Performance

Performance of the Rust version is similar, with a couple of things noticeably faster:

Rust has also added a bunch tiny overheads all over the place, like glue code for panics and destructors, extra bounds checks, use of 64-bit usize instead of 32-bit int, but these things did not show up in profiling, and other wins outweigh them.

I think it's an interesting comparison, because it wasn't a rewrite of an old legacy codebase with a shiny new design. The C code was as good as I could make it, and the Rust version uses the same algorithm (with different bugs ;)

Nice things

I've tried to be careful about implicit numeric type conversions in C (including mixing of float and double), but there were a few that I've missed. They became clear after conversion to Rust. Rust is strict about numeric types to the point of being annoying, but at least I know the only type conversions are the ones that I wrote.

Rust's enums worked well. The library has various ways of representing images in memory (e.g. they can be dynamically streamed without buffering). In C this needed a few inter-dependent nullable pointers and boolean flags, which were hard to keep track of. In Rust it translated nicely to different sets of enum fields.

Lifetime annotations helped me catch a bug. There's an image convolution pass that needs to look at pixels from the current, previous, and next line at the same time. However, the statefully-complex image may have had only a 1-line buffer, so all three line reads would end up overwriting each other in that buffer. In Rust such loop didn't compile due to a borrow checking error (can't mutably borrow the internal temp buffer more than once).

When users pass invalid pointers to the library (freed memory or other garbage) it makes my code to crash. Users report it to me as my bug, and I end up debugging their bugs. The good news is that it doesn't happen in Rust, but I still expose a C API, so I've ported my defence from C: I first dereference all user-supplied C pointers in a specific function named its_your_fault_not_mine (or a more professional version of it). This triggers the crashes immediately, right there, rather than sometime later deep in my code, so the crash stacktrace gets a built-in disclaimer. In C, there's no standard feature for preventing a function from being inlined or optimized out. There's no standard for making a "useless" memory read happen without causing a compiler warning. In Rust, there are explicit features for all of this. BTW, volatile in Rust is a function call. That makes so much more sense!

Rust has x86 SIMD intrinsics and CPU detection built-in, so I was able to port my SSE code easily. When I first considered a Rust rewrite a few years ago they weren't there yet, so that's a reminder that Rust is improving.

Corners cut

The C API had support for custom allocators. I've dropped it. Rust has its own mechanism for configuring a global allocator which is hopefully enough. Custom allocators per object are possible, but I don't think it's worth the hassle.

In C I've used a "flexible array member" trick to allocate structs with a head and variable-length data in one malloc. The same technique is doable in Rust with some gymnastics, but also turned out to be entirely unnecessary. Rust prefers to return objects "by value", but with calling convention magic or inlining avoid actually copying them. This usually eliminates heap allocations of structs entirely, or at least makes structs with variable-length data still cost one heap allocation, not two. It was a microoptimization anyway.

The C library used its own memory pool to avoid small allocations. It was probably an overkill even for C, and didn't improve performance in Rust (I've tested it), so I've dropped it.

Quirks

The C API promised to hold image data allocated with malloc, and even dynamically toggle whether it should free it or not. From idiomatic Rust perspective that's weird and sloppy. I've managed to fully reimplement it, but it was 130 lines of not-so-pretty code that shouldn't need to exist in a Rust library.

Neither C nor Rust do a good job of autovectorizing my code that uses 4xf32 pixels. I've tried giving them a 16-byte alignment, reordering operations to be as close to SSE instructions as possible, enabled arch=native, and so on, but it's a quartet of single-data instructions every time, in both languages.

Rust isn't really an object-oriented language, but it has enough syntax sugar to look like one. Trying to make the library more object-oriented exposed a "drum-stick" design issue: does play(drum, stick) translate to drum.play(stick) or stick.hit(drum)? The C library only had global functions, and all fields of all structs were public to the library (opaque structs in C are for the public API/ABI, not the internals), so "where does this code belong" was never a question. The Rust-OO conversion ended up with a few odd functions that don't fit anywhere, and more crate-public fields than I'd like.

By default Rust assumes that any structure containing a C pointer is not thread-safe, but I've inherited an API that is full of C pointers! I've been forced to manually override it with "trust me, it's thread-safe" in a few places. Usually it's Rust telling me whether my code is thread-safe or not, so I'm uneasy about the overrides. miri test passes.

Parallelism

Rust rayon is conceptually similar to OpenMP, but instead of #pragma omp for there's par_iter(). It has delivered the same linear speedups where possible.

The C version used buffers[omp_get_thread_num()] to have exclusive per-thread buffers. This was basically a DIY thread-local storage, so I've used thread-local storage in Rust too. For summing-up results from multiple threads Rayon supports parallel fold/reduce, but my data is simple enough that summing it up serially at the end is slightly faster (parallel reduce adds intermediate steps that didn't pay off here).

I've parallelized all the simple cases. I'd like to tackle more, but the rest requires bigger architectural and algorithmic changes, which is too much for the first release. Rayon is easy to safely experiment with, and Rust is very nice to refactor, so I hope to do more later.

Conclusion

I'm ecstatic that I can now support all platforms from Android to Windows, with the same cargo build command. I won't have to worry again about the ___aarch64_cas8_acq_rel symbol, vcomp.dll, compiler-private lib dirs, ignored #pragmas, or #ifdefs for vendor-specific bugs/features. The next gif.ski will be 2.2 times faster.

Where/when to get it?

I've released the Rust library as a beta on crates.io. The source is on GitHub. I haven't published it in pngquant.org tarballs yet, because I still have some things to finish in pngquant exe, update all build/release processes, packages, documentation, etc.


Jan 2022, by Kornel.

Return to home page.