DEV Community

Cover image for Compression @ SwayDB
Simer Plaha
Simer Plaha

Posted on • Updated on

Compression @ SwayDB

Enabling full compression can reduce data size by 89.3%!

Alt Text

Compression is important for performance because more data can be stored & read into memory within a block, it reduces IOps and the cost of hosting data on servers/cloud. But over-compression can get expensive due to time required for decompression which could effect read performance but it can also be useful for applications that need cold storage or just high compression.

A configurable compression strategy is required to support unique storage requirements allowing us to define how much & how often we want to compress data. For example: if your data file (Segment) is of size 100MB with 100K key-values you can express your compression requirements like

  • Compress the full 100MB data file.
  • Compress data every 4MB.
  • Compress data every 1000 key-values.
  • Compress all data but reset compression every 10th key-value
  • Do not compress all data but compress every 10th key-value.
  • Compress binary-search-indexes & hash-indexes but not other data-blocks.

Note: data-block is referred to a logical set of bytes (Array<Byte>) that are stored within a Segment like indexes, keys, values etc. A Segment itself is a data-block that stores other data-blocks within itself (Array<Array<Byte>>).

Compression strategies

You can combine, enable or disable any or all of the following compression strategies

  • Internal-compression includes prefix compression and duplicate value elimination.
  • External-compression uses LZ4 and/or Snappy which can applied selectively to parts of a file or to the entire file.

Prefix compression

Prefix compression stores all keys in a compressed group format in their sorted order. Reading a single key from that group requires decompressing all keys that exist before the searched key within the group, so for read performance it is useful to leave some keys uncompressed. You can also prefix compress all keys if you just want high compression.

The following Map is configured to compress 4 keys into a group and starts a new group every 5th key. The input boolean parameter named keysOnly is set to true which applies prefix-compression to keys only, if false, it applies prefix-compression to keys and all metadata that gets written with that key which would result in higher compression.

Map<Integer, String, Void> map =  
  MapConfig  
    .functionsOff(Paths.get("myMap"), intSerializer(), stringSerializer())  
    .setSortedKeyIndex(  
      SortedKeyIndex  
        .builder()  
        .prefixCompression(new PrefixCompression.Enable(true, PrefixCompression.resetCompressionAt(5)))  
        ...
    )  
    .get();  

map.put(1, "one");  
map.get(1); //Optional[one]

Enter fullscreen mode Exit fullscreen mode

In the following configuration prefix compression is applied to every 5th key.

.prefixCompression(new PrefixCompression.Enable(false, PrefixCompression.compressAt(5)))
Enter fullscreen mode Exit fullscreen mode

Prefix compression can also be disabled which optionally allows optimising sorted-index for direct binary-search without creating a dedicated binary-search byte array. You can read more about normaliseIndexForBinarySearch here.

.prefixCompression(new PrefixCompression.Disable(false))
Enter fullscreen mode Exit fullscreen mode

Duplicate value elimination

Time-series or events data like weather, electricity, solar etc often contains duplicate values. Such duplicate values can be detected and eliminated with the following configuration.

ValuesConfig  
  .builder()  
  .compressDuplicateValues(true)  
  .compressDuplicateRangeValues(true)
Enter fullscreen mode Exit fullscreen mode

Duplicate value elimination is very cost effective because it does not create or leave decompression markers on compressed data, instead all decompression information for that key is embedded within an already existing 1-2 bytes space.

Range values created by the range APIs like remove-range, update-range & expire-range are most likely to have duplicate values and can be eliminated/compressed with the compressDuplicateRangeValues(true) config.

External compression

Every data-block written into Segment file is compressible! A Segment file is nothing special but just another data-block that stores other data-blocks within itself.

You will find a compression property that configures external compression in all data-blocks that form a Segment - SortedKeyIndex, RandomKeyIndex, BinarySearchIndex, MightContainIndex, ValuesConfig & SegmentConfig.

All LZ4 instances and Snappy are both supported.

The following snippet demos how to apply compression to a SortedKeyIndex/Linear-search-index. It tries to compress with LZ4 first at minimum 20.0% compression savings, if the compression was lower than 20.0% then Snappy is tried with the same.

SortedKeyIndex
  .builder()
  .compressions((UncompressedBlockInfo info) ->
    Arrays.asList(
      //try running LZ4 with minimum 20.0% compression  
      Compression.lz4Pair(
        new Pair(LZ4Instance.fastestJavaInstance(), new LZ4Compressor.Fast(20.0)),
        new Pair(LZ4Instance.fastestJavaInstance(), LZ4Decompressor.fastDecompressor())
      ),
      //if not try Snappy  
      new Compression.Snappy(20.0)
    )
    ...
  ) 
Enter fullscreen mode Exit fullscreen mode

UncompressedBlockInfo provides the data size (info.uncompressedSize()) of the data-block being compressed which can optionally be used to determine if it should be compressed or not. For example: if the data size is already too small then you can disable compression by returning Collections.emptyList().

.compression(
  (UncompressedBlockInfo blockInfo) -> {
    if (blockInfo.uncompressedSize() < StorageUnits.mb(1)) {
      return Collections.emptyList();
    } else { //else do compression
      return {your compression};
    }
  }

Enter fullscreen mode Exit fullscreen mode

How to apply compression at a file Level? Similar to above you can apply file level compression with SegmentConfig.

.setSegmentConfig(  
  SegmentConfig  
    .builder()  
    ...  
    .compression((UncompressedBlockInfo info) ->
       {your compression config here}
    )  
)

Enter fullscreen mode Exit fullscreen mode

How to limit compression by size & key-value count?

The property minSegmentSize sets the compressible size of Segment and if the above compression property is defined for a data-block then that data gets compressed every minSegmentSize data block.

The property maxKeyValuesPerSegment also controls the compressible limit of a Segment which, along with minSegmentSize enables checks to limit the maximum count/number of key-values to store within a compressible Segment.

.setSegmentConfig(  
  SegmentConfig  
    .builder()  
    .minSegmentSize(StorageUnits.mb(4))
    .maxKeyValuesPerSegment(100000)
    ...
)
Enter fullscreen mode Exit fullscreen mode

Summary

SwayDB's compression is highly configurable and can be tuned for unique storage requirements with different tradeoffs.

GitHub repos

Top comments (0)