원문: https://github.com/facebook/rocksdb/wiki/Universal-Compaction

개요

Universal Compaction Style은 낮은 write amplification을 필요로 하는 경우를 타겟으로 하는 compaction 스타일로, read amplification과 space amplification은 상대적으로 높다.

이 compaction을 사용하면, 모든 SST 파일은 전체 key 범위를 포함하는 정렬된 데이터의 연속(run)으로 구성된다. 하나의 정렬된 run은 시간 범위 동안 생성된 데이터를 포함한다. 서로 다른 정렬된 run은 해당 시간 범위 내에서 중복되지 않는다. Compaction은 인접한 시간 범위의 두 개 이상의 정렬된 run에서만 발생할 수 있다. 결과물은 인풋으로 들어온 정렬된 run들의 시간 범위를 갖는 하나의 정렬된 run이다. 어떤 compaction 후에도, 정렬된 run이 해당 시간 범위 내에서 중복되지 않는다는 조건은 계속 유지된다. 정렬된 run은 L0 파일, 또는 데이터가 key 범위대로 나누어진 파일로 저장되는 "레벨"로 구현될 수 있다.

한계

Double Size Issue

Universal style compaction에서는, 때때로 full compaction이 필요하다. 이 경우, 출력 데이터 크기는 입력 크기와 유사하다. Compaction 동안, 입력 파일과 출력 파일을 모두 보관해야하므로, DB는 디스크 공간 사용량을 일시적으로 두 배로 늘린다. 따라서, 전체 compaction을 위해 여유 공간을 충분히 확보해야한다.

num_levels=1일 때 DB (Column Family) 크기

Universal Compaction을 사용할 때, num_levels = 1인 경우, DB의 모든 데이터 (또는 정확하게 Column Family)는 때때로 하나의 단일 SST 파일로 압축된다. 하나의 단일 SST 파일의 크기에는 제한이 있다. RocksDB에서 블록은 4GB를 초과할 수 없다 (크기가 uint32가 되도록). 만약 단일 SST 파일이 너무 큰 경우, 인덱스 블록은 한계치를 초과할 수 있다. 인덱스 블록의 크기는 데이터에 따라 다르다. 우리의 유스 케이스 중 하나에서, 4K 데이터 블록 크기를 사용하면 DB가 약 250GB로 증가할 때 DB가 한계치에 도달하는 것을 관찰했다.

사용자가 num_levels를 1보다 훨씬 크게 설정하면, 이 문제가 완화된다. 이 경우, 더 큰 "파일"은 작은 파일들로 분할된 더 큰 "레벨"에 배치된다 (아래 설명 참조). L0->L0 compaction은 병렬 compaction에서도 여전히 발생할 수 있지만, 대부분의 이 L0의 파일들은 훨씬 더 작다.

데이터 레이아웃과 배치

정렬된 데이터의 연속(Sorted Runs)

앞서 말했듯이, 데이터는 정렬된 run들로 구성되어 있다. 하나의 정렬된 run은 데이터가 업데이트된 시간에 따라 배치되고 L0 또는 전체 "레벨" 파일에 저장된다.

다음은 일반적인 파일 레이아웃의 예시다:

Level 0: File0_0, File0_1, File0_2
Level 1: (empty)
Level 2: (empty)
Level 3: (empty)
Level 4: File4_0, File4_1, File4_2, File4_3
Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

더 큰 수의 레벨은 더 작은 수의 레벨보다 오래된 정렬된 run을 포함한다. 이 예시에서는, 5개의 정렬된 run이 있다. 그리고 레벨 0, 레벨 4 및 5에 세 개의 파일이 있다. 레벨 5는 가장 오래된 정렬된 run이고, 레벨 4는 더 최신의 run이며, 레벨 0 파일들은 가장 최신의 파일들이다.

Compaction 결과물의 배치

Compaction은 항상 연속된 시간 범위의 정렬된 run에 대해 수행되며, 결과물은 항상 다른 새로운 정렬된 run이다. 더 큰 수의 레벨은 더 오래된 데이터를 가진다는 규칙에 따라, 우리는 항상 가능한 가장 높은 레벨로 compaction 결과물을 배치한다.

위와 동일한 예시를 사용해보자. 다음과 같은 정렬된 run이 있다:

File0_0
File0_1
File0_2
Level 4: File4_0, File4_1, File4_2, File4_3
Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

모든 데이터를 compaction하면, 결과로 나오는 정렬된 run은 레벨 5에 배치될 것이다. 그래서 이는 다음과 같아진다:

Level 5: File5_0', File5_1', File5_2', File5_3', File5_4', File5_5', File5_6', File5_7'

이 상태에서 시작해서, 만약 다른 compaction을 실행하려고 한다면 결과물 run이 어떻게 배치되는지 확인해보자:

만약 File0_1, File0_2와 레벨 4를 compaction한다면, 결과물 run은 레벨 4에 배치된다.

Level 0: File0_0
Level 1: (empty)
Level 2: (empty)
Level 3: (empty)
Level 4: File4_0', File4_1', File4_2', File4_3'
Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

만약 File0_0, File0_1과 File0_2를 compaction한다면, 결과물 run은 레벨 3에 배치된다.

Level 0: (empty)
Level 1: (empty)
Level 2: (empty)
Level 3: File3_0, File3_1, File3_2
Level 4: File4_0, File4_1, File4_2, File4_3
Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

만약 File0_0과 File0_1을 compaction한다면, 결과물 run은 여전히 레벨 0에 배치될 것이다.

Level 0: File0_0', File0_2
Level 1: (empty)
Level 2: (empty)
Level 3: (empty)
Level 4: File4_0, File4_1, File4_2, File4_3
Level 5: File5_0, File5_1, File5_2, File5_3, File5_4, File5_5, File5_6, File5_7

Special case options.num_levels=1

If options.num_levels=1, we still follow the same placement rule. It means all the files will be placed under level 0 and each file is a sorted run. The behavior will be the same as initial universal compaction, so it can be used as a backward compatible mode.

Compaction 선택 알고리즘

Assuming we have sorted runs

    R1, R2, R3, ..., Rn

where R1 containing data of most recent updates to the DB and Rn containing the data of oldest updates of the DB. Note it is sorted by age of the data inside the sorted run, not the sorted run itself. According to this sorting order, after compaction, the output sorted run is always placed into the place where the inputs were.

How is all compactions are picked up:

Precondition: n >= options.level0_file_num_compaction_trigger

Unless number of sorted runs reaches this threshold, no compaction will be triggered at all.

(Note although the option name uses word "file", the trigger is for "sorted run" for historical reason. For the names of all options mentioned below, "file" also means sorted run for the same reason.)

If pre-condition is satisfied, there are three conditions. Each of them can trigger a compaction:

1. Compaction Triggered by Space Amplification

If the estimated size amplification ratio is larger than options.compaction_options_universal.max_size_amplification_percent / 100, all files will be compacted to one single sorted run.

Here is how size amplification ratio is estimated:

    size amplification ratio = (size(R1) + size(R2) + ... size(Rn-1)) / size(Rn)

Please note, size of Rn is not included, which means 0 is the perfect size amplification and 100 means DB size is double the space of live data, and 200 means triple.

The reason we estimate size amplification in this way: in a stable sized DB, incoming rate of deletion should be similar to rate of insertion, which means for any of the sorted runs except Rn should include similar number of insertion and deletion. After applying R1, R2 ... Rn-1, to Rn, the size effects of them will cancel each other, so the output should also be size of Rn, which is the size of live data, which is used as the base of size amplification.

If options.compaction_options_universal.max_size_amplification_percent = 25, which means we will keep total space of DB less than 125% of total size of live data. Let's use this value in an example. Assuming compaction is only triggered by space amplification, options.level0_file_num_compaction_trigger = 1, file size after each mem table flush is always 1, and compacted size always equals to total input sizes. After two flushes, we have two files size of 1, while 1/1 > 25% so we'll need to do a full compaction:

1 1  =>  2

After another mem table flush we have

1 2   =>  3

Which again triggers a full compaction becasue 1/2 > 25%. And in the same way:

1 3  =>  4

But after another flush, the compaction is not triggered:

1 4

Because 1/4 <= 25%. Another mem table flush will trigger another compaction:

1 1 4  =>  6

Because (1+1) / 4 > 25%.

And it keeps going like this:

1
1 1  =>  2
1 2  =>  3
1 3  =>  4
1 4
1 1 4  =>  6
1 6
1 1 6  =>  8
1 8
1 1 8
1 1 1 8  =>  11
1 11
1 1 11
1 1 1 11  =>  14
1 14
1 1 14
1 1 1 14
1 1 1 1 14  =>  18

2. Compaction Triggered by Individual Size Ratio

We calculated a value of size ratio trigger as

     size_ratio_trigger = 1 + options.compaction_options_universal.size_ratio / 100

Usually options.compactionoptions_universal.size_ratio is close to 0 so _size ratio trigger is close to 1.

We start from R1, if size(R2) / size(R1) <= size ratio trigger, then (R1, R2) are qualified to be compacted together. We continue from here to determine whether R3 can be added too. If size(R3) / size(R1 + R2) <= size ratio trigger, we would include (R1, R2, R3). Then we do the same for F4. We keep comparing total existing size to the next sorted run until the size ratio trigger condition doesn't hold any more.

Here is an example to make it easier to understand. Assuming options.compaction_options_universal.size_ratio = 0, total mem table flush size is always 1, compacted size always equals to total input sizes, compaction is only triggered by space amplification and options.level0_file_num_compaction_trigger = 1 (so number of files won't block a compaction from running). Now we start with only one file with size 1. After another mem table flush, we have two files size of 1, which triggers a compaction:

1 1  =>  2

After another mem table flush,

1 2  (no compaction triggered)

which doesn't qualify a flush because 2/1 > 1. But another mem table flush will trigger a compaction of all the files:

1 1 2  =>  4

This is because 1/1 >=1 and 2 / (1+1) >= 1.

The compaction will keep working like this:

1 1  =>  2
1 2  (no compaction triggered)
1 1 2  =>  4
1 4  (no compaction triggered)
1 1 4  =>  2 4
1 2 4  (no compaction triggered)
1 1 2 4 => 8
1 8  (no compaction triggered)
1 1 8  =>  2 8
1 2 8  (no compaction triggered)
1 1 2 8  =>  4 8
1 4 8  (no compaction triggered)
1 1 4 8  =>  2 4 8
1 2 4 8  (no compaction triggered)
1 1 2 4 8  =>  16
1 16  (no compaction triggered)
......

Compaction is only triggered when number of input sorted runs would be at least options.compaction_options_universal.min_merge_width and number of sorted runs as inputs will be capped as no more than options.compaction_options_universal.max_merge_width.

3. Compaction Triggered by number of sorted runs

If for every time we try to schedule a compaction, neither of 1 and 2 are triggered, we will try to compaction whatever sorted runs from R1, R2..., so that after the compaction the total number of sorted runs is not greater than options.level0_file_num_compaction_trigger + 1. If we need to compact more than options.compaction_options_universal.max_merge_width number of sorted runs, we cap it to options.compaction_options_universal.max_merge_width.

"Try to schedule" I mentioned below will happen when after flushing a memtable, finished a compaction. Sometimes duplicated attempts are scheduled.

See Universal Style Compaction Example as an example of how output sorted run sizes look like for a common setting.

Parallel compactions are possible if options.max_background_compactions > 1. Same as all other compaction styles, parallel compactions will not work on the same sorted run.

Subcompaction

Subcompaction is supported in universal compaction. If the output level of a compaction is not "level" 0, we will try to range partitioned the inputs and use number of threads of options.max_subcompaction to compact them in parallel. It will help with the problem that full compaction of universal compaction takes too long.

Options to Tune

Following are options affecting universal compactions:

options.compaction_options_universal: various options mentioned above
options.level0_file_num_compaction_trigger the triggering condition of any compaction. It also means after all compactions' finishing, number of sorted runs will be under options.level0_file_num_compaction_trigger+1
options.level0_slowdown_writes_trigger: if number of sorted runs exceed this value, writes will be artificially slowed down.
options.level0_stop_writes_trigger: if number of sorted runs exceed this value, writes will stop until compaction finishes and number of sorted runs turn under this threshold.
options.num_levels: if this value is 1, all sorted runs will be stored as level 0 files. Otherwise, we will try to fill non-zero levels as much as possible. The larger num_levels is, the less likely we will have large files on level 0.
options.target_file_size_base: effective if options.num_levels > 1. Files of levels other than level 0 will be cut to file size not larger than this threshold.
options.target_file_size_multiplier: it is effective, but we don't know a way to use this option in universal compaction that makes sense. So we don't recommend you to tune it.

Following options DO NOT affect universal compactions:

options.max_bytes_for_level_base: only for level-based compaction
options.level_compaction_dynamic_level_bytes: only for level-based compaction
options.max_bytes_for_level_multiplier and options.max_bytes_for_level_multiplier_additional: only for level-based compaction
options.expanded_compaction_factor: only for level-based compactions
options.source_compaction_factor: only for level-based compactions
options.max_grandparent_overlap_factor: only for level-based compactions
options.soft_rate_limit and options.hard_rate_limit: deprecated
options.hard_pending_compaction_bytes_limit: only used for level-based compaction
options.compaction_pri: only supported in level-based compaction

Universal Compaction

개요

한계