| # Tutorial 2: Customize Datasets |
|
|
| ## Customize datasets by reorganizing data |
|
|
| The simplest way is to convert your dataset to organize your data into folders. |
|
|
| An example of file structure is as followed. |
|
|
| ```none |
| βββ data |
| β βββ my_dataset |
| β β βββ img_dir |
| β β β βββ train |
| β β β β βββ xxx{img_suffix} |
| β β β β βββ yyy{img_suffix} |
| β β β β βββ zzz{img_suffix} |
| β β β βββ val |
| β β βββ ann_dir |
| β β β βββ train |
| β β β β βββ xxx{seg_map_suffix} |
| β β β β βββ yyy{seg_map_suffix} |
| β β β β βββ zzz{seg_map_suffix} |
| β β β βββ val |
| |
| ``` |
|
|
| A training pair will consist of the files with same suffix in img_dir/ann_dir. |
|
|
| If `split` argument is given, only part of the files in img_dir/ann_dir will be loaded. |
| We may specify the prefix of files we would like to be included in the split txt. |
|
|
| More specifically, for a split txt like following, |
|
|
| ```none |
| xxx |
| zzz |
| ``` |
|
|
| Only |
| `data/my_dataset/img_dir/train/xxx{img_suffix}`, |
| `data/my_dataset/img_dir/train/zzz{img_suffix}`, |
| `data/my_dataset/ann_dir/train/xxx{seg_map_suffix}`, |
| `data/my_dataset/ann_dir/train/zzz{seg_map_suffix}` will be loaded. |
|
|
| Note: The annotations are images of shape (H, W), the value pixel should fall in range `[0, num_classes - 1]`. |
| You may use `'P'` mode of [pillow](https://pillow.readthedocs.io/en/stable/handbook/concepts.html#palette) to create your annotation image with color. |
|
|
| ## Customize datasets by mixing dataset |
|
|
| MMSegmentation also supports to mix dataset for training. |
| Currently it supports to concat and repeat datasets. |
|
|
| ### Repeat dataset |
|
|
| We use `RepeatDataset` as wrapper to repeat the dataset. |
| For example, suppose the original dataset is `Dataset_A`, to repeat it, the config looks like the following |
|
|
| ```python |
| dataset_A_train = dict( |
| type='RepeatDataset', |
| times=N, |
| dataset=dict( # This is the original config of Dataset_A |
| type='Dataset_A', |
| ... |
| pipeline=train_pipeline |
| ) |
| ) |
| ``` |
|
|
| ### Concatenate dataset |
|
|
| There 2 ways to concatenate the dataset. |
|
|
| 1. If the datasets you want to concatenate are in the same type with different annotation files, |
| you can concatenate the dataset configs like the following. |
| |
| 1. You may concatenate two `ann_dir`. |
|
|
| ```python |
| dataset_A_train = dict( |
| type='Dataset_A', |
| img_dir = 'img_dir', |
| ann_dir = ['anno_dir_1', 'anno_dir_2'], |
| pipeline=train_pipeline |
| ) |
| ``` |
| |
| 2. You may concatenate two `split`. |
|
|
| ```python |
| dataset_A_train = dict( |
| type='Dataset_A', |
| img_dir = 'img_dir', |
| ann_dir = 'anno_dir', |
| split = ['split_1.txt', 'split_2.txt'], |
| pipeline=train_pipeline |
| ) |
| ``` |
| |
| 3. You may concatenate two `ann_dir` and `split` simultaneously. |
|
|
| ```python |
| dataset_A_train = dict( |
| type='Dataset_A', |
| img_dir = 'img_dir', |
| ann_dir = ['anno_dir_1', 'anno_dir_2'], |
| split = ['split_1.txt', 'split_2.txt'], |
| pipeline=train_pipeline |
| ) |
| ``` |
| |
| In this case, `ann_dir_1` and `ann_dir_2` are corresponding to `split_1.txt` and `split_2.txt`. |
| |
| 2. In case the dataset you want to concatenate is different, you can concatenate the dataset configs like the following. |
|
|
| ```python |
| dataset_A_train = dict() |
| dataset_B_train = dict() |
| |
| data = dict( |
| imgs_per_gpu=2, |
| workers_per_gpu=2, |
| train = [ |
| dataset_A_train, |
| dataset_B_train |
| ], |
| val = dataset_A_val, |
| test = dataset_A_test |
| ) |
| ``` |
| |
| A more complex example that repeats `Dataset_A` and `Dataset_B` by N and M times, respectively, and then concatenates the repeated datasets is as the following. |
|
|
| ```python |
| dataset_A_train = dict( |
| type='RepeatDataset', |
| times=N, |
| dataset=dict( |
| type='Dataset_A', |
| ... |
| pipeline=train_pipeline |
| ) |
| ) |
| dataset_A_val = dict( |
| ... |
| pipeline=test_pipeline |
| ) |
| dataset_A_test = dict( |
| ... |
| pipeline=test_pipeline |
| ) |
| dataset_B_train = dict( |
| type='RepeatDataset', |
| times=M, |
| dataset=dict( |
| type='Dataset_B', |
| ... |
| pipeline=train_pipeline |
| ) |
| ) |
| data = dict( |
| imgs_per_gpu=2, |
| workers_per_gpu=2, |
| train = [ |
| dataset_A_train, |
| dataset_B_train |
| ], |
| val = dataset_A_val, |
| test = dataset_A_test |
| ) |
| |
| ``` |
|
|