Introduction
This package was developed to help prepare some samples to be send to a facility. It assumes that you have collected the samples and the information but you still need to do the experiment in several batches due to technical or practical limitations. The question that tries to answer is:
Which samples go with each batch?
Of all the possible combinations of samples, it looks for the combination which minimizes the differences between each subgroup according to the following rules:
- If the variable is categorical it tires to randomize the variable across the subgroups.
- If the variable is numeric it tries to distribute evenly following the original distribution of values.
- If there are
NA
(not available values) it looks to distribute them randomly.
Even with this measures you might end up with some batch effect due to: - Confounding variables not provided for their randomization on the batches Sometimes due to being unknown, impossible to measure. - Lack of replicates(samples with the same conditions) If you can’t provide new replicates, aim to provide more technical replicates. Technical replicates mean reanalyzing the same sample twice or more, the more samples with technical replicates the more accurate your measures will be and easier to avoid or detect batch effects. - Processing If there is a change on the methodology, or you pause and resume later the sample collection there might be changes on the outcome due to external factors.
Previous work
Before building this package I would like to give credit to those that made also efforts in this direction:
The CRAN task View of Experimental Design includes many packages relevant for designing an experiment before collecting data, but none of them provides how to manage them once the samples are already collected.
Two packages allow to distribute the samples on batches:
The OSAT package handles categorical variables but not numeric data. It doesn’t work with our data.
The minDiff package reported in Stats.SE, handles both numeric and categorical data. But it can only optimize for two nominal criteria. It doesn’t work for our data.
The Omixer package handles both numeric and categorical data (converting categorical variables to numeric). But both the same way either Pearson’s Chi-squared Test if there are few samples or Kendall’s correlation. It does allow to protect some spots from being used.
If you are still designing the experiment and do not have collected any data DeclareDesign might be relevant for you.
Question in Bioinformatics.SE I made before developing the package.
Design of Experiment
Imagine you have some samples already collected and you want to distributed them in batches:
library("experDesign")
metadata <- expand.grid(height = seq(60, 80, 5),
weight = seq(100, 300, 50),
sex = c("Male","Female"))
head(metadata, 15)
## height weight sex
## 1 60 100 Male
## 2 65 100 Male
## 3 70 100 Male
## 4 75 100 Male
## 5 80 100 Male
## 6 60 150 Male
## 7 65 150 Male
## 8 70 150 Male
## 9 75 150 Male
## 10 80 150 Male
## 11 60 200 Male
## 12 65 200 Male
## 13 70 200 Male
## 14 75 200 Male
## 15 80 200 Male
First we can check the data to see if there are any concerns regarding the categories:
check_data(metadata)
## [1] TRUE
There are none confounding variables in this artificial dataset (see the examples, and you’ll find some). However, if you block incorrectly and end up with a group in a single batch we will end up with batch effect.
In order to avoid this design()
helps you assign each
sample to a batch (in this case each batch has 24 samples at most).
First we can explore the number of samples and the number of
batches:
size_data <- nrow(metadata)
size_batch <- 24
(batches <- optimum_batches(size_data, size_batch))
## [1] 3
# So now the best number of samples for each batch is less than the available
(size <- optimum_subset(size_data, batches))
## [1] 17
# The distribution of samples per batch
sizes_batches(size_data, size, batches)
## [1] 17 17 16
Note that instead of using a whole batch and then leave a single sample on the third distributes all the samples in the three batches that will be needed.
Randomization
We can directly look for the distribution of the samples given our max number of samples per batch:
desi <- design(metadata, size_batch)
# It is a list but we can convert it to a vector with:
batch_names(desi)
## [1] "SubSet2" "SubSet2" "SubSet1" "SubSet2" "SubSet1" "SubSet1" "SubSet1"
## [8] "SubSet1" "SubSet3" "SubSet3" "SubSet3" "SubSet2" "SubSet1" "SubSet3"
## [15] "SubSet2" "SubSet3" "SubSet2" "SubSet2" "SubSet2" "SubSet2" "SubSet1"
## [22] "SubSet3" "SubSet3" "SubSet1" "SubSet2" "SubSet3" "SubSet3" "SubSet2"
## [29] "SubSet3" "SubSet2" "SubSet1" "SubSet1" "SubSet1" "SubSet1" "SubSet3"
## [36] "SubSet1" "SubSet1" "SubSet3" "SubSet3" "SubSet2" "SubSet2" "SubSet2"
## [43] "SubSet3" "SubSet3" "SubSet2" "SubSet3" "SubSet2" "SubSet1" "SubSet1"
## [50] "SubSet1"
Naively one would either fill some batches fully or distribute them not evenly (the first 17 packages together, the next 17 and so on). This solution ensures that the data is randomized. For more random distribution you can increase the number of iterations performed to calculate this distribution.
Randomization and replicates
If you need space for replicates to control for batch effect you can use:
repli <- replicates(metadata, size_batch, 5)
lengths(repli)
## SubSet1 SubSet2 SubSet3
## 21 21 18
repli
## $SubSet1
## [1] 3 9 10 16 17 20 27 28 30 31 32 33 34 35 40 41 42 44 45 48 49
##
## $SubSet2
## [1] 4 5 7 8 9 10 13 15 18 21 22 25 26 32 33 34 36 37 38 39 47
##
## $SubSet3
## [1] 1 2 6 9 10 11 12 14 19 23 24 29 32 33 34 43 46 50
Which seeks as controls the most diverse values and adds them to the samples distribution. Note that if the sample is already present on that batch is not added again, that’s why the number of samples per batch is different from the design without replicates.
Layout
We can analyze how these samples would be distributed in a layout of 6x4:
spati <- spatial(repli, metadata, rows = LETTERS[1:6], columns = 1:4)
head(spati)
## $A1
## SubSet1 SubSet2 SubSet3
## 32 43 56
##
## $A2
## SubSet1 SubSet2 SubSet3
## 3 37 27
##
## $A3
## SubSet1 SubSet2 SubSet3
## 34 29 28
##
## $A4
## SubSet1 SubSet2 SubSet3
## 36 5 60
##
## $B1
## SubSet1 SubSet2 SubSet3
## 52 13 33
##
## $B2
## SubSet1 SubSet2 SubSet3
## 39 7 53
Report for easy on field usage
We can add the batches to the initial data with
inspect()
:
report <- inspect(repli, metadata)
report2 <- inspect(spati, report, index_name = "position")
head(report2)
## height weight sex batch position
## 1 60 100 Male SubSet3 C3
## 2 65 100 Male SubSet3 C2
## 3 70 100 Male SubSet1 A2
## 4 75 100 Male SubSet2 E1
## 5 80 100 Male SubSet2 A4
## 6 60 150 Male SubSet3 B4
And now we can see the batch and position of each sample
Compare indices
If you have two indices you can also compare them:
desi2 <- create_subset(nrow(metadata), size_batch)
compare_index(metadata, desi, desi2)
## subgroups
## SubSet1 SubSet2 SubSet3
## height 0.21717472 0.30918015 -0.008630805
## weight -3.20170175 -3.83581391 -11.626327348
## sex -0.00669491 0.00669491 -0.011421802
As many of the variables for multiple subsets are negative it shows
that desi
is better.
Unbalanced setting
In the previous case the data was mostly balanced (check it out in
the orig
object) but let’s create an unbalanced dataset to
check it.
n <- 99
samples <- 100
unbalanced <- data.frame(Classroom = rep(c("A", "B"), each = samples/2),
Sex = c(rep("M", n), rep("F", samples-n)),
Age = rnorm(samples, mean = 25, sd = 3))
table(unbalanced[, 1:2])
## Sex
## Classroom F M
## A 0 50
## B 1 49
In this dataset there is only one female, resulting in a classroom full of males. Age is independent of the sex or classroom.
i <- design(unbalanced, 15)
## Warning: There might be some problems with the data use check_data().
evaluation <- evaluate_index(i, unbalanced)
## Warning: There might be some problems with the data use check_data().
# Mean entropy en each subset
rowMeans(evaluation["entropy", , ])
## Classroom Sex Age mix_cat
## 0.96035770 0.05303319 0.00000000 0.93994182
# Original entropy on the dataset
evaluate_orig(unbalanced)["entropy", ]
## Warning in evaluate_orig(unbalanced): There might be some problems with the
## data use check_data().
## Classroom Sex Age mix_cat
## 1.00000000 0.08079314 0.00000000 0.67554928
# Dispersion of the entropy
apply(evaluation["entropy", , ], 1, sd)
## Classroom Sex Age mix_cat
## 0.05017590 0.14031263 0.00000000 0.07994645
We can see that in this simple case where a single variable has all the other cases we approximately reached the same entropy levels.
Quality check
If you need a subset with the samples that are more diverse you can use the following function:
data(survey, package = "MASS")
head(survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 1 Female 18.5 18.0 Right R on L 92 Left Some Never 173.00 Metric
## 2 Male 19.5 20.5 Left R on L 104 Left None Regul 177.80 Imperial
## 3 Male 18.0 13.3 Right L on R 87 Neither None Occas NA <NA>
## 4 Male 18.8 18.9 Right R on L NA Neither None Never 160.00 Metric
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00 Metric
## 6 Female 18.0 17.7 Right L on R 64 Right Some Never 172.72 Imperial
## Age
## 1 18.250
## 2 17.583
## 3 16.917
## 4 20.333
## 5 23.667
## 6 21.000
samples <- extreme_cases(survey, size = 10)
survey[samples, ]
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00 Metric
## 9 Male 20.0 19.5 Right R on L 72 Right Some Never 175.00 Metric
## 84 Female 17.0 17.4 Right R on L NA Neither Some Never NA <NA>
## 99 Male 19.5 19.4 Right Neither NA Right Freq Never 165.00 Metric
## 107 Female 16.2 16.4 Right R on L NA Right Freq Occas 172.00 Metric
## 171 Female 16.5 17.0 Right L on R NA Right Some Never 168.00 Metric
## 177 Male 20.5 20.5 Right Neither 76 Right Freq Regul 172.72 Imperial
## 191 Male 20.0 19.5 Right R on L 92 Right Some Never 179.10 Imperial
## 201 Female 16.2 15.8 Right R on L 61 Right Some Occas 167.00 Metric
## 202 Male 21.0 21.0 Right L on R 48 Neither Freq Never 174.00 Metric
## Age
## 5 23.667
## 9 19.000
## 84 17.167
## 99 18.083
## 107 17.000
## 171 73.000
## 177 36.583
## 191 18.917
## 201 19.250
## 202 21.333
You can also test a given index with check_index()
:
check_index(unbalanced, i)
## Warning in check_index(unbalanced, i): There might be some problems with the
## data use check_data().
## subgroups
## SubSet1 SubSet2 SubSet3 SubSet4 SubSet5 SubSet6
## Classroom 0.02116629 0.02116629 0.04000152 0.04000152 0.06572331 0.02009684
## Sex 0.17853702 0.17853702 0.24841904 0.17853702 0.17853702 0.17853702
## Age 0.20351157 0.19733814 0.25375813 0.28706685 0.12655149 0.14621487
## mix_cat 0.08031059 0.08031059 0.03045638 0.06618417 0.04689282 0.08111268
## subgroups
## SubSet7
## Classroom 0.02502079
## Sex 0.17853702
## Age 0.45676502
## mix_cat 0.07741971
Each row has information about how accurate is a given variable to
the samples available (on this case unbalanced
). Some
variables are distributed more randomly than others on this index.
If we are not satisfied we could design()
a new index
increasing the iterations to obtain a potentially better distribution.
If you want a good stratified randomization you should increase the
iterations used 10 fold.
Internals
We could check all the combinations to select those that allow us to
do this comparison. But as this would be too long with
experDesign
we can try to find the combination with the
best design by comparing each combination with the original according to
multiple statistics.
# To reduce the variables used:
omit <- c("Wr.Hnd", "NW.Hnd", "Fold", "Pulse", "Clap", "Exer", "Height", "M.I")
(keep <- colnames(survey)[!colnames(survey) %in% omit])
## [1] "Sex" "W.Hnd" "Smoke" "Age"
head(survey[, keep])
## Sex W.Hnd Smoke Age
## 1 Female Right Never 18.250
## 2 Male Left Regul 17.583
## 3 Male Right Occas 16.917
## 4 Male Right Never 20.333
## 5 Male Right Never 23.667
## 6 Female Right Never 21.000
# Set a seed for reproducibility
# Looking for groups at most of 70 samples.
index <- create_subset(nrow(survey), size_subset = 70)
index
## $SubSet1
## [1] 9 12 21 25 34 37 43 47 58 59 65 67 68 73 74 77 79 80 103
## [20] 104 113 120 136 140 142 143 144 148 152 153 162 164 165 170 172 173 174 176
## [39] 178 181 182 185 189 190 192 194 197 202 203 205 208 214 215 217 219 221 224
## [58] 231 235 237
##
## $SubSet2
## [1] 1 2 7 8 22 23 29 30 33 36 39 42 45 46 51 54 55 64 71
## [20] 75 76 91 102 105 109 111 114 115 117 118 119 123 125 128 132 135 138 154
## [39] 155 159 167 177 187 191 193 195 196 198 200 201 207 211 212 218 222 227 228
## [58] 230 233
##
## $SubSet3
## [1] 3 4 11 13 14 15 18 19 20 24 28 31 32 35 41 48 49 50 52
## [20] 70 72 81 84 85 86 87 92 93 94 98 112 121 122 124 126 127 129 130
## [39] 134 137 139 141 147 151 157 158 161 168 171 179 183 199 204 210 213 220 223
## [58] 225 234
##
## $SubSet4
## [1] 5 6 10 16 17 26 27 38 40 44 53 56 57 60 61 62 63 66 69
## [20] 78 82 83 88 89 90 95 96 97 99 100 101 106 107 108 110 116 131 133
## [39] 145 146 149 150 156 160 163 166 169 175 180 184 186 188 206 209 216 226 229
## [58] 232 236
We can measure how does this index does:
score_index1 <- check_index(survey[, keep], index)
## Warning in check_index(survey[, keep], index): There might be some problems
## with the data use check_data().
score_index1
## subgroups
## SubSet1 SubSet2 SubSet3 SubSet4
## Sex 0.285472670 0.28479310 0.45087536 0.28479310
## W.Hnd 0.406863574 0.56967175 0.38262251 0.38311366
## Smoke 0.352226528 0.35423141 0.51717322 0.36004142
## Age 0.966161840 0.88451852 0.78376910 1.52773530
## mix_cat 0.009637159 0.02713277 0.02836273 0.00787233
These values come from calculating the difference between the
original data and the samples for each subset for the median, mean, mad,
NA, entropy and independence (chisq.test()
p.value).
On themselves these values have no meaning, but internally they are used to compare with other possible index:
index2 <- create_subset(nrow(survey), size_subset = 70)
Then it compares to the previous index
score_index2 <- check_index(survey[, keep], index2)
## Warning in check_index(survey[, keep], index2): There might be some problems
## with the data use check_data().
If this score is lower than the previous one the new index is kept.
This is done similarly for the spatial search, which adds two new categorical variables with the position of the samples before calculating any statistics.
SessionInfo
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] experDesign_0.4.0
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.2 knitr_1.46 rlang_1.1.3
## [5] xfun_0.44 purrr_1.0.2 textshaping_0.3.7 jsonlite_1.8.8
## [9] htmltools_0.5.8.1 ragg_1.3.2 sass_0.4.9 rmarkdown_2.27
## [13] evaluate_0.23 jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.8
## [17] lifecycle_1.0.4 memoise_2.0.1 compiler_4.4.0 fs_1.6.4
## [21] systemfonts_1.1.0 digest_0.6.35 R6_2.5.1 magrittr_2.0.3
## [25] bslib_0.7.0 tools_4.4.0 pkgdown_2.0.9 cachem_1.1.0
## [29] desc_1.4.3