Generate a covariate-matched control set of ranges

matchRanges() uses a propensity score-based method to generate a covariate-matched control set of DataFrame, GRanges, or GInteractions objects.

matchRanges(focal, pool, covar, method = "nearest", replace = TRUE, ...)

# S4 method for class 'DF_OR_df_OR_dt,DF_OR_df_OR_dt,formula,character_OR_missing,logical_OR_missing'
matchRanges(focal, pool, covar, method, replace)

# S4 method for class 'GRanges,GRanges,formula,character_OR_missing,logical_OR_missing'
matchRanges(focal, pool, covar, method, replace)

# S4 method for class 'GInteractions,GInteractions,formula,character_OR_missing,logical_OR_missing'
matchRanges(focal, pool, covar, method, replace)

Arguments

focal: A DataFrame, GRanges, or GInteractions object containing the focal data to match.
pool: A DataFrame, GRanges, or GInteractions object containing the pool from which to select matches.
covar: A rhs formula with covariates on which to match.
method: A character describing which matching method to use. supported options are either 'nearest', 'rejection', or 'stratified'.
replace: TRUE/FALSE describing whether to select matches with or without replacement.
...: Additional arguments.

Value

A covariate-matched control set of data.

Details

Available inputs for focal and pool include data.frame, data.table, DataFrame, GRanges, or GInteractions. data.frame and data.table inputs are coerced to DataFrame objects and returned as MatchedDataFrame while GRanges and GInteractions objects are returned as MatchedGRanges or MatchedGInteractions, respectively.

Methodology

matchRanges uses propensity scores to perform subset selection on the pool set such that the resulting matched set contains similar distributions of covariates to that of the focal set. A propensity score is the conditional probability of assigning an element (in our case, a genomic range) to a particular outcome (Y) given a set of covariates. Propensity scores are estimated using a logistic regression model where the outcome Y=1 for focal and Y=0 for pool, over the provided covariates covar.

Matching methods

method = 'nearest': Nearest neighbor matching with replacement. Finds the nearest neighbor by using a rolling join with data.table. Matching without replacement is not currently supported.
method = 'rejection': (Default) Rejection sampling with or without replacement. Uses a probability-based approach to select options in the pool that match the focal distribition.
method = 'stratified': Iterative stratified sampling with or without replacement. Bins focal and pool propensity scores by value and selects matches within bins until all focal items have a corresponding match in pool.

References

matchRanges manuscript:

Eric S. Davis, Wancen Mu, Stuart Lee, Mikhail G. Dozmorov, Michael I. Love, Douglas H. Phanstiel. 2023. "matchRanges: Generating null hypothesis genomic ranges via covariate-matched sampling." Bioinformatics. doi: 10.1093/bioinformatics/btad197

Examples

## Match with DataFrame
set.seed(123)
x <- makeExampleMatchedDataSet(type = 'DataFrame')
matchRanges(focal = x[x$feature1,],
            pool = x[!x$feature1,],
            covar = ~feature2 + feature3)
#> MatchedDataFrame with 500 rows and 3 columns
#>      feature1  feature2    feature3
#>     <logical> <numeric> <character>
#> 1       FALSE   2.87088           c
#> 2       FALSE   3.54290           c
#> 3       FALSE   7.11436           c
#> 4       FALSE  10.78965           b
#> 5       FALSE   4.25960           c
#> ...       ...       ...         ...
#> 496     FALSE  0.173349           e
#> 497     FALSE  4.362421           a
#> 498     FALSE  3.182474           e
#> 499     FALSE  4.688994           d
#> 500     FALSE  5.068635           d

## Match with GRanges
set.seed(123)
x <- makeExampleMatchedDataSet(type = "GRanges")
matchRanges(focal = x[x$feature1,],
            pool = x[!x$feature1,],
            covar = ~feature2 + feature3)
#> MatchedGRanges object with 500 ranges and 3 metadata columns:
#>         seqnames    ranges strand |  feature1  feature2    feature3
#>            <Rle> <IRanges>  <Rle> | <logical> <numeric> <character>
#>     [1]     chr1 8696-8795      * |     FALSE   2.87088           c
#>     [2]     chr1 4386-4485      * |     FALSE   3.54290           c
#>     [3]     chr1 1094-1193      * |     FALSE   7.11436           c
#>     [4]     chr1 5705-5804      * |     FALSE  10.78965           b
#>     [5]     chr1 1643-1742      * |     FALSE   4.25960           c
#>     ...      ...       ...    ... .       ...       ...         ...
#>   [496]     chr1 7288-7387      * |     FALSE  0.173349           e
#>   [497]     chr1 5539-5638      * |     FALSE  4.362421           a
#>   [498]     chr1 8499-8598      * |     FALSE  3.182474           e
#>   [499]     chr1 6507-6606      * |     FALSE  4.688994           d
#>   [500]     chr1 1860-1959      * |     FALSE  5.068635           d
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Match with GInteractions
set.seed(123)
x <- makeExampleMatchedDataSet(type = "GInteractions")
matchRanges(focal = x[x$feature1,],
            pool = x[!x$feature1,],
            covar = ~feature2 + feature3)
#> MatchedGInteractions object with 500 interactions and 3 metadata columns:
#>         seqnames1   ranges1     seqnames2   ranges2 |  feature1  feature2
#>             <Rle> <IRanges>         <Rle> <IRanges> | <logical> <numeric>
#>     [1]      chr1 8696-8795 ---      chr1 8696-8795 |     FALSE   2.87088
#>     [2]      chr1 4386-4485 ---      chr1 4386-4485 |     FALSE   3.54290
#>     [3]      chr1 1094-1193 ---      chr1 1094-1193 |     FALSE   7.11436
#>     [4]      chr1 5705-5804 ---      chr1 5705-5804 |     FALSE  10.78965
#>     [5]      chr1 1643-1742 ---      chr1 1643-1742 |     FALSE   4.25960
#>     ...       ...       ... ...       ...       ... .       ...       ...
#>   [496]      chr1 7288-7387 ---      chr1 7288-7387 |     FALSE  0.173349
#>   [497]      chr1 5539-5638 ---      chr1 5539-5638 |     FALSE  4.362421
#>   [498]      chr1 8499-8598 ---      chr1 8499-8598 |     FALSE  3.182474
#>   [499]      chr1 6507-6606 ---      chr1 6507-6606 |     FALSE  4.688994
#>   [500]      chr1 1860-1959 ---      chr1 1860-1959 |     FALSE  5.068635
#>            feature3
#>         <character>
#>     [1]           c
#>     [2]           c
#>     [3]           c
#>     [4]           b
#>     [5]           c
#>     ...         ...
#>   [496]           e
#>   [497]           a
#>   [498]           e
#>   [499]           d
#>   [500]           d
#>   -------
#>   regions: 10500 ranges and 0 metadata columns
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Nearest neighbor matching with replacement
set.seed(123)
x <- makeExampleMatchedDataSet(type = 'DataFrame')
matchRanges(focal = x[x$feature1,],
            pool = x[!x$feature1,],
            covar = ~feature2 + feature3,
            method = 'nearest',
            replace = TRUE)
#> MatchedDataFrame with 500 rows and 3 columns
#>      feature1  feature2    feature3
#>     <logical> <numeric> <character>
#> 1       FALSE   2.87088           c
#> 2       FALSE   3.54290           c
#> 3       FALSE   7.11436           c
#> 4       FALSE  10.78965           b
#> 5       FALSE   4.25960           c
#> ...       ...       ...         ...
#> 496     FALSE  0.173349           e
#> 497     FALSE  4.362421           a
#> 498     FALSE  3.182474           e
#> 499     FALSE  4.688994           d
#> 500     FALSE  5.068635           d

## Rejection sampling without replacement
set.seed(123)
x <- makeExampleMatchedDataSet(type = 'DataFrame')
matchRanges(focal = x[x$feature1,],
            pool = x[!x$feature1,],
            covar = ~feature2 + feature3,
            method = 'rejection',
            replace = FALSE)
#> MatchedDataFrame with 500 rows and 3 columns
#>      feature1  feature2    feature3
#>     <logical> <numeric> <character>
#> 1       FALSE   5.27263           c
#> 2       FALSE  10.02427           c
#> 3       FALSE   8.19732           a
#> 4       FALSE   7.83727           c
#> 5       FALSE   4.69472           c
#> ...       ...       ...         ...
#> 496     FALSE   1.95550           a
#> 497     FALSE   4.98042           c
#> 498     FALSE   2.62560           e
#> 499     FALSE   1.40300           c
#> 500     FALSE   3.89279           b

## Stratified sampling without replacement
set.seed(123)
x <- makeExampleMatchedDataSet(type = 'DataFrame')
matchRanges(focal = x[x$feature1,],
            pool = x[!x$feature1,],
            covar = ~feature2 + feature3,
            method = 'stratified',
            replace = FALSE)
#> MatchedDataFrame with 500 rows and 3 columns
#>      feature1  feature2    feature3
#>     <logical> <numeric> <character>
#> 1       FALSE   2.87088           c
#> 2       FALSE   3.54290           c
#> 3       FALSE   7.11436           c
#> 4       FALSE  10.78965           b
#> 5       FALSE   4.25960           c
#> ...       ...       ...         ...
#> 496     FALSE   1.32077           b
#> 497     FALSE  10.96586           d
#> 498     FALSE   4.32895           b
#> 499     FALSE   4.68899           d
#> 500     FALSE   5.06864           d