Introduction

When performing statistical analysis on any set of genomic ranges it is often important to compare focal sets to null sets that are carefully matched for possible covariates that may influence the analysis. To address this need, the nullranges package implements matchRanges(), an efficient and convenient tool for selecting a covariate-matched set of null hypothesis ranges from a pool of background ranges within the Bioconductor framework.

In this vignette, we provide an overview of matchRanges() and its associated functions. We start with a simulated example generated with the utility function makeExampleMatchedDataSet(). We also provide an overview of the class struture and a guide for choosing among the supported matching methods. To see matchRanges() used in real biological examples, visit the Case study I: CTCF occupancy, and Case study II: CTCF orientation vignettes.

For a description of the method, see Davis et al. (2023).

Terminology

matchRanges references four sets of data: focal, pool, matched and unmatched. The focal set contains the outcome of interest (Y=1) while the pool set contains all other observations (Y=0). matchRanges generates the matched set, which is a subset of the pool that is matched for provided covariates (i.e. covar) but does not contain the outcome of interest (i.e Y=0). Finally, the unmatched set contains the remaining unselected elements from the pool. The diagram below depicts the relationships between the four sets.

Methodology

matchRanges uses propensity scores to perform subset selection on the pool set such that the resulting matched set contains similar distributions of covariates to that of the focal set. A propensity score is the conditional probability of assigning an element (in our case, a genomic range) to a particular outcome (Y) given a set of covariates. Propensity scores are estimated using a logistic regression model where the outcome Y=1 for focal and Y=0 for pool, over the provided covariates covar. The resulting propensity scores are used to select matches using one of three available matching options: “nearest”, “rejection”, or “stratified” with or without replacement. For more information see the section on Choosing the method parameter below.

Using matchRanges()

We will use a simulated data set to demonstrate matching across covarying features:

library(nullranges)
set.seed(123)
x <- makeExampleMatchedDataSet(type = 'GRanges')
x
## GRanges object with 10500 ranges and 3 metadata columns:
##           seqnames      ranges strand |  feature1  feature2    feature3
##              <Rle>   <IRanges>  <Rle> | <logical> <numeric> <character>
##       [1]     chr1       1-100      * |      TRUE   2.87905           c
##       [2]     chr1       2-101      * |      TRUE   3.53965           c
##       [3]     chr1       3-102      * |      TRUE   7.11742           c
##       [4]     chr1       4-103      * |      TRUE   4.14102           a
##       [5]     chr1       5-104      * |      TRUE   4.25858           c
##       ...      ...         ...    ... .       ...       ...         ...
##   [10496]     chr1 10496-10595      * |     FALSE   1.23578           b
##   [10497]     chr1 10497-10596      * |     FALSE   1.69671           a
##   [10498]     chr1 10498-10597      * |     FALSE   6.11140           a
##   [10499]     chr1 10499-10598      * |     FALSE   2.21657           d
##   [10500]     chr1 10500-10599      * |     FALSE   5.33003           b
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Our simulated dataset has 3 features: logical feature1, numeric feature2, and character/factor feature3. We can use matchRanges() to compare ranges where feature1 is TRUE to ranges where feature1 is FALSE, matched by feature2 and/or feature3:

set.seed(123)
mgr <- matchRanges(focal = x[x$feature1],
                   pool = x[!x$feature1],
                   covar = ~feature2 + feature3)
mgr
## MatchedGRanges object with 500 ranges and 3 metadata columns:
##         seqnames    ranges strand |  feature1  feature2    feature3
##            <Rle> <IRanges>  <Rle> | <logical> <numeric> <character>
##     [1]     chr1 8696-8795      * |     FALSE   2.87088           c
##     [2]     chr1 4386-4485      * |     FALSE   3.54290           c
##     [3]     chr1 1094-1193      * |     FALSE   7.11436           c
##     [4]     chr1 5705-5804      * |     FALSE  10.78965           b
##     [5]     chr1 1643-1742      * |     FALSE   4.25960           c
##     ...      ...       ...    ... .       ...       ...         ...
##   [496]     chr1 7288-7387      * |     FALSE  0.173349           e
##   [497]     chr1 5539-5638      * |     FALSE  4.362421           a
##   [498]     chr1 8499-8598      * |     FALSE  3.182474           e
##   [499]     chr1 6507-6606      * |     FALSE  4.688994           d
##   [500]     chr1 1860-1959      * |     FALSE  5.068635           d
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The resulting MatchedGRanges object is a set of null hypothesis ranges selected from our pool of options that is the same length as our input focal ranges and matched for covar features 2 and 3. These matched ranges print and behave just as normal GRanges would:

## MatchedGRanges object with 500 ranges and 3 metadata columns:
##         seqnames      ranges strand |  feature1  feature2    feature3
##            <Rle>   <IRanges>  <Rle> | <logical> <numeric> <character>
##     [1]     chr1     508-607      * |     FALSE   3.54480           c
##     [2]     chr1     508-607      * |     FALSE   3.54480           c
##     [3]     chr1     510-609      * |     FALSE   5.28640           c
##     [4]     chr1     526-625      * |     FALSE   3.82276           a
##     [5]     chr1     529-628      * |     FALSE   3.22564           b
##     ...      ...         ...    ... .       ...       ...         ...
##   [496]     chr1 10414-10513      * |     FALSE   9.14097           b
##   [497]     chr1 10446-10545      * |     FALSE  11.23394           e
##   [498]     chr1 10480-10579      * |     FALSE   5.64670           a
##   [499]     chr1 10483-10582      * |     FALSE   3.72415           c
##   [500]     chr1 10495-10594      * |     FALSE   5.37738           b
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can change the type argument of makeExampleMatchedDataSet to input data.frames, data.tables, DataFrames, GRanges and GInteractions objects - all of which work as inputs for matchRanges. These produce either MatchedDataFrame, MatchedGRanges, or MatchedGInteractions objects. For more information about the Matched class structure and available methods, see the Class structure section below or the help documentation for each class, ?MatchedDataFrame, ?MatchedGRanges, or ?MatchedGInteractions.

matchRanges() uses propensity scores to select matches using one of three available matching options: “nearest”, “rejection”, or “stratified” with or without replacement. For more information see the section on Choosing the method parameter below.

Assessing quality of matching

We can assess the quality of Matched classes with overview(), plotCovariate(), and plotPropensity(). overview() provides a quick assessment of overall matching quality by reporting the mean and standard deviation for covariates and propensity scores of the focal, pool, matched, and unmatched sets. For factor, character, or logical covariates (e.g. categorical covariates) the N per set (frequency) is returned. It also reports the mean difference in focal-matched sets:

## MatchedGRanges object: 
##        set     N feature2.mean feature2.sd feature3.a feature3.b feature3.c
##      focal   500           4.1         1.9         66        157        206
##    matched   500           4.6         2.7         48        155        223
##       pool 10000           6.0         3.4       4248       3121       1117
##  unmatched  9559           6.0         3.4       4200       2978        936
##  feature3.d feature3.e ps.mean ps.sd
##          49         22   0.100 0.076
##          48         26   0.100 0.076
##         992        522   0.045 0.051
##         946        499   0.042 0.048
## --------
## focal - matched: 
##  feature2.mean feature2.sd feature3.a feature3.b feature3.c feature3.d
##          -0.48       -0.79         18          2        -17          1
##  feature3.e  ps.mean    ps.sd
##          -4 -5.8e-06 -6.8e-06

Visualizing propensity scores can show how well sets were matched overall:

The distributions of features can be visualized in each set with plotCovariate():

Since these functions return ggplots, patchwork can be used to visualize all covariates like this:

library(patchwork)
plots <- lapply(covariates(mgr), plotCovariate, x=mgr, sets = c('f', 'm', 'p'))
Reduce('/', plots)

By default, continuous features are plotted as density line plots while categorical features are plotted as stacked bar plots. All sets are also shown by default. Defaults can be overridden by setting the type and sets arguments.

Results from matchRanges can also be used in conjunction with cobalt for assessing covariate balance. We recommend using cobalt to calculate and report summary statistics to indicate adequately matched sets. For more detail on assessing covariate balance, refer to the detailed documentation on this topic in the cobalt vignette: vignette("cobalt", package = "cobalt"). For an example on how to use cobalt with matchRanges see Using cobalt to assess balancing.

Accessing matched data

Custom plots can be made by extracting data from the Matched object:

##        id feature2 feature3         ps       set
##     1:  1 2.879049        c 0.21095908     focal
##     2:  1 3.539645        c 0.19210984     focal
##     3:  1 7.117417        c 0.11193396     focal
##     4:  1 4.141017        a 0.01771986     focal
##     5:  1 4.258575        c 0.17308581     focal
##    ---                                          
## 20555:  0 1.235781        b 0.08945367 unmatched
## 20556:  0 1.696712        a 0.02707977 unmatched
## 20557:  0 6.111404        a 0.01255772 unmatched
## 20558:  0 2.216575        d 0.07578989 unmatched
## 20559:  0 5.330029        b 0.04535856 unmatched

Attributes of the Matched object can be extracted with the following accessor functions:

## [1] "feature2" "feature3"
## [1] "nearest"
## [1] TRUE

Each set can also be extracted with the following accessor functions:

## [1] "GRanges object with 500 ranges and 3 metadata columns"
## [1] "GRanges object with 10000 ranges and 3 metadata columns"
## [1] "GRanges object with 500 ranges and 3 metadata columns"
## [1] "GRanges object with 9559 ranges and 3 metadata columns"

A “tidy” version of key sets can be obtained using plyranges and the bind_ranges function. This enables efficient comparisons across sets with other plyranges functionality (group_by, summarize, etc.).

library(plyranges)
bind_ranges(
  focal = focal(mgr),
  pool = pool(mgr),
  matched = matched(mgr), .id="type"
)
## GRanges object with 11000 ranges and 4 metadata columns:
##           seqnames    ranges strand |  feature1  feature2    feature3    type
##              <Rle> <IRanges>  <Rle> | <logical> <numeric> <character>   <Rle>
##       [1]     chr1     1-100      * |      TRUE   2.87905           c   focal
##       [2]     chr1     2-101      * |      TRUE   3.53965           c   focal
##       [3]     chr1     3-102      * |      TRUE   7.11742           c   focal
##       [4]     chr1     4-103      * |      TRUE   4.14102           a   focal
##       [5]     chr1     5-104      * |      TRUE   4.25858           c   focal
##       ...      ...       ...    ... .       ...       ...         ...     ...
##   [10996]     chr1 7288-7387      * |     FALSE  0.173349           e matched
##   [10997]     chr1 5539-5638      * |     FALSE  4.362421           a matched
##   [10998]     chr1 8499-8598      * |     FALSE  3.182474           e matched
##   [10999]     chr1 6507-6606      * |     FALSE  4.688994           d matched
##   [11000]     chr1 1860-1959      * |     FALSE  5.068635           d matched
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The indices() function can be used to find the original indices for each set. For example, indices(x, set="matched") will supply the indices from the pool set that corresponds to the matched set. In fact, matched(x) is a convenient wrapper around pool(x)[indices(x, set='matched'):

identical(matched(mgr), pool(mgr)[indices(mgr, set = 'matched')])
## [1] TRUE

Using cobalt to assess balancing

This is straight-forward by accessing the data with matchedData(x):

##  cobalt (Version 4.5.1, Build Date: 2023-04-27)
res <- bal.tab(f.build("set", covariates(mgr)),
               data = matchedData(mgr),
               distance = "ps", # name of column containing propensity score
               focal = "focal", # name of focal group in set column
               which.treat = "focal", # compare everything to focal
               s.d.denom = "all") # how to adjust standard deviation

res
## Balance by treatment pair
## 
##  - - - focal (0) vs. matched (1) - - - 
## Balance Measures
##                Type Diff.Un
## ps         Distance  0.0001
## feature2    Contin.  0.1399
## feature3_a   Binary -0.0360
## feature3_b   Binary -0.0040
## feature3_c   Binary  0.0340
## feature3_d   Binary -0.0020
## feature3_e   Binary  0.0080
## 
## Sample sizes
##     focal matched
## All   500     500
## 
##  - - - focal (0) vs. pool (1) - - - 
## Balance Measures
##                Type Diff.Un
## ps         Distance -1.1274
## feature2    Contin.  0.5523
## feature3_a   Binary  0.2928
## feature3_b   Binary -0.0019
## feature3_c   Binary -0.3003
## feature3_d   Binary  0.0012
## feature3_e   Binary  0.0082
## 
## Sample sizes
##     focal  pool
## All   500 10000
## 
##  - - - focal (0) vs. unmatched (1) - - - 
## Balance Measures
##                Type Diff.Un
## ps         Distance -1.1745
## feature2    Contin.  0.5703
## feature3_a   Binary  0.3074
## feature3_b   Binary -0.0025
## feature3_c   Binary -0.3141
## feature3_d   Binary  0.0010
## feature3_e   Binary  0.0082
## 
## Sample sizes
##     focal unmatched
## All   500      9559
##  - - - - - - - - - - - - - - - - - - - - - - - - - -
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the `stars` argument to distinguish between them and appropriately label the x-axis.

Choosing the method parameter

There are currently 3 available methods for selecting a matched set:

  1. Nearest-neighbor matching with replacement

  2. Rejection sampling with/without replacement

  3. Stratified sampling with/without replacement

Currently, nearest-neighbor matching without replacement is not implemented, but stratified sampling without replacement is a suitable substitute.

Nearest-neighbor matching

Attempts to find the nearest neighbor for each range by using a rolling-join (as implemented in the data.table package) between focal and pool propensity scores.

set.seed(123)
mgr <- matchRanges(focal = x[x$feature1],
                   pool = x[!x$feature1],
                   covar = ~feature2 + feature3,
                   method = 'nearest',
                   replace = TRUE)
nn <- overview(mgr)
plotPropensity(mgr)

This method is best if you have a very large dataset because it is usually the fastest matching method. However, because sampling is done with replacement the user should be careful to assess the number of duplicate ranges pulled. This can be done using the indices() function:

## Total number of duplicated indices
length(which(duplicated(indices(mgr))))

sum(table(indices(mgr)) > 1) # used more than once
sum(table(indices(mgr)) > 2) # used more than twice
sum(table(indices(mgr)) > 3) # used more than thrice
## [1] 59
## [1] 51
## [1] 8
## [1] 0

Duplicate ranges can be pulled since this method selects the closest matching propensity-score in the focal set to each range in the pool set. It is important to inspect the duplicates when using this method particularly when there are very few well-matching options to select from in your pool set to ensure your matched set has a diverse set of ranges.

Nearest neighbor matching without replacement is not currently supported due to its computational complexity. However, stratified sampling without replacement is an acceptable alternative.

Rejection sampling

Uses a probability-based approach to select options in the pool that distributionally match the focal set based on propensity scores. The rejection sampling method first generates kernal-density estimates for both the focal and pool sets. Then a scale factor is determined by finding the point at which the difference in focal and pool densities is maximized. This scale factor is then applied such that the pool distribution covers the focal distribution at all points. Random sampling is then conducted, with probability of accepting a pool range into the matched set given by the ratio between the height of the density and the scaled (covering) density. If method or replace is not supplied, the default values are rejection sampling without replacement.

set.seed(123)
mgr <- matchRanges(focal = x[x$feature1],
                   pool = x[!x$feature1],
                   covar = ~feature2 + feature3,
                   method = 'rejection',
                   replace = FALSE)
rs <- overview(mgr)
plotPropensity(mgr)

Rejection sampling is the fastest available matching method for sampling without replacement. Therefore, it is ideal to use on large datasets when sampling without replacement is important. However, this method can be unstable, particularly when the pool set is not much larger than the focal set. In those cases, the best method to use is stratified sampling.

Stratified sampling

Performs iterative sampling on increasingly large bins of data. focal and pool propensity scores are binned by their value with high granularity, options are randomly selected (with or without replacement) within each bin and subsequently removed from the pool of available options. This procedure is repeated, decreasing the number of bins (and increasing bin size) until the number of selected matches is equal to the focal set. While matches are being found in each bin the bins stay small. However, as the number of bins with no matches increases the algorithm expands bin size faster, which maintains matching quality while decreasing run-time.

set.seed(123)
mgr <- matchRanges(focal = x[x$feature1],
                   pool = x[!x$feature1],
                   covar = ~feature2 + feature3,
                   method = 'stratified',
                   replace = FALSE)
ss <- overview(mgr)
plotPropensity(mgr)

For very large data sets, users might notice a slight increase in run time compared to the other methods. Stratified sampling tends to work very well for discrete data, and often produces the best matches even on continuous data:

## Extract difference in propensity scores
## between focal and matched sets
fmps <- sapply(c(nn, rs, ss), `[[`, "quality")
c('nearest', 'rejection', 'stratified')[which.min(fmps)]
## [1] "nearest"

Class structure

Since matchRanges() automatically constructs the relevant classes, this section is not essential for using any of the nullranges package functionality. Instead, this section serves as a guide for developers who wish to extend these classes or those more interested in S4 implementation details.

Implementation details

matchRanges() acts as a constructor, combining a Matched superclass - which contains the matching results - with either a DataFrame(data.frame/data.table), GRanges, or GInteractions superclass. This results in the MatchedDataFrame, MatchedGRanges, or MatchedGInteractions subclasses.

Internally, each Matched subclass uses a “delegate” object of the same type to assign its slots. The delegate object used is the matched set. Therefore, the resulting Matched* object behaves as a combination of both its superclasses - with access to methods from both.

For example, using matchRanges() on GRanges objects assigns a GRanges delegate object which is used to populate GRanges-specific slots. This results in a MatchedGRanges object, with access to both Matched functions (e.g. plotCovariate) as well as normal GRanges methods (e.g.s seqnames, resize, etc…).

Session information

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cobalt_4.5.1         plyranges_1.20.0     patchwork_1.1.3     
##  [4] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1  IRanges_2.34.1      
##  [7] S4Vectors_0.38.1     BiocGenerics_0.46.0  ggplot2_3.4.3       
## [10] plotgardener_1.6.4   nullranges_1.7.10   
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                rlang_1.1.1                
##  [3] magrittr_2.0.3              matrixStats_1.0.0          
##  [5] ggridges_0.5.4              compiler_4.3.0             
##  [7] systemfonts_1.0.4           vctrs_0.6.3                
##  [9] stringr_1.5.0               pkgconfig_2.0.3            
## [11] crayon_1.5.2                fastmap_1.1.1              
## [13] backports_1.4.1             XVector_0.40.0             
## [15] labeling_0.4.2              utf8_1.2.3                 
## [17] Rsamtools_2.16.0            rmarkdown_2.24             
## [19] pracma_2.4.2                strawr_0.0.91              
## [21] ragg_1.2.5                  purrr_1.0.2                
## [23] xfun_0.40                   zlibbioc_1.46.0            
## [25] cachem_1.0.8                jsonlite_1.8.7             
## [27] progress_1.2.2              highr_0.10                 
## [29] DelayedArray_0.26.7         chk_0.9.0                  
## [31] BiocParallel_1.34.2         prettyunits_1.1.1          
## [33] parallel_4.3.0              R6_2.5.1                   
## [35] bslib_0.5.1                 stringi_1.7.12             
## [37] RColorBrewer_1.1-3          rtracklayer_1.60.1         
## [39] jquerylib_0.1.4             Rcpp_1.0.11                
## [41] SummarizedExperiment_1.30.2 knitr_1.43                 
## [43] Matrix_1.6-1                tidyselect_1.2.0           
## [45] abind_1.4-5                 yaml_2.3.7                 
## [47] codetools_0.2-19            curl_5.0.2                 
## [49] lattice_0.21-8              tibble_3.2.1               
## [51] InteractionSet_1.28.1       Biobase_2.60.0             
## [53] ks_1.14.1                   withr_2.5.0                
## [55] evaluate_0.21               gridGraphics_0.5-1         
## [57] desc_1.4.2                  mclust_6.0.0               
## [59] Biostrings_2.68.1           pillar_1.9.0               
## [61] MatrixGenerics_1.12.3       KernSmooth_2.23-22         
## [63] generics_0.1.3              rprojroot_2.0.3            
## [65] RCurl_1.98-1.12             hms_1.1.3                  
## [67] munsell_0.5.0               scales_1.2.1               
## [69] glue_1.6.2                  tools_4.3.0                
## [71] BiocIO_1.10.0               data.table_1.14.8          
## [73] GenomicAlignments_1.36.0    mvtnorm_1.2-2              
## [75] fs_1.6.3                    XML_3.99-0.14              
## [77] colorspace_2.1-0            GenomeInfoDbData_1.2.10    
## [79] restfulr_0.0.15             cli_3.6.1                  
## [81] textshaping_0.3.6           fansi_1.0.4                
## [83] S4Arrays_1.0.5              dplyr_1.1.2                
## [85] gtable_0.3.3                yulab.utils_0.0.7          
## [87] sass_0.4.7                  digest_0.6.33              
## [89] ggplotify_0.1.2             rjson_0.2.21               
## [91] farver_2.1.1                memoise_2.0.1              
## [93] htmltools_0.5.6             pkgdown_2.0.7              
## [95] lifecycle_1.0.3

References

Davis, Eric S., Wancen Mu, Stuart Lee, Mikhail G. Dozmorov, Michael I. Love, and Douglas H. Phanstiel. 2023. “matchRanges: Generating Null Hypothesis Genomic Ranges via Covariate-Matched Sampling.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btad197.