Introduction

The following vignette describes the nullranges implementation of the block bootstrap with respect to a genomic segmentation. See the main nullranges vignette for an overview of the idea of bootstrapping, and there is additionally a vignette on block boostrapping without respect to segmentation.

As proposed by Bickel et al. (2010), nullranges contains an implementation of a block bootstrap, such that features are sampled from the genome in blocks. Blocks are sampled and placed within regions of a genome segmentation. That is, for a genome segmented into states 1,2,…,S, blocks from state s will be used to tile the ranges of state s in each bootstrap sample.

The segmented block bootstrap has two options, either:

  • Perform a de-novo segmentation of the genome using feature density, e.g. gene density
  • Use exiting segmentation (e.g. ChromHMM, etc.) downloaded from AnnotationHub or external to Bioconductor (BED files imported with rtracklayer)

In this vignette, we give an example of segmenting the hg38 genome by Ensembl gene density, then we profile the time to generate a single block bootstrap sample. Finally, we use a toy dataset to visualize what a segmented block bootstrap sample looks like with respect to a genome segmentation. Future versions of this vignette will demonstrate the functions used within an overlap analysis. See also the unsegmented block bootstrap vignette in nullranges, if it is not desired to bootstrap with respect to a genome segmentation.

A finally consideration is whether the blocks should scale proportionally to the segment state length, with the default setting of proportionLength=TRUE. When blocks scale proportionally, blockLength provides the maximal length of a block, while the actual block length used for a segmentation state is proportional to the fraction of genomic basepairs covered by that state. This option is visualized on toy data at the end of this vignette.

Segmentation by gene density

First we obtain the Ensembl genes (Howe et al. 2020) for segmenting by gene density. We obtain these using the ensembldb package (Rainer, Gatto, and Weichenberger 2019).

suppressPackageStartupMessages(library(EnsDb.Hsapiens.v86))
edb <- EnsDb.Hsapiens.v86
filt <- AnnotationFilterList(GeneIdFilter("ENSG", "startsWith"))
g <- genes(edb, filter = filt)

We perform some processing to align the sequences (chromosomes) of g with our excluded regions and our features of interest (DNase hypersensitive sites, or DHS, defined below).

library(GenomeInfoDb)
g <- keepStandardChromosomes(g, pruning.mode = "coarse")
seqlevels(g, pruning.mode="coarse") <- setdiff(seqlevels(g), "MT")
# normally we would assign a new style, but for recent host issues
## seqlevelsStyle(g) <- "UCSC" 
seqlevels(g) <- paste0("chr", seqlevels(g))
genome(g) <- "hg38"
g <- sortSeqlevels(g)
g <- sort(g)
table(seqnames(g))
## 
##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
##  5194  3971  3010  2505  2868  2863  2867  2353  2242  2204  3235  2940  1304 
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY 
##  2224  2152  2511  2995  1170  2926  1386   835  1318  2359   523

Import excluded regions

We next import excluded regions, as created by Amemiya, Kundaje, and Boyle (2019), and packaged for R/Bioconductor in the excluderanges package by Mikhail Dozmorov.

## snapshotDate(): 2021-10-20
# `excludeGR.hg38.Kundaje.1` -- see excluderanges vignette
exclude <- ah[["AH95917"]]
## loading from cache
all(seqlengths(g) == seqlengths(exclude))
## [1] TRUE

CBS segmentation

We first demonstrate the use a CBS segmentation as implemented in DNAcopy (Olshen et al. 2004).

We load the nullranges and plyranges packages, and patchwork in order to produce grids of plots.

We subset the excluded ranges to those which are 500 bp or larger. The motivation for this step is to avoid segmenting the genome into many small pieces due to an abundance of short excluded regions. Note that we re-save the excluded ranges to exclude.

set.seed(5)
exclude <- exclude %>%
  plyranges::filter(width(exclude) >= 500)
L_s <- 1e6
seg <- segmentDensity(g, n = 3, L_s = L_s,
                      exclude = exclude, type = "cbs")
## Analyzing: Sample.1
plots <- lapply(c("ranges","barplot","boxplot"), function(t) {
  plotSegment(seg, exclude, type = t)
})
plots[[1]]

plots[[2]] + plots[[3]]

Note here, the default ranges plot gives whole genome and every fractured bind regions represents state transformations happens. However, some transformations within small ranges cannot be visualized, e.g 1kb. If user want to look into specific ranges of segmentation state, the region argument is flexible to support.

region <- GRanges("chr16", IRanges(3e7,4e7))
plotSegment(seg, exclude, type="ranges", region=region)

Alternatively: HMM segmentation

Here we use an alternative segmentation implemented in the RcppHMM CRAN package, using the initGHMM, learnEM, and viterbi functions.

seg_hmm <- segmentDensity(g, n = 3, L_s = L_s,
                          exclude = exclude, type = "hmm")
## Finished at Iteration: 100 with Error: 9.31905e-06
plots <- lapply(c("ranges","barplot","boxplot"), function(t) {
  plotSegment(seg_hmm, exclude, type = t)
})
plots[[1]]

plots[[2]] + plots[[3]]

Segmented block bootstrap

We use a set of DNase hypersensitivity sites (DHS) from the ENCODE project (ENCODE 2012) in A549 cell line (ENCSR614GWM). Here, for speed, we work with a pre-processed data object from ExperimentHub, created using the following steps:

  • Download ENCODE DNase hypersensitive peaks in A549 from AnnotationHub
  • Subset to standard chromosomes and remove mitochondrial DNA
  • Use a chain file from UCSC to lift ranges from hg19 to hg38
  • Sort the DHS features to be bootstrapped

These steps are included in nullrangesData in the inst/scripts/make-dhs-data.R script.

## snapshotDate(): 2021-10-18
dhs <- DHSA549Hg38()
## see ?nullrangesData and browseVignettes('nullrangesData') for documentation
## loading from cache
## 
##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
## 15240 14436  9549  6034  9714  8043  9549  8463  8247  8924  8618 10597  2732 
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY 
##  5785  6929  6596 11438  2643  8818  5986  1471  3010  4024   239

Now we apply a segmented block bootstrap with blocks of size 500kb, to the peaks. Here we simply show generation of two iterations of a block bootstrap, while we plan in future sections of this vignette to demonstrate its use in a workflow including plyranges (Lee, Cook, and Lawrence 2019).

blockLength <- 5e5
br <- bootRanges(dhs, blockLength, R = 2, seg = seg, exclude=exclude)
br
## bootRanges object with 354761 ranges and 9 metadata columns:
##            seqnames            ranges strand |        name     score
##               <Rle>         <IRanges>  <Rle> | <character> <numeric>
##        [1]     chr1 55046962-55047111      * |        <NA>         0
##        [2]     chr1 55066027-55066176      * |        <NA>         0
##        [3]     chr1 55112487-55112636      * |        <NA>         0
##        [4]     chr1 55179402-55179551      * |        <NA>         0
##        [5]     chr1 55179862-55180011      * |        <NA>         0
##        ...      ...               ...    ... .         ...       ...
##   [354757]    chr22 50534629-50534778      * |        <NA>         0
##   [354758]    chr22 50788902-50789051      * |        <NA>         0
##   [354759]    chr22 50818469-50818468      * |        <NA>         0
##   [354760]    chr22 50818469-50818468      * |        <NA>         0
##   [354761]    chr22 50818469-50818468      * |        <NA>         0
##            signalValue    pValue    qValue      peak     block  iter
##              <numeric> <numeric> <numeric> <numeric> <integer> <Rle>
##        [1]           3        -1        -1        -1         1     1
##        [2]          18        -1        -1        -1         1     1
##        [3]          13        -1        -1        -1         1     1
##        [4]         166        -1        -1        -1         1     1
##        [5]          41        -1        -1        -1         1     1
##        ...         ...       ...       ...       ...       ...   ...
##   [354757]          14        -1        -1        -1     12468     2
##   [354758]          14        -1        -1        -1     12471     2
##   [354759]          36        -1        -1        -1     12472     2
##   [354760]          14        -1        -1        -1     12472     2
##   [354761]          14        -1        -1        -1     12472     2
##            blockLength
##                  <Rle>
##        [1]      500000
##        [2]      500000
##        [3]      500000
##        [4]      500000
##        [5]      500000
##        ...         ...
##   [354757]      500000
##   [354758]      500000
##   [354759]      500000
##   [354760]      500000
##   [354761]      500000
##   -------
##   seqinfo: 24 sequences from hg38 genome

What is returned here? The bootRanges function returns a bootRanges object, which is a simple sub-class of GRanges. The iteration (iter) and the block length (blockLength) are recorded as metadata columns, accessible via mcols. We chose to return the bootstrapped ranges as GRanges rather than GRangesList, as the former is more compatible with downstream overlap joins with plyranges, where the iteration column can be used with group_by to provide per bootstrap summary statistics.

Note that we use the exclude object from the previous step, which does not contain small ranges. If one wanted to also avoid generation of bootstrapped features that overlap small excluded ranges, then omit this filtering step (use the original, complete exclude feature set).

Timing on DHS peaks

Here, we test the speed of the various options for bootstrapping (see below for visualization of the difference).

library(microbenchmark)
microbenchmark(
  list=alist(
    prop = bootRanges(dhs, blockLength, seg = seg, proportionLength = TRUE),
    no_prop = bootRanges(dhs, blockLength, seg = seg, proportionLength = FALSE)
), times=10)
## Unit: milliseconds
##     expr      min       lq     mean   median       uq       max neval
##     prop 244.0722 249.3901 253.5678 251.3907 258.2528  266.1374    10
##  no_prop 238.3742 243.9228 327.2324 246.9875 258.6189 1031.4601    10

Visualizing toy bootstrap samples

Below we present a toy example for visualizing the segmented block bootstrap. First, we define a helper function for plotting GRanges using plotgardener (Kramer et al. 2021). A key aspect here is that we color the original and bootstrapped ranges by the genomic state (the state of the segmentation that the original ranges fall in).

suppressPackageStartupMessages(library(plotgardener))
my_palette <- function(n) {
  head(c("red","green3","red3","dodgerblue",
         "blue2","green4","darkred"), n)
}
plotGRanges <- function(gr) {
  pageCreate(width = 5, height = 5, xgrid = 0,
                ygrid = 0, showGuides = TRUE)
  for (i in seq_along(seqlevels(gr))) {
    chrom <- seqlevels(gr)[i]
    chromend <- seqlengths(gr)[[chrom]]
    suppressMessages({
      p <- pgParams(chromstart = 0, chromend = chromend,
                    x = 0.5, width = 4*chromend/500, height = 2,
                    at = seq(0, chromend, 50),
                    fill = colorby("state_col", palette=my_palette))
      prngs <- plotRanges(data = gr, params = p,
                          chrom = chrom,
                          y = 2 * i,
                          just = c("left", "bottom"))
      annoGenomeLabel(plot = prngs, params = p, y = 0.1 + 2 * i)
    })
  }
}

Create a toy genome segmentation:

library(GenomicRanges)
seq_nms <- rep(c("chr1","chr2"), c(4,3))
seg <- GRanges(
  seqnames = seq_nms,
  IRanges(start = c(1, 101, 201, 401, 1, 201, 301),
          width = c(100, 100, 200, 100, 200, 100, 100)),
  seqlengths=c(chr1=500,chr2=400),
  state = c(1,2,1,3,3,2,1),
  state_col = factor(1:7)
)

We can visualize with our helper function:

plotGRanges(seg)

Now create small ranges distributed uniformly across the toy genome:

set.seed(1)
n <- 200
gr <- GRanges(
  seqnames=sort(sample(c("chr1","chr2"), n, TRUE)),
  IRanges(start=round(runif(n, 1, 500-10+1)), width=10)
)
suppressWarnings({
  seqlengths(gr) <- seqlengths(seg)
})
gr <- gr[!(seqnames(gr) == "chr2" & end(gr) > 400)]
gr <- sort(gr)
idx <- findOverlaps(gr, seg, type="within", select="first")
gr <- gr[!is.na(idx)]
idx <- idx[!is.na(idx)]
gr$state <- seg$state[idx]
gr$state_col <- factor(seg$state_col[idx])
plotGRanges(gr)

Not scaling by segmentation

We can visualize block bootstrapped ranges when the blocks do not scale to segment state length:

set.seed(1)
gr_prime <- bootRanges(gr, blockLength = 25, seg = seg,
                       proportionLength = FALSE)
plotGRanges(gr_prime)

Scaling by segmentation

This time the blocks scale to the segment state length. Note that in this case blockLength is the maximal block length possible, but the actual block lengths per segment will be smaller (proportional to the fraction of basepairs covered by that state in the genome segmentation).

set.seed(1)
gr_prime <- bootRanges(gr, blockLength = 50, seg = seg,
                       proportionLength = TRUE)
plotGRanges(gr_prime)

Note that some ranges from adjacent states are allowed to be placed within different states in the bootstrap sample. This is because, during the random sampling of blocks of original data, a block is allowed to extend beyond the segmentation region of the state being sampled, and features from adjacent states are not excluded from the sampled block.

Session information

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] plotgardener_0.99.16        microbenchmark_1.4-7       
##  [3] nullrangesData_0.99.2       InteractionSet_1.21.1      
##  [5] SummarizedExperiment_1.23.5 MatrixGenerics_1.5.4       
##  [7] matrixStats_0.61.0          ExperimentHub_2.1.4        
##  [9] patchwork_1.1.1             plyranges_1.13.1           
## [11] nullranges_0.99.19          excluderanges_0.99.6       
## [13] AnnotationHub_3.1.7         BiocFileCache_2.1.1        
## [15] dbplyr_2.1.1                EnsDb.Hsapiens.v86_2.99.0  
## [17] ensembldb_2.17.4            AnnotationFilter_1.17.1    
## [19] GenomicFeatures_1.45.2      AnnotationDbi_1.55.2       
## [21] Biobase_2.53.0              GenomicRanges_1.45.0       
## [23] GenomeInfoDb_1.29.10        IRanges_2.27.2             
## [25] S4Vectors_0.31.5            BiocGenerics_0.39.2        
## 
## loaded via a namespace (and not attached):
##   [1] systemfonts_1.0.3             plyr_1.8.6                   
##   [3] RcppHMM_1.2.2                 lazyeval_0.2.2               
##   [5] BiocParallel_1.27.17          ggplot2_3.3.5                
##   [7] digest_0.6.28                 yulab.utils_0.0.4            
##   [9] htmltools_0.5.2               fansi_0.5.0                  
##  [11] magrittr_2.0.1                memoise_2.0.0                
##  [13] ks_1.13.2                     Biostrings_2.61.2            
##  [15] pkgdown_1.6.1                 prettyunits_1.1.1            
##  [17] colorspace_2.0-2              blob_1.2.2                   
##  [19] rappdirs_0.3.3                textshaping_0.3.6            
##  [21] xfun_0.27                     dplyr_1.0.7                  
##  [23] crayon_1.4.1                  RCurl_1.98-1.5               
##  [25] jsonlite_1.7.2                glue_1.4.2                   
##  [27] gtable_0.3.0                  zlibbioc_1.39.0              
##  [29] XVector_0.33.0                strawr_0.0.9                 
##  [31] DelayedArray_0.19.4           scales_1.1.1                 
##  [33] mvtnorm_1.1-3                 DBI_1.1.1                    
##  [35] Rcpp_1.0.7                    xtable_1.8-4                 
##  [37] progress_1.2.2                gridGraphics_0.5-1           
##  [39] bit_4.0.4                     mclust_5.4.7                 
##  [41] httr_1.4.2                    RColorBrewer_1.1-2           
##  [43] speedglm_0.3-3                ellipsis_0.3.2               
##  [45] pkgconfig_2.0.3               XML_3.99-0.8                 
##  [47] farver_2.1.0                  sass_0.4.0                   
##  [49] utf8_1.2.2                    DNAcopy_1.67.0               
##  [51] ggplotify_0.1.0               tidyselect_1.1.1             
##  [53] labeling_0.4.2                rlang_0.4.12                 
##  [55] later_1.3.0                   munsell_0.5.0                
##  [57] BiocVersion_3.14.0            tools_4.1.1                  
##  [59] cachem_1.0.6                  generics_0.1.0               
##  [61] RSQLite_2.2.8                 ggridges_0.5.3               
##  [63] evaluate_0.14                 stringr_1.4.0                
##  [65] fastmap_1.1.0                 yaml_2.2.1                   
##  [67] ragg_1.1.3                    knitr_1.36                   
##  [69] bit64_4.0.5                   fs_1.5.0                     
##  [71] purrr_0.3.4                   KEGGREST_1.33.0              
##  [73] mime_0.12                     pracma_2.3.3                 
##  [75] xml2_1.3.2                    biomaRt_2.49.7               
##  [77] compiler_4.1.1                filelock_1.0.2               
##  [79] curl_4.3.2                    png_0.1-7                    
##  [81] interactiveDisplayBase_1.31.2 tibble_3.1.5                 
##  [83] bslib_0.3.1                   stringi_1.7.5                
##  [85] highr_0.9                     desc_1.4.0                   
##  [87] lattice_0.20-45               ProtGenerics_1.25.1          
##  [89] Matrix_1.3-4                  vctrs_0.3.8                  
##  [91] pillar_1.6.4                  lifecycle_1.0.1              
##  [93] BiocManager_1.30.16           jquerylib_0.1.4              
##  [95] data.table_1.14.2             bitops_1.0-7                 
##  [97] httpuv_1.6.3                  rtracklayer_1.53.1           
##  [99] R6_2.5.1                      BiocIO_1.3.0                 
## [101] promises_1.2.0.1              KernSmooth_2.23-20           
## [103] MASS_7.3-54                   assertthat_0.2.1             
## [105] rprojroot_2.0.2               rjson_0.2.20                 
## [107] withr_2.4.2                   GenomicAlignments_1.29.0     
## [109] Rsamtools_2.9.1               GenomeInfoDbData_1.2.7       
## [111] parallel_4.1.1                hms_1.1.1                    
## [113] grid_4.1.1                    rmarkdown_2.11               
## [115] shiny_1.7.1                   restfulr_0.0.13

References

Amemiya, Haley M, Anshul Kundaje, and Alan P Boyle. 2019. “The ENCODE Blacklist: Identification of Problematic Regions of the Genome.” Scientific Reports 9 (1): 9354. https://doi.org/10.1038/s41598-019-45839-z.
Bickel, Peter J., Nathan Boley, James B. Brown, Haiyan Huang, and Nancy R. Zhang. 2010. “Subsampling Methods for Genomic Inference.” The Annals of Applied Statistics 4 (4): 1660–97. https://doi.org/10.1214/10-{AOAS363}.
ENCODE. 2012. An integrated encyclopedia of DNA elements in the human genome.” Nature 489 (7414): 57–74. https://doi.org/10.1038/nature11247.
Howe, Kevin L, Premanand Achuthan, James Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, et al. 2020. Ensembl 2021.” Nucleic Acids Research 49 (D1): D884–91. https://doi.org/10.1093/nar/gkaa942.
Kramer, Nicole E, Eric S Davis, Craig D Wenger, Erika M Deoudes, Sarah M Parker, Michael I Love, and Douglas H Phanstiel. 2021. Plotgardener: Cultivating precise multi-panel figures in R.” bioRxiv. https://doi.org/10.1101/2021.09.08.459338.
Lee, Stuart, Dianne Cook, and Michael Lawrence. 2019. “Plyranges: A Grammar of Genomic Data Transformation.” Genome Biology 20 (1): 4. https://doi.org/10.1186/s13059-018-1597-8.
Olshen, A. B., E. S. Venkatraman, R. Lucito, and M. Wigler. 2004. Circular binary segmentation for the analysis of array-based DNA copy number data.” Biostatistics 5 (4): 557–72.
Rainer, Johannes, Laurent Gatto, and Christian X Weichenberger. 2019. ensembldb: an R package to create and use Ensembl-based annotation resources.” Bioinformatics 35 (17): 3151–53. https://doi.org/10.1093/bioinformatics/btz031.