CRISPRviewR-vignette

Albert C. Vill

7 November 2022

library(CRISPRviewR)

Read in and merge minCED data

Process file pairs and merge the objects into a single dataset.

baseurl <- "https://raw.githubusercontent.com/acvill/CRISPRviewR/master/example_data_minced/"

merged <- merge_minced(
  s1 = read_minced(txt = url(paste0(baseurl, "s1.txt")),
                   gff = url(paste0(baseurl, "s1.gff"))),
  s2 = read_minced(txt = url(paste0(baseurl, "s2.txt")),
                   gff = url(paste0(baseurl, "s2.gff"))),
  s3 = read_minced(txt = url(paste0(baseurl, "s3.txt")),
                   gff = url(paste0(baseurl, "s3.gff"))),
  s4 = read_minced(txt = url(paste0(baseurl, "s4.txt")),
                   gff = url(paste0(baseurl, "s4.gff"))),
  s5 = read_minced(txt = url(paste0(baseurl, "s5.txt")),
                   gff = url(paste0(baseurl, "s5.gff")))
  )

Inspect a subset of the data. Note that running crispr_summary() on the full merged object may give a spurious output as array labels are repeated across samples.

crispr_summary(merged |> dplyr::filter(id == "s1"))
#> # A tibble: 235 × 4
#> # Groups:   array [235]
#>    array    contig                              length spacer_count
#>    <chr>    <chr>                                <dbl>        <int>
#>  1 CRISPR1  NODE_11_length_163482_cov_26.862691    499            7
#>  2 CRISPR2  NODE_12_length_155701_cov_23.751642   3341           50
#>  3 CRISPR3  NODE_15_length_139312_cov_42.730247    697           12
#>  4 CRISPR4  NODE_43_length_91727_cov_51.432073    4917           75
#>  5 CRISPR5  NODE_83_length_64643_cov_17.445052     250            5
#>  6 CRISPR6  NODE_94_length_61915_cov_16.552279     206            4
#>  7 CRISPR7  NODE_94_length_61915_cov_16.552279     559            9
#>  8 CRISPR8  NODE_114_length_56943_cov_14.370324   2902           48
#>  9 CRISPR9  NODE_141_length_50913_cov_5.048252     691           11
#> 10 CRISPR10 NODE_153_length_49486_cov_10.982885    220            4
#> # … with 225 more rows

Fix truncated repeats

minCED relies on CRISPR Recognition Tool (CRT) to detect CRISPR repeats. The CRT algorithm requires repeats to be identical, and this stringency can lead to the misassignment of portions of repeat sequences to spacers. See, for example, the first array of sample s1.

merged |>
  dplyr::filter(id == "s1" & array == "CRISPR1") |>
  dplyr::select(rep, spacer)
#> # A tibble: 7 × 2
#>   rep                              spacer                                       
#>   <chr>                            <chr>                                        
#> 1 GTTGTCATTAGCTTCCAGATTCCGTACCTTCA CACTTGCTAATACAGCTGTGGTTGAGCCAAACAATGAGATGGTA…
#> 2 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA TACTTGCTAATACAGCGCACGCGAGACCTTCACGCGACTAGGAC…
#> 3 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA TACTTGCTAATACAGCCACGAGCCTCATCACGCGAACTCTCATC…
#> 4 GTTGTGATTAGCTTTCAGATTCCGCACCTTCA TACTTGCTAATCCAGCCGAATTATTGCAACGCTTATCCTCGCCT…
#> 5 GTTGTGATTAGCTTTCGAATTCCGTACCTTCA CACTTGCTAACACAGCATAAAAACGACGACGACACGACCGACAG…
#> 6 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA CACTTGCTAATACAGCTCGGAGGAGTGAAGAATAGCCAGCACCT…
#> 7 GTTGTGATTAGCTTTCAGATTCCGTACCTCCA <NA>

Note the similarity of the first 16 nt of each spacer. If fix_repeats = TRUE is specified, read_minced() will attempt to reassign portions of spacers to repeats by calculating rolling average conservation scores across spacer consensus sequences representing both possible array orientations. Reading from the left end of the conservation score vector, a new spacer-repeat boundary is drawn at the position where the conservation score falls below the threshold value given by cutoff. When cutoff = 0.7 (default), this method reliably reassigns more-similar-than-expected spacer substrings to their neighboring repeats.

baseurl <- "https://raw.githubusercontent.com/acvill/CRISPRviewR/master/example_data_minced/"
read_minced(txt = url(paste0(baseurl, "s1.txt")),
            gff = url(paste0(baseurl, "s1.gff")),
            fix_repeats = TRUE) |>
  dplyr::filter(array == "CRISPR1") |>
  dplyr::select(rep, spacer)
#> # A tibble: 7 × 2
#>   rep                                              spacer                       
#>   <chr>                                            <chr>                        
#> 1 GTTGTCATTAGCTTCCAGATTCCGTACCTTCACACTTGCTAATACAGC TGTGGTTGAGCCAAACAATGAGATGGTA…
#> 2 GTTGTGATTAGCTTTCAGATTCCGTACCTTCATACTTGCTAATACAGC GCACGCGAGACCTTCACGCGACTAGGAC…
#> 3 GTTGTGATTAGCTTTCAGATTCCGTACCTTCATACTTGCTAATACAGC CACGAGCCTCATCACGCGAACTCTCATC…
#> 4 GTTGTGATTAGCTTTCAGATTCCGCACCTTCATACTTGCTAATCCAGC CGAATTATTGCAACGCTTATCCTCGCCT…
#> 5 GTTGTGATTAGCTTTCGAATTCCGTACCTTCACACTTGCTAACACAGC ATAAAAACGACGACGACACGACCGACAG…
#> 6 GTTGTGATTAGCTTTCAGATTCCGTACCTTCACACTTGCTAATACAGC TCGGAGGAGTGAAGAATAGCCAGCACCT…
#> 7 GTTGTGATTAGCTTTCAGATTCCGTACCTCCA                 <NA>

Because we are only working with the sequence context of the arrays given in the minCED data, reassignment of spacer sequences to repeats means that one repeat at one end of the array will retain its original sequence.

Group CRISPR arrays by shared repeats

The group_arrays() function determines a consensus repeat (`repcon``) for each array, then associates arrays with identical consensus sequences.

groups <- group_arrays(merged)
group_summary(groups)
#> # A tibble: 366 × 4
#>    repcon_grp repcon                               sample_list sample_count
#>         <int> <chr>                                <list>             <int>
#>  1         65 ATTTCAAAACCAYATCCCCGCAAGGGGACGGAAAC  <chr [5]>              5
#>  2        102 CAGTATATCAAAGGGGATTGGTGATTGATTCCCAGC <chr [5]>              5
#>  3        107 CCAGCCCTGGCGCGTGCGGGCGTTCCC          <chr [5]>              5
#>  4        108 CCAGCCGCCTTCGGGCGGCTGTGTGTTGAAAC     <chr [5]>              5
#>  5        114 CCCGTCCCCGTGAGCGCGGGCGCAGC           <chr [5]>              5
#>  6        116 CCCTCAATGAAGTGCAGTCCCCGAAGGAACTGCAAG <chr [5]>              5
#>  7        131 CCTTAATCTCCGTATGCGCGGAGCTGTC         <chr [5]>              5
#>  8        152 GAAGTCTATCAAGGGGTCTGGTGACTGATTCCCAGC <chr [5]>              5
#>  9        173 GCCTCAATGAAGTGCAGTTCCCGAAGGAACTGCGAC <chr [5]>              5
#> 10        182 GGAACTACCTCCGCGCACGCGGAGAATAC        <chr [5]>              5
#> # … with 356 more rows

Groups that are represented across all of the samples may be interesting to plot. Let’s get the labels for those groups.

group_summary(groups) |>
  dplyr::filter(sample_count == length(unique(merged$id))) |>
  dplyr::select(repcon_grp) |>
  as.vector()
#> $repcon_grp
#>  [1]  65 102 107 108 114 116 131 152 173 182 185 196 200 205 206 211 227 235 236
#> [20] 242 254 266 268 272 296 313 314 325 334 343 348

Plot CRISPR arrays

Given a repcon_grp number selected from group_summary(), you can subset your grouped data to plot arrays with a shared repeat.

plot_arrays(group = dplyr::filter(groups, repcon_grp == 116))

Now we see the utility of this package! By comparing CRISPR arrays with shared repeat sequences, we observe that an array resident in sample s3 lacks the spacer diversity of the arrays seen in the other samples. Before interpreting these plots as representative of biological truth, be sure to review the list of caveats.

By default, a sequence logo representing the repeat consensus is plotted below the arrays. This can be toggled with the plot_logo switch.

plot_arrays(group = dplyr::filter(groups, repcon_grp == 116),
            plot_logo = FALSE)

To better differentiate spacers, cluster numbers can be added with the number_spacers option.

plot_arrays(group = dplyr::filter(groups, repcon_grp == 116),
            number_spacers = TRUE)

Spacer clustering is controlled by the cdist parameter, which represents a proportion of the maximum pairwise Levenshtein distance for the set of spacers associated with a consensus repeat sequence. The default value is 0.1. Decreasing this value creates more spacer clusters, and increasing this value reduces the number of clusters.

plot_arrays(group = dplyr::filter(groups, repcon_grp == 116),
            cdist = 0.5, number_spacers = TRUE)

The default palette used by plot_arrays() has 34 colors. Custom palettes can be specified with the palette parameter. If a palette has fewer colors than there are spacer clusters in the array group, colors are recycled, and number_spacers = TRUE should be specified.

alt_palette <- c("#7FC97F","#BEAED4","#FDC086","#FFFF99",
                 "#386CB0","#F0027F","#BF5B17","#666666")
plot_arrays(group = dplyr::filter(groups, repcon_grp == 116),
            palette = alt_palette, number_spacers = TRUE)

When a single sample has more than one array with the same consensus repeat sequence, as in the case when one CRISPR array is split across multiple contigs, the arrays are plotted side-by-side with gaps.

plot_arrays(group = dplyr::filter(groups, repcon_grp == 268))