library(CRISPRviewR)
Process file pairs and merge the objects into a single dataset.
<- "https://raw.githubusercontent.com/acvill/CRISPRviewR/master/example_data_minced/"
baseurl
<- merge_minced(
merged s1 = read_minced(txt = url(paste0(baseurl, "s1.txt")),
gff = url(paste0(baseurl, "s1.gff"))),
s2 = read_minced(txt = url(paste0(baseurl, "s2.txt")),
gff = url(paste0(baseurl, "s2.gff"))),
s3 = read_minced(txt = url(paste0(baseurl, "s3.txt")),
gff = url(paste0(baseurl, "s3.gff"))),
s4 = read_minced(txt = url(paste0(baseurl, "s4.txt")),
gff = url(paste0(baseurl, "s4.gff"))),
s5 = read_minced(txt = url(paste0(baseurl, "s5.txt")),
gff = url(paste0(baseurl, "s5.gff")))
)
Inspect a subset of the data. Note that running
crispr_summary()
on the full merged
object may
give a spurious output as array
labels are repeated across
samples.
crispr_summary(merged |> dplyr::filter(id == "s1"))
#> # A tibble: 235 × 4
#> # Groups: array [235]
#> array contig length spacer_count
#> <chr> <chr> <dbl> <int>
#> 1 CRISPR1 NODE_11_length_163482_cov_26.862691 499 7
#> 2 CRISPR2 NODE_12_length_155701_cov_23.751642 3341 50
#> 3 CRISPR3 NODE_15_length_139312_cov_42.730247 697 12
#> 4 CRISPR4 NODE_43_length_91727_cov_51.432073 4917 75
#> 5 CRISPR5 NODE_83_length_64643_cov_17.445052 250 5
#> 6 CRISPR6 NODE_94_length_61915_cov_16.552279 206 4
#> 7 CRISPR7 NODE_94_length_61915_cov_16.552279 559 9
#> 8 CRISPR8 NODE_114_length_56943_cov_14.370324 2902 48
#> 9 CRISPR9 NODE_141_length_50913_cov_5.048252 691 11
#> 10 CRISPR10 NODE_153_length_49486_cov_10.982885 220 4
#> # … with 225 more rows
minCED relies on CRISPR Recognition Tool (CRT) to detect CRISPR
repeats. The CRT algorithm requires repeats
to be identical, and this stringency can lead to the misassignment
of portions of repeat sequences to spacers. See, for example, the first
array of sample s1
.
|>
merged ::filter(id == "s1" & array == "CRISPR1") |>
dplyr::select(rep, spacer)
dplyr#> # A tibble: 7 × 2
#> rep spacer
#> <chr> <chr>
#> 1 GTTGTCATTAGCTTCCAGATTCCGTACCTTCA CACTTGCTAATACAGCTGTGGTTGAGCCAAACAATGAGATGGTA…
#> 2 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA TACTTGCTAATACAGCGCACGCGAGACCTTCACGCGACTAGGAC…
#> 3 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA TACTTGCTAATACAGCCACGAGCCTCATCACGCGAACTCTCATC…
#> 4 GTTGTGATTAGCTTTCAGATTCCGCACCTTCA TACTTGCTAATCCAGCCGAATTATTGCAACGCTTATCCTCGCCT…
#> 5 GTTGTGATTAGCTTTCGAATTCCGTACCTTCA CACTTGCTAACACAGCATAAAAACGACGACGACACGACCGACAG…
#> 6 GTTGTGATTAGCTTTCAGATTCCGTACCTTCA CACTTGCTAATACAGCTCGGAGGAGTGAAGAATAGCCAGCACCT…
#> 7 GTTGTGATTAGCTTTCAGATTCCGTACCTCCA <NA>
Note the similarity of the first 16 nt of each spacer. If
fix_repeats = TRUE
is specified, read_minced()
will attempt to reassign portions of spacers to repeats by calculating
rolling average conservation scores across spacer consensus sequences
representing both possible array orientations. Reading from the left end
of the conservation score vector, a new spacer-repeat boundary is drawn
at the position where the conservation score falls below the threshold
value given by cutoff
. When cutoff = 0.7
(default), this method reliably reassigns more-similar-than-expected
spacer substrings to their neighboring repeats.
<- "https://raw.githubusercontent.com/acvill/CRISPRviewR/master/example_data_minced/"
baseurl read_minced(txt = url(paste0(baseurl, "s1.txt")),
gff = url(paste0(baseurl, "s1.gff")),
fix_repeats = TRUE) |>
::filter(array == "CRISPR1") |>
dplyr::select(rep, spacer)
dplyr#> # A tibble: 7 × 2
#> rep spacer
#> <chr> <chr>
#> 1 GTTGTCATTAGCTTCCAGATTCCGTACCTTCACACTTGCTAATACAGC TGTGGTTGAGCCAAACAATGAGATGGTA…
#> 2 GTTGTGATTAGCTTTCAGATTCCGTACCTTCATACTTGCTAATACAGC GCACGCGAGACCTTCACGCGACTAGGAC…
#> 3 GTTGTGATTAGCTTTCAGATTCCGTACCTTCATACTTGCTAATACAGC CACGAGCCTCATCACGCGAACTCTCATC…
#> 4 GTTGTGATTAGCTTTCAGATTCCGCACCTTCATACTTGCTAATCCAGC CGAATTATTGCAACGCTTATCCTCGCCT…
#> 5 GTTGTGATTAGCTTTCGAATTCCGTACCTTCACACTTGCTAACACAGC ATAAAAACGACGACGACACGACCGACAG…
#> 6 GTTGTGATTAGCTTTCAGATTCCGTACCTTCACACTTGCTAATACAGC TCGGAGGAGTGAAGAATAGCCAGCACCT…
#> 7 GTTGTGATTAGCTTTCAGATTCCGTACCTCCA <NA>
Because we are only working with the sequence context of the arrays given in the minCED data, reassignment of spacer sequences to repeats means that one repeat at one end of the array will retain its original sequence.
Given a repcon_grp
number selected from
group_summary()
, you can subset your grouped data to plot
arrays with a shared repeat.
plot_arrays(group = dplyr::filter(groups, repcon_grp == 116))
Now we see the utility of this package! By comparing CRISPR arrays
with shared repeat sequences, we observe that an array resident in
sample s3
lacks the spacer diversity of the arrays seen in
the other samples. Before interpreting these plots as
representative of biological truth, be sure to review the list of
caveats.
By default, a sequence logo representing the repeat consensus is
plotted below the arrays. This can be toggled with the
plot_logo
switch.
plot_arrays(group = dplyr::filter(groups, repcon_grp == 116),
plot_logo = FALSE)
To better differentiate spacers, cluster numbers can be added with
the number_spacers
option.
plot_arrays(group = dplyr::filter(groups, repcon_grp == 116),
number_spacers = TRUE)
Spacer clustering is controlled by the cdist
parameter,
which represents a proportion of the maximum pairwise Levenshtein
distance for the set of spacers associated with a consensus repeat
sequence. The default value is 0.1. Decreasing this value creates more
spacer clusters, and increasing this value reduces the number of
clusters.
plot_arrays(group = dplyr::filter(groups, repcon_grp == 116),
cdist = 0.5, number_spacers = TRUE)
The default palette used by plot_arrays()
has 34 colors.
Custom palettes can be specified with the palette
parameter. If a palette has fewer colors than there are spacer clusters
in the array group, colors are recycled, and
number_spacers = TRUE
should be specified.
<- c("#7FC97F","#BEAED4","#FDC086","#FFFF99",
alt_palette "#386CB0","#F0027F","#BF5B17","#666666")
plot_arrays(group = dplyr::filter(groups, repcon_grp == 116),
palette = alt_palette, number_spacers = TRUE)
When a single sample has more than one array with the same consensus repeat sequence, as in the case when one CRISPR array is split across multiple contigs, the arrays are plotted side-by-side with gaps.
plot_arrays(group = dplyr::filter(groups, repcon_grp == 268))