Parsing GTF files in the tidyverse

9 Nov 2021
metagenomics bioinformatics r

Like many other people working with sequencing data in R, I rely on different packages to do different things. For reading annotations in GTF and GFF formats, there are a few well-regarded choices to get the job done: refGenome and rtracklayer (see Dave Tang’s blog for a description of each). Unfortunately, refGenome has not been updated since 2019 and has since been removed from the CRAN repository. As for rtracklayer, it is updated regularly as part of the Bioconductor suite of tools, though I’ve found import() can be finicky if your annotation file contains unexpected characters in certain places. Moreover, R’s opaque error reporting can make it hard to diagnose whether an error is caused by rtracklayer or one of its dependencies.

stringr logo

I love the tidyverse. Though it has a reputation for instability, I find the consistent grammar and interoperability of its packages make data cleaning a simple and transparent process. And when breaking changes are implemented, the tidyverse’s meticulous lifecycle documentation means that deprecated code is easy to track down and change. So, I sought a tidyverse approach to read genome annotations. Below is my code to read in a GTF file as a tibble using functions from stringr and dplyr.

I would like to credit Stack Overflow user akrun for providing the regex for attribute string parsing.

This above function correctly parses the attribute column for files in which the number and types of attributes are inconsistent between features. It also handles instances where the attribute field separator (;) is found in quoted attribute strings. Compare this to tidy_gff() from Vince Buffalo’s annotatr package, which parses GFF files using a similar tidy scheme. Digging into the source code, however, we see that the function only extracts a fixed subset of attribute types: ID, Parent, and parent_type.

extract_attr <- function(x, key)  {
  has_key <- regexpr(key, x)
  value <- gsub(sprintf('.*%s=([^;]+).*', key), '\\1', x)
  value[has_key == -1] <- NA
  out <- strsplit(value, ',')
  multi_value <- any(sapply(out, length) > 1)
  if (!multi_value) return(unlist(out))
  out
}

tidy_gff <- function(x, chroms=NULL,
                     features=c('five_prime_UTR', 'three_prime_UTR', 
                                'exon', 'intron', 'gene', 'CDS')) {
  out <- dplyr::select(dplyr::mutate(dplyr::filter(x, feature %in% features), 
                id=extract_attr(attr, 'ID'),
                parent=extract_attr(attr, 'Parent'),
                parent_type=extract_attr(attr, 'parent_type')),
        chrom, start, end, strand, feature, id, parent, parent_type)
  if (!is.null(chroms)) {
    out[out$chrom %in% chroms, ]
  }
  out
}

For most applications, this is sufficient and fast. However, metagenomic annotations often contain exotic attribute types, as is the case with files output by programs that aggregate features from multiple sources like Torsten Seemann’s prokka. In such analyses, you may want the full attribute set, and read_gtf() is here to help.