Assembly and validation of conserved long non-coding RNAs in the ruminant transcriptome

mRNA-like long non-coding RNAs (lncRNA) are a significant component of mammalian transcriptomes, although most are expressed only at low levels, with high tissue-specificity and/or at specific developmental stages. In many cases, therefore, lncRNA detection by RNA-sequencing (RNA-seq) is compromised by stochastic sampling. To account for this and create a catalogue of ruminant lncRNA, we compared de novo assembled lncRNA derived from large RNA-seq datasets in transcriptional atlas projects for sheep and goats with previous lncRNA assembled in cattle and human. Few lncRNA could be reproducibly assembled from a single dataset, even with deep sequencing of the same tissues from multiple animals. Furthermore, there was little sequence overlap between lncRNA assembled from pooled RNA-seq data. We combined positional conservation (synteny) with cross-species mapping of candidate lncRNA to identify a consensus set of ruminant lncRNA and then used the RNA-seq data to demonstrate detectable and reproducible expression in each species. The majority of lncRNA were encoded by single exons, and expressed at < 1 TPM. In sheep, 20-30% of lncRNA had expression profiles significantly correlated with neighbouring protein-coding genes, suggesting association with enhancers. Alongside substantially expanding the ruminant lncRNA repertoire, the outcomes of our analysis demonstrate that stochastic sampling can be partly overcome by combining RNA-seq datasets from related species. This has practical implications for the future discovery of lncRNA in other species.

In the largest assembly of predicted lncRNAs, from humans, the transfrags (transcript lncRNAs [72]. Before assembling transfrags, machine learning methods were employed to 2 8 7 filter, from each library, any library-specific background noise (genomic DNA contamination 2 8 8 and incompletely processed RNA). Filtered libraries were then merged before assembling the 2 8 9 final gene models, in effect pooling together transfrags (which may be partial or full-length 2 9 0 transcripts) from all possible libraries. Consequently, a given set of transfrags can be 2 9 1 assembled into a consensus transcript for a lncRNA, but that consensus transcript might not 2 9 2 actually exist in any one cellular source. The only unequivocal means to confirm the full 2 9 3 length expression would be to clone the full length cDNA. However, additional confidence 1 4 replicate. In the sheep expression atlas, 31 diverse tissues/cell types were sampled in each of 2 9 6 6 individual adults (3 females, 3 males, all unrelated virgin animals approximately 2 years of 2 9 7 age). By taking a subset of 31 common tissues per individual, each of the 6 adults was represented by ~0.75 billion reads. In a typical lncRNA assembly pipeline, read alignments from all individuals are merged, to 3 0 0 maximise the number of candidate gene models (using, for instance, StringTie --merge; see reconstructed in, is shown in Table S21. The GTFs themselves are available as Dataset S1  Only 812 of the 12,296 sheep lncRNRAs (6.6%) could be fully reconstructed by any of the 3 0 8 63 GTF combinations (Table S21). One caveat in this assessment is that these sheep libraries are exclusively from adults. Many of the 12,296 lncRNA models may instead be expressed 3 1 0 during embryonic development. There is evidence of extensive embryonic lncRNA 3 1 1 expression in human [15,73] and mouse [16,74]. The lack of embryonic tissues could also  Table S22). In many cases, full-length sheep lncRNAs cannot be reconstructed using all reads sequenced 3 1 7 from a given individual. For instance, the known lncRNA ENSOARG00000025201 is  (Table S21). Only 189 lncRNAs (1.5%) were fully reconstructed in 3 2 0 all 63 possible GTFs. Notably, 154 of these are known Ensembl lncRNAs (Table S21). although a distinct subset -are arbitrarily classified as lncRNAs [79]. eRNAs are likely to be 3 2 7 co-expressed with protein-coding genes in their immediate genomic vicinity.

2 8
To identify co-regulated sets of protein-coding and non-coding loci, we performed network 3 2 9 cluster analysis of the sheep expression level dataset (Table S19)  highly expressed) are included in this analysis (n = 5353). The resulting graph contained only highly structured, organised into clusters of genes with a tissue or cell-type specific 3 3 6 expression profile (Table S23).

7
We expect that for a given cluster of co-expressed genes (which contains x lncRNAs and y 3 3 8 protein-coding genes, each on chromosome z), the distance between an enhancer-derived 3 3 9 lncRNA and the nearest protein-coding gene should be significantly shorter than the distance 3 4 0 between that lncRNA and a random subset of protein-coding genes. For the purposes of this 3 4 1 test, each random subset, of size y, is drawn from the complete set of protein-coding genes on 1 6 and their degree of co-expression with the lncRNA. The significance of any difference in 3 4 4 distance was then assessed using a randomisation test (see Methods).

4 5
Of the 5353 lncRNAs included in the analysis, 1351 (25%) were found on the same 3 4 6 chromosome as a highly co-expressed protein-coding gene (Table S24), with 252 of these 3 4 7 (19%) significantly closer to the co-expressed gene than to randomly selected genes from the 3 4 8 same chromosome (p < 0.05; Table S25).

4 9
Even where the lncRNA is reproducibly expressed in each of 6 animals, there is still 3 5 0 substantial noise in the expression estimates with compromises co-expression analysis. We  (Table S26). The distance to the nearest gene correlates 2.2x10 -16 ) and downstream (rho = -0.21, p < 2.2x10 -16 ) of the lncRNA (Table S26). This suggests that, in general, the expression profile of a lncRNA is more similar to nearer than 3 5 7 more distant protein-coding genes. Using a variant of the above randomisation test, we also the lncRNA and its nearest protein-coding gene, was significantly greater than the value of r 3 6 0 obtained when correlating the lncRNA with 1000 random protein-coding genes drawn from Comparative analysis of lncRNAs assembled using RNA-seq data from several closely 3 7 1 related species -sheep, goat and cattle -demonstrates that for the de novo assembly of 3 7 2 lncRNAs requires very high-depth RNA-seq datasets with a large number of replicates (> 6 3 7 3 replicates per sample, each sequencing >> 100 million reads). The transcription of many 3 7 4 lncRNAs identified by this cross-species approach is conserved, effectively validating their 3 7 5 existence. We identified a subset of lncRNAs in close proximity to protein-coding genes with 3 7 6 which they are strongly co-expressed, consistent with the evolutionary origin of some substantially expanding the lncRNA repertoire for several livestock species, we demonstrate 3 8 0 that the conventional approach to lncRNA detection -that is, species-specific de novo 3 8 1 assembly -can be reliably supplemented by data from related species. The majority of these libraries were sequenced to a depth of >25 million paired-end reads per 3 9 4 sample using the Illumina TruSeq mRNA library preparation protocol (polyA-selected) 3 9 5 (Illumina; Part: 15031047, Revision E). A subset of 11 transcriptionally rich 'core' tissues 3 9 6 (bicep muscle, hippocampus, ileum, kidney medulla, left ventricle, liver, ovary, reticulum, 3 9 7 spleen, testes, thymus), plus one cell type in two conditions (bone marrow derived (LPS)), were sequenced to a depth of >100 million paired-end reads per sample using the   All RNA-seq libraries for goat were prepared by Edinburgh Genomics (Edinburgh Genomics, Edinburgh, UK) (as above) and sequenced using the Illumina HiSeq 4000 sequencing 4 1 1 platform (Illumina, San Diego, USA). These libraries were sequenced to a depth of >30 set of de novo assembled transcripts. The same pipeline is applied to the goat RNA-seq data.

2 2
This pipeline culminates in a single file per species, merged.gtf; that is, the output of replicates of pre-existing bone marrow-derived macrophage libraries, prepared using an 4 2 7 mRNA-seq rather than a total RNA-seq protocol). Not all transcript models in either GTF 4 2 8 will be stranded. This is because HISAT2 infers the transcription strand of a given transcript 4 2 9 by reference to its splice sites; this is not possible for single exon transcripts, which are un- The GTF was parsed to distinguish candidate lncRNAs from assembly artefacts, and from Ensembl-hosted Oar v3.1 annotation and assumed true of all gene models in the ARS1 4 3 6 annotation), or (c) were associated with multiple transcript models (which are more likely to 4 3 7 be spurious). For single-exon gene models, we used a more conservative length threshold of  sheep annotation, the goat annotation is currently only available via NCBI).

5 4
Each longlist of candidates was assessed for coding potential using three different tools: training data for this species).

6 9
For each remaining gene model, we concatenated its exon sequence and identified the longest coding', this translated ORF was considered the most likely peptide encoded by the gene. and HMMER hits -are given in Tables S1 and S2, respectively.  conservation was not expected. we considered successful alignments to be those containing one or more consecutive runs of 5 3 4 20 identical residues, without gaps (the majority of these alignments in any case have >= 5 3 5 75% identity across the entire length of the transcript (Table S18)). The probability that a 5 3 6 transcript randomly matches 20 consecutive residues, within a pre-defined region, is 5 3 7 extremely low.

3 8
For successful alignments, the target sequence (that is, an extract from the intergenic region)   chromatin immunoprecipitation and sequencing (ChIP-seq) data from 13 cell lines. For the 11 'core' tissues of the sheep expression atlas, plus unstimulated and LPS-stimulated lncRNA models (Table S1) and those lncRNAs assembled in either human (n = 18), goat  transcripts. This suggests that those lncRNAs expected to be found at a given genomic location are captured in only one species, not both, 6 6 5 consistent with the stochastic sampling of lncRNAs by RNA-seq libraries.  Table 2. lncRNA transcripts assembled using the RNA-seq libraries of only one species can in many cases be directly mapped to the genome of 6 7 0 another species, assuming the lncRNA is located within a region of conserved synteny.     showing sequence homology to either a known protein (in Swiss-Prot) or protein domain (in 1 0 1 0 Pfam-A).  Pfam-A).             Table S17. High-confidence lncRNA pairs, those conserved across species both sequentially 1 0 4 9 and positionally.  Table S18. lncRNAs inferred in one species by the genomic alignment of a transcript 1 0 5 2 assembled with the RNA-seq libraries from a related species.        Table S25. Distance between lncRNAs and protein-coding genes within the same co-