Paper Title (use style: paper title)

1 downloads 0 Views 679KB Size Report
pipelines, the diploid wheat Triticum Urartu (T. urartu) dataset is used. The sequencing of the dataset was performed on the. Illumina HiSeq2000 machine at the ...
Analysis of Transcriptome Assembly Pipelines for Wheat Natasha Pavlovikj1, Kevin Begcy2, Sairam Behera1, Malachy Campbell2, Harkamal Walia2, Jitender S. Deogun1 1

Department of Computer Science and Engineering, 2Department of Agronomy and Horticulture University of Nebraska-Lincoln Lincoln, USA

Abstract—With advances in next-generation sequencing technologies, transcriptome sequencing has emerged as a powerful tool for performing transcriptome analysis for various organisms. Obtaining draft transcriptome of an organism is a complex multi-stage pipeline with several steps such as data cleaning, error correction and assembly. Based on the analysis performed in this paper, we conclude that the best assembly is produced when the error correction method is used with Velvet Oases and the “multi-k” strategy that combines the 5 k-mer assemblies with highest N50. Our results provide valuable insight for designing good de novo transcriptome assembly pipeline for a given application.

Trans-AbySS [5][6] are some of the main de novo transcriptome assembly tools.

Keywords—transcriptome assembly; de novo assembly; error correction; digital normalization; multi-k method; k-mer length.

In this paper, we develop and analyze 21 different de novo transcriptome assembly pipelines using three de novo assemblers with different tools and packages for different assembly steps. We evaluate the performance of the pipelines when the digital normalization algorithm, the error correction method and the combination of both are used. Moreover, we investigate the range of k values that need to be combined with the “multi-k” method to produce more accurate and full-length transcriptomes. By comparing the performance of these tools and assemblies generated, we draw conclusions and provide guidelines for developing good transcriptome assemblies for a given application.

I. INTRODUCTION

II. MATERIALS AND METHODS

With the recent advance of sequencing technologies, transcriptome sequencing (RNA-Seq) has emerged as a powerful tool for obtaining large amount of functional genomic data in both model and non-model organisms. The assembly of raw sequence data to obtain a draft transcriptome is a complex process and it includes multiple steps such as data cleaning, contaminant removal, error correction, de novo or referencebased assembly, redundancy removal, and assembly validation. To implement all these steps, a great knowledge of different algorithms, various bioinformatics tools and software is required. The assembly pipeline is used to simplify the entire assembly process by automating various steps of the pipeline for producing correct transcripts.

In this paper, we implement de novo transcriptome assembly pipelines which incorporate the pre-processing, assembly and post-processing stages, as well as the assembly annotation. All the experiments in this paper are performed on Tusker, one of the High-Performance Computing Clusters at the University of Nebraska-Lincoln Holland Computing Center (HCC) [9].

There are several available tools that examine the quality of the sequenced reads, their length, quality scores, duplication levels and overrepresented sequences. FastQC, FASTXToolkit, and the R package qrqc [1][2] are some of the widelyused tools. Many software packages have been developed to remove the artificial elements from the sequenced reads. Tools such as Cutadapt, Scythe and TagCleaner trim off the adaptors from the raw reads, while Sickle and Prinseq [7] remove the low-quality bases. The digital normalization algorithm reduces the memory and the computational requirements for the transcriptome assembly by decreasing the differences in gene coverage in RNA-Seq, discarding redundant data and removing most errors, while the error correction method indicates significant improvements on the assembly accuracy [3]. After the data is cleaned and filtered in the pre-processing stage, the next step is to generate the transcriptome assembly from the filtered reads. There are two basic approaches in generating a transcriptome assembly: reference-based approach and de novo approach [4]. Most of the de novo assemblers are based on de Bruijn graphs. Velvet Oases, SOAP-denovo-Trans, Trinity,

A. Dataset For the evaluation of the de novo transcriptome assembly pipelines, the diploid wheat Triticum Urartu (T. urartu) dataset is used. The sequencing of the dataset was performed on the Illumina HiSeq2000 machine at the University of California Davis (UCD) Genome Center using the 100 bp paired-end protocol. This produced a total of 82 GB of sequence data with an average of 248.5 million reads. The raw sequence T. urartu data was submitted to the NCBI Sequence Archive database by the UCD group under the NCBI BioProject PRJNA191053. B. Pre-processing stage The overall read quality of the 248.5 million 100 bp raw T. urartu Illumina paired-end reads is assessed using FastQC. After this initial quality check, the first step from the preprocessing stage is to remove the artificially added Illumina adaptors and trim off the reads with average quality score under 20 and length less than 20bp. Furthermore, we investigate the importance of using a digital normalization, and an error correction on the reads before the assembly by using the programs Khmer, Seecer and combination of both (KhmerSeecer). For better understanding of the results, we denote the datasets of the processed reads generated from Khmer, Seecer and Khmer-Seecer with Pipeline K, Pipeline S and Pipeline KS, respectively.

C. De novo transcriptome assembly After the pre-processing stage, the paired and single-end reads from Pipeline K, Pipeline KS and Pipeline S are used for the transcriptome assembly. In this paper, three de novo assemblers, Velvet Oases, SOAPdenovo-Trans and Trinity are individually used and evaluated. Assemblies are carried out using the three datasets Pipeline K, Pipeline KS and Pipeline S.

transcripts aligned more than once and the overall alignment rate before and after the post-processing stage is shown on Table I (part B). From these results, we can observe that the post-processing steps slightly improve the overall alignment rate for all pipelines.

Velvet Oases invokes two programs, “velveth” and “velvetg”, to generate the contigs, while the transcripts are created using Oases. SOAPdenovo-Trans has two independent executables that support different maximum values of k. The figure showing the distribution of the N50 lengths for all 13 assemblies with different k values for Velvet Oases and SOAPdenovo-Trans with Pipeline K, Pipeline KS and Pipeline S, respectively, is omitted to save space. Since Trinity uses only one value for k of 25, the N50 length for Pipeline K is 2,714 bp, for Pipeline KS is 2,791 bp and for Pipeline S is 2,471 bp.

In this study, we investigate the impact of a few key parameters for generating transcriptome assemblies. First, we compare the importance of using a digital normalization algorithm (Khmer), an error correction method (Seecer), or a combination of both (Khmer-Seecer) in the pre-processing stage of the assembly pipeline. Next, we use three publicly

III. DISCUSSION AND CONCLUSION

D. Post-processing stage After the assembly process, the next stage is the postprocessing. Since the “multi-k” strategy introduces redundancy, it needs to be further removed using CD-HIT-EST and CAP3. Also, to lower the possibility of merging incorrect transcripts, repetitive sequences are identified and removed using the Triticeae Repeat Sequence Database (TREP) [10] and BLAST. Once the post-processing stage is complete, the 21 transcriptome assemblies are evaluated based on the previously used assembly quality metrics. The distribution of the N50 length is shown on Fig. 1. The post-processing steps significantly increase the N50 lengths for Velvet Oases. On the other hand, for the three assemblies generated with Trinity, we can notice that these additional post-processing steps actually decrease the N50 lengths. SOAPdenovo-Trans does not show big improvement with the post-processing steps either. E. Mapping reads to each assebmly To get assembly statistics for the number of raw reads mapped to the resulting transcripts we use Bowtie2 and Samtools to perform the mapping and examine the resulting .bam files. The percentages of paired and single-end reads mapped more than once to the resulting transcripts, as well as the overall alignment rate is shown on Table I (part A). All three assemblers, datasets and groups of k values show a high alignment rate which tells us that most of the raw reads are used in the transcriptome assembly. Parallel to the final transcriptome assemblies, we also map the raw reads to the transcripts generated before the post-processing stage. These results are also shown on Table I (part A). From these results, we can observe that in most cases, the overall alignment rates are either same or slightly higher for the assemblies obtained before the post-processing stage. F. Annotation of the final transcriptome assemblies To test the overall quality of the assembly pipelines and techniques, we align the resulting transcripts to 19,200 sequences from full length common cDNA wheat dataset from TriFLDB with average read length of 1,652.9 bp [11] using BLASTN. The total number of transcripts, the number of

Fig. 1. Distribution of N50 lengths for all steps from the post-processing stage for assemblies generated for Pipelines K, KS and S for Velvet Oases, SOAPdenovo-Trans and Trinity when the “multi-k” strategy is tested with three different groups of combined k values (|k|=5, 8, 10).

available de novo transcriptome assemblers with these datasets. Furthermore, we investigate the importance of the different k values that are combined with the “multi-k” strategy. We use three groups of different k values. The first group combines assemblies of 5 different k values with highest N50 lengths. The second group combines assemblies with 8 different k values in the range of 21 to 91 with increment of 10. The third group contains number of 10 different k values recommended from previous study. Finally, we compare the importance of the post-processing stage by examining the alignment rates of the transcripts before and after the post-processing stage. With that being said, our final experiments include 21 different assembly pipelines. Nine of them are for Velvet Oases when three datasets, Pipeline K, Pipeline KS and Pipeline S, and three groups of different k values are combined with the “multi-k” strategy. Similarly, nine of them are for SOAPdenovo-Trans with the same datasets and groups of k values. Since Trinity has fixed k value of 25, with this assembler three final assemblies for Pipeline K, Pipeline KS and Pipeline S are generated. In order to compare the transcripts generated from the 21 transcriptome assembly pipelines implemented as part of this paper, assembly quality metrics such as the number of transcripts, the average transcripts length and the N50 length are calculated for each pipeline. In general, the better assembly consists of larger number of transcripts in combination with higher average transcripts length. We hope that the provided analysis of various transcriptome assembly pipelines can improve the existing transcriptome assembly pipeline approach and TABLE I.

demonstrate useful optimizing strategies for forthcoming de novo transcriptome assemblies of diploid wheat. A. Comparison of digital normalization algorithm, correction method, and combination of both One of the objectives addressed in this paper, is to investigate the importance of using a digital normalization algorithm, an error correction method or combination of both on the reads before the assembly step. For this purpose, we individually apply Khmer, Seecer and Khmer-Seecer (combination of both) to the trimmed and filtered raw reads. When Seecer is used with the trimmed and filtered reads, the total number of reads in Pipeline S is reduced by 1,938,921 reads. For all three assemblers and three groups of k values, the assemblies generated with Pipeline S give the highest alignment rates. When Khmer is used with the trimmed and filtered reads, the total number of reads in Pipeline K is reduced by 77% (57,359,540). The same number of reads occurs when the combined approach of both Khmer and Seecer is used as well. After the post-processing stage, slightly higher N50 lengths and alignment rates are generated for Pipeline KS compared to Pipeline K. The digital normalization algorithm significantly reduces the computational and assembly costs. However, Seecer outperforms this algorithm in the number of aligned reads and full-length assemblies.

ASSEMBLY ANNOTATION SUMMARY STATISTICS FOR ALL 21 ASSEMBLY PIPELINES. PART A: ALIGNMENT RATES WHEN THE RAW READS ARE MAPPED TO THE TRANSCRIPTS. PART B: ALIGNMENT RATES WHEN THE TRANSCRIPTS ARE MAPPED AGAINST THE TRIFLDB DATABASE.

assembler

Velvet Oases

Velvet Oases

Velvet Oases

Trinity

SOAPdenovoTrans SOAPdenovoTrans SOAPdenovoTrans

A. Map raw reads against transcripts paired-end single-end reads overall reads aligned more aligned more than alignment dataset than once once rate Pipeline * after before after before after before postpostpostpostpostpostprocesprocesprocessing processing processing processing sing sing

K-5k K-8k K-10k KS-5k KS-8k KS-10k S-5k S-8k S-10k K KS S K-5k K-8k K-10k KS-5k KS-8k KS-10k S-5k S-8k S-10k

64.32 70.45 71.23 63.98 70.11 70.84 79.51 81.90 81.15 63.47 63.06 52.89 17.80 28.45 23.51 17.33 27.50 23.01 25.91 43.76 33.39

68.50 73.11 73.25 68.10 72.67 72.89 80.58 82.88 81.91 68.19 67.90 78.11 21.93 32.57 32.04 21.35 31.68 31.32 30.65 47.95 43.61

4.76 4.91 4.91 4.78 4.92 4.92 1.47 1.49 1.50 4.66 4.66 1.07 2.16 2.82 2.57 2.15 2.78 2.55 0.75 0.98 0.86

4.87 4.96 4.94 4.88 4.97 4.95 1.49 1.51 1.51 4.90 4.91 1.48 2.51 3.15 3.13 2.48 3.10 3.09 0.82 1.05 1.01

97.34 97.99 97.95 97.49 98.11 98.10 99.30 99.46 99.50 97.65 97.77 98.95 97.69 98.36 98.38 97.83 98.46 98.49 99.32 99.58 99.55

97.35 98.00 97.96 97.50 98.12 98.11 99.31 99.46 99.51 97.66 97.78 98.96 97.69 98.36 98.38 97.83 98.46 98.49 99.32 99.58 99.55

B. Map transcripts against the TriFLDB database overall total number of transcripts aligned alignment transcripts more than once rate after before after before after before postpostpostpostpostpostprocesprocessing processing processing processing processing sing

210,463 409,262 377,440 206,450 399,845 367,166 356,696 466,964 441,723 407,585 403,359 291,401 108,464 130,594 130,328 106,096 128,455 128,432 111,523 135,806 134,221

285,535 606,175 790,870 280,393 587,113 765,377 428,340 593,277 867,719 410,383 576,175 568,944 121,305 145,822 163,532 118,832 144,689 163,184 123,813 149,507 166,237

47.58 47.42 41.45 46.64 47.42 40.90 56.89 54.62 47.83 32.89 32.76 20.75 25.01 25.20 23.18 24.99 24.86 23.12 25.83 26.27 23.92

39.09 35.59 37.00 38.54 35.64 36.61 46.28 41.83 42.14 20.95 32.55 32.47 23.58 22.43 22.24 23.56 22.33 22.21 24.16 23.17 22.79

62.55 62.79 57.51 62.16 62.77 57.07 67.85 66.22 61.21 46.39 46.19 33.66 43.48 43.52 40.30 43.55 43.28 40.22 44.39 44.84 41.05

57.96 56.14 56.29 57.78 56.09 55.91 62.42 59.31 58.97 35.24 47.37 47.20 41.43 40.28 40.42 41.55 40.37 40.41 42.11 41.10 40.82

B. Comparison of the efficiency of different k values The k-mer length (k value) affects the accuracy of the overall assembly. Shorter k values are better for less expressed transcripts, while larger k values produce higher coverage [8]. From Fig. 1, we can observe that the group of 10 k values gives the worst N50 value. On the other hand, better performance is observed when the group of 5 and the group of 8 k values are used for Velvet Oases (Pipeline K-5k, Pipeline KS-5k, Pipeline S-5k) and SOAPdenovo-Trans (Pipeline K-8k, Pipeline KS-8k, Pipeline S-8k), respectively. Moreover, from Table I, when the raw reads are mapped against the transcripts, the lowest alignment rate occurs when |k| is 5, while the highest is when |k| is 8. On the other hand, when the transcripts are aligned to the TriFLDB database, the lowest alignment rate is observed for |k| of 10, while for both |k| of 5 and 8, the alignment rates are significantly better. Even though we cannot claim the best range of k values that produces the best assembly, we believe that using all the assemblies for various k values does not improve the assembly quality. On the contrary, these values need to be chosen based on the highest N50 value and/or additional metrics that need to be further investigated. C. Comparison of three de novo transcriptome assemblers To compare the performance of each assembler, we measure the number of assembled transcripts, their average length and their N50 length. Moreover, for each generated assembly, we calculate the number of paired and single-end reads mapped more than once when the raw reads are mapped against the transcripts. Also, the overall alignment rate is calculated when the transcripts are mapped against the TriFLDB database. During the post-processing steps, SOAPdenovo-Trans is constantly reporting the lowest number of transcripts and the lowest N50 lengths. The performance of Trinity is second, while Velvet Oases has the highest N50 for all three datasets (Pipeline K, Pipeline KS, Pipeline S) and three groups of k values (|k|=5, 8, 10). The percentage of paired and single-end reads mapped more than once against the transcripts is shown on Table I (part A). The overall alignment rate when the transcripts are mapped against the TriFLDB database is shown on Table I (part B). With these analyses and comparing the statistics calculated, we can conclude that Velvet Oases produces the best transcriptomes. D. Strategy for developing good de novo transcriptome assembly pipelines With our analyses, we find the following useful insights for choosing the best strategy for optimizing de novo transcriptome assembly pipelines:  Both the digital normalization algorithm and error correction method are useful pre-processing steps.  The “multi-k” method gives better overall assembly results.

 From the results obtained in this paper, we believe that the single k assemblies with highest N50 lengths combined together with the “multi-k” strategy lead to better transcriptome.  In our experiments Velvet Oases remains the best de novo transcriptome assembler.  The post-processing stage improves the overall transcriptome quality. Utilizing error correction methods, as well as merging assemblies with k values that have good metrics can improve the overall quality of the transcriptome assembly and provide stable base for further transcriptome biological analysis. Developing a multi-stage assembly pipeline is an important and crucial part of generating accurate and meaningful transcriptome assembly. Since choosing the optimal tools and parameters for building quality transcriptome assembly pipelines is a difficult task, the experiments performed as part of this paper provide useful guidelines for choosing the best strategy for optimizing de novo transcriptome assembly pipelines. ACKNOWLEDGMENT This work was supported in part by a grant from UNLWheat, Wheat Products Research Program and was completed utilizing the Holland Computing Center of the University of Nebraska. REFERENCES [1]

B "Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data", Bioinformatics.babraham.ac.uk, 2016. [Online]. Available: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. [2] "FASTX-Toolkit", Hannonlab.cshl.edu, 2016. [Online]. Available: http://hannonlab.cshl.edu/fastx_toolkit/. [3] M. MacManes and M. Eisen, "Improving transcriptome assembly through error correction of high-throughput sequence reads", PeerJ, vol. 1, p. e113, 2013. [4] S. Schuster, "Next-generation sequencing transforms today's biology", Nature Methods, vol. 5, no. 1, pp. 16-18, 2007. [5] D. Zerbino and E. Birney, "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs", Genome Research, vol. 18, no. 5, pp. 821-829, 2008. [6] M. Grabherr et al., "Full-length transcriptome assembly from RNA-Seq data without a reference genome", Nature Biotechnology, vol. 29, no. 7, pp. 644-652, 2011. [7] X. Li, Y. Kong, Q. Zhao, Y. Li and P. Hao, "De novo assembly of transcriptome from next-generation sequencing data", Quantitative Biology, vol. 4, no. 2, pp. 94-105, 2016. [8] B. Li et al., "Evaluation of de novo transcriptome assemblies from RNA-Seq data", Genome Biol, vol. 15, no. 12, 2014. [9] "Holland Computing Center | University of Nebraska–Lincoln", hcc.unl.edu, 2016. [Online]. Available: http://hcc.unl.edu/. [10] T. Wicker, D. Matthews and B. Keller, "TREP: a database for Triticeae repetitive elements", Trends in Plant Science, vol. 7, no. 12, pp. 561562, 2002. [11] K. Mochida, T. Yoshida, T. Sakurai, Y. Ogihara and K. Shinozaki, "TriFLDB: A Database of Clustered Full-Length Coding Sequences from Triticeae with Applications to Comparative Grass Genomics", PLANT PHYSIOLOGY, vol. 150, no. 3, pp. 1135-1146, 2009.