Bioinformatics workflows are critical for the production of quality data set for NGS based projects. Besides the efforts of generating the best data, another critical step is the evaluation. Assessing the quality of raw reads is essential when using data for further analysis. Quality control steps provide the necessary information for later steps and give assurance that data are high quality. Sequencing experiments generally contain three major quality issues. First, the raw reads from the sequencer or the sequencer output data are only guaranteed to be of high quality if the original DNA template is of high quality. Second, most quality control steps, such as the removal of adaptors, trimming low quality bases, and filtering out adaptors and low quality bases, cannot expect the raw reads to have high quality. Lastly, a large portion of the raw reads are of low quality and containing many errors, and the quality control steps cannot correct the low quality of these reads correctly. CLC Genomics Workbench had provided HiSeq and MiSeq specific Quality Control and Sequence Tag Removal tools, but the workflow can expect short reads and reads with adaptor sequences and low quality bases from the sequencer.
A detailed survey of high quality next-generation sequencing data is presented. The effects on downstream processing including trimming and filtering of short reads and low quality bases are discussed. The removal of short reads and low quality bases is essential for the production of high quality data. The consequences for effective data production are also discussed.
Sequencing data is often received in a variety of file formats. As each file format requires specific software or algorithms to process, it is important to know what formats your data is in to make the right choices about data processing. In this chapter we present the formats used in the HiSeq and MiSeq Next-Generation Sequencing projects, and provide details about how to process the different file types. d2c66b5586