<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          細菌WGS測序數(shù)據(jù):質(zhì)控、過濾、組裝、評估

          共 13074字,需瀏覽 27分鐘

           ·

          2022-10-28 03:56

          使用 trimmomatic 去除fastq數(shù)據(jù)的接頭

          # 創(chuàng)建 Raw Reads目錄mkdir -p $OUTWD/Raw_reads# 拷貝 Raw Readscp -f ~/wgs_prokaryote/fastq/*fastq.gz $OUTWD/Raw_reads/# 創(chuàng)建 Clean Reads 目錄mkdir -p $OUTWD/cleaned_reads# 查看變量和接頭文件echo $TRIMMOMATIC $CPUS $OUTWD $ADAPTERS ${QUALITY} ${MINLEN}head $ADAPTERS# 用 trimmomatic 循環(huán)、批量去除fastq數(shù)據(jù)的接頭for i in `ls $OUTWD/Raw_reads | sed 's/_/\t/g' | cut -f 1 | sort -u | head -n 4`; do  # 用trimmomatic:過濾  # $TRIMMOMATIC PE -threads $CPUS -phred33 $OUTWD/Raw_reads/$i\_1.fastq.gz $OUTWD/Raw_reads/$i\_2.fastq.gz \  #   $OUTWD/cleaned_reads/$i\.ok_1.fastq.gz /dev/null $OUTWD/cleaned_reads/$i\.ok_2.fastq.gz /dev/null \  #   ILLUMINACLIP:$ADAPTERS:1:30:11 LEADING:${QUALITY} TRAILING:${QUALITY} MINLEN:${MINLEN}  # 用trimmomatic:去接頭  $TRIMMOMATIC PE -threads $CPUS -phred33 $OUTWD/Raw_reads/$i\_1.fastq.gz $OUTWD/Raw_reads/$i\_2.fastq.gz \    $OUTWD/cleaned_reads/$i\.noadapt.R1.fastq.gz /dev/null $OUTWD/cleaned_reads/$i\.noadapt.R2.fastq.gz /dev/null \    ILLUMINACLIP:$ADAPTERS:1:30:11done
          使用 prinseq 程序過濾 fastq
          # 將4個樣本名寫入文件ls $OUTWD/Raw_reads | sed 's/_/\t/g' | cut -f 1 | sort -u | head -n 4 > $OUTWD/reads-list.tmp
          # 解壓fastq$PARALLEL -j $CPUS --gnu gzip -d ::: $OUTWD/cleaned_reads/*gz# -d, --decompress decompress
          # 使用 prinseq 程序過濾 fastqecho $PARALLEL $CPUS $OUTWD $PRINSEQ $MINLEN $QUALITY# 默認的過濾閾值(QUALITY=25, MINLEN=125)可能過于嚴格,導(dǎo)致無法生成滿足條件的、合適大小的fastq文件$PARALLEL -j $CPUS -a $OUTWD/reads-list.tmp \ perl $PRINSEQ -verbose \ -fastq $OUTWD/cleaned_reads/{}.noadapt.R1.fastq -fastq2 $OUTWD/cleaned_reads/{}.noadapt.R2.fastq \ -out_good $OUTWD/cleaned_reads/{}.ok -out_format 3 -out_bad null \ -min_len $MINLEN -min_qual_mean $QUALITY \ -trim_qual_right $QUALITY -trim_qual_window 15 -trim_qual_type mean
          prinseq-lite.pl程序的參數(shù):
             -verbose
                     Prints status and info messages during processing.
             ***** FILTER OPTIONS *****
             -min_len <integer>
                     Filter sequence shorter than min_len.
             -min_qual_mean <integer>
                     Filter sequence with quality score mean below min_qual_mean.
             ***** TRIM OPTIONS *****
             -trim_qual_right <integer>
                     Trim sequence by quality score from the 3'-end with this
                     threshold score.
             -trim_qual_window <integer>
                     The sliding window size used to calculate quality score by type.
                     To stop at the first base that fails the rule defined, use a
                     window size of 1. [default: 1]
             -trim_qual_type <string>
                     Type of quality score calculation to use. Allowed options are
                     min, mean, max and sum. [default: min]
             ***** INPUT OPTIONS *****
             -fastq <file>
                     Input file in FASTQ format that contains the sequence and
                     quality data. Use stdin instead of a file name to read from
                     STDIN (-fasta stdin). This can be useful to process compressed
                     files using Unix pipes.
             -fastq2 <file>
                     For paired-end data only. Input file in FASTQ format that
                     contains the sequence and quality data. The sequence identifiers
                     for two matching paired-end sequences in separate files can be
                     marked by /1 and /2, or _L and _R, or _left and _right, or must
                     have the exact same identifier in both input files. The input
                     sequences must be sorted by their sequence identifiers.
                     Singletons are allowed in the input files.
             ***** OUTPUT OPTIONS *****
             -out_format <integer>
                     To change the output format, use one of the following options.
                     If not defined, the output format will be the same as the input
                     format.
                   1 (FASTA only), 2 (FASTA and QUAL), 3 (FASTQ), 4 (FASTQ and
                     FASTA), or 5 (FASTQ, FASTA and QUAL)
             -out_good <string>
                     By default, the output files are created in the same directory
                     as the input file containing the sequence data with an
                     additional "_prinseq_good_XXXX" in their name (where XXXX is
                     replaced by random characters to prevent overwriting previous
                     files). To change the output filename and location, specify the
                     filename using this option. The file extension will be added
                     automatically (either .fasta, .qual, or .fastq). For paired-end
                     data, filenames contain additionally "_1", "_1_singletons",
                     "_2", and "_2_singletons" before the file extension. Use
                     "-out_good null" to prevent the program from generating the
                     output file(s) for data passing all filters. Use "-out_good
                     stdout" to write data passing all filters to STDOUT (only for
                     FASTA or FASTQ output files).
             -out_bad <string>
                     By default, the output files are created in the same directory
                     as the input file containing the sequence data with an
                     additional "_prinseq_bad_XXXX" in their name (where XXXX is
                     replaced by random characters to prevent overwriting previous
                     files). To change the output filename and location, specify the
                     filename using this option. The file extension will be added
                     automatically (either .fasta, .qual, or .fastq). For paired-end
                     data, filenames contain additionally "_1" and "_2" before the
                     file extension. Use "-out_bad null" to prevent the program from
                     generating the output file(s) for data not passing any filter.
                     Use "-out_bad stdout" to write data not passing any filter to
                     STDOUT (only for FASTA or FASTQ output files).
                   Example: use "file_filtered" to generate the output file
                     file_filtered.fasta in the current directory
                   Example: "-out_good stdout -out_bad null" will write data
                     passing filters to STDOUT and data not passing any filter will

                     be ignored

          程序運行時的提示
           If you use programs that use GNU Parallel to process data for an article in a
           scientific publication, please cite:
             Tange, O. (2020, July 22). GNU Parallel 20200722 ('Privacy Shield').
             Zenodo. https://doi.org/10.5281/zenodo.3956817
           Estimate size of input data for status report (this might take a while for large files)
                   done
           Parse and process input data
                   done          
           Clean up empty files
                   done
           Input and filter stats:
                   Input sequences (file 1): 1,445,931
                   Input bases (file 1): 167,706,607
                   Input mean length (file 1): 115.99
                   Input sequences (file 2): 1,445,931
                   Input bases (file 2): 156,139,854
                   Input mean length (file 2): 107.99
                   Good sequences (pairs): 1,445,301
                   Good bases (pairs): 323,719,542
                   Good mean length (pairs): 223.98
                   Good sequences (singletons file 1): 544 (0.04%)
                   Good bases (singletons file 1): 63,066
                   Good mean length (singletons file 1): 115.93
                   Good sequences (singletons file 2): 20 (0.00%)
                   Good bases (singletons file 2): 2,160
                   Good mean length (singletons file 2): 108.00
                   Bad sequences (file 1): 86 (0.01%)
                   Bad bases (file 1): 2,813
                   Bad mean length (file 1): 32.71
                   Bad sequences (file 2): 544 (0.04%)
                   Bad bases (file 2): 57,217
                   Bad mean length (file 2): 105.18
                   Sequences filtered by specified parameters:
                   trim_qual_right: 18
                   min_len: 549

                   min_qual_mean: 129

          # 刪除中間文件,壓縮結(jié)果文件$PARALLEL -j $CPUS --gnu rm -f ::: $OUTWD/cleaned_reads/*singletons*$PARALLEL -j $CPUS --gnu rm -rf ::: $OUTWD/cleaned_reads/*noadapt*$PARALLEL -j $CPUS --gnu gzip ::: $OUTWD/cleaned_reads/*fastq
          # Quality filtering statsfor i in `ls $OUTWD/Raw_reads | sed 's/_/\t/g' | cut -f 1 | sort -u | head -n 4`; do zcat $OUTWD/cleaned_reads/$i\.ok_*.fastq.gz | awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total\tavg\n%d\t%f\n",n,m,sq/n-m*m);}' \ > $OUTWD/cleaned_reads/$i\.ok.stats sed -i 's/\..*//' $OUTWD/cleaned_reads/$i\.ok.statsdone
          # Number of reads surviving to log filefor i in `ls $OUTWD/Raw_reads | sed 's/_/\t/g' | cut -f 1 | sort -u | head -n 4`; do echo -e "Number of reads surviving for sample ${i}: \c" cut -f 1 $OUTWD/cleaned_reads/${i}.ok.stats | tail -n 1done

          Number of reads surviving for sample ERR025837:  3954038

          Number of reads surviving for sample ERR025838:  2890602

          Number of reads surviving for sample ERR038741: 12890710

          Number of reads surviving for sample SRR057610: 13663540

          使用 fastqc 程序進一步檢查 fastq 的數(shù)據(jù)質(zhì)量
          ## Chechking quality with FASTQCfor i in 1 2; do  $PARALLEL -j $CPUS -a $OUTWD/reads-list.tmp $FASTQC $OUTWD/cleaned_reads/{}.ok_${i}.fastq.gz -t 1 -o $OUTWD/cleaned_reads/donerm $OUTWD/cleaned_reads/*zip

          組 裝

          Software for genome assembly chosen: spades (耗時、耗內(nèi)存)

          mkdir -p $OUTWD/assemblymkdir -p $OUTWD/genomes
          echo $CPUS $SPADES
          for i in `ls $OUTWD/Raw_reads | sed 's/_/\t/g' | cut -f 1 | sort -u | head -n 4`; do python $SPADES --careful -1 $OUTWD/cleaned_reads/$i\.ok_1.fastq.gz -2 $OUTWD/cleaned_reads/$i\.ok_2.fastq.gz \ -o $OUTWD/assembly/$i\_assembly -t $CPUSdone
          # Software for genome assembly chosen: megahit
          # echo $MEGAHIT $CPUS# for i in `ls $OUTWD/Raw_reads | sed 's/_/\t/g' | cut -f 1 | sort -u | head -n 4`; do# $MEGAHIT -1 $OUTWD/cleaned_reads/$i\.ok_1.fastq.gz -2 $OUTWD/cleaned_reads/$i\.ok_2.fastq.gz \# -o $OUTWD/assembly/$i\_assembly -t $CPUS# done
          # 對組裝后的基因組重命名: final.contigs.fa --> contigs.fastafor i in `ls $OUTWD/Raw_reads | sed 's/_/\t/g' | cut -f 1 | sort -u | head -n 4`; do mv $OUTWD/assembly/$i\_assembly/final.contigs.fa $OUTWD/assembly/$i\_assembly/contigs.fastadone
          使用 Prinseq 過濾組裝后的基因組,并對后者進行質(zhì)量控制
          # 啟動 Prinseq$PARALLEL -j $CPUS -a $OUTWD/reads-list.tmp perl $PRINSEQ -fasta $OUTWD/assembly/{}\_assembly/contigs.fasta \  -min_len $MINCONTIGLEN -out_good $OUTWD/genomes/{} -out_bad null# 壓縮、刪除之前的組裝結(jié)果# parallel -j $CPUS --gnu rm -rf ::: $OUTWD/genomes/1/$PARALLEL -j $CPUS -a $OUTWD/reads-list.tmp --gnu tar cfz $OUTWD/assembly/{}\_assembly.tgz -C $OUTWD/assembly/{}\_assembly$PARALLEL -j $CPUS -a $OUTWD/reads-list.tmp --gnu rm -rf $OUTWD/assembly/{}\_assembly
          使用 QUAST 評估、統(tǒng)計組裝結(jié)果
            1. Quality assessment for genome assemblies
            2. Count the number of contigs in genome
          mkdir -p $OUTWD/genome_stats
          for i in `ls $OUTWD/Raw_reads | sed 's/_/\t/g' | cut -f 1 | sort -u | head -n 4`; do$QUAST $OUTWD/genomes/${i}.fasta -o $OUTWD/genome_stats/${i}_genome_stats -t $CPUS \ --min-contig $MINCONTIGLEN --no-icarus --silent --no-svdone

          往期精品(點擊圖片直達文字對應(yīng)教程)

          機器學(xué)習(xí)

          后臺回復(fù)“生信寶典福利第一波”或點擊閱讀原文獲取教程合集


          瀏覽 12
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          分享
          舉報
          評論
          圖片
          表情
          推薦
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          分享
          舉報
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  好看的一区二区 | 亚洲搞逼 | 在线观看黄色片 | 精品无码第一页在线观看 | 久久精品国产亚洲7777 |