這個(gè)處理不同基因組區(qū)域關(guān)系的工具集很不錯(cuò)!
Bedtools是處理基因組信息分析的強(qiáng)大工具集合,本文列出自己學(xué)習(xí)其官方文檔的幾個(gè)點(diǎn),對(duì)后面計(jì)算不同樣品peak相似性的腳本做了下更新和調(diào)整,使用起來(lái)更為簡(jiǎn)單方便。
內(nèi)容摘要
區(qū)域注釋?zhuān)鏿eak注釋?zhuān)琾eak分布分析,peak與調(diào)控元件交集等。
區(qū)域合并,如求算多樣品peak合集,或合并重疊區(qū)域
區(qū)域互補(bǔ),如得到非基因區(qū)
利用比對(duì)結(jié)果對(duì)測(cè)序廣度和深度評(píng)估
多樣品peak相似性計(jì)算,評(píng)估ChIP類(lèi)區(qū)域結(jié)果的樣品相似性。
bedtools主要功能
bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage: bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
求區(qū)域之間的交集,可以用來(lái)注釋peak,計(jì)算reads比對(duì)到的基因組區(qū)域
不同樣品的peak之間的peak重疊情況。
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
尋找最近但可能不重疊的區(qū)域
coverage Compute the coverage over defined intervals.
計(jì)算區(qū)域覆蓋度
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
合并重疊或相接的區(qū)域
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
獲得互補(bǔ)區(qū)域
subtract Remove intervals based on overlaps b/w two files.
計(jì)算區(qū)域差集
slop Adjust the size of intervals.
調(diào)整區(qū)域大小,如獲得轉(zhuǎn)錄起始位點(diǎn)上下游3 K的區(qū)域
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
排序,部分命令需要排序過(guò)的bed文件
random Generate random intervals in a genome.
獲得隨機(jī)區(qū)域,作為背景集
shuffle Randomly redistrubute intervals in a genome.
根據(jù)給定的bed文件獲得隨機(jī)區(qū)域,作為背景集
sample Sample random records from file using reservoir sampling.
spacing Report the gap lengths between intervals in a file.
annotate Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
提取給定位置的FASTA序列
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
計(jì)算數(shù)據(jù)集相似性
reldist Calculate the distribution of relative distances b/w two files.
fisher Calculate Fisher statistic b/w two feature files.
[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
用于生成一個(gè)腳本,批量捕獲IGV截圖
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
把給定區(qū)域劃分成指定大小和間隔的小區(qū)間 (bin)
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
分組結(jié)算,不只可以用于bed文件。
expand Replicate lines based on lists of values in columns.
split Split a file into multiple files with equal records or base pairs.安裝bedtools
ct@ehbio:~$ conda install bedtools獲得測(cè)試數(shù)據(jù)集(http://quinlanlab.org/tutorials/bedtools/bedtools.html)
ct@ehbio:~$ mkdir bedtools
ct@ehbio:~$ cd bedtools
ct@ehbio:~$ url=https://s3.amazonaws.com/bedtools-tutorials/web
ct@ehbio:~/bedtools$ curl -O ${url}/maurano.dnaseI.tgz
ct@ehbio:~/bedtools$ curl -O ${url}/cpg.bed
ct@ehbio:~/bedtools$ curl -O ${url}/exons.bed
ct@ehbio:~/bedtools$ curl -O ${url}/gwas.bed
ct@ehbio:~/bedtools$ curl -O ${url}/genome.txt
ct@ehbio:~/bedtools$ curl -O ${url}/hesc.chromHmm.bed交集 (intersect)

查看輸入文件,bed格式,至少三列,分別是染色體,起始位置(0-based,
包括),終止位置
(1-based,不包括)。第四列一般為區(qū)域名字,第五列一般為空,第六列為鏈的信息。更詳細(xì)解釋見(jiàn)http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1。
自己做研究CpG島信息可以從UCSC的Table Browser獲得,具體操作見(jiàn)http://blog.genesino.com/2013/05/ucsc-usages/。
ct@ehbio:~/bedtools$ head -n 3 cpg.bed exons.bed
==> cpg.bed <==
chr1 28735 29810 CpG:_116
chr1 135124 135563 CpG:_30
chr1 327790 328229 CpG:_29
==> exons.bed <==
chr1 11873 12227 NR_046018_exon_0_0_chr1_11874_f 0 +
chr1 12612 12721 NR_046018_exon_1_0_chr1_12613_f 0 +
chr1 13220 14409 NR_046018_exon_2_0_chr1_13221_f 0 +獲得重疊區(qū)域(既是外顯子,又是CpG島的區(qū)域)
ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed | head -5
chr1 29320 29370 CpG:_116
chr1 135124 135563 CpG:_30
chr1 327790 328229 CpG:_29
chr1 327790 328229 CpG:_29
chr1 327790 328229 CpG:_29輸出重疊區(qū)域?qū)?yīng)的原始區(qū)域(與外顯子存在交集的CpG島)
ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb > | head -5chr1 28735 29810 CpG:_116 chr1 29320 29370
NR_024540_exon_10_0_chr1_29321_r 0 -
chr1 135124 135563 CpG:_30 chr1 134772 139696
NR_039983_exon_0_0_chr1_134773_r 0 -
chr1 327790 328229 CpG:_29 chr1 324438 328581
NR_028322_exon_2_0_chr1_324439_f 0 +
chr1 327790 328229 CpG:_29 chr1 324438 328581
NR_028325_exon_2_0_chr1_324439_f 0 +
chr1 327790 328229 CpG:_29 chr1 327035 328581
NR_028327_exon_3_0_chr1_327036_f 0 +
計(jì)算重疊堿基數(shù)
ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10chr1 28735 29810 CpG:_116 chr1 29320 29370
NR_024540_exon_10_0_chr1_29321_r 0 - 50
chr1 135124 135563 CpG:_30 chr1 134772 139696
NR_039983_exon_0_0_chr1_134773_r 0 - 439
chr1 327790 328229 CpG:_29 chr1 324438 328581
NR_028322_exon_2_0_chr1_324439_f 0 + 439
chr1 327790 328229 CpG:_29 chr1 324438 328581
NR_028325_exon_2_0_chr1_324439_f 0 + 439
chr1 327790 328229 CpG:_29 chr1 327035 328581
NR_028327_exon_3_0_chr1_327036_f 0 + 439
chr1 713984 714547 CpG:_60 chr1 713663 714068
NR_033908_exon_6_0_chr1_713664_r 0 - 84
chr1 762416 763445 CpG:_115 chr1 761585 762902
NR_024321_exon_0_0_chr1_761586_r 0 - 486
chr1 762416 763445 CpG:_115 chr1 762970 763155
NR_015368_exon_0_0_chr1_762971_f 0 + 185
chr1 762416 763445 CpG:_115 chr1 762970 763155
NR_047519_exon_0_0_chr1_762971_f 0 + 185
chr1 762416 763445 CpG:_115 chr1 762970 763155
NR_047520_exon_0_0_chr1_762971_f 0 + 185
計(jì)算第一個(gè)(-a)bed區(qū)域有多少個(gè)重疊的第二個(gè)(-b)bed文件中有多少個(gè)區(qū)域
ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -c | head
chr1 28735 29810 CpG:_116 1
chr1 135124 135563 CpG:_30 1
chr1 327790 328229 CpG:_29 3
chr1 437151 438164 CpG:_84 0
chr1 533219 534114 CpG:_94 0
chr1 544738 546649 CpG:_171 0
chr1 713984 714547 CpG:_60 1
chr1 762416 763445 CpG:_115 10
chr1 788863 789211 CpG:_28 9另外還有-v取出不重疊的區(qū)域,
-f限定重疊最小比例,-sorted可以對(duì)按sort -k1,1 -k2,2n排序好的文件加速操作。
同時(shí)對(duì)多個(gè)區(qū)域求交集 (可以用于peak的多維注釋)
# -names標(biāo)注注釋來(lái)源
# -sorted: 如果使用了這個(gè)參數(shù),提供的一定是排序好的bed文件
ct@ehbio:~/bedtools$ bedtools intersect -a exons.bed \
-b cpg.bed gwas.bed hesc.chromHmm.bed -sorted -wa -wb -names cpg gwas chromhmm \
| head -10000 | tail -10chr1 27632676 27635124
NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27633213
27635013 5_Strong_Enhancer
chr1 27632676 27635124
NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27635013
27635413 7_Weak_Enhancer
chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f
chromhmm chr1 27632613 27632813 6_Weak_Enhancer
chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f
chromhmm chr1 27632813 27633213 7_Weak_Enhancer
chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f
chromhmm chr1 27633213 27635013 5_Strong_Enhancer
chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f
chromhmm chr1 27635013 27635413 7_Weak_Enhancer
chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f cpg
chr1 27648453 27649006 CpG:_63
chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f
chromhmm chr1 27648613 27649413 1_Active_Promoter
chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f cpg
chr1 27648453 27649006 CpG:_63
chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f
chromhmm chr1 27648613 27649413 1_Active_Promoter
合并區(qū)域

bedtools merge輸入的是按sort -k1,1 -k2,2n排序好的bed文件。
只需要輸入一個(gè)排序好的bed文件,默認(rèn)合并重疊或鄰接區(qū)域。
ct@ehbio:~/bedtools$ bedtools merge -i exons.bed | head -n 5
chr1 11873 12227
chr1 12612 12721
chr1 13220 14829
chr1 14969 15038
chr1 15795 15947合并區(qū)域并輸出此合并后區(qū)域是由幾個(gè)區(qū)域合并來(lái)的
ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -c 1 -o count | head -n 5
chr1 11873 12227 1
chr1 12612 12721 1
chr1 13220 14829 2
chr1 14969 15038 1
chr1 15795 15947 1合并相距90 nt內(nèi)的區(qū)域,并輸出是由哪些區(qū)域合并來(lái)的
# -c: 指定對(duì)哪些列進(jìn)行操作
# -o: 與-c對(duì)應(yīng),表示對(duì)指定列進(jìn)行哪些操作
# 這里的用法是對(duì)第一列做計(jì)數(shù)操作,輸出這個(gè)區(qū)域是由幾個(gè)區(qū)域合并來(lái)的
# 對(duì)第4列做收集操作,記錄合并的區(qū)域的名字,并逗號(hào)分隔顯示出來(lái)
ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -d 340 -c 1,4 -o count,collapse | head -4
chr1 11873 12227 1 NR_046018_exon_0_0_chr1_11874_f
chr1 12612 12721 1 NR_046018_exon_1_0_chr1_12613_f
chr1 13220 15038 3 NR_046018_exon_2_0_chr1_13221_f,NR_024540_exon_0_0_chr1_14362_r,NR_024540_exon_1_0_chr1_14970_r
chr1 15795 15947 1 NR_024540_exon_2_0_chr1_15796_r計(jì)算互補(bǔ)區(qū)域
給定一個(gè)全集,再給定一個(gè)子集,求另一個(gè)子集。比如給定每條染色體長(zhǎng)度和外顯子區(qū)域,求非外顯子區(qū)域。給定基因區(qū),求非基因區(qū)。給定重復(fù)序列,求非重復(fù)序列等。
重復(fù)序列區(qū)域的獲取也可以用下面提供的鏈接:http://blog.genesino.com/2013/05/ucsc-usages/。
ct@ehbio:~/bedtools$ head genome.txt
chr1 249250621
chr10 135534747
chr11 135006516
chr11_gl000202_random 40103
chr12 133851895
chr13 115169878
chr14 107349540
chr15 102531392
ct@ehbio:~/bedtools$ bedtools complement -i exons.bed -g genome.txt | head -n 5
chr1 0 11873
chr1 12227 12612
chr1 12721 13220
chr1 14829 14969
chr1 15038 15795基因組覆蓋廣度和深度
計(jì)算基因組某個(gè)區(qū)域是否被覆蓋,覆蓋深度多少。有下圖多種輸出格式,也支持RNA-seq數(shù)據(jù),計(jì)算junction-reads覆蓋。

genome.txt里面的內(nèi)容就是染色體及對(duì)應(yīng)的長(zhǎng)度。
# 對(duì)單行FASTA,可如此計(jì)算
# 如果是多行FASTA,則需要累加
ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="\t"}{\
if($0~/>/) {seq_name=$0;sub(">","",seq_name);} \
else {print seq_name,length;} }' ../bio/genome.fa | tee ../bio/genome.txt
chr1 60001
chr2 54001
chr3 54001
chr4 60001
ct@ehbio:~/bedtools$ bedtools genomecov -ibam ../bio/map.sortP.bam -bga \
-g ../bio/genome.txt | head
# 這個(gè)warning很有意思,因?yàn)锽AM中已經(jīng)有這個(gè)信息了,就不需要提供了
*****
*****WARNING: Genome (-g) files are ignored when BAM input is provided.
*****
# bedgraph文件,前3列與bed相同,最后一列表示前3列指定的區(qū)域的覆蓋度。
chr1 0 11 0
chr1 11 17 1
chr1 17 20 2
chr1 20 31 3
chr1 31 36 4
chr1 36 43 6
chr1 43 44 7
chr1 44 46 8
chr1 46 48 9
chr1 48 54 10兩個(gè)思考題:
怎么計(jì)算有多少基因組區(qū)域被測(cè)到了?
怎么計(jì)算平均測(cè)序深度是多少?
數(shù)據(jù)集相似性
bedtools jaccard計(jì)算的是給定的兩個(gè)bed文件之間交集區(qū)域(intersection)占總區(qū)域(union-intersection)的比例(jaccard)和交集的數(shù)目(n_intersections)。
ct@ehbio:~/bedtools$ bedtools jaccard \
-a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
-b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed
intersection union-intersection jaccard n_intersections
81269248 160493950 0.50637 130852小思考:1. 如何用bedtools其它工具算出這個(gè)結(jié)果?2. 如果需要比較的文件很多,怎么充分利用計(jì)算資源?
一個(gè)辦法是使用for循環(huán),
雙層嵌套。這種用法也很常見(jiàn),不管是單層還是雙層for循環(huán),都有利于簡(jiǎn)化重復(fù)運(yùn)算。
ct@ehbio:~/bedtools$ for i in *.merge.bed; do \
for j in *.merge.bed; do \
bedtools jaccard -a $i -b $j | cut -f3 | tail -n +2 | sed "s/^/$i\t$j\t/"; \
done; done >total.similarity另一個(gè)辦法是用parallel,不只可以批量,更可以并行。
root@ehbio:~# yum install parallel.noarch
# parallel 后面雙引號(hào)("")內(nèi)的內(nèi)容為希望用parallel執(zhí)行的命令,
# 整體寫(xiě)法與Linux下命令寫(xiě)法一致。
# 雙引號(hào)后面的 三個(gè)相鄰冒號(hào) (:::)默認(rèn)用來(lái)傳遞參數(shù)的,可多個(gè)連寫(xiě)。
# 每個(gè)三冒號(hào)后面的參數(shù)會(huì)被循環(huán)調(diào)用,而在命令中的引用則是根據(jù)其出現(xiàn)的位置,分別用{1}, {2}
# 表示第一個(gè)三冒號(hào)后的參數(shù),第二個(gè)三冒號(hào)后的參數(shù)。
#
# 這個(gè)命令可以替換原文檔里面的整合和替換, 相比于原文命令生成多個(gè)文件,這里對(duì)每個(gè)輸出結(jié)果
# 先進(jìn)行了比對(duì)信息的增加,最后結(jié)果可以輸入一個(gè)文件中。
#
ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \
| sed 's/^/{1}\t{2}\t/'" ::: `ls *.merge.bed` ::: `ls *.merge.bed` >totalSimilarity.2
# 上面的命令也有個(gè)小隱患,并行計(jì)算時(shí)的輸出沖突問(wèn)題,可以修改為輸出到單個(gè)文件,再cat到一起
ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \
| sed 's/^/{1}\t{2}\t/' >{1}.{2}.totalSimilarity_tmp" ::: `ls *.merge.bed` ::: `ls *.merge.bed`
ct@ehbio:~/bedtools$ cat *.totalSimilarity_tmp >totalSimilarity.2
# 替換掉無(wú)關(guān)信息
ct@ehbio:~/bedtools$ sed -i -e 's/.hotspot.twopass.fdr0.05.merge.bed//' \
-e 's/.hg19//' totalSimilarity.2 totalSimilarity.2數(shù)據(jù)表格式如下 (數(shù)據(jù)是假的):
fMusle_leg-DS19115 fMusle_back-DS18454 0.55
fMusle_leg-DS19115 fHeart-DS15643 0.4
fHeart-DS15643 fHeart-DS16621 0.8
fHeart-DS15643 fHeart-DS15839 0.7
fHeart-DS16621 fHeart-DS15839 0.7得到結(jié)果后,就可以繪制相關(guān)性熱圖或PCA了。

也可以使用下面的命令轉(zhuǎn)換成Wide format矩陣,用高顏值可定制在線繪圖工具-第三版繪制。
# 這里面b和c可以用一個(gè),因?yàn)槭且粋€(gè)對(duì)稱(chēng)陣
# 如果2和3列內(nèi)容不同,此腳本也可用
ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="\t"}{a[$1, $2]=$3; b[$1]=1; c[$2]=1;}END\
{printf("ID"); for(i in c) printf("\t%s", i); \
for (i in b) {printf("%s", i); for(j in c) {printf("\t%s", a[i, j]);} print "";}}原文檔(http://quinlanlab.org/tutorials/bedtools/bedtools.html)的命令,稍微有些復(fù)雜,利于學(xué)習(xí)不同命令的組合。使用時(shí)推薦使用上面的命令。
ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} \
| awk 'NR>1' | cut -f 3 \
> {1}.{2}.jaccard" \
::: `ls *.merge.bed` ::: `ls *.merge.bed`This command will create a single file containing the pairwise Jaccard
measurements from all 400 tests.
find . \
| grep jaccard \
| xargs grep "" \
| sed -e s"/\.\///" \
| perl -pi -e "s/.bed./.bed\t/" \
| perl -pi -e "s/.jaccard:/\t/" \
> pairwise.dnase.txtA bit of cleanup to use more intelligible names for each of the samples.
cat pairwise.dnase.txt \
| sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \
| sed -e 's/.hg19//g' \
> pairwise.dnase.shortnames.txtNow let’s make a 20x20 matrix of the Jaccard statistic. This will allow
the data to play nicely with R.
awk 'NF==3' pairwise.dnase.shortnames.txt \
| awk '$1 ~ /^f/ && $2 ~ /^f/' \
| python make-matrix.py \
> dnase.shortnames.distance.matrix往期精品(點(diǎn)擊圖片直達(dá)文字對(duì)應(yīng)教程)
后臺(tái)回復(fù)“生信寶典福利第一波”或點(diǎn)擊閱讀原文獲取教程合集




























