In order to integrate a new feature of on/off target levels of coverage in NGSphy, I needed to understand coverage variation that one could find in a capture experiment. Goals are basically:
Calypte_anna.gene.CDS.gff
and Chaetura_pelagica.CDS.gff
)Genome sizes
Anna | Swift |
---|---|
1,114,626,264 | 1,125,117,173 |
48birds_ortholog.list.chi.anna.cpe.hum.finch
).Data used comes from Rute’s Hummingbird project. Most of the information reported here about the samples and targets was extracted from the paper.
BAM
files corresponding to the mappings of the targeted-sequencing to 2 references.
map2Anna
).
map2Swift
).
Need more information on missing data id + removal
Workspace triploid
(UVigo):
triploid.uvigo.es
/home/merly/research/cph-visit/coverage-analysis
Workspace randy
(KU):
randy.popgen.dk
/home/merly/anna
/home/merly/swift
Targeted regions are exons, retrieved a posteriori, due to the fact that the original probes for this project were lost during a flood.
Ingroup species reference
: Original target file given was a GFF file: Calypte_anna.gene.CDS.2750.gff
[1.6MB]. This contains the exons from the 2,750 This file was converted into a BED file, keeping only chromosome (scaffold), start and end position of the targets.
Outgroup reference:
: For this one, I received 2 files. One that had the known one-to-one ortholog relation between the genes from Anna and Swift (48birds_ortholog.list.chi.anna.cpe.hum.finch
[656KB]). The second, is a GFF file, which contains the exons from Swift (Chaetura_pelagica.CDS.gff
[8.5MB]). I had to get the targeted regions from the GFF files, taking into account that I needed the genes matching to Anna, and also I needed those genes that were actually used in the Anna’s targets (Calypte_anna.gene.CDS.2750.gff
):
48birds_ortholog.list.chi.anna.cpe.hum.finch
.Chaetura_pelagica.CDS.gff
where codon regions (CDS
).Calypte_anna.gene.CDS.2750.gff
so I’ll have the same covered genes.Chaetura_pelagica.CDS.gff
and keep those targets from genes matching both Anna and Swift.Finally, datasets will be filter to match gene number.
birds48filename=paste0(WD,"48birds_ortholog.list.chi.anna.cpe.hum.finch")
swiftgff=paste0(WD,"Chaetura_pelagica.CDS.gff")
annagff=paste0(WD,"Calypte_anna.gene.CDS.2750.gff")
birds=read.table(birds48filename, header=T,
colClasses=c("character","numeric","character","character",
"character","character","character"))
swift=read.table(swiftgff,
colClasses=c("character","character","character","numeric",
"numeric","character","character","character","character"))
anna=read.table(annagff,
colClasses=c("character","character","character","numeric",
"numeric","character","character","character","character"))
birds.2=birds[birds$anna!="-",]
birds.3=birds.2[birds.2$swift!="-",]
birds.4=unique(birds.3)
birds.5=birds.4[,c("anna","swift")]
birds.5$swift=paste0("Parent=",birds.5$swift,";")
birds.5$anna=paste0("Parent=Aan_", birds.5$anna,";")
annaGenes=unique(anna$V9[birds.5$anna %in% anna$V9])
birds.6=birds.5[birds.5$anna %in% annaGenes, ]
swiftGenes=birds.6$swift
swift.2=swift[swift$V3=="CDS",]
swift.3=swift.2[swift.2$V9 %in% birds.6$swift,]
swift.3$size=swift.3$V5-swift.3$V4
swift.4=swift.3[swift.3$size>0,]
anna.2=anna[(anna$V5-anna$V4)>0,]
swift.4=swift.4[,1:(ncol(swift.4)-1)]
anna.3=anna.2[anna.2$V9 %in% unique(birds.6$anna),]
anna.3$targetname=paste0( anna.3$V1,
rep("-", nrow(anna.3)),
anna.3$V4,
rep("-", nrow(anna.3)),
anna.3$V5)
swift.4$targetname=paste0( swift.4$V1,
rep("-", nrow(swift.4)),
swift.4$V4,
rep("-", nrow(swift.4)),
swift.4$V5)
This is a quantitative summary description of the resulting target files. The difference between number of genes, and subsequently in number of targets and covered genome, in both Anna and Swift, is not precisely alarming. The experiment was done using known ortholog genes, which not necessarily match in number of exons (targets) or their size. There are two (2
) Anna datasets, the original and the reduced (1,469 genes) and the Swift dataset.
Number of targets | Size of targeted genome (bp) | Number of genes | Total of scaffolds | |
---|---|---|---|---|
Anna - Original | 24,044 | 3,934,867 | 2,750 | 389 |
Anna - Reduced | 14,740 | 2,285,746 | 1,469 | 314 |
Swift | 14,764 | 2,279,626 | 1,469 | 372 |
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
Anna - Original | 104 | 724 | 1,136 | 1,431 | 1,800 | 13,150 |
Anna - Reduced | 110 | 842 | 1,285 | 1,556 | 1,953 | 13,150 |
Swift | 110 | 843 | 1,281 | 1,552 | 1,943 | 13,000 |
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
Anna - Original | 1 | 90 | 125 | 163.7 | 170 | 5,246 |
Anna - Reduced | 1 | 90 | 125 | 163.2 | 169 | 5,246 |
Swift | 1 | 89 | 123 | 154.4 | 166 | 5,240 |
‘> 5kb’ fragments? ok, “idea was to capture whole genes” but “targets” are CONTIGUOUS regions, right? pretty much like our loci, right? Were probes designed from fragments > 5kb? hard to believe…
Target datasets used from now on will contain the same number of genes (1,469).
I obtained the coverage of the targeted regions using bedtools
[9] (v. 2.22.0), module coverage
. With this, and the option -hist
I can report a histogram of coverage for each feature in A as well as a summary histogram for _all_
features in A. Output (tab delimited) after each feature in A (from bedtools
documentation))
#
bases at depthfor bamfile in $(find $HOME/anna/originals -name "*.bam"); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup bedtools coverage -hist -abam $bamfile -b "Calypte_anna.gene.CDS.2750.3.gff" | gzip > $HOME/anna/bedtools/cov/${tag}.cov.gz &
done
for bamfile in $(find $HOME/swift/originals -name "*.bam"); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup bedtools coverage -hist -abam $bamfile -b "Chaetura_pelagica.CDS.2.gff" | gzip > $HOME/swift/bedtools/cov/${tag}.cov.gz &
done
And so, I filtered the output to keep the coverage per region and the summary histogram separated.
# (i.e)
for tag in $(cat $HOME/anna/files/samples.txt ); do
nohup zcat $HOME/anna/bedtools/cov/H09.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist/$tag.nohist.gz &
nohup zcat $HOME/anna/bedtools/cov/H09.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist/$tag.hist.gz &
done
for tag in $(cat $HOME/swift/files/samples.txt | tail -n+2 | head -46 ); do
nohup zcat $HOME/swift/bedtools/cov/H09.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist/$tag.nohist.gz &
nohup zcat $HOME/swift/bedtools/cov/H09.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist/$tag.hist.gz &
done
With the information extracted from the bedtools coverage -hist
, we can see the relation between the breadth of coverage obtained and the depth per sample. This gives the fraction of the captured genome against the coverage.
Map2 - target | Coverage per target overview |
---|---|
map2Anna | |
map2Swift |
(Click image to enlarge)
Zoom: 0x-100x.
map2Anna | map2Swift |
---|---|
(Click image to enlarge)
On the depth vs breath… the zoom 0x-100x… is it OK? Is the scale the same? It does not look like…
Outliers presented in this report were calculated using the following:
qnt <- quantile(data, probs=c(.25, .75))
H <- 1.5 * IQR(data)
y <- data
y[data < (qnt[1] - H)] <- "Outlier"
y[data > (qnt[2] + H)] <- "Outlier"
From the bedtools coverage
output I was able to extract the depth of coverage per target, gene and sample.
I generated 2 matrices:
Each cell of the matrix correspond to the sum of the number of times each base was covered within the target or the gene. Then, coverage was calculated as the mean value of all the depth of coverage of all the samples for a specific target/gene divided by the size of the corresponding target/gene.
For the depth of coverage for the samples, I summed the coverage of all targets and divided it by the total amount of bases that were targeted.
Everything is just average depth of coverage (individual/gene and target)
Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | Var. | Std. Dev | |
---|---|---|---|---|---|---|---|---|
map2Anna | 2.115 | 47.76 | 77.73 | 101.3 | 115.8 | 310.2 | 6,616.195 | 81.34 |
map2Swift | 1.715 | 38.27 | 62.29 | 83.8 | 98.07 | 265.6 | 4,931.13 | 70.22201 |
map2Anna without outliers | 2.115 | 44.22 | 64.82 | 74.42 | 93.6 | 177.6 | 1,886.191 | 43.4303 |
map2Swift without outliers | 1.715 | 35.94 | 50.84 | 61.49 | 76.74 | 156.5 | 1,530.202 | 39.1178 |
Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | Var. | Std. Dev | |
---|---|---|---|---|---|---|---|---|
map2Anna | 12.05 | 72.38 | 98.38 | 105 | 133.5 | 290.3 | 2,134.744 | 46.2033 |
map2Swift | 0.8245 | 54.24 | 81.06 | 85.55 | 111.5 | 702.6 | 2,039.718 | 45.16324 |
Identification and removal of the outliers
scaffold116
, positions 5,306,081
-5,306,303
. A
Shows the coverage distribution, as plotted by IGV. While, B
, shows the bottom of the region coverage peak, and the corresponding nucleotide sequence.extHighCoverageGene=coveragePerGeneSwift[outliersSwift>0]
extHighCoverageGene=extHighCoverageGene[extHighCoverageGene>300]
# Cpe_R001479
swiftSplitGene=split(swift.4,swift.4$V9)
dd=swiftSplitGene[names(extHighCoverageGene)][[1]]
dd$Size=dd$V5-dd$V4
dd=dd[c(1,4,5,9,11)]
colnames(dd)=c("Scaffold","Start", "End","Gene name","Size")
kable(dd, format="markdown", row.names = F)
Scaffold | Start | End | Gene name | Size |
---|---|---|---|---|
scaffold116 | 5296564 | 5296613 | Parent=Cpe_R001479; | 49 |
scaffold116 | 5301761 | 5301875 | Parent=Cpe_R001479; | 114 |
scaffold116 | 5302539 | 5302571 | Parent=Cpe_R001479; | 32 |
scaffold116 | 5305161 | 5305355 | Parent=Cpe_R001479; | 194 |
scaffold116 | 5306081 | 5306303 | Parent=Cpe_R001479; | 222 |
scaffold116 | 5308066 | 5308165 | Parent=Cpe_R001479; | 99 |
scaffold116 | 5316336 | 5316420 | Parent=Cpe_R001479; | 84 |
Num. outliers | Min. | 1st. Qu. | Median | Mean | 3rd. Qu. | Max. | |
---|---|---|---|---|---|---|---|
map2Anna | 21 | 225.2 | 237.3 | 251.2 | 255.5 | 272.8 | 290.3 |
map2Swift | 19 | 199.1 | 207.3 | 215.7 | 246.4 | 234.3 | 702.6 |
map2Anna Gene | map2Anna Size | map2Swift Gene | map2Size Gene |
---|---|---|---|
Parent=Aan_R000557; | 1025 | Parent=Cpe_R000257; | 1008 |
Parent=Aan_R000645; | 1400 | Parent=Cpe_R001479; | 794 |
Parent=Aan_R000757; | 1675 | Parent=Cpe_R001695; | 145 |
Parent=Aan_R000784; | 1089 | Parent=Cpe_R002655; | 431 |
Parent=Aan_R000837; | 2118 | Parent=Cpe_R003268; | 137 |
Parent=Aan_R001079; | 884 | Parent=Cpe_R003836; | 461 |
Parent=Aan_R001728; | 598 | Parent=Cpe_R004636; | 640 |
Parent=Aan_R001861; | 604 | Parent=Cpe_R005332; | 1643 |
Parent=Aan_R001895; | 3276 | Parent=Cpe_R008000; | 282 |
Parent=Aan_R002476; | 896 | Parent=Cpe_R008889; | 709 |
Parent=Aan_R003858; | 992 | Parent=Cpe_R011470; | 349 |
Parent=Aan_R004237; | 3287 | Parent=Cpe_R012383; | 1214 |
Parent=Aan_R004624; | 801 | Parent=Cpe_R012980; | 925 |
Parent=Aan_R005273; | 508 | Parent=Cpe_R013031; | 753 |
Parent=Aan_R005275; | 340 | Parent=Cpe_R013115; | 774 |
Parent=Aan_R006006; | 360 | Parent=Cpe_R013182; | 2619 |
Parent=Aan_R006018; | 584 | Parent=Cpe_R013834; | 743 |
Parent=Aan_R006633; | 1151 | Parent=Cpe_R014755; | 608 |
Parent=Aan_R007365; | 2134 | Parent=Cpe_R014832; | 1370 |
Parent=Aan_R007414; | 1060 | ||
Parent=Aan_R008340; | 964 |
Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | Var. | Std. Dev | |
---|---|---|---|---|---|---|---|---|
map2Anna without outliers | 12.05 | 72.16 | 98.54 | 105 | 133.5 | 290.3 | 2,144.958 | 46.31369 |
map2Swift without outliers | 0.8245 | 54.06 | 80.49 | 83.44 | 109.8 | 195.4 | 1,566.158 | 39.57471 |
I’m looking for a relation between gene size and depth of coverage. So I fit a linear model to the data.
lm()
function that need to be discussed
##
## Call:
## lm(formula = coveragePerGeneAnnaFiltered ~ genesAnnaFiltered$Size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.192 -32.445 -5.977 28.228 178.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.955749 2.051036 54.585 < 2e-16 ***
## genesAnnaFiltered$Size -0.004442 0.001065 -4.173 3.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.05 on 1454 degrees of freedom
## Multiple R-squared: 0.01183, Adjusted R-squared: 0.01115
## F-statistic: 17.41 on 1 and 1454 DF, p-value: 3.19e-05
lm()
function that need to be discussed
##
## Call:
## lm(formula = coveragePerGeneSwiftFiltered ~ genesSwiftFiltered$Size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.810 -29.445 -2.706 26.272 111.522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.8002830 1.7674363 47.979 <2e-16 ***
## genesSwiftFiltered$Size -0.0008706 0.0009156 -0.951 0.342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.58 on 1448 degrees of freedom
## Multiple R-squared: 0.000624, Adjusted R-squared: -6.614e-05
## F-statistic: 0.9042 on 1 and 1448 DF, p-value: 0.3418
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
map2Anna | 0 | 34.12 | 84.28 | 167.6 | 178.3 | 42,890 |
map2Swift | 0 | 23.04 | 64.59 | 136.6 | 144.9 | 56,610 |
Num. outliers | Min. | 1st. Qu. | Median | Mean | 3rd. Qu. | Max. | |
---|---|---|---|---|---|---|---|
map2Anna | 1,066 | 0 | 31.09 | 75.25 | 102 | 150 | 394.5 |
map2Swift | 1,096 | 0 | 20.64 | 57.1 | 80.72 | 119.5 | 327.2 |
Information of the depth coverage distribution of the outliers
Num. outliers | Min. | 1st. Qu. | Median | Mean | 3rd. Qu. | Max. | |
---|---|---|---|---|---|---|---|
map2Anna | 810 | 0 | 34.75 | 80.78 | 106.3 | 155.3 | 394.5 |
map2Swift | 814 | 0 | 23.12 | 60.83 | 84.11 | 124.2 | 327.2 |
Size distribution after removing outliers
Min. | 1st. Qu. | Median | Mean | 3rd. Qu. | Max. | |
---|---|---|---|---|---|---|
map2Anna | 1 | 91 | 122 | 128.4 | 160 | 285 |
map2Swift | 1 | 91 | 122 | 128.3 | 160 | 285 |
Coverage distribution after removing size and coverage related outliers
Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | Var. | Std. Dev | |
---|---|---|---|---|---|---|---|---|
map2Anna without outliers | 0 | 34.75 | 80.78 | 106.3 | 155.3 | 394.5 | 8,395.816 | 91.62869 |
map2Swift without outliers | 0 | 23.12 | 60.83 | 84.11 | 124.2 | 327.2 | 5,944.672 | 77.1017 |
I’m looking for a relation between target size and depth of coverage.
lm()
function that need to be discussed
##
## Call:
## lm(formula = anna.6$Coverage ~ anna.6$Size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -167.95 -65.13 -20.57 47.08 337.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.43121 2.13600 78.85 <2e-16 ***
## anna.6$Size -0.48399 0.01548 -31.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.31 on 12742 degrees of freedom
## Multiple R-squared: 0.07122, Adjusted R-squared: 0.07115
## F-statistic: 977.1 on 1 and 12742 DF, p-value: < 2.2e-16
lm()
function that need to be discussed
##
## Call:
## lm(formula = swift.7$Coverage ~ swift.7$Size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.95 -56.29 -19.71 39.86 279.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 126.28067 1.81543 69.56 <2e-16 ***
## swift.7$Size -0.32870 0.01316 -24.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.28 on 12734 degrees of freedom
## Multiple R-squared: 0.0467, Adjusted R-squared: 0.04663
## F-statistic: 623.8 on 1 and 12734 DF, p-value: < 2.2e-16
While there is no specific color key for the coverage values, this will only serve as an overview of the coverage per target/gene per sample.
red/yellow = min/max coveragre (absolute? of the coverage distribution?)… just that is weird that most of the plot is red (or there were darker reds and this represents somehow an “intermediate” value of the cov distribution? (From the “targets” plots… one can see that there is a “sample” and a “locus/target” effect.“sample” effect is stronger… this goes on the lines of what we had though for your sims… using a “factor” per “sample/individual”… right?
>0x
.Full (1.469 Genes) | Without Outliers |
---|---|
Parent=Cpe_R000823; | Parent=Cpe_R000823; |
Parent=Cpe_R008410; | Parent=Cpe_R008410; |
Parent=Cpe_R008909; | Parent=Cpe_R008909; |
This is a summary of the target regions that were not captured per sample.
Dataset | Number of common targets not captured (within samples) | Percentage (common target/total target regions) | |
---|---|---|---|
Within map2Anna samples | Full dataset | 268 | 1.81818181818182 |
Within map2Anna samples | Outliers removed | 268 | 2.10295040803515 |
Within map2Swift samples | Full dataset | 153 | 1.0363045245191 |
Within map2Swift samples | Outliers removed | 153 | 1.20131909547739 |
Across datasets | Full Dataset | 0 | 0 |
Across datasets | Outliers removed | 0 | 0 |
Ànna
and Swift
references are exactly the same.map2Swift
dataset, even though the percentages are similar (1%), this was expected.Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | |
---|---|---|---|---|---|---|
Anna | 1 | 91 | 122 | 128.4 | 160 | 285 |
Swift | 1 | 91 | 122 | 128.3 | 160 | 285 |
" This shows that there are not common target regions across datasets. Makes sense that they do not match, because the target file of the map2Swift was match (by genes) to the map2Anna target file, and coordinates are different"
The off-target regions correspond to those regions of the genome that were not included in the target list.
To calculate this, for the purpose of this analysis, there are two (2
) approaches for the off-target regions.
Calypte_anna.gene.CDS.2750.2.gff
and the regions in Chaetura_pelagica.CDS.2.gff
, for their corresponding dataset.
In order to know what is the size of the off-target regions, I first needed to obtain the size of the genome per each of the species. To do that, I needed the size of the scaffolds, which I got using:
samtools faidx Calypte_anna.fasta
cut -f1,2 Calypte_anna.fasta.fai > Calypte_anna.sizes
samtools faidx Chaetura_pelagica.fasta
cut -f1,2 Chaetura_pelagica.fasta.fai > Chaetura_pelagica.sizes
With the size of the scaffolds per each species, I can generate a BED
file describing the coordinates of the scaffolds.
finalBed=cbind(genomeSizesAnna$Scaffold,rep(1,nrow(genomeSizesAnna)),genomeSizesAnna$Size)
write.table(finalBed,paste0(WD,"scaffolds.anna.bed"), row.names = F,col.names = F,quote = F, sep="\t")
finalBed=cbind(genomeSizesSwift$Scaffold,rep(1,nrow(genomeSizesSwift)),genomeSizesSwift$Size)
write.table(finalBed,paste0(WD,"scaffolds.swift.bed"), row.names = F,col.names = F,quote = F, sep="\t")
We know the set of scaffolds that represent the reference genomes and the position of the codons per dataset (represented in the darker box, bottom), the Calypte_anna.gene.CDS.2750.gff
file, for the map2Anna
dataset and the Chaetura_pelagica.CDS.2.gff
for the map2Swift
dataset with the relation of the codons to their corresponding genes.
Then, used bedtools subtract
(searches for features in B that overlap A. If an overlapping feature is found in B, the overlapping portion is removed from A and the remaining portion of A is reported) to removed all possible targets from the GFF
files out of their corresponding scaffolds file, obtaining two (2
) BED
files, one per dataset.
bedtools subtract -a scaffolds.anna.bed -b $gffAnna > "$HOME/files/offtarget.anna.bed" &
bedtools subtract -a scaffolds.anna.bed -b $gffSwift > "$HOME/files/offtarget.swift.bed" &
General overview of scaffolds and resulting off-target regions
Number of scaffolds | Genome Size | Number of resulting off-target regions | Genome Size (Off-target) | |
---|---|---|---|---|
Anna | 122,611 | 1,114,626,264 | 137,351 | 1,112,203,167 |
Swift | 60,234 | 1,125,117,173 | 74,998 | 1,122,762,549 |
General overview of the size distribution of off-target regions
Unit | Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | |
---|---|---|---|---|---|---|---|
Anna | Scaffold | 100 | 118 | 169 | 9,091 | 605 | 23,310,000 |
Swift | Scaffold | 100 | 111 | 129 | 18,680 | 295 | 19,540,000 |
Anna | Target | 1 | 119 | 196 | 8,098 | 770 | 5,756,000 |
Swift | Target | 1 | 113 | 156 | 14,970 | 699 | 5,805,000 |
The idea is to identify the off-target regions, as well as the size that covers and how is the coverage distribution within this regions compared to the on-target coverage.
To obtain coverage information from the off-target regions:
for bamfile in $(find $originalsAnna -name "*.bam" ); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.anna.bed" | gzip > $HOME/anna/bedtools/cov2/${tag}.cov.gz &
done
for bamfile in $(find $originalsSwift -name "*.bam" ); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.swift.bed" | gzip > $HOME/swift/bedtools/cov2/${tag}.cov.gz &
done
# hist split
for tag in $(cat $HOME/anna/files/samples.txt ); do
zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist2/$tag.nohist.gz
zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist2/$tag.hist.gz
done
for tag in $(cat $HOME/swift/files/samples.txt ); do
zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist2/$tag.nohist.gz
zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist2/$tag.hist.gz
done
I followed what has been done for the on-target regions to get this information. Getting how much of the off-target regions was captured, and at which depth.
map2Anna |
map2Swift |
---|---|
X-axis: 0-20 | X-axis: 0-20 |
map2Anna
dataset, for both off-target sets breadth vs. depth looks the same. Same situation with the mat2Swift
dataset.map2Anna
, we see that the general depth is low, 25% of the off-target regions are covered at 1x for most of the samples.map2Swift
, we see that there are 3 situations:
AnnaBGI
and SwiftBGI
in the map2Anna
datasets, is because these samples were not in that mapping.Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | Var. | Std. Dev | |
---|---|---|---|---|---|---|---|---|
map2Anna | 0 | 0 | 0.002091 | 1.789 | 0.02666 | 18,010 | 5,273.295 | 72.61746 |
map2Swift | 0 | 0 | 0.001225 | 4.87 | 0.2126 | 6,299 | 2,939.12560967077 | 54.2137031540068 |
I am trying to analyze if there is any correlation between the depth of coverage and the phylogenetic distance from the reference species, used for the probe generation and mapping.
Information of the phylogenetic reconstruction I got from the Hummingbirds Paper, this says that it was made with mitochondrial genome and nuclear gene trees using either a concatenation approach with RAxML [3] or the multi-species coalescent approach implemented in ASTRAL [4] and ASTRAL-II [5].
The phylogenies were built using subsets of the nuclear genes present the same topology between the main groups of hummingbirds with high confidence. The three subsets correspond to:
This is the Supplementary Table S5 (from the Hummingbirds Paper), describing the subsets selected:
number of genes | minimum number of sites per species (concatenated) | maximum number of sites per species (concatenated) | includes outgroup | ASTRAL score |
---|---|---|---|---|
741 | 949,470 | 1,542,750 | yes | 87% |
1987 | 1,667,071 | 2,783,832 | yes | 78% |
2949 | 2,473,664 | 3,792,500 |
Cpe
, outgroup reference.Aan
, ingroup reference.For the analysis shown above as well as for phylogeny reconstruction using the dataset (1) with 2,949 genes.
To obtain the distance matrix I used the tree obtained with RAxML, from the concatenation of 2,750 genes. The tree gives me the branch lengths in number of substitutions per site. So, the distance calculated here, is the pairwise distance between the tips. This was done with the R package phangorn
[7], and its cophenetic
function. This function computes the pairwise distances between the pairs of tips from a phylogenetic network using its branch lengths.
sampleColors = colorRampPalette(brewer.pal(9, "Set1"))(48)
treefile="/home/merly/git/cph-visit/coverage-analysis/empirical/data/trees/RAxML_bipartitions.2011.concat"
tree=read.nexus(treefile)
distMatrix=cophenetic(tree)
dCpe=distMatrix["Cpe",]
dAan=distMatrix["Aan",]
distMatrix[upper.tri(distMatrix)]=NA
Following the figure F3A in Braggs’s paper [8], where it shows sequencing coverage as a function of genetic divergence. I have calculated the depth of coverage for both datasets and phylogenetic distance. In the plot below you can see the median target coverage per sample.
Is the “Detailed” part, a long description of all steps you did to get above, or different things… bit confusing to me… it seems like a different thing, where you’ll now look at this relationship PER GENE, right??
After genes with missing data were removed leaving 2,750 genes, and after matching those genes to the ones available in Swift, 1,469.
For obtain the relation between depth of coverage and phylogenetic distance gene-wise, I will used a smaller dataset (2), with 1,987 genes, which I will intersect with those from which I do have the coverage information.
To do that, I generated a new dataset from the intersection of the genes found in Calypte_anna.gene.CDS.2750.3.gff
(1,469), which are those ortholog genes matching across datasets, and the available gene trees (1,988).
cat "$WD/files/Calypte_anna.gene.CDS.2750.3.gff" | cut -f9 | uniq | tr "_" " " | awk '{print $2}' | tr ";" " " > g.trees.to.filter.txt
for gtreefile in $(cat $WD/files/g.trees.to.filter.txt); do
cp "$WD/trees/gtrees/$gtreefile" "$WD/trees/gtrees.2/"
done
R001806, R005562, R009279, R014527, R014590
.ls "$WD/trees/seqs" | tr "." " " | awk '{print $1}' > "$WD/files/g.trees.to.filter.seqs.2.txt"
mkdir "$WD/trees/seqs.2"
for gtreefile in $(cat $WD/files/g.trees.to.filter.seqs.2.txt); do
filename="$WD/trees/seqs/${gtreefile}.x.aa.aln.backTranslated.gz"
cp $filename "$WD/trees/seqs.2/"
done
You made an underlying assumption that “hamming distances” will reflect phylogenetic distance. May not be exactly like that, but, ok, I think it works to approximate.
Anna
and to Swift
. I did this using the dist.dna()
function, from the ape
R
package, with the model raw
. This function computes a matrix of pairwise distances from DNA sequences using a model of DNA evolution. The model raw
is simply the proportion or the number of sites that differ between each pair of sequences.geneList=unlist(read.table(paste0(WD,"g.trees.to.filter.seqs.2.txt"), stringsAsFactors=F))
distancesAnna=matrix(rep(0,length(geneList)*length(sampleNamesSwift)), length(geneList), length(sampleNamesSwift))
distancesSwift=matrix(rep(0,length(geneList)*length(sampleNamesSwift)), length(geneList), length(sampleNamesSwift))
rownames(distancesAnna)=geneList; colnames(distancesAnna)=sampleNamesSwift;
rownames(distancesSwift)=geneList; colnames(distancesSwift)=sampleNamesSwift
gtreefiles <- list.files(
path = paste0("/media/merly/Baymax/research/cph-visit/coverage-analysis/trees/seqs.2/"),
pattern="*.x.aa.aln.backTranslated")
gtreefiles=paste0("/media/merly/Baymax/research/cph-visit/coverage-analysis/trees/seqs.2/",gtreefiles)
labelsCorrect=c("H05","H01","H08","H03","H04","H06","H02","H07")
labelsWronglyFormated=c("H5","H1","H8","H3","H4","H6","H2","H7")
for (i in 1:length(gtreefiles)) {
print(i)
msa=as.matrix(read.FASTA(gzfile(gtreefiles[i])))
hammDist=as.matrix(dist.dna(msa, model="raw"))
indexAnna=which(startsWith(rownames(hammDist),"Aan"))
indexSwift=which(startsWith(rownames(hammDist),"Cpe"))
namesIds=which(rownames(hammDist) %in% labelsCorrect)
rownames(hammDist)[c(indexAnna, indexSwift, namesIds)]=c("AnnaBGI","SwiftBGI",labelsWronglyFormated)
colnames(hammDist)[c(indexAnna, indexSwift, namesIds)]=c("AnnaBGI","SwiftBGI",labelsWronglyFormated)
gene=strsplit(basename(gtreefiles[i]),"\\.")[[1]][1]
distancesAnna[gene,colnames(distancesAnna)]=hammDist[indexAnna,colnames(distancesAnna)]
distancesSwift[gene,colnames(distancesSwift)]=hammDist[indexSwift,colnames(distancesSwift)]
}
write.table(distancesAnna,file=paste0(WD,"distance.matrix.per.gene.anna.txt"), col.names=T,row.names=T)
write.table(distancesSwift,file=paste0(WD,"distance.matrix.per.gene.swift.txt"), col.names=T,row.names=T)
distancesAnna=read.table(file=paste0(WD,"distance.matrix.per.gene.anna.txt"), header=T)
distancesSwift=read.table(file=paste0(WD,"distance.matrix.per.gene.swift.txt"), header=T)
birds48filename=paste0(WD,"48birds_ortholog.list.chi.anna.cpe.hum.finch")
birds=read.table(birds48filename, header=T,
colClasses=c("character","numeric","character","character",
"character","character","character"))
birds.2=birds[birds$anna!="-",]
birds.3=birds.2[birds.2$swift!="-",]
birds.4=unique(birds.3)
birds.5=birds.4[,c("anna","swift")]
birds.5$swift=paste0("Parent=",birds.5$swift,";")
birds.5$anna=paste0("Parent=Aan_", birds.5$anna,";")
annaGenes=unique(gffAnna$V9)
birds.6=birds.5[birds.5$anna %in% annaGenes, ]
birds.6$swift.plain=unlist(strsplit(unlist(lapply(strsplit(birds.6$swift,"_"), function(x){x[[2]]})), ";"))
birds.6$anna.plain=unlist(strsplit(unlist(lapply(strsplit(birds.6$anna,"_"), function(x){x[[2]]})), ";"))
smallbird=birds.6[birds.6$anna.plain %in% intersect(birgit status
ds.6$anna.plain,geneList),]
distancesAnna=distancesAnna[smallbird$anna.plain,]
distancesSwift=distancesSwift[smallbird$anna.plain,]
rownames(distancesAnna)=smallbird$anna
rownames(distancesSwift)=smallbird$swift
samplesFilenameAnna="/home/merly/git/cph-visit/coverage-analysis/empirical/data/anna/samples.txt"
samplesFilenameSwift="/home/merly/git/cph-visit/coverage-analysis/empirical/data/swift/samples.txt"
sampleNamesAnna=unlist(read.table(samplesFilenameAnna, stringsAsFactors = F))
sampleNamesSwift=unlist(read.table(samplesFilenameSwift, stringsAsFactors = F))
coverageMatrixFileAnna="/home/merly/git/cph-visit/coverage-analysis/empirical/data/anna/coverage.matrix.per.gene.txt"
coverageMatrixFileSwift="/home/merly/git/cph-visit/coverage-analysis/empirical/data/swift/coverage.matrix.per.gene.txt"
coverageMatrixAnna=read.table(coverageMatrixFileAnna, header=T)/genesAnna$Size
coverageMatrixSwift=read.table(coverageMatrixFileSwift, header=T)/genesSwift$Size
rownames(coverageMatrixAnna)=genesAnna$Gene
rownames(coverageMatrixSwift)=genesSwift$Gene
#-------------------------------------------------------------------------------
cmswift=coverageMatrixSwift[smallbird$swift,]
cmanna=coverageMatrixAnna[smallbird$anna,]
cmanna=cbind(cmanna,AnnaBGI=rep(0,nrow(cmanna)), SwiftBGI=rep(0,nrow(cmanna)))
(Click image to enlarge)
##
## Call:
## lm(formula = finalCoverageSwift ~ finalDistanceSwift)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.29 -36.23 -12.25 23.84 152.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.9422 0.8903 70.696 <2e-16 ***
## finalDistanceSwift -8.1952 11.2783 -0.727 0.467
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 48.22 on 20806 degrees of freedom
## Multiple R-squared: 2.538e-05, Adjusted R-squared: -2.269e-05
## F-statistic: 0.528 on 1 and 20806 DF, p-value: 0.4675
(Click image to enlarge)
(Click image to enlarge)
##
## Call:
## lm(formula = finalCoverageAnna ~ finalDistanceAnna)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.18 -39.74 -12.89 26.50 182.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.0355 0.6766 103.52 <2e-16 ***
## finalDistanceAnna 423.2795 33.4987 12.64 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 54.82 on 19718 degrees of freedom
## Multiple R-squared: 0.008032, Adjusted R-squared: 0.007982
## F-statistic: 159.7 on 1 and 19718 DF, p-value: < 2.2e-16
(Click image to enlarge)
Now, the plots below show the relation between phylogenetic distance and coverage, of two (2
) samples from the Cocquettes, Brilliants, Mangoes and Hermits groups, following the ML tree from the Hummingbird paper (F1,A). These samples correspond to one with high coverage and one with low coverage.
Coquettes | Brilliants | Mangoes | Hermits | |
---|---|---|---|---|
Low Coverage - map2Anna | 0.0078472 | 0.1558059 | 1.52e-05 | 0.0000023 |
High Coverage - map2Anna | 0.0000386 | 0.1587529 | 1.43e-04 | 0.0054126 |
Low Coverage - map2Swift | 0.0000003 | 0.0000021 | 0.00e+00 | 0.0000091 |
High Coverage - map2Swift | 0.0031817 | 0.0045679 | 8.60e-06 | 0.0000005 |
100
and standard deviation ~40
~2%
which for this case is not known if it is just because of the capture, or if it is because some of the targets used for mapping were not used to design the probes.~100x
), the off-target regions have less that 1% of the mean coverage, and so I think that the variation predicted fits perfectly.Source files
Bash script with the commands for the coverage calculation [bash] Download
On target coverage analysis [Rscript] Download
Off target coverage analysis [Rscript] Download
Phylogenetic distance vs. depth of coverage (Gene-wise) Download
Phylogenetic distance vs. depth of coverage (RaxML) Download