This R code describes our analyses of haplotype distributions in Mediterranean Halimeda tuna.
Clean up history and load dependencies:
rm(list=ls(all=TRUE))
library(ape)
library(pegas)
library(RColorBrewer)
library(plotrix)
library(rworldmap)
Show the version of R and packages:
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rworldmap_1.3-6 sp_1.3-1 plotrix_3.7-6
## [4] RColorBrewer_1.1-2 pegas_0.11 adegenet_2.1.1
## [7] ade4_1.7-13 ape_5.3
##
## loaded via a namespace (and not attached):
## [1] maps_3.3.0 splines_3.6.1 dotCall64_1.0-0
## [4] gtools_3.8.1 shiny_1.3.2 assertthat_0.2.1
## [7] expm_0.999-4 yaml_2.2.0 LearnBayes_2.15.1
## [10] pillar_1.4.2 lattice_0.20-38 glue_1.3.1
## [13] digest_0.6.21 promises_1.0.1 colorspace_1.4-1
## [16] htmltools_0.3.6 httpuv_1.5.2 Matrix_1.2-17
## [19] plyr_1.8.4 pkgconfig_2.0.2 gmodels_2.18.1
## [22] purrr_0.3.2 xtable_1.8-4 scales_1.0.0
## [25] gdata_2.18.0 later_0.8.0 tibble_2.1.3
## [28] mgcv_1.8-28 ggplot2_3.2.1 lazyeval_0.2.2
## [31] magrittr_1.5 crayon_1.3.4 mime_0.7
## [34] deldir_0.1-23 maptools_0.9-5 evaluate_0.14
## [37] nlme_3.1-140 MASS_7.3-51.4 foreign_0.8-71
## [40] class_7.3-15 vegan_2.5-6 tools_3.6.1
## [43] stringr_1.4.0 munsell_0.5.0 cluster_2.1.0
## [46] compiler_3.6.1 e1071_1.7-2 rlang_0.4.0
## [49] classInt_0.4-1 units_0.6-4 grid_3.6.1
## [52] spam_2.3-0 igraph_1.2.4.1 rmarkdown_1.15
## [55] boot_1.3-22 gtable_0.3.0 DBI_1.0.0
## [58] reshape2_1.4.3 R6_2.4.0 knitr_1.25
## [61] dplyr_0.8.3 seqinr_3.6-1 spdep_1.1-3
## [64] KernSmooth_2.23-15 permute_0.9-5 stringi_1.4.3
## [67] parallel_3.6.1 Rcpp_1.0.2 fields_9.8-6
## [70] sf_0.8-0 spData_0.3.2 tidyselect_0.2.5
## [73] xfun_0.9 coda_0.19-3
Load the tufA dataset and calculate the haplotypes contained in it. Note that the sequence file was prepared to have the sequence labels refer to the region where the sample was from. For example:
>Malta
ATGATCACTGGAGCGGC…
>Sardinia
ATGATCACTGGAGCGGC…
>Sardinia
ATGATCACTGGAGCGGC…
>Cataluna
ATGATCACTGGAGCGGC…
tufA = read.dna("tufA_loc.fas",format="fasta")
tufAhaps = haplotype(tufA)
tufAhaps
##
## Haplotypes extracted from: tufA
##
## Number of haplotypes: 7
## Sequence length: 822
##
## Haplotype labels and frequencies:
##
## I II III IV V VI VII
## 5 23 3 6 1 2 1
Okay, so there are seven different haplotypes in the dataset. Now let’s calculate the haplotype network and show some basic information about it.
tufAnet = haploNet(tufAhaps)
tufAnet
## Haplotype network with:
## 7 haplotypes
## 8 links
## link lengths between 1 and 1 steps
##
## Use print.default() to display all elements.
And let’s build a table that indicates where the different haplotypes are found.
tufA.ind.hap = with(stack(setNames(attr(tufAhaps,"index"), rownames(tufAhaps))),table(hap=ind, individuals=rownames(tufA)[values]))
tufA.ind.hap
## individuals
## hap Adriatic Cataluna ItalyNW.France ItalyW Malta Murcia Sardinia
## I 1 0 2 0 2 0 0
## II 7 7 3 4 0 1 1
## III 2 1 0 0 0 0 0
## IV 1 0 0 0 0 3 2
## V 0 0 0 1 0 0 0
## VI 0 0 0 0 0 0 0
## VII 0 0 0 1 0 0 0
## individuals
## hap Sicilia
## I 0
## II 0
## III 0
## IV 0
## V 0
## VI 2
## VII 0
Okay, time to plot the network. First we will choose a set of 8 colours from RColorBrewer (one per geographic area in the dataset), then plot the haplotype network, showing pie charts indicating in which geographic areas the haplotypes are found, and add a legend.
colors=brewer.pal(8, "Set1")
plot(tufAnet,size=attr(tufAnet,"freq"),scale.ratio=2,cex=0.8,pie=tufA.ind.hap,bg=colors,show.mutation=2,threshold=0)
legend("bottomleft", colnames(tufA.ind.hap), fill=colors)
We will do two analyses of the data from the ribosomal proteins cluster. First, we will analyse the original dataset, which doesn’t take the length polymorphism of the simple sequence repeats into account. All code is the same as for tufA above, so we’ll just run through and produce a haplotype network plot.
rp = read.dna("rp_loc.fas",format="fasta")
rphaps = haplotype(rp)
rphaps
##
## Haplotypes extracted from: rp
##
## Number of haplotypes: 4
## Sequence length: 1499
##
## Haplotype labels and frequencies:
##
## I II III IV
## 4 10 30 1
rpnet = haploNet(rphaps)
rp.ind.hap<-with(stack(setNames(attr(rphaps,"index"), rownames(rphaps))),table(hap=ind, individuals=rownames(rp)[values]))
rp.ind.hap
## individuals
## hap Adriatic Cataluna ItalyNW.France ItalyW Malta Murcia Sardinia Sicily
## I 0 0 0 0 0 0 4 0
## II 10 0 0 0 0 0 0 0
## III 0 7 6 8 3 3 0 3
## IV 0 0 0 0 0 1 0 0
colors=brewer.pal(8, "Set1")
plot(rpnet,size=attr(rpnet,"freq"),scale.ratio=2,cex=0.8,pie=rp.ind.hap,bg=colors,show.mutation=2,threshold=0)
legend("topleft", colnames(rp.ind.hap), fill=colors)
Because the ribosomal proteins dataset contains several simple repeats, these repeats were manually re-coded so that each additional repeat would count as a single point mutation. We will now analyse the resulting dataset.
rp = read.dna("rp_loc_recoded.fas",format="fasta")
rphaps = haplotype(rp)
rphaps
##
## Haplotypes extracted from: rp
##
## Number of haplotypes: 10
## Sequence length: 1514
##
## Haplotype labels and frequencies:
##
## I II III IV V VI VII VIII IX X
## 4 10 3 8 8 1 1 8 1 1
So instead of the 4 haplotypes from above, with the re-coding we now get 10 haplotypes from the ribosomal proteins data. Let’s continue the analysis and plot the network.
rpnet = haploNet(rphaps)
rp.ind.hap<-with(stack(setNames(attr(rphaps,"index"), rownames(rphaps))),table(hap=ind, individuals=rownames(rp)[values]))
rp.ind.hap
## individuals
## hap Adriatic Cataluna ItalyNW.France ItalyW Malta Murcia Sardinia
## I 0 0 0 0 0 0 4
## II 10 0 0 0 0 0 0
## III 0 1 0 0 0 0 0
## IV 0 0 1 3 3 0 0
## V 0 1 3 2 0 2 0
## VI 0 0 0 1 0 0 0
## VII 0 0 0 0 0 1 0
## VIII 0 4 2 1 0 1 0
## IX 0 0 0 1 0 0 0
## X 0 1 0 0 0 0 0
## individuals
## hap Sicily
## I 0
## II 0
## III 2
## IV 1
## V 0
## VI 0
## VII 0
## VIII 0
## IX 0
## X 0
colors=brewer.pal(8, "Set1")
plot(rpnet,size=attr(rpnet,"freq"),scale.ratio=2,cex=0.8,pie=rp.ind.hap,bg=colors,show.mutation=2,threshold=0)
legend("topleft", colnames(rp.ind.hap), fill=colors)
Okay, we will now plot the haplotype information on a map of the western Mediterranean Sea, using the re-coded ribosomal proteins data as that seems to show the finest resolution among all datasets analysed. We’ll start by plotting the map itself.
worldmap <- getMap(resolution = "low")
plot(worldmap,xlim=c(-6,20),ylim=c(38,40),col="gray95")
Now we need some latitude/longitude coordinates for the different geographic areas. This was prepared as a table that we’ll import here as a data frame.
df = read.table(file="regions_coordinates,txt",header=T,sep="\t")
df
## region long lat
## 1 Adriatic 15.60 42.4
## 2 Cataluna 2.90 41.6
## 3 ItalyNW.France 7.80 43.7
## 4 ItalyW 12.70 41.1
## 5 Malta 14.30 35.4
## 6 Murcia -0.73 37.8
## 7 Sardinia 9.00 39.5
## 8 Sicily 14.00 38.2
Let’s plot the map again with some text labels to indicate these regions.
plot(worldmap,xlim=c(-10,20),ylim=c(38,41),col="gray95")
text(x=df$long,y=df$lat,labels=df$region)
For downstream analyses, it’s important to double-check that the order of regions in imported table matches that in the rp.ind.hap
variable, so we’ll check that here.
names(rp.ind.hap[1,]) == df$region
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Looks good. Now we will create a matrix from the haplotype frequency information that was calculated earlier and stored in rp.ind.hap
.
sec = NULL
for (i in 1:nrow(rp.ind.hap)) {
sec = cbind(sec,rp.ind.hap[i,])
}
sec
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## Adriatic 0 10 0 0 0 0 0 0 0 0
## Cataluna 0 0 1 0 1 0 0 4 0 1
## ItalyNW.France 0 0 0 1 3 0 0 2 0 0
## ItalyW 0 0 0 3 2 1 0 1 1 0
## Malta 0 0 0 3 0 0 0 0 0 0
## Murcia 0 0 0 0 2 0 1 1 0 0
## Sardinia 4 0 0 0 0 0 0 0 0 0
## Sicily 0 0 2 1 0 0 0 0 0 0
Okay, with all of that in place we can now plot the pie charts onto the map. First, we’ll pick 10 colors (corresponding to the 10 haplotypes in the dataset). Then we’ll plot the map and text labels again, and loop through each region to plot a pie chart and add the number of samples sequenced in that region.
colors=brewer.pal(10, "Set3")
plot(worldmap,xlim=c(-6,20),ylim=c(38,40),col="gray95")
text(x=df$long,y=df$lat+1.1,labels=df$region)
for (i in 1:nrow(df)) {
r = sec[i,]
names(r) = colors
r=sort(r,decreasing=T)
floating.pie(df$long[i],df$lat[i],r,radius=1,col=names(r))
text(df$long[i],df$lat[i]-1,cex=.8,labels=paste0("(n=",sum(sec[i,]),")"))
}
Last thing to do is relate the colours used in the map to the haplotypes in the network, so we’ll re-plot the network with those colours.
plot(rpnet,size=attr(rpnet,"freq"),fast=FALSE,show.mutation=2,threshold=0,bg=colors)
That’s it.