DIVERSIFICATION INTO THE GENUS BADNAVIRUS: PHYLOGENY AND POPULATION GENETIC VARIABILITY

Badnaviruses (family Caulimoviridae) have semicircular dsDNA genomes encapsidated into bacilliform particles. The genus Badnavirus is the most important due to its high number of species reported infecting cultivated plants worldwide. This study aimed to evaluate the phylogenetic positioning and population genetic variability into Badnavirus. Data sets comprising the badnavirus complete genome and partial sequences of the RT and RNaseH genes were obtained from the GenBank database. Multiple nucleotide sequence alignments from complete genome, ORFIII, complete genomic domain RT/RNaseH (1020pb) and partial (579pb) were performed. A total of 127 genomes were obtained, representing 53 species of badnavirus. Nucleotide sequence comparisons for the RT/RNaseH domain showed only a few isolates reported as distinct species shared ≥80% identity, the current threshold used for species demarcation into this genus. Phylogenetic trees for the complete genome and for ORFIII showed four well supported clusters (badnavirus groups 1-4), with clusters 1 and 3 being sister groups comprising predominantly sugarcaneand banana-infecting species. Non-tree-like evolution analysis evidenced putative recombination events among badnaviruses, and at least 23 independent events were detected. High levels of nucleotide diversity were observed for the partial RT/RNaseH region in isolates of 11 badnavirus species. These results showed that mutation and recombination are important mechanisms that acting on badnavirus diversification.


INTRODUCTION
Viruses belonging to the family Caulimoviridae have semicircular, double-stranded (ds)DNA genomes, 7.2-9.2kbp in length, encapsidated into isometric or bacilliform particles, and which replicate through an RNA intermediate (plant pararetroviruses; Geering and Hull, 2012).This family is divided into eight genera (Badnavirus, Caulimovirus, Cavemovirus, Petuvirus, Rosadnavirus, Solendovirus, Soymovirus and Tungrovirus) according to host range, insect vector, genomic organization and phylogeny (Geering and Hull, 2012;Bath et al., 2016).Badnaviruses are transmitted mostly by mealybugs (a few species by aphids) in a semi-persistent manner (Geering and Hull, 2012;Bath et al., 2016) and are among the most important plant viruses with a DNA genome.
The semicircular dsDNA of badnaviruses has site-specific discontinuities and at least three ORFs, named I, II and III (Bouhida et al., 1993;Hagen et al., 1993;Harper and Hull, 1998;Geering and Hull, 2012).Proteins codified by ORFs I and II have been reported to be virion-associated (Cheng et al., 1996) and nucleic acid-binding (ORF II; Jacquot et al., 1996).ORFIII encodes a polyprotein of 208-216 kDa that is proteolytically cleaved generating the movement and coat proteins, the aspartate protease responsible for the polyprotein cleavage, the reverse transcriptase (RT) and ribonuclease H (RNaseH) (both genomic domains involved in the viral replication) (Medberry et al., 1990;Harper and Hull, 1998).The criterion of ≥80% nucleotide sequence identity for the RT/RNaseH domains was established for species demarcation in the genus Badnavirus (Geering and Hull, 2012), and specific primer pairs are largely used to amplify this viral genomic region (Yang et al., 2003).However, different studies have shown this criterion to be insufficient to separate some badnavirus species, mostly those infecting banana and sugarcane (Muller et al., 2011;Karuppaiah et al., 2013;Silva et al., 2015).Furthermore, the existence of endogenous badnavirus sequences represents a great challenge for taxonomy and diagnosis of members into this genus.
A in silico large-scale study was carried out to obtain more information about the genetic relationship and variability in Badnavirus.The threshold for nucleotide sequence comparisons of the RT/RNaseH genomic region currently used for species demarcation identifies most reported badnaviruses.However, this criterion alone is unable to differentiate all sugarcaneand banana-infecting badnaviruses.A new badnavirus phylogenetic clade was proposed, here named badnavirus group 4. Additionally, it was observed this is a highly diverse viral group, with recombination and mutation being important factors contributing to high levels of nucleotide diversity observed in some badnavirus populations.

Badnavirus data set
Full-length genome sequences of badnaviruses were retrieved from the non-redundant GenBank database (www.ncbi.nlm.nih.gov/genbank; accessed on Nov 2017) (Table 1).Data sets of nucleotide sequences of the ORFIII, and full (1020pb) and partial (579pb) RT/RNaseH domains were obtained from the complete genomes.The partial RT/RNaseH data set comprises the genomic region amplified by the primer pairs largely used for detection and identification of badnaviruses (Yang et al., 2003).A previous analysis using sequences from ORF I and II showed these regions are inconclusive, presenting low support and insufficient phylogenetic signal, for this reason were not included in this study (data not shown).

Sequence analysis
Multiple amino acid sequence alignments were prepared for the ORFIII using the MUSCLE algorithm (Edgar, 2004), manually edited in the MEGA7 package (Kumar et al., 2016) and returned to nucleotide sequences for posterior analyses.The RT/RNaseH sequences (full and partial) were obtained from the ORFIII data set.Additionally, multiple nucleotide alignments were obtained for the complete genome.
To confirm taxonomy attributed to the badnavirus isolates retrieved from GenBank, pairwise nucleotide sequence comparisons were performed to all data sets using Sequence Demarcation Tool (SDT) v.1.2(Muhire et al., 2013).

Phylogenetic inference
In order, to demonstrate if the phylogenetic relationship observed for the RT/RNaseH reflects the clustering inferred for the complete genomes of badnaviruses, Bayesian phylogenetic trees were obtained for the complete genome, ORFIII and RT/ RNaseH (full and partial) data sets.Analyses were run using MrBayes v. 3.2 (Ronquist et al., 2012) through the CIPRES web portal (Miller et al., 2010), assuming GTR+I+G as the evolutionary model.Two replicates with four chains each for 20 million generations and sampling every 2,000 generations were used.The first 2,500 trees were discarded as a burn-in phase in each run.Posterior probabilities (Rannala and Yang, 1996) were determined from a majority-rule consensus tree generated with the 15,000 remaining trees.The trees were edited in FigTree v.1.4(ztree.bio.ed.ac.uk/software/figtree) and Inkscape (https:// inkscape.org/pt/).

Recombination analysis
Evidence of non-tree-like evolution was assessed for the complete genome, ORFIII, and RT/ RNaseH (full and partial) data sets using the Neighbor-Net method implemented in SplitsTree v.4.10 (Huson and Bryant, 2006).Putative parental sequences and recombination breakpoints for the complete genome data set were determined using the methods RDP, Geneconv, Boot-scan, Maximum Chi Square, Chimaera, SisterScan and 3Seq implemented in the RDP v.4.0 package (Martin et al., 2015).Alignments were analyzed with default settings for the different methods and statistical significance was inferred by a P-value lower than a Bonferroni-corrected cut-off of 0.05.Only events detected by at least five different methods were considered to be reliable.

Population genetic variability
Partial nucleotide sequences of the RT/ RNaseH region of badnaviruses infecting different hosts were retrieved from the non-redundant GenBank database (www.ncbi.nlm.nih.gov/genbank;accessed on Dez 2017).The mean pairwise number of nucleotide differences per site (nucleotide diversity, π) was estimated for each population using DnaSP v. 6.10 (Rozas et al., 2017).

Badnavirus isolates
A total of 127 full-length genomes were obtained from GenBank (Table 1), comprising 53 different badnavirus species.Cacao swollen shoot virus (CSSV) was the badnavirus represented by the higher number of sequences/isolates (12), while 25 species were represented by one only sequence (Table 1).

Species demarcation
Pairwise sequence comparisons for the partial (579pb) and full (1020pb) RT/RNaseH, largely used for badnavirus species identification, showed percent nucleotide identities ranging from 57.2 to 82.4% and 57.1 to 83.8%, respectively, among species.Therefore, some comparisons exceeded the currently used 80% nucleotide identity criterion for species demarcation into the genus Badnavirus (Geering and Hull, 2012).
The Sweet potato badnavirus B (SPBV) and Sweet potato pakakuy virus (SPPV) isolates shared 81.9% and 83.8% of nucleotide identity for the partial and full RT/RNaseH data sets, respectively.For all badnavirus represented by more than one sequence/ isolate, percent nucleotide identities were higher than 80.0% within species.
When analyzed the ORFIII data set, which comprises the RT/RNaseH domains, SCBGAV showed highest nucleotide identity of 80.3% and 79.8% with BSOLV and BSCAV, respectively.However, SCBBBV, SCBIMV and SCBMOV shared up to 75,7% identity.For the complete genome, all pairwise comparisons between distinct badnaviruses were lower than 80% identity, with SCBGAV isolates showing the highest level of nucleotide identity (79.5%) with BSOLV and BSCAV isolates.

Phylogenetic relationship
In the complete genome Bayesian phylogenetic tree, the three badnavirus clusters (badnavirus groups 1, 2 and 3) described by Muller et al., (2011) were observed (Figure 1).Additionally, a fourth clade can be observed, here named badnavirus group 4 (Figure 1).The clusters 1 and 3 formed sister groups, being predominantly comprised by badnaviruses infecting sugarcane and banana (Figure 1).Similar results were observed for the ORFIII data set (Figure 2), which represents ~80% of the complete genome.These results reinforce the idea that badnaviruses infecting sugarcane and banana are closely related, as indicated by the pairwise comparisons and recombination analyses (see below).
When analyzed the phylogenetic trees for the RT/RnaseH data sets (full and partial), badnavirus groups are still evidenced, but with many topological incongruences and very low resolution (SFigure 1 and SFigure 2).Some species (mainly the cacao-infecting isolates) in the badnavirus group 2 clustered with isolates in groups 3 and 4, while the other group 2 isolates clustered with badnaviruses in group 1 (SFigure 1).However, statistical support was considerably smaller for trees based on RT/ RnaseH sequences than ORFIII and complete genome, indicating the RT/RNaseH sequences have insufficient phylogenetic signal.

Recombination events
Non-tree-like evolution for the complete genome, ORF III and RT/RNaseH revealed evidence of putative recombination events affecting the evolution of badnaviruses, with both intra-and interspecies recombination being observed (Figure 3, Figure 4, SFigure 3 and SFigure 4).These events were more pronounced among badnaviruses infecting sugarcane and banana (phylogenetic groups 1 and 3), and cacao (phylogenetic group 2).Besides recombination, a strong mutation effect was evidenced in the diversification of members into this genus, indicated by the long branches associated with many isolates (Figure 3, Figure 4, SFigure 3 and SFigure 4).To investigate putative parental sequences and recombination breakpoints, the complete genome data set was analyzed using the RDP4 package.Based on a stringent set of criteria, at least 23 independent recombination events were detected among badnavirus isolates, with ten of them involving species associated with sugarcane and banana (Table 2).Additional recombination events involved viruses infecting citrus, cacao, black pepper, pineapple and grapevine.Most recombination breakpoints were located in ORF III and in the intergenic region (Table 2).* Numbering starts at the 5' end of the minus-strand primer-binding site and increases clockwise.(?), breakpoint could not be precisely pinpointed.† R, RDP; G, GeneConv; B, Bootscan; M, MaxChi; C, Chimera; S, SisScan; 3, 3SEQ.‡ The reported P values are for the methods indicated in red, and they are the lowest P values calculated for the region in question.^ The recombinant sequence may have been misidentified (one of the identified parents might be the recombinant).
In the present study, nucleotide sequence comparisons of a large sequence data set showed that the threshold of ≥80% identity for the RT/RNaseH genomic region [partial (579pb) or full (1020pb)] allowed the species demarcation of most badnavirus isolates with full-length genome sequences available in GenBank.However, as previously reported, it was not possible to distinguish a few badnaviruses from banana and sugarcane (Muller et al., 2011;Karuppaiah et al., 2013;Silva et al., 2015), even when analyzed the entire ORF III (which corresponds to ~80% of the badnavirus genome).
We suggest a few alternatives to solve the identification problems observed for badnaviruses infecting banana and sugarcane.First, maintaining the criterion of ≥80% nucleotide sequence identity for the RT/RNaseH genomic region, closely related banana streak viruses (BSV) and sugarcane bacilliform viruses (SCBV) species sharing ≥80% nucleotide sequence identity should be considered as different strains of a same species.Tanking account biological aspects as host range (Geering and Hull, 2012), cross infection involving sugarcane and banana-infecting badnaviruses has been observed (Lockhart and Autrey, 1988;Bouhida et al., 1993;Jones and Lockhart, 1993), which reinforces the ideia SCBV and BSV sharing more than 80% identity may belong to a same viral species.Additionally, the isolates SPPV (#access FJ560943) and SPBVB (#access FJ560944), previously reported as distinct species, must belong to a same badnavirus species as they share more than 80% identity for RT/ RNaseH and the same host range, besides their narrow phylogenetic relationship.Second, a new threshold for species demarcation based on nucleotide identity of the RT/RNaseH domain could be established.In our analysis, it was observed the badnavirus species showed no more than 82,5% nucleotide identity for the RT/ RNaseH domain, and therefore higher values could be adopted as the new threshold for species demarcation.Third, considering the inefficiency to differentiate some badnavirus species using the currently established criteria, full-length genome sequences could be used for taxonomy.
Phylogenetic analysis showed clear shuffle of SCBVs and BSVs isolates into at least two distinct badnavirus clusters (i.e., badnavirus groups 1 and 3 sensu Muller et al., 2011).The close genetic relationship among sugarcane and banana-infecting badnaviruses, and the polyphyletic structure of these viruses (Gayral and Iskra-Caruana, 2009;Muller et al., 2011), strongly support the hypothesis of a host shift, although it is not possible to determine whether sugarcane or banana was the original host (Gayral and Iskra-Caruana, 2009;Muller et al., 2011).
The phylogenetic relationships observed for all data sets analyzed, in which SCBVs and BSVs isolates are grouped in sister clusters, agree with results of pairwise sequence comparisons.Besides the three well defined badnavirus groups reported by Muller et al., (2011), a fourth phylogenetic cluster is proposed here, named badnavirus group 4. In the phylogeny based on full RT/RNaseH, the groups 1 and 3 (both composed predominantly by sugarcane and banana-infecting viruses) are more distantly related, with group 3 being closer to group 4.However, in phylogenies based either on ORFIII or complete genomes, a close relationship between groups 1 and 3 was evident, so they can be considered as sister groups.These results reinforce the hypothesis that sugarcane and banana-infecting badnaviruses have a common evolutionary history.
The high levels of nucleotide diversity for the partial RT/RNaseH region in some badnavirus species (BSUAV, BSULV, BSUMV, PYMoV, SPBVB, SCBIMV and DBALV) are comparable to values observed for RNA viruses (Guimarães et al., 2015).However, this variability is lower when compared to less conserved regions [ORFs I, II and intergenic region (IGR)], with IGR showing the highest values of genetic diversity (0.468) followed by ORFII 0.367; Sharma et al., 2015).These results suggest although the partial RT/RNaseH region be variable in different badnavirus populations, non-conserved regions tend to have higher nucleotide diversity.
The high genetic variability observed for badnaviruses has been attributed to error-prone replication by their reverse transcriptase (Bousalem et al., 2008).Reverse transcriptases (RT) are known to produce errors in retroviruses and retroelements for which the fidelity rates have been estimated (Svarovskaia et al., 2003).Although fidelity rates of caulimovirid RTs have not been estimated, it is believed that the lack of proofreading activity contributed to the high mutation rates observed for these viruses (Svarovskaia et al., 2003).Nevertheless, the contribution of recombination to the genetic diversity of badnaviruses must also be considered (Govind et al., 2014).
One of the basic assumptions for successful recombination is the occurrence of mixed infections, with the presence of the viruses in the same host cell (Zhou et al., 1997;García-Andrés et al., 2006;Graham et al., 2010).Cross infection of badnaviruses seems to be an uncommon event, however it has been reported for viruses infecting sugarcane and banana, which could explain the putative recombinant origin of SCBV and BSV isolates (Lockhart and Autrey, 1988;Bouhida et al., 1993;Sharma et al., 2015;Bath et al., 2016).Here, besides the putative recombination events affecting the evolution of SCBV and BSV isolates, it also seems an important evolutionay mechanism for diversification of cacao-infecting badnaviruses.
Nucleotide sequence comparisons of the partial RT/RNaseH region are sufficient for species demarcation of most badnaviruses currently known.However, the ≥80% nucleotide identity threshold alone is unable to differentiate all SCBV and BSV species and should be reviewed.Finally, mutation and putative recombination events may be involved with the high levels of genetic variability and diversity observed in Badnavirus.

Table 1 .
Full-length badnavirus sequences retrieved from the non-redundant GenBank database on Nov 2017.

Table 2 .
Putative recombination events detected within badnavirus isolates, based on complete genome sequences.

Table 3 .
Genetic variability of the RT/RNaseH region of badnavirus populations infecting distinct hosts.