Comparative Phylogenetic and Residue Analysis of Hepatitis C Virus E1 Protein from the Middle East and North Africa Region

authors:

avatar Muhammad Umar Sohail 1 , avatar Asma A Al Thani 2 , avatar Hadi Mohamad Yassine 2 , *

Biomedical Research Center, Qatar University, Doha, Qatar
Department of Biomedical Sciences, College of Health Sciences, QU Health, Qatar University, Doha, Qatar

how to cite: Sohail M U, Al Thani A A, Yassine H M. Comparative Phylogenetic and Residue Analysis of Hepatitis C Virus E1 Protein from the Middle East and North Africa Region. Hepat Mon. 2019;19(8):e92437. https://doi.org/10.5812/hepatmon.92437.

Abstract

Hepatitis C virus (HCV) is a major public health problem in the Middle East and North Africa (MENA) region with an estimate of over 15 million chronically infected patients. However, molecular characterization of circulating genotypes in the MENA region remains elusive. Here, we performed a comparative phylogenomic analysis of so-far available E1 gene sequences (937), originating from eight countries in the MENA region. All HCV E1 protein sequences present in NCBI from the MENA region were retrieved and cataloged per year and country of origin. Phylogenetic analysis revealed a maximum diversity of genotypes and subtypes in South Arabia [G-1 (1a, 1b, 1g), G-2 (2a, 2c), G-3 (3a) and G-4 (4a, 4d, 4n, 4o, 4r, 4s)] followed by Egypt [G-1 (1b, 1g) and G-4 (4a, 4l, 4n, 4m, 4u)], Iran [G-1 (1b) and G-3 (3a) G-6 (6a)], Tunisia [G-1 (1b) and G-2 (2a, 2b, 2c)], Algeria [G-1 (1i), 4(4f), Pakistan [G-1 (1a), G-3(3a, 3b)], Afghanistan [G-1 (1a), GT-3 (3a)], and 5(5a), and Yemen [G-4 (4r)]. The calculated evolution rate of retrieved sequences was 1.601 × 10−3 substitutions/site/year and the mean nucleotide diversity rate was 0.2684 (P < 0.001). The ratio of synonymous to non-synonymous (mean dN/dS) substitutions was higher in genotypes 2 and 4 compared to the genotypes 1 and 3. A higher degree of nucleotide identity in E1 gene was found between subtypes 1a and 1b, between 2c and 2g, and between 4a, 4d, and 4o. Comparative residue analysis of E1 protein epitope sequences of previously reported H111, A4, and A6 monoclonal antibodies showed relatively poor and genotype-specific conservancy. Perhaps, none of the reported epitope sequences had immunogenicity score higher than 0.4 (A minimum threshold for vaccine sequence prediction). Furthermore, these epitope sequences were heavily glycosylated at amino acid 196, 209, and 234 sites in all GTs. In conclusion, a high genetic variability in E1 protein coupled with increased glycosylation may deduce heterogeneity and subsequent escape from vaccine-generated immune response, thereby ascertaining necessary interventions for disease management and control.

1. Context

Hepatitis C virus (HCV) is a pathogen of global health significance. Africa and Asia are the most affected regions with the highest prevalence rates and over 15 million individuals are infected with chronic HCV in the Middle East and North Africa (MENA) region alone (1). About 80% of the acute hepatitis C cases progress to chronic infection and 10% - 20% of them develop complications characterized by chronic inflammation, liver cirrhosis, and hepatocellular carcinoma (2). The virus exhibits high replicative activity leading to high genetic diversity and can be classified into seven genotypes, and more than 67 subtypes (3). In the MENA region, country-specific genotype distribution is calculated previously by Mahmud, Al-Kanaani (1). The authors report that pooled mean proportion of genotype 1 is higher in Algeria, Morocco, Iran, Tunisia, and Bahrain, whereas genotype 3 mainly prevails in Pakistan and Afghanistan. Egypt, Saudi Arabia, Jordan, and Palestine are predominantly occupied by genotype 4.

The virus has 9.6 kb long RNA genome that encodes three structural proteins (core, E1, and E2) and seven non-structural proteins (p7, NS2, NS3, NS4A, NS4B, NS5A, and NS5B). The maximal variability is observed in genes encoding the envelope glycoproteins E1 and E2 (4). The structural envelope glycoproteins (E1 and E2) facilitate virus entry into the host cell and escape from the host immune system using different adaptive mechanisms. These glycoproteins bind surface receptors on hepatic cells and participate in endocytosis and virion formation (5). Envelope protein E2 has been extensively characterized and was initially assumed responsible for virus binding and fusion with the host cells (6). Contrary to this suggestion, recent work proposes that E1 alone or in combination with E2 may be involved in virus fusion (7). A conserved hydrophobic motif (CSALYVGDLC, residues 272-281) in E1 gene is believed to be associated with virion fusion process (8). The disulfide and covalent bonds between E1 and E2 proteins make E1/E2 heterodimer, a type I transmembrane protein that releases individual E1 and E2 from the polyprotein by signal peptidase cleavages (9). After the cleavage, E1 protein targets the endoplasmic reticulum lumen where it is modified by N-glycosylation. All HCV genotypes carry 4 to 5 potential N-glycosylation sites on E1 protein, playing important roles in protein folding, virion morphogenesis, and immune escape (10, 11).

The high glycosylation rate on E1 protein significantly limits the immunogenicity of HCV and restrict the binding of antibodies to their epitopes. Chronic infections arise through high mutation rates in the envelope proteins and formation of glycosylation-associated cellular aggregates and viral quasispecies that help virus escape from the immune system (12). Recently, the proposed HCV E1 crystal structure and in vitro analysis of N-terminal domain (192-270) revealed a conserved protein domain organization, encompassing N-glycosylation sites, antigenic sites defined by neutralizing monoclonal antibodies, and direct drug binding sites (13). Limited data are available on E1 polymorphism, mutagenesis, and immunogenicity analysis of a large population cohort using in silico bioinformatics tools. Here, we performed these analyses on 937 sequences of HCV E1 protein that originate from the MENA region. Such knowledge is important not only to elucidate ongoing evolution in HCV but also to facilitate necessary interventions for effective control of HCV through novel vaccine and antiviral drug approaches. Although countries in the MENA region have the highest HCV prevalence rate, relatively little attention has been paid on HCV genome characterization in this region. In this study, the E1 protein is chosen because of the availability of sequences from the MENA region in the Gene Bank database and also because it has been less explored at the molecular level.

2. Materials and Methods

2.1. E1 Sequences

HCV E1 gene sequences obtained from the MENA region were downloaded from GenBank database maintained by NCBI (14). A search was performed using the combination of the following search queries; “Hepatitis C Virus”, “HCV”, “envelope protein” “E1 protein”, and “country name”. The search produced 1,199 complete or partial E1 sequences for Saudi Arabia (446), Egypt (428), Tunisia (119), Iran (57), Pakistan (112), Afghanistan (29), Yemen (5), and Algeria (3). All sequences were aligned against the HCV genotype 1a reference gene H77 (coordinates 915 - 1491) using MAFFT v7.310 (15) and the alignments were clipped to codon positions 192 - 300 of E1 protein, corresponding to nucleotides 915 - 1240 on H77 reference genome. Partial sequences that had sequence length less than 300 bp or which lacked N-terminal domain were removed, resulting in 937 sequences for further analysis. A systematic analysis of these sequences was performed using different bioinformatics tools (Figure 1).

Flowchart of bioinformatics analysis performed on HCV E1 protein. Blue boxes show the work performed and green ovals show the software used.
Flowchart of bioinformatics analysis performed on HCV E1 protein. Blue boxes show the work performed and green ovals show the software used.

2.2. Sequence Alignment and Phylogenetic Analysis

For phylogenetic analysis, we randomly selected only 278 sequences, which represented all genotypes, years, and country of isolation. Phylogenetic tree was constructed with MEGA® V. 7.0 software (16) using distance-based neighbor-joining (1,000 replication bootstrap values). Reference genotype sequences were obtained from https://hcv.lanl.gov database to construct phylogenetic tree. All assembled sequences were further pairwise aligned with genotype-specific reference sequences using multiple sequence aligned methods in MAFFT tool for E1 nucleotide mutations and amino acid substitution analysis. BioEdit® V. 5.0.6 (17) software was used for nucleotide mutations identification and comparative residue analysis of all the sequences.

2.3. Evolution Rate and Site-Specific Selection Pressure

HCV E1 gene evolution rate was estimated using the Bayesian Markov Chain Monte Carlo (MCMC) algorithm implemented in the BEAST V. 1.8.4 software (18). Virus isolation dates were used for calibration of the strict molecular clock in HKY substitution model. The XML file was generated in BEAUTi program using gamma parameter of site heterogeneity at 200 million chain-length, echoed every 1,000 states. The estimated effective sampling sizes (ESSs ≥ 200) were used to evaluate sampling convergence of MCMC method. Sampling prior and mean clock rate were estimated in Tracer software V. 1.7. To elucidate patterns of adaptive evolution in E1 protein, single-likelihood ancestor counting (SLAC) model in web-based suite-Datamonkey was used. Site-specific selection pressures were analyzed for synonymous (dS) to non-synonymous (dN) ratio of nucleotide substitutions per site (19). A codon position with dN/dS value more than 1 at P value 0.05 was considered positively selected sites. The nucleotide diversity was analyzed for polymorphism, and average numbers of pairwise nucleotide differences were determined using DnaSP V. 6.12.01 (20) and Tajima’s D test of neutrality.

2.4. Comparative Residue Analysis

Residue diversity and/or conservancy at each amino acid site were analyzed through online WebLogo (21) program. Pairwise analysis of the sequence identity was performed by bootstrap value 1,000 using MEGA 7 (16). Residue analysis of key motifs recognized as monoclonal antibody epitope sites was performed for all genotypes. Epitope sequences were compared with consensus references. Furthermore, residue conservancy of these motifs was assessed using epitope conservancy analysis program implemented in the immune epitope database and analysis resource (IEDB) (22). The program calculated epitope conservancy as a percentage of the number of polymorphic sites over the epitope length. The epitope immunogenicity was assessed against HLA class1- alleles using IEDB epitope analysis tool (23). Finally, a literature review was conducted to collect available information on amino acid mutations that may affect the efficacy of direct-acting antiviral drugs targeting E1 protein, which concluded following seven residue sites (T213A, W239, I262A, D263-, Q289H, M267V, F291I, Y297H) (7, 24, 25). Amino acid substitutions at these specific sites were compared between different genotypes and reported in a graph as the percentage (Figure 2).

Percentage nucleotide identity and divergence of HCV E1 protein analyzed using pairwise distance maximum composite likelihood algorithm implemented by MEGA 7. Nucleotide identity percentage are given in the below diagonal, whereas percent divergence values are presented in the above diagonal.
Percentage nucleotide identity and divergence of HCV E1 protein analyzed using pairwise distance maximum composite likelihood algorithm implemented by MEGA 7. Nucleotide identity percentage are given in the below diagonal, whereas percent divergence values are presented in the above diagonal.

2.5. Prediction of Glycosylation Sites

NetNglyc online tool was used to predict N-linked glycosylation (26). The default settings were used to predict N-X-T/S sequons. Only scores crossing the default threshold of 0.5 and jury agreement 9/9 were considered positive for potential glycosylation sites.

3. Results

3.1. Phylogenetic and Evolutionary Analysis

The phylogenetic analysis of HCV E1 gene revealed a distinct and genotype-specific distribution pattern. All E1 sequences clustered together in distinct clades of their respective genotypes. Most of the retrieved sequences belonged to genotype 1, 2, 3, 4, except only one sequence was related to both 5a and 6a genotypes belonged to Algeria and Iran, respectively. Overall, 19 genotypes (subtypes) were clustered in the phylogenetic tree as presented in Figure 3. The phylogenetic analysis revealed the evolutionary dynamic of E1 protein for diverse distribution of various genotypes and subtypes of HCV in the MENA region. In this regard, 4a was the most abundant genotype followed by 3a, and 2c. The maximum diversity of genotypes and subtypes was observed in South Arabia [G-1 (1a, 1b, 1g), G-2 (2a, 2c), G-3 (3a) and G-4 (4a, 4d, 4n, 4o, 4r, 4s)] followed by Egypt [G-1 (1b, 1g) and G-4 (4a, 4l, 4n, 4m, 4u)], Iran [G-1 (1b) and G-3 (3a) G-6 (6a)], Tunisia [G-1 (1b) and G-2 (2a, 2b, 2c)], Algeria [G-1 (1i), 4(4f), Pakistan [G-1 (1a), G-3(3a, 3b)], Afghanistan [G-1 (1a), GT-3 (3a)], and 5(5a) and Yemen [G-4 (4r)].

Comparative phylogenetic analysis of 278 HCV E1 protein sequences divide the sequences into genotype-specific clusters. Each branch header contains NCBI accession number, country, genotype name, and year of sample isolation. Few branch headers do not contain genotype description because their genotype record was missing at NCBI.
Comparative phylogenetic analysis of 278 HCV E1 protein sequences divide the sequences into genotype-specific clusters. Each branch header contains NCBI accession number, country, genotype name, and year of sample isolation. Few branch headers do not contain genotype description because their genotype record was missing at NCBI.

The mean evolution rate of the HCV E1 gene in the MENA region was 1.601 × 10-3 (95% HPD interval 1.122 × 10-3, 2.081 × 10-3) substitutions/site/year and sampling mean prior rate was -2.055 × 103. Based on the individual coding gene sequences, the average nucleotide diversity in the E1 gene was 0.268. On average, 454 (max: 956 for 4a genotype and min: 36 for 4l genotype) mutation sites were observed in different genotypes with average Tajima’s D test value = -1.067 (Table 1) (27). Taken together, HCV E1 gene is considered under positive selection since all Tajima’s D values are negative for all genotypes (P > 0.10). Datamonkey analysis of coding DNA sequence (CDS) for dN/dS is summarized in Tables 2 and 3. Most of the E1 CDS had SLAC model-based mean dN/dS ratio less than 1 (P < 0.05). The highest dN/dS ratio was observed in 4a, and 4d genotypes followed by 2a, 4u, 1b, and 2c. Furthermore, 4a genotype showed the highest positive selection site followed by 2a, 4d, and 1b genotypes.

Table 1.

Brief Summary of the DnaSP-Derived Polymorphism Analysis for HCV E1 Genea

GenotypeNumber of MutationsNucleotide DiversityStandard DeviationTajima’s D test
1a3930.269190.06676-1.28820
1b5130.166320.02228-1.73551
1g2570.222980.03579-0.70035
2a8040.372730.04727-1.26680
2c7520.323330.03277-0.96337
2g680.094850.01732-0.58834
3a3210.280470.09067-1.50896
4a9560.252930.01373-1.13219
4d6120.298520.03532-0.90753
4o2250.293300.09484-0.8886b
4r5310.278510.06631-1.45598
4u6450.392740.05652-0.36978
Table 2.

Determination of Synonymous and Non-Synonymous Substitution Rates Using SLAC Method for Genotype 1-3 of Envelope Glycoprotein E1 Gene Circulating in the MENA Regiona

Genotypes1a1b1g2a2c2g3a
Mean dN/dS0.15850.22490.198760.23790.23130.21330.1466
Positive selection sites0304200
Codon position02844106084073108587300
dN-dS value01.85252.07851.412404.13522.82123.23543.35971.10951.313100
P value00.08630.04470.099500.00040.01290.00070.01030.07810.031300
Negative selection sites70632288642722
Table 3.

Determination of Synonymous and Non-Synonymous Substitution Rates Using SLAC Method for Genotype 4 of Envelope Glycoprotein E1 Gene Circulating in the MENA Regiona

Genotypes 4a 4d 4o 4r 4u
Mean dN/dS0.19970.16690.25880.16720.1314
Positive selection sites54002
Codon position124144576512414486005860
dN-dS value26.07577.06158.89718.015716.80358.396416.521321.05214.4983002.92023.4037
P value0.00123.9927e-101.758e-080.000160.02430.00930.00440.00050.0833000.08140.04218
Negative selection sites1117932341

3.2. Comparative Residue Analysis and Sequence Conservancy

Residue diversity and/or conservancy at all amino acid sites are presented in WebLogo (Figure 4). WebLogo produced a consensus output sequence from the given input sequences, wherein, a series of letter stacked over each other at each polymorphic site. The height of each letter within a stack is proportional to the relative frequency of the residue at that position in the consensus sequence. The pairwise analysis of genetic distances shows a higher degree of nucleotide identity in E1 residue between genotypes 1a and 1b, between 2c and 2g, and between 4a, 4d, and 4o (Figure 2).

Residue analysis of HCV E1 protein. Constructing multiple sequence alignments, the diversity and/or conserveness of residues at each position were analyzed by WebLogo 3.1 (21). Vertical rectangles highlight the epitopes regions. Owing to software limitation, those genotypes (3b, 4l, 4m, 4n, 4g) that had less than 3 sequences were excluded.
Residue analysis of HCV E1 protein. Constructing multiple sequence alignments, the diversity and/or conserveness of residues at each position were analyzed by WebLogo 3.1 (21). Vertical rectangles highlight the epitopes regions. Owing to software limitation, those genotypes (3b, 4l, 4m, 4n, 4g) that had less than 3 sequences were excluded.

The relative conservancy of the E1 protein epitope sequences is presented in Table 4. Genotypes 2a and 4m had the lowest conservancy among all the genotypes. The highest conservancy (100%) in H111 monoclonal antibody epitopes (192YQVRNSSGLYH202) was observed in 1a and 3a genotypes, while multiple substations were observed in all other genotypes (Table 5). Similarly, the other two reported epitopes of monoclonal antibodies for A4 (197SSGLYHVTNDC207) and A6 (230VREGNASRCW239) antibodies also showed relatively high sequence conservancy in 1a compared with the rest of the genotypes. A high degree of conservancy was observed for 257QLRRHIDLLV266 epitope (this epitope was predicted in the immune epitope database (28) and has never been reported in the literature before) among different genotypes. In addition to epitope motifs, a classical 266CXXC229 motif that mediates the isomerization of disulfide bonds in E1 during virus entry (29) is also highly conserved among all genotypes (Table 5). The percentage substitution of key residue sites was determined. Direct drug binding sites that trigger the resistance against direct-acting antivirals were mostly conserved across genotypes (Figure 5).

Table 4.

Epitope Conservness and Immunogenicity Analysis of Envelop Glycoprotein 1 for Each Genotypea

GenotypeEpitope SequencePercent ConservancyImmunogenicity
MedianMin Identity, %Maxi Identity, %
1a192YQVRNSSGLYH20210090.00100.000.23502
197SSGLYHVTDC207070.0080.000.11232
230VREGNASRCW23992.350.00100.00-0.04545
257QLRRHIDLLV26610090.91100.00-0.24264
1b192YQVRNSSGLYH20269.8977.78100.000.08
197SSGLYHVTDC20779.4872.00100.000.25706
230VREGNASRCW23910.9840.00100.00-0.169
257QLRRHVDLLV26695.970.00100.000.08034
1g192YQVRNSSGLYH20255.670.0090.000.12609
197SSGLYHVTDC20777.880.0090.00-0.31315
230 VREGNASRCW 23910090.91100.000.25706
257QLRRHIDLLV26677.872.73100.000.08034
2a192VQVRNTSDSYM202030.0070.000.0495
197TSDSYMVTNDC207027.2781.82-0.325
230VRTGNKSRCW2393.230.0090.00-0.31315
257SLRRHVDLMV266045.4572.730.25706
2c192 VEVRNTSTSYM2021.840.00100.000.07943
197TSTSYMATNDC20742.960.00100.000.19978
230VRTGNKSRCW 23976.872.73100.00-0.246
257SLRRHVDLMV26633.963.64100.00-0.16936
3a192VEVKNNSDTYM20276.254.55100.000.01476
197NSDTYMVDLLV20790.570.00100.000.0495
230VRTGNKSRCW 23995.272.73100.000.06968
257SLRRHVDLMV266040.0080.00-0.07535
4a192VHYRNVSGIYH2027840.00100.000.0495
197VSGIYHVTNDC2073.527.2790.910.065
230VRTGNKSRCW23910.430.0090.00-0.27415
257SLRRHVDLMV26654.627.27100.000.25706
4d192VHYRNVSGIYH20298.3680.00100.00-0.02337
197VSGIYHVTNDC20714.7550.0090.00-0.14895
230VRTGNKSRCW23957.3872.7390.910.18971
257SLRRHVDLMV266036.3663.64-0.22298
4l192VHYRNVSGIYH20210090.91100.000.25028
197VSGIYHVTNDC20710090.00100.000.0495
230VRTGNKSRCW23910090.9190.91-0.0248
257SLRRHVDLMV266080.0080.00-0.19419
4m192VHYRNVSGIYH2027581.8290.910.25028
197VSGIYHVTNDC2072560.0090.000.0495
230VRTGNKSRCW239081.8281.82-0.0248
257SLRRHVDLMV266060.0080.000.25706
4n192VHYRNVSGIYH202072.7381.82-0.05824
197VSGIYHVTNDC20733.381.8290.910.25408
230VRTGNKSRCW239070.0080.00-0.21958
257SLRRHVDLMV26633.350.00100.00-0.31315
4o192VHYRNVSGIYH2020.581.8290.910.15514
197VSGIYHVTNDC2077581.8290.910.34646
230VRTGNKSRCW23910090.00100.000.0495
257SLRRHVDLMV2662570.0090.00-0.293
4r192VHYRNVSGIYH20210090.91100.000.34646
197VSGIYHVTNDC20794.181.82100.000.06364
230VRTGNKSRCW23910090.00100.000.0495
257SLRRHVDLMV26635.370.0090.00-0.19419
4u192VHYRNVSGIYH202100100.00100.00-0.17642
197VSGIYHVTNDC20710090.91100.000.00437
230VRTGNKSRCW23910090.00100.000.16386
257SLRRHVDLMV2662570.00100.00-0.06237
Table 5.

Comparative Residue Analysis of Envelope Glycoprotein E1 Gene for Important Motifsa

Genotype/SubtypeYearCountryN-Terminal Domain Motifs
1
Reference2002USA192YQVRNSSGLYH202197SSGLYHVTNDC207226CXXC229230VREGNASRCW239257QLRRHIDLLV266
1a2011KSA……T…..T……..….…..T….……….
2011IRN………....……..….…..S.K..……….
2010IRN………....……..….…..S.K..……….
2008IRN……T…..T……..….……….……….
2007IRN………....……..….…..S.K..……….
1993EGP………....……..….…..S.K..……….
1b2011KSA.EE..V..EF.V..EF….N.….……….TI…V…G
2010IRN.E…A..V....V…….….……….TI…V….
2008IRN.E…A..V..A..V…….….……….TI…V….
2007IRNFE…A..M.QA..M.Q…..….…N.S….TI…V….
2000TUN.E…V..A..V..A…….….…..Y….TI…V….
2000KSA.E…V..A..V..A…….….…..T….TI…V….
1993EGP.E…V..A..V..A…….….…..R.Q..TI…V….
1g2011KSA.KI..V..I..V..I…….….…..V….DV…V….
2002EGP.EI..V..I..V..I…….….……….DV…V….
1993EGP.EI..V..I..V..I…….….…..V….DV…V….
2
Reference2011CA192VEVKNNSDTYM202197NSDTYMATNDC207226CXXC229230EREGNNSRCW239257GLRAHIDIIV266
2a2011SA.Q…T.NS..T.NS..V….…..N…T….…….V..
2009TUNA….T.Q…T.Q……..…..D…T….…T…L..
2008TUN…R.T.Q…T.Q……..…..KDN.T...…T…L..
2007TUN…R.T.Q…T.Q……..….…..T….…T…A..
2006TUNA….T.EL.IT.EL.I…..…...KD.E….…S.V….
2005TUN.Q…TTTS..TTTS…….…..LK..S.F..…T…T..
2004TUN…..T.Q…T.Q…V….…..LV..K.L..…T…L..
2003TUN…..T.Q…T.Q…V….…..SVNNV….…T…L..
2c2004TUN…R.T.I…T.I……..…...I..V….…T…T..
2003TUN....T.VL..T.VL…….…..QT..V….…T…T..
2g2011KSA..IR.I.NS..I.NS…….…..RI..V….…….V..
2009TUN….NT..S..NT..S…….…..QI..V…...….A..
2008TUN…..T.KS..T.KS…….…..RN..V….…….V..
2007TUN…..T.NS..T.NS…….…...T..V….……….
2005TUN…..T.TS..T.TS…….…..KLD.V….…….V..
2004TUN…..T..S..T..S…….….EQI..I….…….V..
2003TUN…..T.EL..T.EL…….…...S..G.W..…….V..
3
Reference2012CA192LEYRNSSGLYV202197SSGLYVLTNDC207226CXXC229230VRKGNTSQCW239257SLRSHVDLMV266
3a2011KSA..W..T…..T……….…..Q…..M...I.G….L.
2016PAK..W..T…..T........AR…..QTG…K...I.G….L.
2011IRN..W..T…..T……….…..QD….T...V.R….L.
2008IRN..W..T…..T……….…..QD….T...I……L.
3bPAK…..T…..T……….…..PCVT.G.K...I.N….L.
4
Reference2011UK192VHYRNVSGIYH202197VSGIYHVTNDC207226CXXC229230VRTGNKSRCW239257SLRRHVDLMV266
4a2011KSA.N…A…..A…..I….….…..L….…S……
2012EGPIN………………..…..K…Q….…S……
2013EGPIN…A..V..A..V…….…...V..Q.S..…S……
2006EGPTN…A..V..A..V…….….……….…S……
2003EGPTN………………..…...S..Q….…S…..G
2002EGPTN……………I….…...E..Q….…S……
1993EGPVN…I..V..I..V…….…...V..Q….…S……
1993EGPIN………………..…..RE..Q….…S……
4d2011KSAYN…S..V..S..V..V….…...V….T..……….
4l2011EGPI….A.DV..A.DV..V….…..KV..R.Q..…K……
4m2002EGPI.…A..V..A..V..V….….…..V….E..H…ML.
1993EGPA….A..V..A..V…….…..K…V….A……ML.
4n2011KSAI.H..S…..S……….…...S..V….……….
4o2011KSAI..H.T…..T……….…...V..I….……….
2002EGPI….T…..T……….….V.E…….…Q…...
4r2011YEME….A…..A……….…..K…V…..F……..
2011KSAE….A…..A……….…...T..V…..F……..
1994YEME….A…..A……….…..K…V…..F……..
4f2000ALG…H.T..V..T..V…….….…..R.Q...V……..
5
Reference2011USA192VHYRNVSGIYH202197VSGIYHITNDC207226CXXC229230VRKGNKSRCW239257PLRRHVDLLA266
5a2009ALG.P…A..V..A..V…….….…D.V….….A..Y..
6
Reference2010CN192LTYGNSSGLYH202197SSGLYHLTNDC207226CXXC229230VKVDNQSTCW239257GFRRHVDLLA266
6a2011IRN………..………..….……….……….
Residue analysis at direct drug binding sites *T213A, *W239, #I262A, #D263-, #Q289H, #M267V, #F291I, #Y297H. Mutation on these residue sites develop resistance against direct-acting antiviral drugs (flunarizine, phenothiazines, pimozide, ferroquine, and aminoquinoline-derivative molecules). The results are stated as percentage changes on the individual residue site. *Residue mutations associated with virus entry cycle (7). #Residue mutations associated with drug resistance (7, 24, 25).
Residue analysis at direct drug binding sites *T213A, *W239, #I262A, #D263-, #Q289H, #M267V, #F291I, #Y297H. Mutation on these residue sites develop resistance against direct-acting antiviral drugs (flunarizine, phenothiazines, pimozide, ferroquine, and aminoquinoline-derivative molecules). The results are stated as percentage changes on the individual residue site. *Residue mutations associated with virus entry cycle (7). #Residue mutations associated with drug resistance (7, 24, 25).

3.3. Epitope Conservancy and Immunogenicity Analysis

In silico immune responses to previously recognized E1 protein epitopes by cytotoxic T lymphocytes (CTLs) was studied. The epitope immunogenicity predicted on IEDB analysis tools using MHC class I binding tools showed a poor immunogenicity score, wherein, all values were less than 0.4 (threshold for producing strong neutralizing antibodies against HCV E1 protein). The percentage conservancy and immunogenicity, predicted in IEDB, are presented in Table 4. The percentage substitution of key residue sites was determined. Direct drug binding sites that trigger resistance against direct-acting antivirals (flunarizine, phenothiazines, pimozide, ferroquine, and aminoquinoline-derivative molecules) (7, 24, 25) showed high variability in different genotypes. Therefore, mutations within epitope sequences influence their immunogenic potential.

3.4. N-Linked Glycosylation Prediction

HCV E1 is a heavily glycosylated protein that possesses four to five conserved glycosylation sites (196, 209, 234, 305, and 325) in all genotypes. We used NetNglyc 1.0 webserver to predict N-glycosylation on the HCV E1 protein sequences originating from the MENA region. Regardless of E1 sequence origin or genotype, all sequences were heavily glycosylated. Since our sequences were 192-300 AA long, we could only analyze the first three glycosylation sites (196, 209, and 234) in all genotypes (Table 6). Owing to uneven genotype or country-specific sample size, no appropriate statistical model could be applied.

Table 6.

N-Glycosylation at Different Amino Acid Sites in the Envelop Glycoprotein E1 Circulating in the MENA Regiona

GenotypeAA Site (Percentage per Site)b
Saudi ArabiaEgyptIranTunisiaYemenAfghanistanPakistanAlgeria
Nc196209234N196209234N196209234N196209234N196209234N196209234N196209234196209234
1a11010092010002009500055021000
1b407878511001000540600309790300000
1g21001001007718643000000
1i00000010
3a7145700120338.30022536387.8187.80
3b0000005060400
4a27236864250678012000000
4d614.9932641007550000000
4f00000001100100
4n300670000000
4o11001001003100670000000
4r1292337500051000100000
4s20000000
2a3010000019166.6430000
2c51000100005601.7520000
4l03010033000000
4m045010075000000
4u0205855000000
5a00000001100
6a00100000

4. Discussion

Hepatitis C virus (HCV) is a major public health problem around the globe with more than 2.5% (177.5 million adults) of the world population infected with this disease (2). Asia (2.8%) and Africa (2.9%) are the most infected regions where disease prevalence rates are the highest (2). Globally, 170 million individuals are chronically inflicted of which 15 million people belong to the MENA region (30). Compared to the high HCV prevalence rate in the MENA region, very few complete genome sequences are available from this region. Further, the molecular characterization of circulating strains in this region is also limited. Generally, HCV genotyping is performed by sequencing 5’-untranslated region, hypervariable region in the envelope protein, or by NS5 protein (31). In the present study, we analyzed 937 HCV E1 partial sequences because of E1 functional significance and relative higher E1 sequences availability, belonging to different countries in the MENA region. This is the first of its kind study to perform polymorphism, adaptive mutations, evolution frequencies and dynamics, and their possible impact on virus immunogenicity.

Genotype 4 tends to be the most abundant circulating genotype in Saudi Arabia and Egypt, with more than half of the sequences belonging to subtype 4a. Although genotype 1a and 3a are the most abundantly present worldwide (2); however, only few E1 sequences for these genotypes are available in NCBI nucleotide repository for MENA region. Based upon available E1 protein sequences, phylogenetic analysis revealed the highest diversity in genotype 4, clustering it in several subtypes (4a, 4d, 4f, 4m, 4n, 4o, 4r, 4s, and 4u). The mean evolution rate of E1 protein was 1.601 × 10-3 substitutions/site/year, which is comparable to previous estimates (32, 33). Phylogenetic and evolution rates analysis of E1 protein reported previously from China indicated that E1 protein is under positive selection (32). Similarly, we observed here positive selection for E1 protein with an average nucleotide diversity of 0.268 and Tajima’s value -1.067.

The positive selection sites were randomly located on the entire N-terminal domain (NTD) of 4a, 4d, 2a, 1b, 4u, and 2c genotypes. The NTD is thought to be exposed on the protein surface and, therefore, may act as a target for host antibody responses (33). E1 protein is known to contain several monoclonal antibody epitopes on NTD with potent neutralization effect (34-36). In the present sequences, high residue divergence and poor immunogenicity score prove the concept of immune-driven evolution in these epitope sites (37). Both HCV genotypes and host genetics may determine epitope immunogenicity. Keck, Sung (35) reported that H111 antibody epitopes (192YQVRNSSGLYH202) is highly conserved in 1a, 1b, 2b, and 3a genotypes and blocks HCV virion attachment to the host cells. Contrary to these findings, we observed poor conservancy and immunogenicity of this epitope across HCV genotypes circulating in the MENA region. In fact, these epitope studies are mostly performed on genotype 1 peptide sequence that may not align with other genotypes (35). Therefore, antibodies specific for one genotype may indicate poor neutralization against other HCV genotypes and compromise the efficiency of the vaccine. Owing to the inconsistent sequence number in all genotypes, no statistical trend for epitope conservancy and immunogenicity could be observed among genotypes in this study. Surprisingly, all glycosylation sites (n = 783) predicted here fall in these antibody epitopes. These glycosylations might mask epitopes from host antibody responses, and enhance protein folding, and virion formation (38).

In conclusion, the current study analyzes HCV E1 protein for genotype distribution and polymorphism in the MENA region. The study identifies the high genetic diversity and polymorphism in HCV E1 protein, albeit very few and uneven genotype-specific sequence coverage in the region. Most of the sequences are reported from Saudi Arabia and Egypt, which belong to genotype 4. The high variability in glycosylation and residue mutation score among different genotypes correlate with the number of sequences in each genotype; for example, 4a has more than 500 sequences but 4m to 4u genotypes have only few sequences that present fewer glycosylation sequons and mutation score. However, virus pathogenicity is genotype-specific since genotype 1 shows a poor response to interferon therapy compared with the genotype 2 or 3. The development of direct-acting antiviral drugs can significantly improve responses of interferon therapy. We could not determine specific markers of pathogenicity in E1 protein due to limited data available on protein sequences in the literature. The study elucidates E1 protein polymorphism that tends to disguise host immune response and resist direct-acting antivirals. High genetic variability in E1 protein and superimposed glycosylation enhance virus polymorphism and immune escape and make it impossible to develop either effective vaccine or drug against E1 protein of studied genotypes and a multitude of subtypes. Antigenically E1 protein evolves quickly, with correspondingly high rates of positive selection, as inferred in Tajima’s D test and MCMC analysis. The genetic diversity of HCV E1 represents these observed changes; meanwhile, the functional implications of these mutations would shed light on specific roles of virus evolution and pathogenicity. Further efforts are required for comprehensive genome analysis that may support effective control of HCV infection in the region.

References

  • 1.

    Mahmud S, Al-Kanaani Z, Chemaitelly H, Chaabna K, Kouyoumjian SP, Abu-Raddad LJ. Hepatitis C virus genotypes in the Middle East and North Africa: Distribution, diversity, and patterns. J Med Virol. 2018;90(1):131-41. [PubMed ID: 28842995]. [PubMed Central ID: PMC5724492]. https://doi.org/10.1002/jmv.24921.

  • 2.

    Petruzziello A, Marigliano S, Loquercio G, Cozzolino A, Cacciapuoti C. Global epidemiology of hepatitis C virus infection: An up-date of the distribution and circulation of hepatitis C virus genotypes. World J Gastroenterol. 2016;22(34):7824-40. [PubMed ID: 27678366]. [PubMed Central ID: PMC5016383]. https://doi.org/10.3748/wjg.v22.i34.7824.

  • 3.

    Appleby TC, Perry JK, Murakami E, Barauskas O, Feng J, Cho A, et al. Viral replication. Structural basis for RNA replication by the hepatitis C virus polymerase. Science. 2015;347(6223):771-5. [PubMed ID: 25678663]. https://doi.org/10.1126/science.1259210.

  • 4.

    Beljelarskaya SN, Orlova OV, Drutsa VL, Orlov VA, Timohova AV, Koroleva NN, et al. Hepatitis C virus: The role of N-glycosylation sites of viral genotype 1b proteins for formation of viral particles in insect and mammalian cells. Biochem Biophys Rep. 2016;7:98-105. [PubMed ID: 28955895]. [PubMed Central ID: PMC5613296]. https://doi.org/10.1016/j.bbrep.2016.05.019.

  • 5.

    Mazumdar B, Banerjee A, Meyer K, Ray R. Hepatitis C virus E1 envelope glycoprotein interacts with apolipoproteins in facilitating entry into hepatocytes. Hepatology. 2011;54(4):1149-56. [PubMed ID: 21735466]. [PubMed Central ID: PMC3184191]. https://doi.org/10.1002/hep.24523.

  • 6.

    Krey T, d'Alayer J, Kikuti CM, Saulnier A, Damier-Piolle L, Petitpas I, et al. The disulfide bonds in glycoprotein E2 of hepatitis C virus reveal the tertiary organization of the molecule. PLoS Pathog. 2010;6(2). e1000762. [PubMed ID: 20174556]. [PubMed Central ID: PMC2824758]. https://doi.org/10.1371/journal.ppat.1000762.

  • 7.

    Perin PM, Haid S, Brown RJ, Doerrbecker J, Schulze K, Zeilinger C, et al. Flunarizine prevents hepatitis C virus membrane fusion in a genotype-dependent manner by targeting the potential fusion peptide within E1. Hepatology. 2016;63(1):49-62. [PubMed ID: 26248546]. [PubMed Central ID: PMC4688136]. https://doi.org/10.1002/hep.28111.

  • 8.

    Garry RF, Dash S. Proteomics computational analyses suggest that hepatitis C virus E1 and pestivirus E2 envelope glycoproteins are truncated class II fusion proteins. Virology. 2003;307(2):255-65. [PubMed ID: 12667795]. https://doi.org/10.1016/s0042-6822(02)00065-x.

  • 9.

    Balasco N, Barone D, Sandomenico A, Ruggiero A, Doti N, Berisio R, et al. Structural versatility of hepatitis C virus proteins: Implications for the design of novel anti-HCV intervention strategies. Curr Med Chem. 2017;24(36):4081-101. [PubMed ID: 28482787]. https://doi.org/10.2174/0929867324666170508105544.

  • 10.

    Haddad JG, Rouille Y, Hanoulle X, Descamps V, Hamze M, Dabboussi F, et al. Identification of novel functions for hepatitis C virus envelope glycoprotein E1 in virus entry and assembly. J Virol. 2017;91(8). [PubMed ID: 28179528]. [PubMed Central ID: PMC5375667]. https://doi.org/10.1128/JVI.00048-17.

  • 11.

    Lavie M, Hanoulle X, Dubuisson J. Glycan shielding and modulation of hepatitis C virus neutralizing antibodies. Front Immunol. 2018;9:910. [PubMed ID: 29755477]. [PubMed Central ID: PMC5934428]. https://doi.org/10.3389/fimmu.2018.00910.

  • 12.

    Bukh J. The history of hepatitis C virus (HCV): Basic research reveals unique features in phylogeny, evolution and the viral life cycle with new perspectives for epidemic control. J Hepatol. 2016;65(1 Suppl):S2-S21. [PubMed ID: 27641985]. https://doi.org/10.1016/j.jhep.2016.07.035.

  • 13.

    Guest JD, Pierce BG. Computational modeling of hepatitis C virus envelope glycoprotein structure and recognition. Front Immunol. 2018;9:1117. [PubMed ID: 29892287]. [PubMed Central ID: PMC5985375]. https://doi.org/10.3389/fimmu.2018.01117.

  • 14.

    Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2005;33(Database issue):D34-8. [PubMed ID: 15608212]. [PubMed Central ID: PMC540017]. https://doi.org/10.1093/nar/gki063.

  • 15.

    Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 2008;9(4):286-98. [PubMed ID: 18372315]. https://doi.org/10.1093/bib/bbn013.

  • 16.

    Kumar S, Stecher G, Tamura K. MEGA7: Molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016;33(7):1870-4. [PubMed ID: 27004904]. https://doi.org/10.1093/molbev/msw054.

  • 17.

    Hall T, Biosciences I, Carlsbad C. BioEdit: An important software for molecular biology. GERF Bull Biosci. 2011;2(1):60-1.

  • 18.

    Bouckaert R, Heled J, Kuhnert D, Vaughan T, Wu CH, Xie D, et al. BEAST 2: A software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 2014;10(4). e1003537. [PubMed ID: 24722319]. [PubMed Central ID: PMC3985171]. https://doi.org/10.1371/journal.pcbi.1003537.

  • 19.

    Delport W, Poon AF, Frost SD, Kosakovsky Pond SL. Datamonkey 2010: A suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics. 2010;26(19):2455-7. [PubMed ID: 20671151]. [PubMed Central ID: PMC2944195]. https://doi.org/10.1093/bioinformatics/btq429.

  • 20.

    Rozas J, Ferrer-Mata A, Sanchez-DelBarrio JC, Guirao-Rico S, Librado P, Ramos-Onsins SE, et al. DnaSP 6: DNA sequence polymorphism analysis of large data sets. Mol Biol Evol. 2017;34(12):3299-302. [PubMed ID: 29029172]. https://doi.org/10.1093/molbev/msx248.

  • 21.

    Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: A sequence logo generator. Genome Res. 2004;14(6):1188-90. [PubMed ID: 15173120]. [PubMed Central ID: PMC419797]. https://doi.org/10.1101/gr.849004.

  • 22.

    Kim Y, Sette A, Peters B. Applications for T-cell epitope queries and tools in the Immune Epitope Database and Analysis Resource. J Immunol Methods. 2011;374(1-2):62-9. [PubMed ID: 21047510]. [PubMed Central ID: PMC3041860]. https://doi.org/10.1016/j.jim.2010.10.010.

  • 23.

    Calis JJ, Maybeno M, Greenbaum JA, Weiskopf D, De Silva AD, Sette A, et al. Properties of MHC class I presented peptides that enhance immunogenicity. PLoS Comput Biol. 2013;9(10). e1003266. [PubMed ID: 24204222]. [PubMed Central ID: PMC3808449]. https://doi.org/10.1371/journal.pcbi.1003266.

  • 24.

    Vausselin T, Calland N, Belouzard S, Descamps V, Douam F, Helle F, et al. The antimalarial ferroquine is an inhibitor of hepatitis C virus. Hepatology. 2013;58(1):86-97. [PubMed ID: 23348596]. https://doi.org/10.1002/hep.26273.

  • 25.

    Vausselin T, Seron K, Lavie M, Mesalam AA, Lemasson M, Belouzard S, et al. Identification of a new benzimidazole derivative as an antiviral against Hepatitis C virus. J Virol. 2016;90(19):8422-34. [PubMed ID: 27412600]. [PubMed Central ID: PMC5021404]. https://doi.org/10.1128/JVI.00404-16.

  • 26.

    Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics. 2004;4(6):1633-49. [PubMed ID: 15174133]. https://doi.org/10.1002/pmic.200300771.

  • 27.

    Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585-95. [PubMed ID: 2513255]. [PubMed Central ID: PMC1203831].

  • 28.

    Ohno S, Moriya O, Yoshimoto T, Hayashi H, Akatsuka T, Matsui M. Immunogenic variation between multiple HLA-A*0201-restricted, hepatitis C virus-derived epitopes for cytotoxic T lymphocytes. Viral Immunol. 2006;19(3):458-67. [PubMed ID: 16987064]. https://doi.org/10.1089/vim.2006.19.458.

  • 29.

    Wahid A, Helle F, Descamps V, Duverlie G, Penin F, Dubuisson J. Disulfide bonds in hepatitis C virus glycoprotein E1 control the assembly and entry functions of E2 glycoprotein. J Virol. 2013;87(3):1605-17. [PubMed ID: 23175356]. [PubMed Central ID: PMC3554189]. https://doi.org/10.1128/JVI.02659-12.

  • 30.

    Mohd Hanafiah K, Groeger J, Flaxman AD, Wiersma ST. Global epidemiology of hepatitis C virus infection: New estimates of age-specific antibody to HCV seroprevalence. Hepatology. 2013;57(4):1333-42. [PubMed ID: 23172780]. https://doi.org/10.1002/hep.26141.

  • 31.

    Simmonds P, Bukh J, Combet C, Deleage G, Enomoto N, Feinstone S, et al. Consensus proposals for a unified system of nomenclature of hepatitis C virus genotypes. Hepatology. 2005;42(4):962-73. [PubMed ID: 16149085]. https://doi.org/10.1002/hep.20819.

  • 32.

    Lu L, Wang M, Xia W, Tian L, Xu R, Li C, et al. Migration patterns of hepatitis C virus in China characterized for five major subtypes based on samples from 411 volunteer blood donors from 17 provinces and municipalities. J Virol. 2014;88(13):7120-9. [PubMed ID: 24719413]. [PubMed Central ID: PMC4054444]. https://doi.org/10.1128/JVI.00414-14.

  • 33.

    Maurin G, Fresquet J, Granio O, Wychowski C, Cosset FL, Lavillette D. Identification of interactions in the E1E2 heterodimer of hepatitis C virus important for cell entry. J Biol Chem. 2011;286(27):23865-76. [PubMed ID: 21555519]. [PubMed Central ID: PMC3129168]. https://doi.org/10.1074/jbc.M110.213942.

  • 34.

    Dubuisson J, Hsu HH, Cheung RC, Greenberg HB, Russell DG, Rice CM. Formation and intracellular localization of hepatitis C virus envelope glycoprotein complexes expressed by recombinant vaccinia and Sindbis viruses. J Virol. 1994;68(10):6147-60. [PubMed ID: 8083956]. [PubMed Central ID: PMC237034].

  • 35.

    Keck ZY, Sung VM, Perkins S, Rowe J, Paul S, Liang TJ, et al. Human monoclonal antibody to hepatitis C virus E1 glycoprotein that blocks virus attachment and viral infectivity. J Virol. 2004;78(13):7257-63. [PubMed ID: 15194801]. [PubMed Central ID: PMC421663]. https://doi.org/10.1128/JVI.78.13.7257-7263.2004.

  • 36.

    Mesalam AA, Desombere I, Farhoudi A, Van Houtte F, Verhoye L, Ball J, et al. Development and characterization of a human monoclonal antibody targeting the N-terminal region of hepatitis C virus envelope glycoprotein E1. Virology. 2018;514:30-41. [PubMed ID: 29128754]. [PubMed Central ID: PMC5784761]. https://doi.org/10.1016/j.virol.2017.10.019.

  • 37.

    Liu L, Fisher BE, Dowd KA, Astemborski J, Cox AL, Ray SC. Acceleration of hepatitis C virus envelope evolution in humans is consistent with progressive humoral immune selection during the transition from acute to chronic infection. J Virol. 2010;84(10):5067-77. [PubMed ID: 20200239]. [PubMed Central ID: PMC2863818]. https://doi.org/10.1128/JVI.02265-09.

  • 38.

    Tong Y, Lavillette D, Li Q, Zhong J. Role of hepatitis C virus envelope glycoprotein E1 in virus entry and assembly. Front Immunol. 2018;9:1411. [PubMed ID: 29971069]. [PubMed Central ID: PMC6018474]. https://doi.org/10.3389/fimmu.2018.01411.