1. Context
2. Materials and Methods
2.1. E1 Sequences
2.2. Sequence Alignment and Phylogenetic Analysis
2.3. Evolution Rate and Site-Specific Selection Pressure
2.4. Comparative Residue Analysis
Percentage nucleotide identity and divergence of HCV E1 protein analyzed using pairwise distance maximum composite likelihood algorithm implemented by MEGA 7. Nucleotide identity percentage are given in the below diagonal, whereas percent divergence values are presented in the above diagonal.
2.5. Prediction of Glycosylation Sites
3. Results
3.1. Phylogenetic and Evolutionary Analysis
Comparative phylogenetic analysis of 278 HCV E1 protein sequences divide the sequences into genotype-specific clusters. Each branch header contains NCBI accession number, country, genotype name, and year of sample isolation. Few branch headers do not contain genotype description because their genotype record was missing at NCBI.
| Genotype | Number of Mutations | Nucleotide Diversity | Standard Deviation | Tajima’s D test |
|---|---|---|---|---|
| 1a | 393 | 0.26919 | 0.06676 | -1.28820 |
| 1b | 513 | 0.16632 | 0.02228 | -1.73551 |
| 1g | 257 | 0.22298 | 0.03579 | -0.70035 |
| 2a | 804 | 0.37273 | 0.04727 | -1.26680 |
| 2c | 752 | 0.32333 | 0.03277 | -0.96337 |
| 2g | 68 | 0.09485 | 0.01732 | -0.58834 |
| 3a | 321 | 0.28047 | 0.09067 | -1.50896 |
| 4a | 956 | 0.25293 | 0.01373 | -1.13219 |
| 4d | 612 | 0.29852 | 0.03532 | -0.90753 |
| 4o | 225 | 0.29330 | 0.09484 | -0.8886b |
| 4r | 531 | 0.27851 | 0.06631 | -1.45598 |
| 4u | 645 | 0.39274 | 0.05652 | -0.36978 |
aAnalysis was not done because of less number of sequences required for Tajima’s D test.
bStatistical significance (P < 0.001)
| Genotypes | 1a | 1b | 1g | 2a | 2c | 2g | 3a | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean dN/dS | 0.1585 | 0.2249 | 0.19876 | 0.2379 | 0.2313 | 0.2133 | 0.1466 | ||||||
| Positive selection sites | 0 | 3 | 0 | 4 | 2 | 0 | 0 | ||||||
| Codon position | 0 | 28 | 44 | 106 | 0 | 8 | 40 | 73 | 108 | 58 | 73 | 0 | 0 |
| dN-dS value | 0 | 1.8525 | 2.0785 | 1.4124 | 0 | 4.1352 | 2.8212 | 3.2354 | 3.3597 | 1.1095 | 1.3131 | 0 | 0 |
| P value | 0 | 0.0863 | 0.0447 | 0.0995 | 0 | 0.0004 | 0.0129 | 0.0007 | 0.0103 | 0.0781 | 0.0313 | 0 | 0 |
| Negative selection sites | 70 | 63 | 22 | 88 | 64 | 27 | 22 | ||||||
aSingle-Likelihood Ancestor Counting (SLAC) method implemented by Datamonekey online tool was used to identify dN (non-synonymous) and dS (synonymous) substitutions at P value 0.1. Indeed, SLAC employs a maximum-likelihood model to infer rates of substations and reports positive selection using posterior probabilities. dN/dS > indicate a positive selection, while < 1 indicate a negative selection site. Analysis was only performed on the genotypes that have ≥ 3 sequences.
| Genotypes | 4a | 4d | 4o | 4r | 4u | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean dN/dS | 0.1997 | 0.1669 | 0.2588 | 0.1672 | 0.1314 | ||||||||
| Positive selection sites | 5 | 4 | 0 | 0 | 2 | ||||||||
| Codon position | 12 | 41 | 44 | 57 | 65 | 12 | 41 | 44 | 86 | 0 | 0 | 58 | 60 |
| dN-dS value | 26.075 | 77.061 | 58.897 | 18.0157 | 16.8035 | 8.3964 | 16.5213 | 21.0521 | 4.4983 | 0 | 0 | 2.9202 | 3.4037 |
| P value | 0.0012 | 3.9927e-10 | 1.758e-08 | 0.00016 | 0.0243 | 0.0093 | 0.0044 | 0.0005 | 0.0833 | 0 | 0 | 0.0814 | 0.04218 |
| Negative selection sites | 111 | 79 | 3 | 23 | 41 | ||||||||
aSingle-Likelihood Ancestor Counting (SLAC) method implemented by Datamonekey online tool was used to identify dN (non-synonymous) and dS (synonymous) substitutions at P value 0.1. Indeed, SLAC employs a maximum-likelihood model to infer rates of substations and reports positive selection using posterior probabilities. dN/dS > indicate a positive selection, while < 1 indicate a negative selection site. Analysis was only performed on the genotypes that have ≥ 3 sequences.
3.2. Comparative Residue Analysis and Sequence Conservancy
Residue analysis of HCV E1 protein. Constructing multiple sequence alignments, the diversity and/or conserveness of residues at each position were analyzed by WebLogo 3.1 (21). Vertical rectangles highlight the epitopes regions. Owing to software limitation, those genotypes (3b, 4l, 4m, 4n, 4g) that had less than 3 sequences were excluded.
| Genotype | Epitope Sequence | Percent Conservancy | Immunogenicity | ||
|---|---|---|---|---|---|
| Median | Min Identity, % | Maxi Identity, % | |||
| 1a | 192YQVRNSSGLYH202 | 100 | 90.00 | 100.00 | 0.23502 |
| 197SSGLYHVTDC207 | 0 | 70.00 | 80.00 | 0.11232 | |
| 230VREGNASRCW239 | 92.3 | 50.00 | 100.00 | -0.04545 | |
| 257QLRRHIDLLV266 | 100 | 90.91 | 100.00 | -0.24264 | |
| 1b | 192YQVRNSSGLYH202 | 69.89 | 77.78 | 100.00 | 0.08 |
| 197SSGLYHVTDC207 | 79.48 | 72.00 | 100.00 | 0.25706 | |
| 230VREGNASRCW239 | 10.98 | 40.00 | 100.00 | -0.169 | |
| 257QLRRHVDLLV266 | 95.9 | 70.00 | 100.00 | 0.08034 | |
| 1g | 192YQVRNSSGLYH202 | 55.6 | 70.00 | 90.00 | 0.12609 |
| 197SSGLYHVTDC207 | 77.8 | 80.00 | 90.00 | -0.31315 | |
| 230 VREGNASRCW 239 | 100 | 90.91 | 100.00 | 0.25706 | |
| 257QLRRHIDLLV266 | 77.8 | 72.73 | 100.00 | 0.08034 | |
| 2a | 192VQVRNTSDSYM202 | 0 | 30.00 | 70.00 | 0.0495 |
| 197TSDSYMVTNDC207 | 0 | 27.27 | 81.82 | -0.325 | |
| 230VRTGNKSRCW239 | 3.2 | 30.00 | 90.00 | -0.31315 | |
| 257SLRRHVDLMV266 | 0 | 45.45 | 72.73 | 0.25706 | |
| 2c | 192 VEVRNTSTSYM202 | 1.8 | 40.00 | 100.00 | 0.07943 |
| 197TSTSYMATNDC207 | 42.9 | 60.00 | 100.00 | 0.19978 | |
| 230VRTGNKSRCW 239 | 76.8 | 72.73 | 100.00 | -0.246 | |
| 257SLRRHVDLMV266 | 33.9 | 63.64 | 100.00 | -0.16936 | |
| 3a | 192VEVKNNSDTYM202 | 76.2 | 54.55 | 100.00 | 0.01476 |
| 197NSDTYMVDLLV207 | 90.5 | 70.00 | 100.00 | 0.0495 | |
| 230VRTGNKSRCW 239 | 95.2 | 72.73 | 100.00 | 0.06968 | |
| 257SLRRHVDLMV266 | 0 | 40.00 | 80.00 | -0.07535 | |
| 4a | 192VHYRNVSGIYH202 | 78 | 40.00 | 100.00 | 0.0495 |
| 197VSGIYHVTNDC207 | 3.5 | 27.27 | 90.91 | 0.065 | |
| 230VRTGNKSRCW239 | 10.4 | 30.00 | 90.00 | -0.27415 | |
| 257SLRRHVDLMV266 | 54.6 | 27.27 | 100.00 | 0.25706 | |
| 4d | 192VHYRNVSGIYH202 | 98.36 | 80.00 | 100.00 | -0.02337 |
| 197VSGIYHVTNDC207 | 14.75 | 50.00 | 90.00 | -0.14895 | |
| 230VRTGNKSRCW239 | 57.38 | 72.73 | 90.91 | 0.18971 | |
| 257SLRRHVDLMV266 | 0 | 36.36 | 63.64 | -0.22298 | |
| 4l | 192VHYRNVSGIYH202 | 100 | 90.91 | 100.00 | 0.25028 |
| 197VSGIYHVTNDC207 | 100 | 90.00 | 100.00 | 0.0495 | |
| 230VRTGNKSRCW239 | 100 | 90.91 | 90.91 | -0.0248 | |
| 257SLRRHVDLMV266 | 0 | 80.00 | 80.00 | -0.19419 | |
| 4m | 192VHYRNVSGIYH202 | 75 | 81.82 | 90.91 | 0.25028 |
| 197VSGIYHVTNDC207 | 25 | 60.00 | 90.00 | 0.0495 | |
| 230VRTGNKSRCW239 | 0 | 81.82 | 81.82 | -0.0248 | |
| 257SLRRHVDLMV266 | 0 | 60.00 | 80.00 | 0.25706 | |
| 4n | 192VHYRNVSGIYH202 | 0 | 72.73 | 81.82 | -0.05824 |
| 197VSGIYHVTNDC207 | 33.3 | 81.82 | 90.91 | 0.25408 | |
| 230VRTGNKSRCW239 | 0 | 70.00 | 80.00 | -0.21958 | |
| 257SLRRHVDLMV266 | 33.3 | 50.00 | 100.00 | -0.31315 | |
| 4o | 192VHYRNVSGIYH202 | 0.5 | 81.82 | 90.91 | 0.15514 |
| 197VSGIYHVTNDC207 | 75 | 81.82 | 90.91 | 0.34646 | |
| 230VRTGNKSRCW239 | 100 | 90.00 | 100.00 | 0.0495 | |
| 257SLRRHVDLMV266 | 25 | 70.00 | 90.00 | -0.293 | |
| 4r | 192VHYRNVSGIYH202 | 100 | 90.91 | 100.00 | 0.34646 |
| 197VSGIYHVTNDC207 | 94.1 | 81.82 | 100.00 | 0.06364 | |
| 230VRTGNKSRCW239 | 100 | 90.00 | 100.00 | 0.0495 | |
| 257SLRRHVDLMV266 | 35.3 | 70.00 | 90.00 | -0.19419 | |
| 4u | 192VHYRNVSGIYH202 | 100 | 100.00 | 100.00 | -0.17642 |
| 197VSGIYHVTNDC207 | 100 | 90.91 | 100.00 | 0.00437 | |
| 230VRTGNKSRCW239 | 100 | 90.00 | 100.00 | 0.16386 | |
| 257SLRRHVDLMV266 | 25 | 70.00 | 100.00 | -0.06237 | |
| Genotype/Subtype | Year | Country | N-Terminal Domain Motifs | ||||
|---|---|---|---|---|---|---|---|
| 1 | |||||||
| Reference | 2002 | USA | 192YQVRNSSGLYH202 | 197SSGLYHVTNDC207 | 226CXXC229 | 230VREGNASRCW239 | 257QLRRHIDLLV266 |
| 1a | 2011 | KSA | ……T…. | .T…….. | …. | …..T…. | ………. |
| 2011 | IRN | ……….. | ..…….. | …. | …..S.K.. | ………. | |
| 2010 | IRN | ……….. | ..…….. | …. | …..S.K.. | ………. | |
| 2008 | IRN | ……T…. | .T…….. | …. | ………. | ………. | |
| 2007 | IRN | ……….. | ..…….. | …. | …..S.K.. | ………. | |
| 1993 | EGP | ……….. | ..…….. | …. | …..S.K.. | ………. | |
| 1b | 2011 | KSA | .EE..V..EF. | V..EF….N. | …. | ………. | TI…V…G |
| 2010 | IRN | .E…A..V.. | ..V……. | …. | ………. | TI…V…. | |
| 2008 | IRN | .E…A..V.. | A..V……. | …. | ………. | TI…V…. | |
| 2007 | IRN | FE…A..M.Q | A..M.Q….. | …. | …N.S…. | TI…V…. | |
| 2000 | TUN | .E…V..A.. | V..A……. | …. | …..Y…. | TI…V…. | |
| 2000 | KSA | .E…V..A.. | V..A……. | …. | …..T…. | TI…V…. | |
| 1993 | EGP | .E…V..A.. | V..A……. | …. | …..R.Q.. | TI…V…. | |
| 1g | 2011 | KSA | .KI..V..I.. | V..I……. | …. | …..V…. | DV…V…. |
| 2002 | EGP | .EI..V..I.. | V..I……. | …. | ………. | DV…V…. | |
| 1993 | EGP | .EI..V..I.. | V..I……. | …. | …..V…. | DV…V…. | |
| 2 | |||||||
| Reference | 2011 | CA | 192VEVKNNSDTYM202 | 197NSDTYMATNDC207 | 226CXXC229 | 230EREGNNSRCW239 | 257GLRAHIDIIV266 |
| 2a | 2011 | SA | .Q…T.NS.. | T.NS..V…. | …. | .N…T…. | …….V.. |
| 2009 | TUN | A….T.Q… | T.Q…….. | …. | .D…T…. | …T…L.. | |
| 2008 | TUN | …R.T.Q… | T.Q…….. | …. | .KDN.T... | …T…L.. | |
| 2007 | TUN | …R.T.Q… | T.Q…….. | …. | …..T…. | …T…A.. | |
| 2006 | TUN | A….T.EL.I | T.EL.I….. | …. | ..KD.E…. | …S.V…. | |
| 2005 | TUN | .Q…TTTS.. | TTTS……. | …. | .LK..S.F.. | …T…T.. | |
| 2004 | TUN | …..T.Q… | T.Q…V…. | …. | .LV..K.L.. | …T…L.. | |
| 2003 | TUN | …..T.Q… | T.Q…V…. | …. | .SVNNV…. | …T…L.. | |
| 2c | 2004 | TUN | …R.T.I… | T.I…….. | …. | ..I..V…. | …T…T.. |
| 2003 | TUN | ....T.VL.. | T.VL……. | …. | .QT..V…. | …T…T.. | |
| 2g | 2011 | KSA | ..IR.I.NS.. | I.NS……. | …. | .RI..V…. | …….V.. |
| 2009 | TUN | ….NT..S.. | NT..S……. | …. | .QI..V…. | ..….A.. | |
| 2008 | TUN | …..T.KS.. | T.KS……. | …. | .RN..V…. | …….V.. | |
| 2007 | TUN | …..T.NS.. | T.NS……. | …. | ..T..V…. | ………. | |
| 2005 | TUN | …..T.TS.. | T.TS……. | …. | .KLD.V…. | …….V.. | |
| 2004 | TUN | …..T..S.. | T..S……. | …. | EQI..I…. | …….V.. | |
| 2003 | TUN | …..T.EL.. | T.EL……. | …. | ..S..G.W.. | …….V.. | |
| 3 | |||||||
| Reference | 2012 | CA | 192LEYRNSSGLYV202 | 197SSGLYVLTNDC207 | 226CXXC229 | 230VRKGNTSQCW239 | 257SLRSHVDLMV266 |
| 3a | 2011 | KSA | ..W..T….. | T………. | …. | .Q…..M.. | .I.G….L. |
| 2016 | PAK | ..W..T….. | T........AR | …. | .QTG…K.. | .I.G….L. | |
| 2011 | IRN | ..W..T….. | T………. | …. | .QD….T.. | .V.R….L. | |
| 2008 | IRN | ..W..T….. | T………. | …. | .QD….T.. | .I……L. | |
| 3b | PAK | …..T….. | T………. | …. | .PCVT.G.K.. | .I.N….L. | |
| 4 | |||||||
| Reference | 2011 | UK | 192VHYRNVSGIYH202 | 197VSGIYHVTNDC207 | 226CXXC229 | 230VRTGNKSRCW239 | 257SLRRHVDLMV266 |
| 4a | 2011 | KSA | .N…A….. | A…..I…. | …. | …..L…. | …S…… |
| 2012 | EGP | IN……… | ……….. | …. | .K…Q…. | …S…… | |
| 2013 | EGP | IN…A..V.. | A..V……. | …. | ..V..Q.S.. | …S…… | |
| 2006 | EGP | TN…A..V.. | A..V……. | …. | ………. | …S…… | |
| 2003 | EGP | TN……… | ……….. | …. | ..S..Q…. | …S…..G | |
| 2002 | EGP | TN……… | ……I…. | …. | ..E..Q…. | …S…… | |
| 1993 | EGP | VN…I..V.. | I..V……. | …. | ..V..Q…. | …S…… | |
| 1993 | EGP | IN……… | ……….. | …. | .RE..Q…. | …S…… | |
| 4d | 2011 | KSA | YN…S..V.. | S..V..V…. | …. | ..V….T.. | ………. |
| 4l | 2011 | EGP | I….A.DV.. | A.DV..V…. | …. | .KV..R.Q.. | …K…… |
| 4m | 2002 | EGP | I.…A..V.. | A..V..V…. | …. | …..V…. | E..H…ML. |
| 1993 | EGP | A….A..V.. | A..V……. | …. | .K…V…. | A……ML. | |
| 4n | 2011 | KSA | I.H..S….. | S………. | …. | ..S..V…. | ………. |
| 4o | 2011 | KSA | I..H.T….. | T………. | …. | ..V..I…. | ………. |
| 2002 | EGP | I….T….. | T………. | …. | V.E……. | …Q…... | |
| 4r | 2011 | YEM | E….A….. | A………. | …. | .K…V…. | .F…….. |
| 2011 | KSA | E….A….. | A………. | …. | ..T..V…. | .F…….. | |
| 1994 | YEM | E….A….. | A………. | …. | .K…V…. | .F…….. | |
| 4f | 2000 | ALG | …H.T..V.. | T..V……. | …. | …..R.Q.. | .V…….. |
| 5 | |||||||
| Reference | 2011 | USA | 192VHYRNVSGIYH202 | 197VSGIYHITNDC207 | 226CXXC229 | 230VRKGNKSRCW239 | 257PLRRHVDLLA266 |
| 5a | 2009 | ALG | .P…A..V.. | A..V……. | …. | …D.V…. | ….A..Y.. |
| 6 | |||||||
| Reference | 2010 | CN | 192LTYGNSSGLYH202 | 197SSGLYHLTNDC207 | 226CXXC229 | 230VKVDNQSTCW239 | 257GFRRHVDLLA266 |
| 6a | 2011 | IRN | ……….. | ……….. | …. | ………. | ………. |
Abbreviations: ALG, Algeria; CA, Canada; CN, China; EGP, Egypt; IRN, Iran; KSA, Kingdom of Saudi Arabia; PAK, Pakistan; TUN, Tunisia; USA, United States of America; YEM, Yemen.
aSelective isolates from each genotype are presented in this table to represent genotype-specific residue mutations in all genotypes reported from the MENA region. Only genotype 1i and 4u are not presented due to unavailability of their reference genome in www.hcv.lanl.gov/ database. All motifs are positioned at the whole genome sequences of HCV. The identical amino acid is shown as “.”
Residue analysis at direct drug binding sites *T213A, *W239, #I262A, #D263-, #Q289H, #M267V, #F291I, #Y297H. Mutation on these residue sites develop resistance against direct-acting antiviral drugs (flunarizine, phenothiazines, pimozide, ferroquine, and aminoquinoline-derivative molecules). The results are stated as percentage changes on the individual residue site. *Residue mutations associated with virus entry cycle (7). #Residue mutations associated with drug resistance (7, 24, 25).
3.3. Epitope Conservancy and Immunogenicity Analysis
3.4. N-Linked Glycosylation Prediction
| Genotype | AA Site (Percentage per Site)b | |||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Saudi Arabia | Egypt | Iran | Tunisia | Yemen | Afghanistan | Pakistan | Algeria | |||||||||||||||||||||||||
| Nc | 196 | 209 | 234 | N | 196 | 209 | 234 | N | 196 | 209 | 234 | N | 196 | 209 | 234 | N | 196 | 209 | 234 | N | 196 | 209 | 234 | N | 196 | 209 | 234 | 196 | 209 | 234 | ||
| 1a | 11 | 0 | 100 | 9 | 2 | 0 | 100 | 0 | 20 | 0 | 95 | 0 | 0 | 0 | 5 | 50 | 2 | 100 | 0 | |||||||||||||
| 1b | 40 | 78 | 78 | 5 | 1 | 100 | 100 | 0 | 5 | 40 | 60 | 0 | 30 | 97 | 90 | 30 | 0 | 0 | 0 | 0 | ||||||||||||
| 1g | 2 | 100 | 100 | 100 | 7 | 71 | 86 | 43 | 0 | 0 | 0 | 0 | 0 | 0 | ||||||||||||||||||
| 1i | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ||||||||||||||||||||||||
| 3a | 7 | 14 | 57 | 0 | 0 | 12 | 0 | 33 | 8.3 | 0 | 0 | 22 | 5 | 36 | 38 | 7.8 | 18 | 7.8 | 0 | |||||||||||||
| 3b | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 60 | 40 | 0 | |||||||||||||||||||||
| 4a | 272 | 36 | 86 | 4 | 250 | 67 | 80 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | ||||||||||||||||||
| 4d | 61 | 4.9 | 93 | 26 | 4 | 100 | 75 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | ||||||||||||||||||
| 4f | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 100 | 100 | ||||||||||||||||||||||
| 4n | 3 | 0 | 0 | 67 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||||||||||||||||||||
| 4o | 1 | 100 | 100 | 100 | 3 | 100 | 67 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||||||||||||||||
| 4r | 12 | 92 | 33 | 75 | 0 | 0 | 0 | 5 | 100 | 0 | 100 | 0 | 0 | 0 | ||||||||||||||||||
| 4s | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||||||||||||||||||||||
| 2a | 3 | 0 | 100 | 0 | 0 | 0 | 19 | 16 | 6.6 | 43 | 0 | 0 | 0 | 0 | ||||||||||||||||||
| 2c | 5 | 100 | 0 | 100 | 0 | 0 | 56 | 0 | 1.7 | 52 | 0 | 0 | 0 | 0 | ||||||||||||||||||
| 4l | 0 | 3 | 0 | 100 | 33 | 0 | 0 | 0 | 0 | 0 | 0 | |||||||||||||||||||||
| 4m | 0 | 4 | 50 | 100 | 75 | 0 | 0 | 0 | 0 | 0 | 0 | |||||||||||||||||||||
| 4u | 0 | 20 | 5 | 85 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | |||||||||||||||||||||
| 5a | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 100 | |||||||||||||||||||||||
| 6a | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ||||||||||||||||||||||||
aThe analysis is performed on NetNglyc (26) online tool to predict N-glycosylation at AA site 196, 209, and 234.
bResults are presented in percentage.
cN presents the number of sequences for each genotypes.




