Genetic Path of the Emergence of SARS-CoV-2


avatar Taslima Nasrin ORCID 1 , avatar Safdar Ali ORCID 1 , *

Aliah University, Kolkata, India

how to cite: Nasrin T, Ali S. Genetic Path of the Emergence of SARS-CoV-2. Gene Cell Tissue. 2022;9(1):e118302.



Severe acute respiratory syndrome corona virus 2 (SARS-CoV-2) is the seventh coronavirus that has humans as the host. Because of its highly infectious nature, toward the end of January 2020, the WHO declared it a public health emergency of international concern. The present review is about understanding the journey of SARS-CoV-2 to its present form with an attempt to assess the genetic basis of its pandemic-causing abilities.

Evidence Acquisition:

The data for the present review were accessed through different publications and preprint repositories.


SARS-CoV-2 is a beta-coronavirus, and is approximately 60 - 140 nm in size. The appearance of its structure as a crown shape under an electron microscope led to the coining of its name ‘Coronavirus’. Comparative genome and proteome analysis exhibits similarities and differences with reference to SARS-CoV. The open reading frames (ORFs) found on the SARS-CoV-2 genome, and their corresponding proteins have been discussed. Bats may act as reservoir hosts but not exclusively. The possibility of snakes as the host, as well as other intermediate hosts, before reaching humans seems plausible. This has been supported by ACE2 receptor diversity and conservation across different tissues and organisms. The role of spike glycoprotein and its interaction with the receptor through specific residues for invading host cells makes a perfect therapeutic target, but the variations therein and the resulting impact on interactions pose challenges for the same.


Though the differences between the MERS, SARS-CoV, and SARS-CoV-2 genomes indicate amino acid changes, leading to the present pandemic situation, the fact that new variants are still emerging signifies that the journey is an ongoing one, which requires monitoring.

1. Context

The emergence of SARS-CoV-2 began with many people reporting pneumonia in the Wuhan district of China sometime in December 2019. The outbreak was thereon found to be triggered by a new human coronavirus, which was subsequently named severe acute respiratory syndrome corona virus 2 (SARS-CoV-2) (1). The WHO declared SARS-CoV-2 as a public health emergency of international concern on 30 January 2020, and the world is still struggling to combat its menace (2). As of 11 July 2021, there were 187,419,263 reported cases and 4,045,647 deaths worldwide ( due to SARS-CoV-2 of which, 3,08,37,222 cases and 4,08,040 deaths were reported from India ( The journey of coronaviruses and human diseases has many prior incidents of which SARS-CoV and MERS-CoV have been known to cause severe diseases. Others, including OC43, HKU1, NL63, and 229E were reported to be responsible for minor indicators (3, 4). In the present review, we discuss the origin of SARS-CoV-2 by comparing the genome, host range, its target receptor, viral spike protein, and strength of interaction of receptor binding domain (RBD) with its host receptor with those of earlier reported coronaviruses.

2. Evidence Acquisition

The data for the present review were accessed through different publications and preprint repositories.

3. Results

3.1. Genomic Variation of SARS-CoV-2

As known, SARS-CoV-2 belongs to beta-coronaviruses. Morphologically, it may be round or oval of about 60 - 140 nm. Its characteristic crown-shaped appearance under an electron microscope led to the basis of the name ‘Coronavirus’ (5). Besides beta (β), the Coronaviridae family also has other genera, namely α, γ, and δ. Among them, members of α and β genera are known to have mammals and humans as the host. The coronaviruses harbor a single-stranded RNA genome of about 26 kb to 32 kb (6). Among those infecting humans, the MERS-CoV genome has been reported to be of ~ 30.1 kb RNA (+) (7), while the SARS-CoV-2 reference genome is 29.9 kb (8). The SARS-CoV-2 genome has been reported to harbor around 26 proteins whose details and comparative mutation profiles across deceased and asymptomatic patients have been reported by our group recently (9). The 3’ end of the genome encodes four structural proteins: (1) spike (S), (2) envelope (E), (3) membrane (M), and (4) nucleocapsid (N), of which the S protein is known to interact and attach to the host cell aiding in viral entry.

The genomes of SARS-CoV-2 and SARS-CoV have more than 80% identity to human coronavirus (SARS-like bat CoV) (10). The common entity includes ORF1ab with 16 non-structural proteins (NSPs) and four structural proteins. However, SARS-CoV-2 ORF3b and ORF10 have little similarity to those of SARS-CoV. Further, ORF8 is undivided in the former and split as ORF8a and 8b in the latter (7). A comparative illustration of the genomes of SARS-CoV, SARS-CoV-2, and MERS-CoV is shown in Figure 1. There are similarities and dissimilarities at the protein level. For instance, there are 154 amino acid 3b proteins in SARS-CoV as compared to just 22 amino acids in SARS-CoV-2 (10). Contrastingly, NSP7, NSP13, envelope, matrix, and accessory proteins have no amino acid substitutions, whereas there are variations in NSP2, NSP3, and S protein (11). Of these, NSP2 and NSP3 mutations are known to affect infectivity (12). The new emerging variants are discussed later.

Genome organization of three different beta coronaviruses. A, SARS-CoV; B, SARS-CoV-2; and C, MERS-CoV. Figure were developed by using CGView DNA Plotter (13).
Genome organization of three different beta coronaviruses. A, SARS-CoV; B, SARS-CoV-2; and C, MERS-CoV. Figure were developed by using CGView DNA Plotter (13).

3.2. Definitive and Intermediate Host Range

The SARS-Cov-2 pandemic owes its origin to a seafood market, and efforts have been made to study the animals in the market acting as hosts. Snakes have been one of the probable candidates supported by similar codon usage (14). The high level of similarity between SARS-CoV-2 and SARS-like bat coronavirus at the genome level supports the bats acting as reservoirs (15). This was also supported by phylogenetic studies that not only placed the bat-derived coronaviruses across all the five subgenera of beta-coronavirus but also found SARS-CoV-2 was evolutionarily closest to bat-SL-CoVZC45 and bat-SL-CoVZXC21 (16). Further evidence is provided by over 96% genome similarity between bat CoV RaTG13 and SARS-CoV-2 (2). Moreover, similar receptor sequences across species indicate that turtles, pangolin, and snakes can also act as intermediate hosts (17). It is noteworthy to mention that SARS-CoV and MERS-CoV also have bats as natural reservoirs and masked palm civets or dromedary camels as intermediate hosts. Needless to say that humans are the terminal hosts (18). The transmission of SARS-CoV-2 from bat to human may have happened through animals being sold at the Wuhan market.

Metagenomic analysis of samples from pangolins reported the presence of β-CoV28 in most. One of the studied samples showed almost 99% similarity to SARS-CoV-2. This supports the candidature of pangolin as an intermediate host (19). It was also supported by several pangolin Coronaviruses from pangolin with identical sequences at important positions of RNA binding domain (RBD) of the S protein, implying the ability to bind to human ACE2 receptor, facilitating their movement to humans (20).

3.3. Angiotensin-Converting Enzyme 2 (ACE2) as Target Receptor

The S protein is a glycosylated protein found on the surface of coronaviruses, appearing as a spike. It mediates viral entry by binding RBD to ACE2 receptors on the host cell (21). As known, ACE2 encodes for an exopeptidase enzyme, which is responsible for angiotensin conversion to different forms (22). Besides, SARS-CoV-2 is believed to employ a similar method to enter the host cell as uses SARS-CoV (23).

The implication of ACE2 receptor in viral entry makes the cells expressing this receptor susceptible to SARS-CoV-2 infection. Diverse cells acting as hosts would imply the severity of symptoms. The major cells known to express this receptor include the epithelial cells of lung alveoli, the smooth muscle cells of the artery, the epithelium of the kidney tubule, and the epithelium of the small intestine (23). Though SARS-CoV-2 has been known to be primarily associated with respiratory problems, the presence of ACE2 receptors across tissues makes the body more vulnerable. Corresponding physiological manifestations have been reported as well, with diarrhea being present in the sample followed by the presence of virus in feces (22).

Expectedly, there are variations in the expression levels of ACE2 across tissues adding to the dynamics. Its expression is normally low in the lungs but can be enhanced under some physiological manifestations (24). The possibility of co-receptors aiding ACE2 is also very much plausible. Aminopeptidase N (ANPEP) and dipeptidyl-peptidase 4 (DPP4) are candidates for the same (25).

3.4. ACE2 Receptor Diversity in Nature

The ACE2 expression is reported across fish to mammals. Moreover, it shows structural conservation, as well. Human ACE2, when analyzed with that of the civet, bat, bird, snake, frog, and fish, showed amino acid sequence identities of 83, 81, 83, 61, 60, and 59%, respectively (24). Thus, ACE2 from these species can potentially interact with the RBD of SARS-CoV-2, making these animals not only prone to infection but also as reservoirs. This also highlights the host adaptability of the virus (24).

3.5. Spike Glycoprotein

The role of spike (S) glycoprotein in mediating the entry of SARS-CoV-2 in human host cells is well documented. At the protein level, it is comprised of an ectodomain at the N terminal, followed by a transmembrane region and a C-terminal intracellular tail (21). This ectodomain region makes up subunits S1 (receptor-binding) and S2 (membrane-fusion). The RBD, which is critical for interacting with host receptors, is localized in the S1 region. Interestingly, the S2 region is greatly conserved with 99% identity to not only SARS-CoV but also several bat coronaviruses (24). Once it is fused to the target receptor, S1 is thereon acted on by S2, as well as host protein cleaving enzymes (26). The induced changes in conformation can lead to the activation of membrane fusion proteins (27).

In terms of variation, the RBD of the S protein is highly diverse, but there are six amino acids, which, if affected, will directly impact the interaction with the host receptor. They are summarized in Table 1. Structural comparison of RBD reveals that it is almost superimposing for SARS-CoV and SARS-CoV-2. The only exception is a loop. The structure is primarily of beta-sheets with no breakers and multiple disulfide bonds forming cysteinyl residues: C366/C418 and C467/C474 (24). The S protein is also conserved concerning 22 glycosylation sites, with the exception of N370 being present only in SARS-CoV-2 (28).

Table 1.

Amino Acid Changes in Spike Protein Between SARS-CoV and SARS-CoV-2

S. No.SARS-CoVSARS-CoV-2Effect on RBD/ACE2 InteractionReferences
1N442L455RBD of SARS-CoV-2 has better efficiency than that of SARS-CoV(29)
2L472F486Hot spot 31; Adds flexibility to the interaction(24)
3N479Q493Hot spot 31; Promotes the civet to human transmission(30)
4D480S494Enhances viral binding to human ACE2(18)
5T487N501 (hot spot 353)RBD/ACE2 interaction is more efficient in SARS-CoV-2(24)
6Y491Y505This Y491/Y505 alteration has the enhanced RBD/ACE2 interaction at various temperature.(31)
7T499P499P499 forms better adaptation for host binding in SARS-CoV-2(32)
8CTPPALNC; 68 - 471CNGVEGFNC; 482 - 485Flexibility to RBD/ACE2 interaction(24)
9RRAR (cleavage site)PRRAThe addition of Proline makes the cleavage site more unique and affects transmission and pathogenesis in the animal model(19)

Also, SARS-CoV-2 has a polybasic cleavage site (RRAR) which on cleavage influences viral infectivity and host range. For instance, if it changes to PRRA, it creates a cleavage site exclusive to SARS-CoV-2. This mutation has helped the spike protein evolve to be able to bind to human ACE2 (19).

3.6. RBD and ACE2 Interaction

The binding energy for the interaction between RBD and ACE2 receptors has been shown to be positive and involves two b-sheets and three loop structures. The residues directly associated with this interaction, if altered, will have implications in viral entry and hence pathogenesis (24). There are a total of 14 residues in the ACE2/RBD interface. Of these, nine are conserved, and four are variable (33). At the 479-position, amino acid aspergine (N) was mutated to lycine (K) & arginine (R) in civet and bat respectively. This phenomenon determines the species barrier of host preference in SARS-CoV infections. The N479K mutation reduces binding affinity drastically (34). Contrastingly, the S487T mutation enhances RBD/ACE2 interaction (33). Thus, mutations favoring this interaction would support transmission to humans and vice versa. As an example, SARS-CoV with T487S mutation, which hugely decreases binding affinity, exhibited no human to human transmission, implying the importance of the hydroxyl group (34).

The presence of hydrophobic amino acid L472 is crucial for RBD/ACE binding, which is substituted by proline and phenylalanine in SARS-CoV and SARS-CoV-2 (F486), respectively. It is localized in the loop region formed by C467/C474. This loop differs from CTPPALNC in SARS-CoV to CNGVEGFNC in SARS-CoV-2 (Table 1). The substitution of prolines by glycines alters the structure flexibility. Aromatic residues, when present in the region, further increases binding efficiency (24). Another substitution at P499 in SARS-CoV-2 has been reported to lead to better adaptability (32). Another important aspect of the S protein is the glycosylation sites. Of the reported 23 sites of this type, only two are present on RBD, but their importance in RBD/ACE2 interaction, if any, needs to be ascertained.

3.7. Various New Strains of SARS-CoV-2 Lineage

Since November 2020, a new strain of SARS-CoV-2 has manifested in Britain. It is believed that 60% of recent infections happened in London by this new strain. As of 20 December 2020, a large number of areas in London, as well as South-East and East England, showed confirmed cases of this new variant. Scientists have named the new strain as “VUI-202012/01” (originally termed as B.1.1.7 by Public Health England), which became dominant over existing versions of SARS-CoV-2 (35). The 3D images have been shown in Figure 2, indicating the locations of key mutations of SARS-CoV, SARS-CoV-2, B.1.1.7 lineage, and B.1.617 lineage. High rates of mutation were shown in a short period in immunodeficient and immunosuppressed people infected by SARS-CoV-2 (36). This strain inhabited nonsynonymous mutations and deletion (36).

The 3D images indicating the location of the key mutations. A, SARS-CoV; B, SARS-CoV-2; C, B.1.1.7 lineage; and D, B.1.617 lineage (images are from PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC).
The 3D images indicating the location of the key mutations. A, SARS-CoV; B, SARS-CoV-2; C, B.1.1.7 lineage; and D, B.1.617 lineage (images are from PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC).

A study showed eminent changes in the Spike protein, importantly position 501 in RBD changes N (aspergine) to Y (tyrosine). That enhances the binding to human ACE2. This N501Y mutation can increase its ACE2 receptor binding affinity (37). The N501Y mutation also enhances virus adaption in the animal model (38). This mutation raises virus transmissibility by 70% (39). Another important mutation, P681H, occurs in the S1/S2 furin cleavage site (40). This furin cleavage site has been shown to enhance access to respiratory epithelial cells and also helps in virus transmission in the animal model (41). A deletion at 69-70HV in the N terminal of the spike protein has also been reported (42). Other important nonsynonymous mutations in the Spike protein include A570D, P681H, T716I, S982A, D1118H, and Y144 deletion (40).

A study suggested that SARS-CoV-2 lineage aggregate one to two mutations per month (43). Various SARS-CoV-2 lineages emerge throughout the world, and a few important mutations are listed in Table 2. Some mutations help viruses escape from the immune system and resist antibodies. Multiple mutations give rise to a new strain. Mutations in viruses are often a result of low polymerase fidelity and a survival mechanism to adjust a new host tropism (44). Recently an Indian variant named B.1.617 emerged with two mutations in the spike protein (L452 and E484Q mutations) separately (45). These two E484K and L452R mutations coexist and enhance the affinity for binding to ACE2 receptors (46). As per the World Health Organization, lineage B.1.617.2 was named as ‘delta variant’ with substitutions T478K, P681R, and L452R. Further, the B.1.617 variant also carries another unusual combinational mutation L452R & E484Q (designated as B.1.617.1 or kappa variant) and V383L mutation along with L452R & E484Q (known as B.1.63.3) [Coronavirus disease (COVID-19) situation reports (].

Table 2.

Important Mutations in SARS-CoV-2 Lineages

S No.Lineage B.1.1.7 or 501Y.V1/Alpha (UK)Lineage B.1.351 or 501Y.V2 /Beta (South Africa)Lineage B.1.1.24 or P.1/ Gamma Variant (Brazil)B.1.617.2/Delta (India)B.1.617.1/Kappa (India)B.1.617.3Effect of Amino Acid Changes in the New StrainReferences
1HV 69-70 delL18F; D80AL18F; T20N; P26ST19R; 156-148 delT19R; 156-148 delT19R; 156-148 delThe H69/V70 single mutant showed two-fold more infectivity compared to the wild-type virus.(47)
2Y144 delD215G; 242-245 del; R246ID138Y; R190SL452R mutation (without E484Q)L452R.V383L mutation along with L452RAlso present in California B.1.427/B.1.429. Affecting resistance to antibodies(46)
3E484K a; N501Y aK417N; E484K a; N501Y aK417T; E484K a; N501Y aT478KE484Q aE484Q a1.N501Y mutation was associated with increased transmissibility of the virus; 2.The E484K was reported to be an escape mutation from a monoclonal antibody, which neutralizes SARSCoV-2; 3.E484 in RBM interacts with the K31 interaction hotspot, enhancing binding affinity; 4. K417N mutation has a functional significance in the receptor-binding domain; 5. E484K mutation in B.1.351, P.1, and P.3 imparts partial resistance to antibodies.(31, 37, 45, 47, 48)
4D614G a; A570DD614G a; A701VD614G a; H655YD614G aD614G aD614G aScientists showed that the 614 position of spike protein has a serine protease elastase 2 proteolytic site. This D614G mutation participates in proteolytic activity, which enhances viral entry into 293T-ACE2 cells.(49)
5P681H; T716L--P681RP681RP681RP681H mutation is involved in creating a furin cleavage site at S1/S2, which promotes the entry of coronavirus to respiratory epithelial cells and animal model; 2.P681R arrived due to an increase in the amount of cleaved spike protein (S) on virions.(40, 41, 46)
6Q27StopLocated in ORF8, it is a stop mutation that truncates the protein making it inactive.(40)

4. Conclusions

The biggest scientific challenge in combating the present pandemic has been the interpretation of constantly accumulating data from all over the world. The spike glycoprotein has emerged as the frontrunner therapeutic candidate, but its mutations and their impacts make it highly unpredictable. The widespread presence of ACE2 receptors in the body has led to varying physiological manifestations of SARS-CoV-2 infection. With long-term effects of the infection still unknown, a comprehensive understanding of the disease remains elusive, but the strong correlation at both genome and proteome levels with other coronaviruses should help us manage the pandemic.