Identification and functional analysis of the SARS-COV-2 nucleocapsid protein

A severe form of pneumonia, named coronavirus disease 2019 (COVID-19) by the World Health Organization is widespread on the whole world. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was proved to be the main agent of COVID-19. In the present study, we conducted an in depth analysis of the SARS-COV-2 nucleocapsid to identify potential targets that may allow identification of therapeutic targets. The SARS-COV-2 N protein subcellular localization and physicochemical property was analyzed by PSORT II Prediction and ProtParam tool. Then SOPMA tool and swiss-model was applied to analyze the structure of N protein. Next, the biological function was explored by mass spectrometry analysis and flow cytometry. At last, its potential phosphorylation sites were analyzed by NetPhos3.1 Server and PROVEAN PROTEIN. SARS-COV-2 N protein composed of 419 aa, is a 45.6 kDa positively charged unstable hydrophobic protein. It has 91 and 49% similarity to SARS-CoV and MERS-CoV and is predicted to be predominantly a nuclear protein. It mainly contains random coil (55.13%) of which the tertiary structure was further determined with high reliability (95.76%). Cells transfected with SARS-COV-2 N protein usually show a G1/S phase block company with an increased expression of TUBA1C, TUBB6. At last, our analysis of SARS-COV-2 N protein predicted a total number of 12 phosphorylated sites and 9 potential protein kinases which would significantly affect SARS-COV-2 N protein function. In this study, we report the physicochemical properties, subcellular localization, and biological function of SARS-COV-2 N protein. The 12 phosphorylated sites and 9 potential protein kinase sites in SARS-COV-2 N protein may serve as promising targets for drug discovery and development for of a recombinant virus vaccine.


Background
On February 12, 2020, the World Health Organization officially named the new coronavirus causing the pneumonia epidemic in Wuhan as Coronavirus Disease 2019 (COVID-19) [1]. As of September 17, 2020, there were approximately 30,055,710 confirmed cases and 943,433 deaths in the worldwide [2]. The latest research shows that the impact of COVID-19 has far exceeded the impact of severe acute respiratory syndrome (SARS) in 2003 [3,4]. At present, there are no clinically validated SARS-COV-2 vaccine candidates or therapeutic antibodies to prevent infection, and its diagnosis is still based on viral nucleic acid detection and false negative cases pose a problem [5]. In response to the COVID-19 outbreak, searching for potential viral genetic or protein information as soon as possible will greatly help clinicians improve diagnosis and treatment efficiency and aid in subsequent vaccine development.
The Coronaviridae family is made up of two subfamilies: Letovirinae and Orthocoronavirinae. The Orthocoronavirinae family consists of the α-coronavirus, βcoronavirus, γ-coronavirus, and δ-coronavirus genera [6]. Among them, β-coronaviruses are human which usually cause severe respiratory diseases, including SARS-CoV, the Middle Eastern Respiratory Syndrome Coronavirus (MERS-CoV), and currently, SARS-CoV-2. Coronaviruses are enveloped, positive-sense, singlestranded RNA viruses with mammalian and avian hosts. The length of the SARS-CoV-2 genome is approximately 30 kb and it encodes at least 29 proteins, including 16 non-structural proteins (NSP), 9 accessory proteins and 4 structural proteins such as (spike [ [7]. The coronavirus N protein is an important viral structural protein, which plays an important role in promoting of genome packaging, RNA chaperoning, intracellular protein transport, DNA degradation, interference in host translation, and restricting host immune responses [8]. It is reported that coronavirus N protein may help tether the genome to replicase-transcriptase complex (RTC), and package the encapsidated genome into virions by binding nsp3 protein which is also an antagonist of interferon and viral encoded repressor (VSR) of RNA interference (RNAi) that further benefits the viral replication [9]. The SARS-CoV N protein, the most abundant protein in the virus infected cells, is also proved to be a genetically stable protein, which is a primary requirement for an efficient drug target candidate [10]. Phylogenetic analysis of the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) determined that it is most closely related (89.1% nucleotide similarity similarity) to SARS-CoV that had a history of genomic recombination [11]. The N protein of SARS-COV-2 may also be an important part of virus on host specificity and evolution of the interactions between N and host cell proteins.
To date, little is known about SARS-COV-2 N protein. Our aim is to conduct a bioinformatic analysis of the primary, secondary and tertiary structure of SARS-COV-2 N protein to inform the research community about potential targets for development of anti-viral agents.

Results
The sequence and location of SARS-COV-2 N protein The complete N protein sequence was analyzed using NCBI protein-blast which showed that the SARS-COV-2 N was composed of 419 amino acids which had a 91 and 49% similarity to SARS-CoV and MERS-CoV N proteins. Subcellular localization analysis predicted that the protein had a predominantly nuclear distribution although it is present to some extent in the cytoplasm and cell membrane (k = 23, Table 1). At the same time, we also found that a small amount of protein is also predicted to be distributed to the cell vesicles, suggesting that SARS-COV-2 may be spread in the human body through the cell vesicles.
The physicochemical properties of SARS-COV-2 N protein As showed in Table 1, we further studied the SARS-COV-2 N protein physicochemical properties through ProtParam tool which demonstrated it was a 45.6 kDa positively charged (PI> 10) and unstable (instability index < 60) protein. Its aliphatic index and GRAVY which was less than 70 and 0 indicated that it was also a hydrophobic protein with poor heat resistance.

The structure of SARS-COV-2 N protein
The secondary structure of SARS-COV-2 N protein was predicted using window width 17, similarity threshold 8 and 4 of states which is showed in Fig. 1. The results indicated that the SARS-COV-2 N protein was made up of alpha helix (21.24%), beta fold (16.71%), beta turn (6.92%), and random coil (55.13%). As showed on Tables 2, 231 of 419 amino acid residues localized to the random coil indicating that it might be the main secondary structure of SARS-COV-2 N protein. In addition, the secondary structure of SARS-COV-2 N protein was further compared with SARS-CoV and MERS-CoV proteins which all of them showed a high similarity to each other (supplymentary Table 1).
With the help of Swiss-model, the tertiary structure was constructed with a 95.76% sequence identity (Fig. 2). The SAVES v5.0 that contains more than 5 different verification methods was further performed to verify the tertiary structure model of SARS-COV-2 N protein ( Table 2). AThe overall quality factor of ERRAT was higher than 90 (Fig. 3a), the z-score of Prove was close to 1 (Fig. 3b). Whatcheck analysis of expected properties showed green positive results which confirmed the usefulness of our SARS-COV-2 N protein tertiary structure model (Fig. 3c)..

The biological function of SARS-COV-2 N protein on cell cycle
As showed on Fig. 4a, Cells transfected with SARS-COV-2 N protein or negative control were detected with mass spectrometry analysis. A significant higher expression of TUBA1C, IFIT1, TUBB6, CCT3, WDR1, SYNCRIP protein was found on SARS-COV-2 N protein transfection group comparing with negative control. Then the six proteins were predicted by STRING database. We fortunately found that high expression of TUBA1C, TUBB6 might be related to the Cell Cycle, Mitotic regulation (p = 0.0199). Therefore, the cell cycle analysis was performed. The results showed that host cells transfected with SARS-COV-2 N plasmid had higher rates on G1 phase and lower rates on S or G2 phase than other groups which demonstrated a G1/S cycle was blocked for the affection of SARS-COV-2 N protein (Fig. 4b, p < 0.05).

The SARS-COV-2 N protein phosphorylated sites prediction
A total number of 56 phosphorylated sites were identified in SARS-COV-2 N protein (Fig. 3d). Only 46 of them showed probable protein kinase except unsp. Finally, 18 phosphorylated sites with specific predicted kinases were found in the remaining 46 sites (Table 3) and 9 protein kinases such as PKA, PKC, PKG, EGFR, DNAPK, CKI, CKII, CDC2, ATM were predicted to be the main kinase involved in SARS-COV-2 N protein phosphorylation. Next, to further confirm the phosphorylated sits impact on the biological function of N protein, PROVEAN PROTEIN were performed which  (Table 3). Unfortunately, positive results were not found on the phosphorylated sites 49 T, 232S, 245 T, 366 T, 379 T, 391 T, 393 T, 413S, 417 T. Therefore, a further analysis of multiple amino acids variant affection was applied which 49 T, 245 T, 366 T showed a significant impaction on protein function (Table 4). 232S, 379 T, 391 T, 393 T, 413S, 417 T might not be the main amino acids on SARS-COV-2 N protein function.

Discussion
In order to thoroughly control the spread of SARS-COV-2 and design reasonable drugs for prevention and treatment, we must first understand the biological functions of SARS-COV-2 structure protein. In this study, we primary analyzed the SARS-COV-2 N protein which was a 45.6 kDa, positively charged, unstable hydrophobic protein with poor heat resistance protein mainly composed of 419 amino acid residues.
The coronavirus N protein could cause deregulation of the cell-cycle which offered a better environment for itself binding to viral RNA to form the ribonucleocapsid and promoting virus replication, transcription and translation [12]. Study showed the reason of it might due to its localization to the nucleolus [13]. The later study of the SARS-CoV and MERS N protein function also confirm the interaction with nucleic acids. In this study, we found that the SARS-COV-2 N protein might located mainly on nuclear and had a 91 and 49% similarity to SARS-CoV and MERS-CoV not only protein sequence but also secondary structure which indicated that SARS-COV-2 N protein should also play an important role on SARS-COV-2 replication.
The phosphorylation of virus proteins can regulate their activity, localization and interactions with host intracellular proteins which is an important sign of active viral replication [14]. Moreover, the phosphorylation of coronavirus N protein was reported to played an important role on its localization and interactions with host cell nucleolus which could further delay the cell cycle and creates a mechanism that is conducive to viral RNA translation [15,16]. Studies on SARS-CoV showed its N protein phosphorylation was significantly correlated to nucleoplasmic shuttle capacity which might further block host cell G1/S phase [10,13]. In this study, a significant G1/S phase was also observed on cells tranfected with SARS-COV-2 N protein. Furthermore, a total number of 12 phosphorylation sites were identified on SARS-COV-2 N protein and analyzed to be significantly associated with N protein functions. The researches exposed that microtubules combined with many microtubule-related proteins such as g α-, β-, and γ-tubulin aggregate to achieve various cellular functions in the cell cycle (mitosis and meiosis) [17,18]. Moreover, TUBA1C, a subtype of α-tubulin, which is composed of microtubule structure, was reported to be overexpressed and promotes oncogenesis in pancreatic ductal adenocarcinoma via Regulating the cell cycle [19].
In this study, we also found that cells transfected with SARS-COV-2 N protein usually had a higher TUBA1C expression. The SARS-COV-2 N might block host cell G1/S phase through up-regulated TUBA1 expression. By the way, TUBB6, as one of the β-tubulins, was also found to be highly expressed in SARS-COV-2 N protein transfected cells which might participate in host cell cycle regulation [20]. However, much more studies were needed in this area. The result of SARS-COV-2 N protein aliphatic index indicated its poor heat resistance which might be good news on SARS-COV-2 prevention. Unfortunately, since SARS and MERS epidemic, lots of anti-CoV agents have been developed against virus proteases, polymerases, MTases and entry proteins. None of them have been proved in clinical therapy [21,22]. The herpes simplex virus type 1 (HSV-1) phosphorylation site is S187. After mutation of this site to alanine, the replication ability and virulence level of HSV-1 in mouse central nervous system decreased significantly [23]. Moreover, the influenza C virus replication was significantly lower than that of wild-type recombinant influenza C virus when its phosphorylation site of the second membrane protein at position 78 and/or 103 was replaced with an alanine residue [24]. The coronavirus N protein shows least variation in the gene sequence, therefore indicating it to be a genetically stable protein, which is a primary requirement for an efficient drug target candidate [25]. Through X-ray crystallography analysis, studies had reported that the N-terminal domain of SARS and MERS structurally adjacent to the receptor binding region which might be a promising target for neutralizing antibodies [26]. In this study, the tertiary structure model with potential phosphorylation sites of SARS-COV-2 N protein was built which would promisingly assist the area for further drug exploration and development of recombinant virus vaccine.
However, for the limitation of our laboratory safety level, there were still some unfortunate limitations in this study. Though the SARS-COV-2 tertiary structure The global quality estimate results showed that most scores were higher than −4.0 which represents a high quality of the tertiary structure; Then Ramachandran analysis was performed which showed a 91.38% score (> 90%); Moreover, the QMEAN and local quality estimate scores were also calculated which were 0.17 and higher than 50% similarity to target model was successfully built, the Verify 3D and Procheck score of SAVES v5.0 evaluation system were not good enough which a further improvement should be made on its tertiary structure. By the way, though N protein shows least variation in the gene sequence, the main secondary structure of random curl and its instability also made difficulties on future studies.

Conclusion
On general, we primary analyzed the SARS-COV-2 N protein physicochemical property, subcellular localization, protein structure. A total number of 12 SARS-COV-2 N protein phosphorylated sites and 9 potential protein kinase were also found in this study which showed a promising target for further drug exploration and development of recombinant virus vaccine. More studies are needed in SARS-COV-2 N protein.

The sequence and location of SARS-COV-2 N protein
The DNA and protein sequence were downloaded from the NCBI (YP_009724397.2) which encoding the SARS-COV-2 N protein was cloned into pET28a-N plasmid and successfully expressed in E. coli by New Testing Technology Center of Guangdong Experimental Animal Monitoring Institute (supplementary Fig. 1). The plasmid will be available free of charge for scientific research on SARS-COV-2 (https://jinshuju.net/f/9BnU6j). The NCBI protein-blast was used to compare the SARS-COV-2 N sequence with SARS-CoV and Middle East respiratory syndrome-related coronavirus (MERS). PSORT Fig. 3 Systemic evaluation of potential phosphorylated sites in the SARS-COV-2 N protein tertiary structure. a: the ERRAT analysis of SARS-COV-2 N protein tertiary structure. Overall quality factor was 93.5 higher than 90; b: the Prove z-score of SARS-COV-2 N protein tertiary structure; c: the analysis results of Whatcheck on SARS-COV-2 N protein tertiary structure. The favorable results are colored green which were significantly higher than 50%; d: the distribution of phosphorylated sites on SARS-COV-2 N protein. The phosphorylated sites were colored blue on the SARS-COV-2 N protein tertiary structure II Prediction was utilized to predict protein subcellular localization in human cells.

Western-blot of SARS-COV-2 N protein
Hct-116 Cells transfected with SARS-COV-2 N plasmid and negative control was harvested and extracted total proteins.. Then, the protein concentration was quantified using a BCA protein assay kit (Beyotime, Shanghai, China). Sodium dodecyl sulfate (SDS)-polyacrylamide gel electrophoresis and Western blot analyses were performed according to the standard procedures. Next, the gel was stained with Coomassie Brilliant Blue for 1 h and decolorized overnight.

Mass spectrometry analysis
Hct-116 Cells treated with SARS-COV-2 N plasmid and negative control was harvested and extracted total proteins. Then, the protein concentration was quantified using a BCA protein assay kit (Beyotime, Shanghai, China). Western-blot was further used to separate the proteins. Then the N protein complexes were denatured, reduced, alkylated and digested with immobilized trypsin (Promega) for mass spectrometry analysis.

Cell cycle test
Cell cycle kit (Keygentec KGA511, China) was used in this study. According to the kit instructions, Cells were digested using 0.1% tryps in without EDTA and centrifuged at 1000 rpm. Then binding buffer was used to suspend cells, keeping cell concentration at 1 × 10 6 cells/ mL. Remove the supernatant, add 500ul of cold 70% ethanol to fix the cells (2 h to overnight), store at 4°C, wash off the fixative with PBS before staining; Add 500 μL PI/RNase A staining working solution and avoid light at room temperature for 30-60 min. After the incubation, cell cycle was detected using flow cytometry within 1 h.

The physicochemical properties of SARS-COV-2 N protein
The chemical formula, number of amino acids, molecular weight, theoretical pI, number of charged residues, estimated half-life, instability index, aliphatic index using and grand average of hydropathicity (GRAVY) was analyzed by ProtParam tool [27]. A protein with GRAVY > 0 was defined as hydrophobic protein and a protein GRAVY< 0 was defined as hydrophilic protein. Aliphatic index< 70 was defined as poor heat resistance. GRAVY > 0 was defined as hydrophobic protein and GRAVY< 0 was defined as hydrophilic protein.

The structure of SARS-COV-2 N protein
The SOPMA tool was firstly applied to analyze the secondary structure of SARS-COV-2 N protein [28]. Next we used Swiss-model to generate tertiary structure [29]. For models with less than 100 residues, the sequence identity must be over 30%. For models with greater than 100 residues the QMEAN score must be greater than − 5 [30]. QMEAN Z-scores around zero suggested a good agreement between the model Finally, SAVES v5.0 which provides quality measures for protein crystal structure and uses five different online tools such as WHATCHECK, PROCHECK, ERRAT, Verify3D, PROVE to assess the quality of the predicted 3D model of SARS-COV-2 N protein [31].

The SARS-COV-2 N protein phosphorylated sites prediction
The SARS-COV-2 N protein sequence was up-loaded to NetPhos3.1 Server to analyze the potential phosphorylation sites [32]. The prediction score (a value in the range [0.000-1.000]) which above 0.500 indicated positive predictions. The active kinase or the string "unsp" was represented for non-specific prediction. So those phosphorylated sites with specific predicted kinase were then included in the study. Then PROVEAN PROTEIN was applied to further analyzed whether the phosphorylated site variants would affect the structure and function of the N protein [33]. Default threshold is − 2.5, that is: Variants with a score equal to or below − 2.5 are considered "Deleterious"; Variants with a score above − 2.5 are considered "Neutral" Received: 24 September 2020 Accepted: 27 January 2021