Identifying novel host-based diagnostic biomarker panels for COVID-19: a whole-blood/nasopharyngeal transcriptome meta-analysis

Regardless of improvements in controlling the COVID-19 pandemic, the lack of comprehensive insight into SARS-COV-2 pathogenesis is still a sophisticated challenge. In order to deal with this challenge, we utilized advanced bioinformatics and machine learning algorithms to reveal more characteristics of SARS-COV-2 pathogenesis and introduce novel host response-based diagnostic biomarker panels. In the present study, eight published RNA-Seq datasets related to whole-blood (WB) and nasopharyngeal (NP) swab samples of patients with COVID-19, other viral and non-viral acute respiratory illnesses (ARIs), and healthy controls (HCs) were integrated. To define COVID-19 meta-signatures, Gene Ontology and pathway enrichment analyses were applied to compare COVID-19 with other similar diseases. Additionally, CIBERSORTx was executed in WB samples to detect the immune cell landscape. Furthermore, the optimum WB- and NP-based diagnostic biomarkers were identified via all the combinations of 3 to 9 selected features and the 2-phases machine learning (ML) method which implemented k-fold cross validation and independent test set validation. The host gene meta-signatures obtained for SARS-COV-2 infection were different in the WB and NP samples. The gene ontology and enrichment results of the WB dataset represented the enhancement in inflammatory host response, cell cycle, and interferon signature in COVID-19 patients. Furthermore, NP samples of COVID-19 in comparison with HC and non-viral ARIs showed the significant upregulation of genes associated with cytokine production and defense response to the virus. In contrast, these pathways in COVID-19 compared to other viral ARIs were strikingly attenuated. Notably, immune cell proportions of WB samples altered in COVID-19 versus HC. Moreover, the optimum WB- and NP-based diagnostic panels after two phases of ML-based validation included 6 and 8 markers with an accuracy of 97% and 88%, respectively. Based on the distinct gene expression profiles of WB and NP, our results indicated that SARS-COV-2 function is body-site-specific, although according to the common signature in WB and NP COVID-19 samples versus controls, this virus also induces a global and systematic host response to some extent. We also introduced and validated WB- and NP-based diagnostic biomarkers using ML methods which can be applied as a complementary tool to diagnose the COVID-19 infection from non-COVID cases.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-COV-2), known as 2019-nCoV, has spread worldwide, causing about 490 million cases and more than 6.15 million fatalities (5 April 2022). Coronavirus disease 2019 (COVID-19) has a wide range of clinical manifestations, from asymptomatic patients to severe inflammatory reactions, resulting in organ failure and death (Chen et al. 2020). It is not yet completely clarified whether severe outcomes of disease are related to the viral infection, the host's immunological response, host underlying diseases, or a combination of these variables (Williamson et al. 2020). Evidence indicates that the host responses to SARS-COV-2 are highly different than antiviral responses to other respiratory viruses like influenza and seasonal corona (Smith et al. 2021). According to the analysis of SARS-COV-2-specific immunological responses, SARS-COV-2 inhibits innate immune system activation like dendritic cells while increasing proinflammatory macrophage activation and IL-6 and tumor necrosis factor (TNF) production (Zhou et al. 2020;Schultze and Aschenbrenner 2021). Besides, studies have revealed that one of the important damaging factors, locally in the lungs and systemically in the circulation, is an increase in neutrophils (Vanderbeke et al. 2021). Despite the observed peripheral lymphopenia, SARS-COV-2-specific B cell responses and the number of plasma cells increase in COVID-19 patients (Smith et al. 2021).
Recent studies have demonstrated that the host transcriptome undergoes substantial changes upon SARS-COV-2 infection in a variety of tissues like respiratory epithelial cells, nasopharynx, colonocytes, and whole blood or plasma samples (Ong et al. 2020;Lioa et al. 2020). Therefore, RNA sequencing can be employed as a robust tool to identify host transcriptional signatures affected by SARS-COV-2, leading to the development of some novel diagnostic biomarkers and therapeutic strategies (Ng et al. 2021;Thair et al. 2020a;Mick et al. 2020). Furthermore, understanding the similarities and differences of the host response to SARS-COV-2 infection compared to other (viral or non-viral) respiratory infections is necessary to find common/virus-specific transcriptional signatures (Thair et al. 2020a).
The conventional diagnostic test for COVID-19, viral nucleic acid amplification tests (NAAT) using reverse transcription-polymerase chain reaction (RT-PCR), provides a considerable rate of false-negative results because the virus load in individuals can be low and significantly change during the course of the disease (Pan et al. 2020;Wölfel et al. 2020). Therefore, identification of the hostspecific biomarkers as a complementary tool is critical to accurately diagnose the COVID-19 infection from non-COVID cases (Mick et al. 2020). Several studies focused on finding effective repurposable drugs for the COVID-19 treatment and introduced some potential therapeutic targets by analyzing a single gene expression dataset of COVID-19 patients. In a study by Ahmed et al., a microarray dataset consisting of 10 and 4 PBMC samples of patients with SARS-CoV-1 infection and healthy controls, respectively was analyzed to find hub genes and remarkable signaling pathways in SARS-CoV-1 infection. The repurposable drugs for SARS-CoV-1 infections were then identified and validated for the treatment of SARS-CoV-2 infections (Ahmed et al. 2022). Furthermore, Mosharaf et al. analyzed an RNA-seq dataset consisting of 35 lung tissue samples infected with SARS-COV-2 (case) and 5 control samples to find differentially expressed genes (DEGs). By constructing a protein-protein interaction network, key genes and signaling pathways in SARS-COV-2 infection were determined to be used as targets for drug repurposing in COVID-19 (Mosharaf et al. 2022). However, recent developments in "omics" technologies, along with advances in computer sciences, have provided an opportunity to integrate and analyze multi-cohort datasets using systems biology approaches and machine learning (ML) methods, leading to decreased heterogeneity of publicly available single population-based transcriptome datasets (Tavassolifar et al. 2020). The multi-cohort analysis of published transcriptional data derived from whole blood (WB), peripheral blood mononuclear cells (PBMCs), epithelial cells, or cell lines that represented infections from 7 viruses (adenovirus, influenza, SARS, RSV, HRV, enterovirus, HHV6) and 4 bacteria (S. pneumonia, S. aureus, E. coli, Salmonella), proposed a common host signature across different respiratory viral infections. These findings could discriminate individuals with viral infections from those with bacterial infections and healthy controls (Andres-Terre et al. 2015).
Cross-platform normalization (CPN) as a data integration technique increases sample sizes, improves overall heterogeneity estimation, identifies more specific host response signatures, and reduces the effect of individual study-specific biases. It has been shown that novel diagnostic panels with more power, robustness, and generality can be introduced by using CPN (Hamid et al. 2009;Larsen et al. 2014;Irigoyen et al. 2018;Taminau et al. 2014;Maleknia et al. 2020). In this study, we developed an integrated, multi-cohort analysis framework that takes advantage of the heterogeneity seen in Gene Expression Omnibus (GEO) public data repositories to identify and validate robust, reproducible, and more specific host response signatures by CPN. This study involved the five gene expression datasets including 179 human WB samples from SARS-COV-2 infected patients and healthy controls (HC) and also four gene expression datasets including 1387 human nasopharyngeal (NP) samples from patients with SARS-COV-2 infection, other viral acute respiratory illnesses (ARIs), and non-viral ARIs, as well as HC individuals. Our methods were employed for two different hypotheses. In the first step, the gene expression profiles of 9 WB and NP datasets were analyzed to find out SARS-COV-2 pathogenesis and perturbed host immune response pathways which we termed 'COVID-19 meta-signature' (CMS). We investigated the similarities and differences of the host response to SARS-COV-2 infection compared to other (viral or non-viral) respiratory infections in NP samples. In the continuum, to introduce diagnostic biomarker panels as complementary tools along with the conventional diagnostic test, we leveraged feature selection and ML methods on the integrated datasets. In the first phase of the ML methods, the optimum combinations of features were designated by using k-fold cross validation on a train set (80% of the population), and then in the second phase, the best combinations based on accuracy parameter were selected to be validated on independent test sets (20% of the population). Finally, we could identify high-performance 3 to 9-biomarker panels related to WB and NP samples that could accurately distinguish COVID-19 patients from HCs and non-COVID individuals (Fig. 1).

Data collection and pre-processing
The entire analysis was accomplished with publicly available data. In order to reach proper gene expression profiles, the keyword "COVID-19" was searched in the GEO database restricted to "Homo sapiens" taxonomy and "Expression profiling by high throughput sequencing". Afterwards, data obtained from WB and NP samples were selected, while the data related to organoids, single-cell, cell lines, recurrent patients, and drug treatments were excluded from the results. Finally, by setting the sample count on more than 20 samples on 11/20/2021, three datasets with accession numbers GSE163151, GSE151161, and GSE152641 related to WB samples and four datasets with accession numbers GSE163151, GSE152075, GSE156063, and GSE188678 related to NP samples were used for this study. To raise the number of HCs related to WB, 27 HC samples were imported to the study from two non-COVID studies with accession numbers GSE169687 and GSE172450 (Table 1).
Rather than doing conventional biological experiments, we chose cohorts from various centers to incorporate technical variation and population-related biological heterogeneity in the study. The samples in these datasets characterized different biological conditions including viral infections (SARS-COV-2, influenza, other Coronaviruses, and other viruses causing ARIs), non-viral ARIs (types of bacterial infections), and HC. To add on the technological heterogeneity property in our study, datasets profiled by RNA-seq technology from several manufacturers were used. The Ensemble IDs that represented transcripts in each dataset were mapped to gene symbols (HGNC) to facilitate integrated analysis. If multiple Ensemble IDs were matched to one gene, the expression values for those IDs were averaged to one value using the aggregate function from the stats package version 4.0.3 (Fig. 1).
We selected 9 datasets (GSE163151(WB and NP), GSE151161, GSE152641, GSE169687, GSE172450, GSE152075, GSE156063, and GSE188678). The NP samples in GSE163151 related to "Influenza" and "other corona" viruses were labeled as other viruses in the CMS analysis. Cohorts related to NP samples contained samples from SARS-COV-2 infection and other conditions such as HC, other viral ARIs, and non-viral ARIs. Cohorts related to WB samples contained samples from SARS-COV-2 infection and HC individuals.

Data integration via cross-platform normalization
we performed the CPN method on the finalized cohorts and samples (Walsh et al. 2015). In order to gain a uniform dataset associated with each tissue while employing more heterogeneity in the population, the datasets related to each group were integrated by merge function from the base package, R 4.0.3. The genes with 0 read count in at least 60% of all classes (COVID-19, HC, non-viral, and other viral) were excluded from the datasets. Subsequently, the batch correction was performed using ComBat-seq from the sva package version 3.38  to minimize experimental variance. The parameter group was set to the vector of the disease condition (Pei et al. 2021).

Differential gene expression analysis between COVID-19 and non-COVID
Two uniform gene expression datasets included 179 human WB samples from COVID-19 patients and HC Fig. 1 The workflow of the study: The RNA-Seq datasets related to whole blood (WB) and nasopharyngeal (NP) samples from patients with COVID-19 infection and other similar disease conditions including viral and non-viral acute respiratory illnesses (ARI) as well as healthy controls were acquired from GEO database. Data were integrated and the batch effects were eliminated. Subsequently, the datasets were subjected to pathway enrichment and GO analyses. Furthermore, the candidate diagnostic biomarker panels were identified using machine learning methods on train datasets and validated on independent cohorts to introduce the best biomarker combinations. Besides, the RF-based generic prediction models were generated by using all combinations of 3 to 9 markers related to 23 common WB/NP DEGs was done. Finally, the results of two prediction models, including the LASSO feature-based prediction model and RF-based generic prediction model were compared. WB whole blood, NP nasopharyngeal, ARI acute respiratory illnesses, RF random forest individuals and 1387 human NP samples from patients with SARS-COV-2 infection, other viral ARIs, non-viral ARIs, and HC individuals were imported for differential gene expression analysis. Read counts were normalized by TMM scaling from the DEseq2 and edgeR packages version 1.30.1 and 3.32.1. Afterward, DEGs were identified by using the lmFit, and empirical Bayes (eBayes) functions within the limma package version 3.46.0 (Ritchie et al. 2015). The P values were corrected for multiple comparisons by the Benjamini-Hochberg false discovery rate (FDR) method (Benjamini and Hochberg 1995). The significant DEGs were defined by adjusted P-value < 0.05, and the absolute value of the log2-transformed fold-change in the expression level was set to more than 1. The volcano plots related to each set of DEGs were plotted by ggplot2 package version 3.3.3.

Identifying CMS by gene set enrichment and gene ontology analyses in WB and NP samples
To identify enriched biological processes and pathways associated with each set of DEGs in WB and NP samples, we utilized EnrichR (Kuleshov et al. 2016) and ToppGene (Chen et al. 2009) to explore the most significant Gene Ontology Biological processes (BPs), Molecular functions (MFs), Cellular components (CCs), and signaling pathways. Due to the popularity of the EnrichR tool in scientific resources (Stephenson et al. 2021;Sajuthi et al. 2020;Unterman et al. 2022;Müller et al. 2021), the results and discussion of this study are mostly based on EnrichR findings. We illustrated the results by dot-plot and bar plot functions in the ggplot2 package.

Estimation of the immune cell type proportion in WB samples by CIBERSORT
In this step, WB row counts achieved from batch correction were normalized using the counts per million (CPM). A computational method called CIBERSORTx (https:// ciber sort. stanf ord. edu/) method was applied to quantify immune cell-type proportions from Gene Expression Profiles (GEPs) (Newman et al. 2015). As reference gene expression signatures, the standard LM22 signature matrix was leveraged to estimate the relative proportions of each cell type (Chen et al. 2018). The signature matrix consists of 547 genes that precisely recognize 22 functionally defined human immune subsets, including seven T cell types, naïve and memory B cells, plasma cells, NK cells, and myeloid subsets. By applying the independent samples T-test, the differences of each cell type between two considering groups of COVID-19 patients and HC individuals were tested.

Comparison of SARS-COV-2-infected WB and NP transcriptome profiles
In order to detect similarities and differences in gene expression patterns between two types of SARS-COV-2-infected tissues, we compared two sets of DEGs between COVID-19 patients versus HC in WB and NP samples. We first explored the DEGs sets via the venn diagram by venn package version 1.10 to recognize the overlapping genes between two DEGs sets. Then, enriched biological processes and pathways associated with each set of the common DEGs were obtained by Enrichr and ToppGene databases.

In-silico discovery of diagnostic biomarkers in WB and NP samples
To discriminate COVID-19 patients from HC individuals in WB samples, an ML method was used to recognize Table 1 The number of WB and NP samples applied in this study a Genomic Spatial Event (GSE) database (Danford et al. 2008) b The samples related to "Influenza" and "other corona" were labeled as other viruses in the analysis the diagnostic biomarker panel. The same approach was applied to distinguish patients with COVID-19 and non-COVID in NP samples. Non-COVID samples in NP mean the total samples of non-viral, other viral, and the not-COVID samples in GSE152075. Samples were stratified randomly but proportionally assigned into a training set (80%) or an independent test set (20%). The methods used are as follows:

Feature selection in the train sets of WB and NP samples
Diagnostic features were selected in two ways. Initially, the DEGs were found by pairwise comparisons among considering groups in the train sets of WB and NP samples. These genes were exploited as input features to fit a Least Absolute Shrinkage and Selection Operator (LASSO) regression model for feature selection (Tibshirani 1997; L'Heureux 2017). The glmnet package version 4-1.1 was used plus the cross-validation method to estimate lambda's regularization parameter by cv.glmnet function. The eligibility criteria to enter the features into the classification step were as follows: absolute LASSO coefficient more than 0.1 OR the non-zero LASSO coefficient and the absolute value of logFC more than 1.3. Subsequently, EnrichR and ToppGene databases were used to explore enriched biological processes and pathways associated with these features. In parallel, we also considered the common DEGs between NP and WB to check whether they could also be applied as diagnostic features.

Discrimination of COVID-19 patients from non-COVID individuals by optimal biomarker panels
To detect the optimal, powerful and robust diagnostic biomarker panel obtained by the LASSO method, a two-phase ML platform was done. The Random Forest (RF) classifier (Breiman 2001) by randomForest R package version 4.6-14 was employed in both phases. In the first phase, the classifier was performed on the training set (80%) of the WB and NP datasets using fivefold and tenfold cross validations, respectively. The input features of the first phase were all combinations of 3 to 9 selected features based on LASSO. The statistical parameters; sensitivity, specificity, and accuracy, were estimated by confusion Matrix function from caret package version 6.0-86. The best combinations considering the highest sensitivity and specificity were picked out for the next phase. In the second phase, the best combinations of features in the training set (80%), were validated once more, based on the independent test set (20%). The sensitivity, specificity, and accuracy obtained from the two phases were displayed by line plots. Likewise, the receiver operating characteristic (ROC) curves were plotted, and the Area Under Curve (AUC) was computed through pROC package version 1.17.0.1.
In parallel, to find the best diagnostic panels among common DEGs between WB and NP samples of COVID-19 patients compared to HC, we used RF classifier on all combinations of 3 to 9 common features in the training set (80%), and then they were validated based on the independent test set (20%). Finally, the results of the LASSO feature-based prediction model and common WB/NP feature-based model were compared.

Data integration and batch effect correction in WB and NP datasets
The statistics of the samples used for this study is demonstrated in Table 1. The number of WB samples for COVID-19 patients and HC individuals was 108 and 71 samples, respectively. Meanwhile, the number of NP samples for COVID-19 patients, HC, and AIRs were751, 11, and 571, respectively.
Both genders were included in these datasets, and all individuals were adults. The gene expression datasets related to each tissue were integrated, and the effect of batches was corrected to obtain a uniform dataset in WB and NP samples, individually. To check the uniformity of the datasets, principal components analysis (PCA) was performed before and after removing the batch effect (Additional file 1: Figs. S1 and S2).

WB transcriptome analysis of patients with COVID-19
The differential gene expression analysis between COVID-19 and HC groups in WB final dataset resulted in the identification of 345 DEGs, of which 309 genes were upregulated, and 36 genes were downregulated in COVID-19 ( Fig. 2A). The 20 non-redundant significant BPs and all seven significant hallmark pathways with adjusted P-value < 0.05 based on EnrichR and ToppGene findings were depicted in Fig. 2B, C. The significant BPs were enhancement in "neutrophil activation", "cell cycle", "inflammatory host response", "interferon signature", and reduction in "gas and oxygen transport". Interestingly, GO functional analyses of DEGs using the ToppGene database also confirmed that BPs listed above are highly linked with the SARS-COV-2 infection (Additional file 2: Table S1). Furthermore, downregulated genes were significantly involved in "ferrous iron binding", "heme binding", and "hemoglobin alpha binding" and belonged to the endocytic vesicle lumen. The pathway analysis of WB samples from patients with COVID-19 compared with HC showed notable enrichment of genes in multiple pathways associated with "inflammatory response", "interferon-alpha", "G2M checkpoint", "mitotic spindle", and "interferon-gamma response" as well as "KRAS Signaling Up". The complete list of DEGs, significant BPs, MF, and CCs, and enriched signaling pathways based on EnrichR and ToppGene databases as well as their commonalities are available in Additional file 2: Table S1.

COVID-19 immune cell landscape
Specific alterations in gene expression between SARS-COV-2 and HC individuals might be derived from changes in tissue cellular composition such as immune cell types. To examine this, the CPM of WB samples was imported to the CIBERSORTx tool and cell-type proportions were estimated. The results of the independent T-test between different immune cell types of COVID-19 and HC groups were presented in Additional file 3: Table S2. Remarkably, SARS-COV-2 infection increased the proportion of T regulatory cells (Tregs) while decreasing the proportions of CD8, CD4 naïve, and CD4 memory resting cells. Correspondingly, the proportions of neutrophils, B cells naïve, plasma cells, and macrophages (M0 and M1) were augmented in COVID-19 vs HC (Fig. 3).

NP transcriptome analysis of patients with COVID-19
To delineate the molecular pathogenesis of SARS-COV-2 based on host NP gene expression, differential gene expression, GO and hallmark gene set analyses were performed in three paired conditions including; COVID-19 vs HC, COVID-19 vs non-viral ARIs, and COVID-19 vs other viral ARIs to highlight the contrasts. The volcano plot to demonstrate differential expressed genes which had adjusted P-value < 0.05, |Log 2 FC|> 1. Red and green show up and downregulated genes, respectively (A). Dot plot to show BPs (GO) according to significantly upregulated and downregulated genes. The size of the dots is proportional to the gene ratio in considering process and the color corresponds to the -log10 of the adjusted P-value. Selected top and not-redundant terms are visualized (B). Bar plot to depict hallmark gene set enrichment analysis. The size of the bars is proportional to the gene ratio in considering pathway and the color corresponds to the -log10 of the adjusted P-value (C). BP biological process, GO gene ontology The number of up and downregulated genes were depicted in volcano plots (Fig. 4A). 341 upregulated and 1598 downregulated genes were found when the NP transcriptome profile of COVID-19 patients was compared to that of healthy individuals. The expression of 64 and 1344 genes was also significantly increased and decreased in COVID-19 compared to non-viral ARIs, respectively. Additionally, 1722 and 358 genes were up and downregulated in COVID-19 vs other viral ARIs, respectively. A striking contrast highlighted by differential gene expression analysis between the first two groups and the third group emerged in the enrichment of interferon signature. According to enrichment analysis using two databases, SARS-COV-2 infection in comparison with HC and non-viral ARIs leads to the significant upregulation of genes associated with regulation of cytokine production, cellular response to cytokines, defense response to the virus, innate immune response, and inflammatory response. However, these BPs in SARS-COV-2 infection were noticeably attenuated compared to other viral ARIs. Likewise, the genes relevant to neutrophil functions had distinct expression patterns among study groups. The expression of genes involved in neutrophil-mediated immunity, neutrophil activation involved in immune response, and neutrophil degranulation was reduced in COVID-19 compared to other-viral ARIs. However, neutrophil chemotaxis and neutrophil migration were increased in COVID-19 vs non-viral ARIs (Table. S3). Furthermore, the results of enriched MFs indicated that genes with CXCR3 chemokine receptor binding activity which play important role in recruiting pro-inflammatory cells such as neutrophils were upregulated in COVID-19. The regulation of the RIG-I signaling pathway, viral genome replication, and regulation of ribonuclease activity were incremented in the COVID-19 group compared to HC and non-viral groups. Moreover, the CCs related to the nucleus and cytosolic ribosome were overrepresented in the COVID-19 patients compared to HC. According to the results, the NADH dehydrogenase complex which has an important role in the redox system by producing reactive oxygen species was increased in COVID-19 compared to non-viral ARIs. Furthermore, few genes involved in "protein polyubiquitination" were specifically upregulated in COVID-19 patients compared to other viral patients and the HC group ( Fig. 4B and Additional file 4: Table S3).
On the other hand, the TNF-alpha signaling pathway was enriched in COVID-19 relative to other-viral ARIs. Moreover, some genes related to protein secretion and reactive oxygen species pathways like MAPK1, RAB14, RAB22A, SOD1, and CAT were dysregulated in COVID-19 versus other-viral diseases. All of these four sets of DEGs, BPs, MF, CCs, and enriched signaling pathways are accessible in Additional file 4: Table S3.

Common and distinct gene signatures associated with COVID-19 in WB and NP samples
To investigate the overlapped molecular mechanisms involved in SARS-COV-2 host responses between WB and NP samples, common genes which were differentially expressed in WB and NP samples of COVID-19 patients compared to HC were determined. Using the Venn diagram, the common DEGs indicated in Fig. 5A were partitioned into three groups: upregulated in WB and upregulated in NP (UB-UN), downregulated in WB and downregulated in NP (DB-DN), and upregulated in WB and downregulated in NP (UB-DN).
The first group had 19 genes, and the next two groups contained 4 and 43 genes, respectively. The BPs related to innate immune response such as cellular response to IFNI, IFN-b production, defense response to virus, and cytokine signaling pathways, as well as alpha-beta T cell differentiation, were increased in both WB and NP groups, while the gas transportation, hydrogen peroxide metabolism, and cell death were reduced in both WB and NP groups (Fig. 5B). Also, consistent with the enriched "gas transportation" BP, MFs such as "heme binding" and "haptoglobin binding" were decreased in both WB and NP groups. Interestingly, the cell cycle and ubiquitinprotein ligase activity were enriched BPs in the UB-DN group. These findings were in line with functional enrichment analyses performed independently in the previous sections on WB and NP transcriptomes of patients with COVID-19. Most of the enriched BPs by EnrichR (Fig. 5B) were also validated by the ToppGene database (Additional file 5: Table S4).
Likewise, CCs validated by both EnrichR and ToppGene databases for DB-DN genes were related to endocytic vesicle lumen, cytosolic small ribosomal subunit, and small ribosomal subunit, which were in line with the previous enriched CCs obtained independently from SARS-COV-2 infected NP and WB transcriptome analyses. The common overexpressed genes of NP and WB samples, including SERPING1, OLR1, IFITM3, CXCL10, IFI27, IFI44, IFI44L, IFIT3, and ISG15 were mostly members of complement, IL-2/STAT5, inflammatory response, interferon alpha, and interferon gamma responses, and TNF-alpha signaling pathways (Fig. 5C). Furthermore, we found that the CDC20, PLK1, and BIRC5 genes lead to enriching E2F targets and the G2-M checkpoint pathway in the UB-DN group. Intriguingly, downregulation of KRAS signaling was observed in both UB-UN and DB-DN groups (Fig. 5C). The complete list of common DEGs in WB and NP samples of COVID-19 patients compared to the control group, significant BPs, MF, and CCs, and enriched signaling pathways based on EnrichR and ToppGene databases as well as their commonalities are available in Additional file 6: Table S5.

Identification of high-performance diagnostic biomarker panels for COVID-19 in WB and NP samples
To determine the best diagnostic biomarker panel, in the first phase, the DEGs associated with train sets of WB and NP samples were applied as inputs to the LASSO regression method for feature selection. We selected 80% of each group of WB and NP samples as train sets using a stratified random sampling method. The genes with an absolute LASSO coefficient more than 0.1 OR the genes with a non-zero LASSO coefficient and an absolute value of logFC more than 1.3 were selected (described in Table 2), and used in the RF classifier. According to the mentioned criteria, 22 and 23 markers related to WB and NP datasets were detected and almost all of them have previously been implicated in COVID-19 or other viral or inflammatory illnesses.
The RF classification was implemented on all combinations of 3 to 9 features based on 5/tenfold cross validation in WB/NP training sets. The biomarker panels were designated to enter the next phase based on different values of Accuracy. Considering that the number of combinations for 3 to 9 features varied from 4620 to 4,476,780, in order to keep the calculation cost-effective, it was desirable that as few biomarker panels as possible be included in subsequent evaluations. Consequently, the biomarker panels with the accuracy of 75% for 3-marker panels to 85% for 9-marker panels were selected in WB datasets to enter the next phase. These criteria in the NP dataset were 80% for 3-marker panels to 85% for 9-marker panels, as well. 5272 and 1259 combinations for WB and NP samples were selected to be applied in the second phase.
In the next phase, the RF classifier was applied on independent test sets to validate the best selected biomarker panels. The employed methodology gained the final panels based on more data than any of the previous studies and also was validated twice. Therefore, these biomarker combinations could be more generalized, robust, and powerful. The parameters related to the first and second phases for the best combinations of markers were illustrated in Additional file 6: Table S5 and Table 3. The best final 3-to 9-marker panels to diagnose COVID-19 in WB samples had an accuracy of 88% to 98% in the first phase and 91% to 98% in the second phase. Similarly, the best final 3-to 9-marker panels for classifying NP samples had an accuracy of 82% to 88% in the first phase and 80% to 88% in the second one. Correspondingly, line The Venn diagram to display the distribution of genes in four desired groups (UB upregulated genes in blood, DB downregulated genes in blood, UN upregulated genes in nasal, and DN downregulated genes in nasal) (A). Dot plot to show BPs according to common genes of each paired group. The size of the dots is proportional to the gene ratio in considering process and the color corresponds to the -log10 of the adjusted P-value. Selected top and not-redundant terms are visualized (B). Bar plot to depict hallmark gene set enrichment analysis. The size of the bars is proportional to the gene ratio in considering pathway and the color corresponds to paired groups whose common genes were studied. The "KRAS Signaling Dn" pathway was enriched in two groups (C). BP biological process plots of the sensitivity, specificity, and accuracy as well as ROC curves obtained from two phases for NP and WB samples were presented in Figs. 6 and 7.
The best biomarker panels obtained from the LASSO method, as subsets of DEGs, were part of CMS achieved by gene set enrichment and GO analyses of WB and NP transcriptomes. method. Functional enrichment analysis based on EnrichR and ToppGene databases showed that WB-based diagnostic markers were mostly involved in cell cycle-related BPs and pathways. Besides, in agreement with previous results, the "hemoglobin biosynthetic process" and "hemoglobin metabolic process" were decreased MFs, indicating the important role of candidate biomarkers in COVID-19 pathogenesis. Functional annotation of NP biomarkers indicated that the MFs of "antiviral innate immune response, " "cellular response to cytokine stimulus, " and "positive regulation of immune system process" were enhanced. They significantly contributed to interferon signaling pathways (Additional file 7: Table S6). In the next step, we investigated whether 23 common DEGs in WB and NP samples of COVID-19 patients compared to the control group could be used as robust features to classify COVID-19 from non-COVID cases using the RF algorithm. The total number of combinations with 3 to 9-markers related to 23 common DEGs was 1,697,883. we selected the first 100 top panels with the highest accuracy and sensitivity for each class (Additional file 8: Table S7).
We then compared the results of the LASSO featurebased prediction model and the common WB/NP feature-based prediction model (an RF-based generic prediction model) ( Table 4). The genes IFI44L, IFI6, and CXCL10 were observed in the best NP-based diagnostic panels of both prediction models. Likewise, GTF2H2C as a novel gene related to SARS-COV-2 infection was presented in the best WB-based diagnostic panels of the LASSO-and common feature-based prediction models.
Our findings indicated that the accuracy of the best panels in the LASSO feature-based prediction model was higher than that in the RF-based generic prediction model. In addition, as represented in Additional file 6: Table S5 and Additional file 8: Table S7, LASSObased panels have higher performance in comparison to   Fig. 6 The criteria of classifiers: The Line plots to indicate the value of the sensitivity, specificity, and accuracy of the classifiers for whole blood (WB) and nasopharyngeal (NP) samples in the first and second phases based on the number of features biomarker panels which were achieved using common WB/NP DEGs.

Discussion
Understanding the pathogenesis of COVID-19 plays a key role in drug design and treatment. Besides, identifying the host responses in different tissues can give us a more comprehensive conception. Fig. 7 The ROC curves: These ROC curves illustrate the sensitivity, 1-specificity, and AUC associated to phase I (A and C) and phase II (B and D) for whole blood and nasopharyngeal samples among the top 3 to 9 features, respectively In the present study, we implemented a meta-analysis and integrated 9 public datasets from different populations to improve individual study-specific biases. Using this approach, we notably increased the sample size and decreased the heterogeneity of the samples compared to previous studies. we applied GO and enrichment analyses to scrutinize more pathophysiological aspects of this novel pandemic virus compared to other ARIs and physiological conditions and developed the diagnostic panels based on host WB and NP transcriptomes using the ML methods.
The results revealed the distinct gene expression patterns of WB and NP samples in patients with COVID-19 versus other individuals. In agreement with previous studies (Thair et al. 2020a;Aschenbrenner et al. 2020), WB transcriptomics followed by the analysis of immune cell type proportion showed an increase in the number and activity of neutrophils in COVID-19 patients compared to healthy individuals.
T cells have been found to play a key role in viral infections and immunological homeostasis. The observed decline in CD4 + and CD8 + T cells could be attributed to SARS-COV-2 replication and dissemination, and the associated immunopathology damage (Huang et al. 2021). Despite the reduction of CD4 + and CD8 + T lymphocyte subsets in COVID-19-infected patients, our deconvolution analysis demonstrated that the Treg population was highly increased. These findings imply that Tregs may exert a negative effect on COVID-19 patients via inhibiting antiviral T-cell responses during the infection (Galván-Peña et al. 2021). In particular, our research showed that naïve B cells and plasma cells were highly increased in COVID-19 whose antibody production may prevent COVID-19 patients from deteriorating) (Huang et al. 2021;Wu et al. 2020). Transcriptome analysis revealed that the oxygen transport process was decremented in WB samples of COVID-19 patients, corresponding to the clinical features of hypoxia. Hypoxia not only plays an essential role in inflammation by increasing the release of pro-inflammatory cytokines but may also affect viral replication (Tavassolifar et al. 2020(Tavassolifar et al. , 2021Zhou et al. 2021;Maras et al. 2021), an enriched BP which was also observed in this study. Impaired regulation of host antiviral immune responses, such as IFN pathway activation and chemokine production, is part of the characteristics of viral infections (Salazar-Mather and Hokeness 2006;McNab et al. 2015). In line with previous studies (Kim and Shin 2021;Hadjadj et al. 2020), the IFN-mediated antiviral signature was obtained in the WB and NP samples of SARS-COV-2 infected patients as a crucial immune system response. Serum IFN-I levels in COVID-19 patients are higher in comparison to HC, proposing a beneficial effect of enhanced IFN-I in the blood for killing SARS-COV-2 (Huang et al. 2021). However, IFNmediated antiviral pathways, innate immune responses, and response to virus pathways were less activated in NP samples infected by SARS-COV-2 compared to other respiratory viruses. Further, it has been mentioned that specific expression patterns of IFN pathway activation in SARS-COV-2 differ from those found in other respiratory viruses (Ng et al. 2021). A specific immune response is necessary during the incubation and non-severe phases to eradicate the virus and prevent disease progression to severe stages. SARS-COV-2 replicates during the incubation period and severely damages the targeted tissues, particularly in organs with high ACE2 levels, by impairing the protective immune response (Shi et al. 2020). Consequently, the lower innate immune response and inflammation compared to other viruses during the incubation period may lead to enhance viral replication and propagation and subsequent enormous destruction of the affected tissues. In general, transcriptome analysis of WB and NP samples illustrated that the expression pattern of some genes is altered in SARS-COV-2 compared to HCs. Some of them like genes involved in the cell cycle showed an opposite expression pattern in SARS-COV-2-infected NP and WB samples which means that the pathogenesis of COVID-19 is body-site-specific. Albeit, due to the common DEGs in WB and NP samples of COVID-19 patients versus HC, such as those involved in the INF and cytokine pathways, we showed that SARS-COV-2 could also induce a global and systematic host response. RT-PCR, as the gold standard method of SARS-COV-2 detection, has clinical sensitivity ranging from 66 to 80% (Nag et al. 2020). Hence, host transcriptional biomarkers could be employed alone or combined with molecular detection of SARS-COV-2 to reduce false-negative and false-positive results, such as those caused by insufficient viral load or cross-contamination (Mick et al. 2020). We attempted to address this problem by using two meta-datasets from WB and NP samples to develop COVID-19 diagnostic biomarker panels that are precise and clinically practical and can overcome the constraints of direct viral genetic material detection. The differential gene expression analysis followed by the LASSO method led to the detection of 22 and 23 markers for WB and NP, respectively. The high-performance 3 to 9-gene sets of WB samples were determined from 22 features. We also applied 23 common WB/NP DEGs as input features to construct RF-based prediction models and observed that the accuracy, sensitivity, and specificity of these models were less than LASSO feature-based prediction models. Intriguingly, both prediction models included some common markers, such as the GTF2H2C gene, which was found in the best WB-based diagnostic panels in both prediction models.
Most of the 22 features of WB which were achieved by the LASSO method introduced as key factors in the pathogenesis of SARS-COV-2 or other viral infections in previous studies. For instance, ABCC11 is one of the ABC transporter genes whose genetic polymorphism has been associated with SARS-COV-2 infection (Yamamoto et al. 2021). Likewise, transcription of complement genes such as C1QC has been reported to be induced in response to SARS-COV-2, while higher expression has been observed in the lung relative to blood (Daamen et al. 2021;Lazara et al. 2021). A protein-protein interaction analysis also showed that CDCA5, a gene involved in the cell cycle, could be one of the significant characteristics of SARS-COV-2 infection (Mo et al. 2021). The GSTM1 gene, another introduced biomarker increases the risk of various oxidative stress-related multifactorial disorders, it might therefore play a role in susceptibility to SARS-COV-2 infection, as well (Abbas et al. 2021). Although the association of some biomarkers such as SCN5A and BEGAIN with SARS-COV-2 infection has been previously reported (Guidicessai et al. 2020;Thair et al. 2020b), their specific function in SARS-COV-2 pathogenesis needs to be experimentally clarified. Furthermore, the importance of some features like GOL-GA8M and SLC24A5 has been stated in the other viral infections or autoimmune diseases. GOLGA8M, which encodes Golgin A8 family member M, contributes to the development of HBV-related HCC (Jiang et al. 2020). The researchers have shown that the transport protein SLC24A5 induces a significantly higher frequency of CD8 T cell activation in an autoimmune disease like Alopecia areata (Wang et al. 2019).
The NP-based diagnostic biomarker panels that could discriminate COVID-19 patients from non-COVID individuals were obtained from 23 features. Among them, the overexpression of pro-inflammatory chemokines and cytokines is a key hallmark of SARS-COV-2-induced pulmonary complications, which contribute to a cytokine storm (Callahan et al. 2021). The gene CXCL10 as a pro-inflammatory chemokine, which was presented in the best panels of both prediction models, could be one of the main mediators involved in the SARS-COV-2-related cytokine storm. The expression level of CXCL10 has shown a strongly significant positive correlation with viral load and progression of COVID-19, thus it could be also used as a biomarker for COVID-19 acute respiratory distress syndrome (ARDS) patients (Oliviero et al. 2021;Hemida et al. 2010). Furthermore, the significant overexpression of CXCL11 transcripts has been demonstrated in patients with mild to severe disease, which may be related to different T cell responses in COVID-19 patients . Macrophage migration inhibitory factor (MIF), another biomarker candidate, has an important role in the inflammatory response to SARS-COV-2 infection by inducing the pulmonary inflammatory cytokines (Dheir et al. 2021). In addition, it has been observed that the level of MIF is higher in severe patients with COVID-19 (Aksakal et al. 2021). Therefore, MIF was also identified as a biomarker for determining the patients with COVID-19 ARDS (Bleilevens et al. 2021). We also identified that the interferon-stimulated genes (ISGs) such as IFIT2, IFI6, and IFI44L were upregulated in COVID-19 patients. Studies have shown that IFIT2 and IFI44L are strongly induced in bronchoalveolar lavage (BAL) cells and peripheral blood mononuclear cells (PBMCs) of COVID-19 patients, leading to enhance antiviral and immune modulatory functions (Zhu et al. 2020;Shaath et al. 2020). Transcriptome analysis of PBMCs also highlighted the potential role of IFI6 in responses to SARS-COV-2 in comparison to other respiratory viruses (Shaath et al. 2020;Qi et al. 2021). According to our findings, IL1R2 is a downregulated NP marker for COVID-19 patients. IL1R2, known as anti-inflammatory cytokines, is highly expressed in monocyte-macrophages from BALF and follicular regulatory T (TFR) cells; however, TFR cells have been reported to be significantly lower in hospitalized COVID-19 patients (Meckiff et al. 2020;Xu et al. 2020). LAMB3 which is present in anchoring junctions of epithelial cells was among NP-based diagnostic biomarkers. The human papillomavirus penetrates the epithelial barrier by inducing some changes in the LAMB3 of anchoring junctions ). Evidence has shown that blockers of anchoring junction proteins such as LAMB3 could prevent COVID-19 infection (Doehn et al. 2021). SAMHD1 is the molecule that controls the cellular deoxyribonucleoside triphosphates (dNTP) pool and has an inhibitory effect on HIV-1 replication by reducing the concentration of intracellular dNTP pools (Monit et al. 2019;Buffone et al. 2019). It has been suggested that SAMHD1 may be associated with neurological complications of COVID-19 (Khan and Sergi 2020). XIAP-associated factor 1 (XAF1), as a novel binding ligand of XIAP, can reverse XIAP's antiapoptotic activity (Gao et al. 2021). Zhu et al. reported that the expression of XAF1 in T, B, NK, and DC cells from patients with COVID-19 and influenza was upregulated, thus it may increase apoptosis in T-cells of COVID-19 patients (Zhu et al. 2020).
The best final host response-based markers were acquired using a 2-phase ML approach which employed k-fold cross validation on discovery sets (80% of the population) in the first step by all 3 to 9 combinations of selected features. The best results were then validated on independent test datasets (20% of the population), as well. Data integration from several different laboratories followed by a 2-phase ML method made the final host response-based biomarker panels more powerful, robust, and generalized. The optimal WBand NP-based diagnostic panels which have the minimum number of markers with maximum accuracy included 6 and 8 markers with an accuracy of 97% and 88% in both two phases.

Conclusion
Distinct transcriptome profiles of different SARS-COV-2-infected tissues relative to HCs and other pathophysiological conditions shed the light on the necessity of SARS-COV-2-specific drug design. Comparison of gene expression profiles of WB and NP samples with HCs demonstrated that expression of some genes is exclusively altered in WB or NP samples. Intriguingly, some of them like genes involved in the cell cycle showed a remarkably opposite expression pattern which means that the pathogenesis of COVID-19 may be body-site-specific. On the other hand, SARS-COV-2 induces a global and systematic host response according to common gene signatures in WB and NP e.g. the genes involved in INF and cytokine pathways, demonstrating the disease's complexity, as well. We introduced and validated host-responsebased diagnostic biomarkers using ML methods which could be applied as a complementary tool to diagnose the COVID-19 infection from non-COVID cases.