MarkerGeneBERT: A Natural Language Processing System for Cell Marker Extraction PDF

Document Details

SimplerOgre

Uploaded by SimplerOgre

Tags

NLP cell markers scRNA-seq natural language processing

Summary

MarkerGeneBERT is a natural language processing (NLP) system designed to extract cell markers from literature associated with single-cell RNA sequencing (scRNA-seq) studies. The system identifies species, tissues, cell types, and marker genes. It has been trained on a large dataset and achieves high accuracy and completeness in extracting marker genes and cell types. The findings highlight the effectiveness of NLP techniques in accelerating data annotation in scRNA-seq research.

Full Transcript

www.nature.com/scientificreports OPEN A natural language processing system for the efficient extraction of cell markers Peng Cheng 1, Yan Peng 1, Xiao‑Ling Zhang 1, She...

www.nature.com/scientificreports OPEN A natural language processing system for the efficient extraction of cell markers Peng Cheng 1, Yan Peng 1, Xiao‑Ling Zhang 1, Sheng Chen 1, Bin‑Bin Fang 1, Yan‑Ze Li 1* & Yi‑Min Sun 1,2* Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/ subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies. Conclusions: Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https://​github.​com/​cheng​ peng1​116/​Marke​rGene​BERT. Keywords ScRNA-seq, Natural language processing, Cell marker Single-cell sequencing technology has pioneered a burgeoning field of research across numerous species and tissues due to its exceptional resolution at the singular cell l­evel1. This advancement has laid the foundation for comprehensive exploration of cellular landscapes, allowing for precise delineation of all cell types within distinct tissues and organs. Achieving a thorough annotation of diverse cell types necessitates the identification of potential cell types within tissues and subsequent aggregation of corresponding cell type marker genes via ­ ellAssign2 and comprehensive literature reviews or referencing existing database. Notably, existing tools such as C ­scCATCH3, provide coarse-grained annotation by leveraging such ­databases4–6. Additionally, various databases, including CellMarker2.07, ­PanglaoDB8, ­singleCellBase9, ­PCMDB10, and C ­ ancerSEA11 have been established, offering an extensive collection of cell markers for different species and tissue types. These databases are pre- dominantly sourced through manual review and curation of scientific articles, enabling the acquisition of highly accurate marker genes; however, this approach demands substantial human effort and time. Numerous text mining-based methodologies have been implemented in various research fields for identify- ing entities of interest and discerning the relationships between these entities by parsing syntactic dependencies within the text. For instance, Shetty et al. developed a language model called MaterialsBERT, which was trained on 2.4 million abstracts from the polymer literature to autonomously extract various properties of organic and polymer materials from the literature ­abstract12. Gu et al. employed a pretrained NLP text mining system called MarkerGenie to identify entities of interest, such as diseases, microbiomes, genes, and metabolites, that were mentioned in texts. After entity identification, the system parses the syntactic structure of the text and extracts contextual features for each word, thereby distinguishing between the types of relationships-diagnostic, predic- tive, prognostic, predisposing, or treatment related-among diseases, microbiomes, genes, and ­metabolites13. 1 Marketing and Management Department, CapitalBio Technology, Beijing 100176, China. 2National Engineering Research Center for Beijing Biochip Technology, Beijing 102206, China. *email: [email protected]; [email protected] Scientific Reports | (2024) 14:21183 | https://doi.org/10.1038/s41598-024-72204-6 1 Vol.:(0123456789) www.nature.com/scientificreports/ Naseri et al. utilized an NLP pipeline to identify pain-related medical terms from largely unstructured and non- standardized clinical consultation notes, subsequently predicting pain scores based on recognized pain ­terms14. Doddahonnaiah et al. utilized a precompiled cell type and gene vocabulary to assess the correlation between gene and cell type entities by calculating their co-occurrence frequency within more than 26 million biomedical ­documents15. In conclusion, these published methods provided a more efficient and comprehensive analysis of research articles than manual curation by aiding in the identification of rare or novel entities of interest along with their interrelations. In this study, we present MarkerGeneBERT, an NLP-based system designed for the automatic extraction of cell marker genes from single-cell sequencing studies. Leveraging biomedical corpora such as ­CRAFT16, ­JNLPBA17, and B ­ IONLP13CG18 along with a text classification model trained on a manual curation dataset of 27323 sentences, MarkerGeneBERT aims to automatically identify cell and gene entities while removing false positive associations. We collected 3702 single-cell sequencing articles published from January 2017 to June 2023 from free-text PubMed and PubMed Central, then put them into MarkerGeneBERT to extract cell marker genes, subsequently validating our findings against manually curated databases. Moreover, we applied our marker gene list using scCATCH for cell cluster annotation in brain tissue samples, yielding results consistent with prior studies. An overview of MarkerGeneBERT is given in Fig. 1, which consists of four main components: literature retrieval, extraction of marker-related sentences, establishment of cell-marker associations, and inference of species, tissue, and disease information within the articles. Methods Data collection The main texts of single-cell RNA sequencing studies were downloaded and parsed from free-text PubMed and PubMed Central. Specifically, we employed the R package "RISmed"19 to retrieve literature using the search terms "Animals"[MeSH Terms] AND "Single-Cell Analysis"[MeSH Terms] OR "single-cell" AND "expression" within a specified time frame. These rigorous rules enabled us to obtain a comprehensive collection of PMID from single-cell research-related studies. Subsequently, using the R package "easyPubMed"20, we acquired basic information such as titles, abstracts, and literature sources for each PMID. For the literature sourced from the PMC, we utilized the R package "europepmc" to retrieve the main text documents and systematically extracted sections, including the introduction, method, and results. For other manually collected literature in PDF format, we employed the python library "scipdf_parser" to parse the PDF files and extract pertinent sections such as the introduction, method, and results based on the parsed outcomes. Fig. 1.  The pipeline of MarkerGeneBERT for extracting cell marker genes from the literature Scientific Reports | (2024) 14:21183 | https://doi.org/10.1038/s41598-024-72204-6 2 Vol:.(1234567890) www.nature.com/scientificreports/ Marker‑related sentence classification model Supervised training data generation for marker‑related sentence classification To identify marker-related sentences in the main text of the literature, specifically concerning those containing both cell and gene names with a particular syntactic structure, such as "Gene A is a marker of Cell B" or "Gene A (specific to Cell B)", we constructed a text classification model based on the s­ paCy21 and "textcat" modes to pretrained on a manually annotated marker-related dataset curated by our team. Specifically, a total of 62,000 main text sentences were initially collected from approximately 900 single-cell RNA sequencing studies. Subsequently, over ten bioinformatics engineers with expertise in single-cell research manually screened these sentences to isolate those containing cell-marker genes from the raw sentences encom- passing both cells and genes. This processing step resulted in narrowing down the 62000 initial sentences to 27323 remaining sentences. Following this, the 27323 sentences underwent random shuffling and were redistributed to the aforementioned bioinformatics engineers according to predefined rules (Table 1) and their personal expertise to conduct manual labeling. Collation and review of the annotated sentences were conducted by two senior bioinformatics engineers. Any sentences with disputed annotations were subjected to discussion and potential re-annotation. Text preprocessing of marker‑related sentences Text preprocessing has been a traditionally important step for NLP tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better. Specifically, the 27323 sentences were initially input into the SciBERT ­model22, and the tokenizer and parser components of the SciBERT model were used for part- of-speech tagging and syntactic dependency parsing of the sentences. This process generated contiguous spans of tokens, including words, punctuation symbols, and whitespace. Non-gene entity tokens were subsequently lemmatized and converted to lowercase characters. Additionally, tokens classified as stop words, punctuation marks (excluding parentheses), or numerical values were filtered out. Lastly, based on prior knowledge, we selec- tively retained parentheses only when the token inside or preceding the parentheses constituted a gene name. We performed text preprocessing on each sentence, and the cleaned sentences were exclusively utilized for training a text classification model. Marker‑related sentence classification model construction The TextCategorizer (TextCat) module within the spaCy natural language processing library served as a potent tool in our development of an innovative marker-related sentence classification model. Incorporating both a bag- of-words approach and a neural network model, we configured the "vectors" parameter to "en_core_web_trf ", which means a transformer pipeline tailored for English text classification while maintaining default values for other parameters. Leveraging the remaining 27323 sentences as the training dataset for our marker-related sentence classification model, our trained model generated probability values in its output, facilitating the assess- ment of sentence credibility concerning cell types and marker genes. To determine an appropriate probability threshold for distinguishing marker-related sentences, the original training dataset was evenly divided into 10 parts, ensuring a 1:1 ratio of label 0 and label 1 in each subset. Subse- quently, a tenfold cross-validation approach was employed, where 9 parts of the original training set were used as a new training set to train the text classification model, while the remaining 1 part served as the validation set for evaluating the model’s performance. The sentences from the validation set were inputted into the model, yielding a predicted probability corresponding to the likelihood of the sentence being a marker-related sentence. We evaluated the precision and recall of the model at different probability thresholds and calculated the F1 score. Finally, based on the variations in F1 score under different thresholds, an appropriate threshold was selected. Entity extraction Named entity recognition (NER) As shown in the Table 2, each s­ cispacy23 NER model was originally built to identify distinct entity types. Employ- ing the spaCy Python package, we seamlessly integrated these four NER models with default parameters to extract cell, species, tissue, and disease entities comprehensively. In our study, our primary focus did not revolve around optimizing text-to-token conversion efficiency. For instance, when extracting cell entities, we amalgamated tokenization results from various NER models rather than relying on partial outputs to enhance entity extraction completeness. Consequently, we disabled the "tagger" "parser" "attribute_ruler" and "lemmatizer" components within the NER models to enhance processing speed. On average, processing a single sentence required approxi- mately 4 s. The total runtime of our model displayed a direct correlation with the number of articles, exhibiting a linear relationship between them. The peak resource consumption during execution peaked at around 21 GB. Example Ruler Label Differential expression of gene A between interneurons derived from culture A and Have association between genes and cell 0 culture B Differential expression is not equivalent to specific expression Have association between genes and cell Cell A expressed mesenchymal markers gene B 1 Gene B is a marker of cell A Table 1.  Marker-related sentence annotation rules Scientific Reports | (2024) 14:21183 | https://doi.org/10.1038/s41598-024-72204-6 3 Vol.:(0123456789) www.nature.com/scientificreports/ Cell type Species Tissue Disease en_ner_jnlpba_md √ – – – en_ner_craft_md √ √ – – en_ner_bionlp13cg_md √ – √ – en_ner_bc5cdr_md – – – √ Table 2.  NER model for identifying different entity types Generation of gene vocabulary. The complete set of human and mouse protein-coding genes was obtained from the GTF file in Cell Ranger v5.0.1, and gene entities were extracted using exact string matching. Cell entity recognition. First, each sentence was parsed, and cell names were extracted using three NER models independently (Fig. 2). Specifically, the "en_ner_craft_md" model identified entities with the entity type "CL" as cell names, the "en_ner_jnlpba_md" model recognized entities with the entity types "CELL_TYPE" and "CELL_ LINE" as cell names, and the "en_ner_bionlp13cg_md" model identified entities with the entity type "CELL" as cell names. Subsequently, we performed exact string matching on the same sentence using the comprehensive cell names obtained from the cell ontology ­database24. Finally, we retained the cell names that were extracted by at least two sources as the cell names present in the respective sentences. To alleviate the potential limitations in capturing all cell names comprehensively, specifically regarding instances such as "CD4 + T cell" where three models may extract disparate cell entities, we conducted a com- parison and completion of cell names identified by different models at the same position within the text. In cases where two models extracted "CD4 + T cell" and "T cell" as cell entities at the same position, we completed "T cells" to "CD4 + T cells." A full‑text‑based strategy for extracting species and tissue entities. We employed a full-text-based strategy in which the literature was divided into sections such as abstracts, methods, and results, and entity recognition was performed using NER models on each section, followed by comprehensive analysis and judgment. Species entity recognition. The extraction of species entities primarily relied on MeSH (Medical Subject Headings) terms, which are controlled vocabulary thesaurus used by the National Library of Medicine (NLM) for indexing articles in PubMed. For each study, we utilized the "en_ner_craft_md" model to identify species enti- ties from the MeSH terms. If no species entities were identified from the MeSH term text provided by PubMed, we further performed species entity recognition based on the overall structure of the full text. Specifically, we employed the "en_ner_craft_md" model to identify species entities separately from the text in the title section, methods section, and first paragraph of the Results section. The most frequently occurring species entity was selected as the species type studied in the respective literature. Tissue entity recognition. We utilized the "en_ner_bionlp13cg_md" model for the recognition of tissue enti- ties. Specifically, for each study, we identified tissue entities separately from the MeSH term text, the text in the title section, and the sentences within the full text that contained keywords related to single-cell sequencing, such as "single-cell" and "dissociation". If a tissue entity was identified in all three text sections of the literature, it was considered the correct tissue type. Otherwise, we supplemented the recognition of tissue entities by analyzing the text in the first paragraph of the Results section and the Methods section of the article. We calculated the frequency of each entity extracted from different text sources and ranked them accordingly. Additionally, we determined the frequency of co-occurrence between each entity and keywords related to single-cell sequencing in the same sentence, as well as the frequency of co-occurrence between each entity and all cell entities identified in the literature. The top two tissue types based on the cumulative ranks from these three ranking results were considered candidate tissue types according to the literature. Disease entity recognition. We employed the "en_ner_bc5cdr_md" model to identify disease entities from the title section, which were considered the disease types studied in the respective literature. If no disease entities were detected, the literature was assumed to be "normal" by default. Cell type–gene relation classification To extract cell marker genes from marker-related sentences, we initiated the process by retaining sentences that concurrently contained both cell and gene names, as identified through entity recognition. Subsequently, these sentences underwent text preprocessing before being input into a text classification model. Upon surpass- ing a predetermined probability threshold, the original sentences were further classified into two types: those The stem cell CL clusters were marked by enrichment of Lgr5 GGP , Olfm4 GGP , and Ascl2 GGP. Fig. 2.  An example of an NER model for identifying and classifying named entities. CL indicates the cell line, and GGP indicates the gene or gene product Scientific Reports | (2024) 14:21183 | https://doi.org/10.1038/s41598-024-72204-6 4 Vol:.(1234567890) www.nature.com/scientificreports/ conducive to extracting cell-gene relationship pairs based on predefined rules and those necessitating manual extraction of cell-gene relationship pairs (Table 3). For sentences meeting the criteria for extracting cell-gene relationship pairs based on predefined rules, the tagger and parse components of the SciBERT model were employed to parse the syntactic structure of the sen- tences and generate a syntactic dependency tree (Fig. 3). Each syntactic dependency tree consisted of numerous subtrees, where a subtree was defined as a sequence encompassing the token and all its syntactic descendants. This process effectively delineated relationships between tokens, allowing for the extraction of cell-gene relation- ship pairs located within the same subtree. Additionally, it is worth mentioning that sentence structures that conformed to the pattern "cell name (gene name)" were directly selected for the extraction of cell–gene pairs. Statistics All the statistical analyses were performed in R (version 4.1). The performance of the marker-related sentence classification model was evaluated using the precision, recall, and F1 score of the predicted entity tag compared to the ground truth labels. Results Identification of gene and cell entities using MarkerGeneBERT Pretrained NER models for entity extraction have proven to be effective in various research fields. MarkerGen- eBERT integrates three pretrained NER models based on diverse biomedical corpora. Additionally, we incor- porated cell names curated from the Cell Ontology database for exact string matching. Given the standardized gene names, the MarkerGeneBERT utilized only gene symbol IDs exclusively sourced from the GTF file in Cell Ranger for accurate gene entity recognition. Further details can be found in the Methods section. As detailed in the Methods section, 27323 sentences, initially labeled with cell and gene names and manually annotated by our team for the marker-related sentence classification model, were used to validate the perfor- mance of "en_ner_bionlp13cg_md", "en_ner_craft_md", "en_ner_jnlpba_md", and MarkerGeneBERT in identify- ing cell and gene entities. Compared to the three pretrained NER models used individually, MarkerGeneBERT demonstrated higher precision and recall in the extraction of cell and gene names (Table 4). Specifically, for gene name identification, MarkerGeneBERT achieved an F1 score of 87% (precision: 89%, recall: 99%), surpassing the second-best model by 20%. In terms of cell name identification, MarkerGeneBERT obtained an F1 score of 92% (precision: 86%, recall: 98%), outperforming the second-best model by 8%, thus representing the optimal trade-off between precision and recall. Cell–biomarker associative binary classification We introduced a supervised marker-related text classification model to determine which sentences included not only cell entities and gene entities but also specific syntactic patterns indicating that a gene is a marker of a cell. More details about the model and training dataset construction process are available in the Methods section. Relation classification Keyword Example We noticed that in this group of cells we are indeed detect- Extracting cell-gene relationship pairs based on predefined ing high levels of genes such as AQP4 and GFAP (astrocyte rules specific) as well as THY1, SYT1, and STMN2 (neuron specific) The differentiated subcutaneous adipocytes from Fam13a Small, non, not, non, respectively, neither, Neither, distin- KO mice also expressed a marginally but not significantly Manual extraction guish, nor, no, decreased, decrease, downregulated, down- higher level of beige adipocyte markers (e.g. Pgc1a and regulate, weak, absent, absented, low, lack, lacks, lower, few Ucp1) Table 3.  Classification of sentence extraction methods Subtree 4 dobj Subtree 2 Subtree 3 det amod nmod pou om n c d Subtree 1 dobj case conj mark pou om n c d cc nsubj po det nsubj amod pou om n dep pou om n om u nd c c c d d A cell have high express of pericyte markers Abcc9 and Kcnj8, while B cell expressed the smooth muscle cell marker Acta2 DET NOUN AUX ADJ NOUN ADP NOUN NOUN NOUN CCONJ NOUN SCONJ NOUNNOUN VERB DET ADJ NOUN NOUN NOUN NOUN Fig. 3.  An example of a syntactic dependency tree Scientific Reports | (2024) 14:21183 | https://doi.org/10.1038/s41598-024-72204-6 5 Vol.:(0123456789) www.nature.com/scientificreports/ Gene entity Cell type entity NER model P (%) R (%) F1 (%) P (%) R (%) F1 (%) en_ner_bionlp13cg_md 58 80 67 84 82 83 en_ner_craft_md 64 70 67 93 58 72 en_ner_jnlpba_md – – – 90 79 84 MarkerGeneBERT 78 99 087 86 98 92 Table 4.  Performance of various NER models in identifying gene and cell entities. To evaluate the performance of the marker-related text classification model in distinguishing specific syntactic patterns indicating that genes are markers of cells, we partitioned the training dataset into 10 subsets, randomly selecting 9 subsets for model training and reserving one subset for validation. The evaluation results depicted in Fig. 4A demonstrated a mean average precision (mAP) of 0.876 (ranging from 0.84 to 0.91), a mean precision of 0.844 (ranging from 0.8 to 0.9), and a mean recall of 0.734 (ranging from 0.56 to 0.78). After processing by the model, each sentence could obtain a predicted probability value. A sentence was classified as a marker-related sentence if the predicted probability value was greater than the threshold, so the threshold setting was very important for the performance of our model. We calculated the F1 score for different thresholds, as illustrated in Fig. 4B, and the fitting threshold was 0.7. Under these threshold settings, the F1 score achieved optimal performance across different validation sets. For the remaining marker-related sentences whose predicted probability was greater than 0.7, we employed syntactic structure-based analysis within each sentence to identify and extract reliable cell-marker relationship pairs. The extraction criteria are described in detail in the Methods section. In addition, we employed an appropriate NER model, as shown in Table 2, to assess the species, organs, and disease information in each study. Further details are provided in the Methods section. Statistics of the NLP system extraction results We employed MarkerGeneBERT to extract 3280 cell types and 16124 genes from 3702 literature sources (Sup- plemental Table 1). Compared to existing databases manually curated by domain experts over the years, our model achieved competitive retrieval results (Table 5). The maximum memory of our system, which included A B Iteration 1.00 Iteration 1 ( AP : 0.88 ) Iteration 2 ( AP : 0.89 ) 0.75 0.75 Iteration 3 ( AP : 0.86 ) Iteration 4 ( AP : 0.91 ) Precision F1-score 0.50 Iteration 5 ( AP : 0.86 ) 0.50 Iteration 6 ( AP : 0.84 ) Iteration 7 ( AP : 0.89 ) 0.25 0.25 Iteration 8 ( AP : 0.88 ) Iteration 9 ( AP : 0.88 ) Iteration 10 ( AP : 0.87 ) 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall cut off Fig. 4.  Evaluation of the marker-related text classification model. A Precision‒recall curve for the training model on the validation set at various iterations. B F1-scores for the training cohort in the validation cohort for different cutoff values. The mean F1-score is greatest at the vertical red line Database Literature Source Type Date Species Tissue Disease Cell Marker scRNA-seq (1945) & Experiment Human CellMarker2.0 Full text manual curated 2022 828 374 (cancer type) 3149 27122 (4040) & Mouse & Review (354) Human Panglao DB scRNA-seq (1054) Full text manual curated 2020.3 29 – 178 8286 & Mouse 26 million biomedi- Human scALE Main text NLP 2021 Pan tissue – 556 – cal documents & Mouse MarkerGeneBERT scRNA-seq (3702) Main text NLP 2023.6 Human & Mouse 729 721 3280 16124 Table 5.  Comparison of inclusion results across different databases. Scientific Reports | (2024) 14:21183 | https://doi.org/10.1038/s41598-024-72204-6 6 Vol:.(1234567890) www.nature.com/scientificreports/ all the scripts and models, was 21 GB, and the parsing and entity extraction of one paper could be quickly com- pleted in 7 min. Concordance between MarkerGeneBERT and manually curated databases To validate the accuracy of the system for detecting cell entities, gene entities, cell-marker pairs, species, tissue, and disease information, we conducted a comparison with CellMarker2.0, widely recognized as the gold stand- ard for manual curation. As our methodology chiefly extracted gene markers from main text, we specifically compared gene markers from 1027 articles present in both CellMarker2.0 and our database. Other articles were excluded due to reasons such as unavailability for download or because the markers were sourced from sup- plemental materials; additional details are available in Supplemental Fig. 1. The MarkerGeneBERT identifies most cell and gene entities recorded in databases In this 1027 studies, the CellMarker2.0 manual curated a total of 4646 cell types with 12,874 marker genes, while the main text parts covered 3185 cell types and 8683 marker genes; approximately 84% of the valuable informa- tion was derived from the main text (Supplemental Fig. 2). MarkerGeneBERT identified 90.8% of the marker gene entities (7890/8683) and 92.7% of the cell type entities (2954/3185) in these common studies (Fig. 5A). Through a systematic comparison of the results extracted from each literature source with those of Cell- Marker2.0, MarkerGeneBERT revealed an additional 1764 cell types associated with the marker genes (Fig. 5B). Among the 1764 newly identified cell types, 1344 were initially excluded by CellMarker2.0 in the corresponding literature; however, these were reported in other studies of the same tissue. It is noteworthy that 89 cell types were not cataloged in CellMarker2.0, primarily comprising tissue-specific cell types. These cells, including enteric mesothelial fibroblasts from the intestine and retinal progenitor cells from ocular tissue, exhibited low frequencies. Additionally, 302 cell types were detected with CellMarker2.0 but not with corresponding tissues. We cat- egorized these 89 newly recorded cell types and 302 reported cell types according to their tissue information (Fig. 6). These cell types primarily represent functional cells distributed across different tissues; for instance, in the literature related to human gastric tissue, cancer-associated fibroblasts (CAFs), as central components of the tumor microenvironment in primary and metastatic tumors, profoundly influence the behavior of cancer cells and are involved in cancer progression through extensive interactions with cancer cells and other stromal ­cells25. Our method can be used to directly record CAFs in both cancer and gastric tissues. The detailed cell marker information is available in Supplemental Table 2, and the additional cell types and marker genes identified by MarkerGeneBERT have been manually reviewed. High consistency of the marker gene list between the MarkerGeneBERT and the database For each study, we assessed the consistency of the cell marker genes identified between CellMarker2.0 and MarkerGeneBERT. As illustrated in Fig. 7, approximately 47% of the cell types and their corresponding marker gene pairs were the same in the CellMarker2.0 database and MarkerGeneBERT. Additionally, for approximately 23% of the cell types, the marker genes extracted by MarkerGeneBERT were present in CellMarker2.0, and they accounted for 87% of the corresponding marker genes recorded in CellMarker2.0. The reason for the extraction results falling short of 100% was primarily due to certain cell types that record multiple marker genes within a single document, and it was possible that MarkerGeneBERT may have filtered out some marker genes based on preset conditions (Supplemental Fig. 3). And still, most of such cell markers also showed a high level of precision, often reaching 100%. Overall, MarkerGeneBERT exhibited a high percentage of true positives, and there was a high level of consistency between the results extracted from the MarkerGeneBERT and CellMarker2.0 databases. A B 1000 CellMarker2.0 DB MarkerGeneBert recognition 750 1764 Percent of entity recognition Cell type Literature count 100% New in Model 70%~99% 500 2954 2954 share