Deconvoluting Multi-Person Biological Mixtures PDF

Forensic Science International: Genetics 71 (2024) 103030 Contents lists available at ScienceDirect Forensic Science...

Forensic Deconvoluting multi-person biological mixtures and accurate characterization and identification of separated contributors using non-targeted single-cell DNA sequencing Lucie Kulhankova a, Eric Bindels c, Manfred Kayser a, *, 1, a Department of Genetic Identification, Erasmus MC University b Department of Cell Biology, Erasmus MC University Medical c Department of Haematology, Erasmus MC University Medical A R T I C L E I N F O Keywords: Single-cell DNA sequencing Single-cell ATAC sequencing Mixture Deconvolution Genetic identification Bio-geographic ancestry Forensics * Correspondence to: Erasmus MC University Medical Center Netherlands. ** Correspondence to: Erasmus MC University Medical Center E-mail addresses: [email protected] (M. Kayser), [email protected] (E. Mulugeta). 2024; Accepted 4 March 2024 This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Forensic Science International: Genetics 71 (2024) 103030 two (or more) autosomal SNPs, were also explored for the purpose of forensic mixture deconvolution, especially when analyzed with targeted MPS. In addition to the general limitation that the forensic appli cation of these compound markers requires the availability of reference samples of the suspects for generating the needed reference data, since these alternative DNA markers are not included in criminal offender DNA databases, a major drawback of these methods is that they are generally suitable to two-person mixtures only. A completely different approach on forensic mixture deconvolution is to separate the biological mixture into individual cells prior to forensic DNA profiling. Various technologies such as FACS [25,26], laser microdissection [27,28], DEPArray [29–33], and optical tweezers have been previously tested for this purpose, all with limited success for various reasons [26, 33, 35–38]. The major limitation of all these ap proaches is that they attempt to obtain forensic DNA profiles from DNA of single cells in a targeted way, either by standard genotyping or MPS-based genotyping i.e., only analysing the specific DNA markers finally used in forensic DNA profiling from very limited DNA. This consequently leads to the typical problems in low template DNA analysis with allele drop-outs and drop-ins, resulting in unreliable or incomplete forensic DNA profiles. Furthermore, with these cell separation tech niques, the selection of cells from a mixture is often guided by the investigator that introduces a significant bias in the final forensic DNA profiling outcome. In principle, challenges faced in previous attempts to deconvolute human biological mixtures in forensic applications could be overcome with a technique that allows the separation of a mixture into its individual-specific cells prior to downstream forensic genetic analyses, and in an unbiased and non-targeted manner. In recent years, such suite of technologies has emerged and is typically referred to as single cell sequencing. In single cell sequencing, the RNA or DNA of each cell is first labelled with barcodes, followed by non-targeted high-throughput sequencing of large parts of the transcriptome or genome of a cell using next generation sequencing technologies. Notably, these non-targeted single cell RNA or DNA sequencing techniques are substantially different to targeted MPS analysis of forensic DNA markers previously applied to the problem of forensic mixture deconvolution. This distinction arises from their ability to generate sequencing data from large parts of the human transcriptome or genome from which a large numbers of single nucleotide polymorphisms (SNPs) can be obtained. These SNPs are subsequently utilised for genetic mixture deconvolution and downstream forensic genetic analyses. In the past years, several single-cell RNA and DNA sequencing techniques have been developed and applied in different biological fields for understanding cellular processes, cellular heterogeneity and the molecular mechanisms that drive this heterogeneity [40–43]. Until very recently, such single cell sequencing technologies had not been applied to the forensic problem of mixture deconvolution. In 2023, we have presented a proof-of-principle study demonstrating the feasibility of non-targeted transcriptome-wide single-cell RNA sequencing (scRNA-seq) and our bioinformatics pipeline De-goulash for mixture deconvolution and downstream forensic genetic analyses [44, 45]. We showed that the genotype data of different sets of carefully selected SNPs that De-goulash automatically selected from the mixture’s scRNA-seq data, allowed the correct separation of the mixtures ac cording to the individual contributors in up to 9-person balanced and imbalanced mixtures. We further demonstrated that different sets of SNPs automatically selected from the scRNA-seq data of the separated cell clusters allowed for accurate genetic characterization of the sepa rated mixture contributors regarding biological sex and bio-geographic ancestry (maternal, paternal, and bi-parental) as well as their correct individual genetic identification with statistical certainty. Furthermore, this approach provided accurate identification of the cellular / tissue origin of the analysed mixture. In our previous study, we pioneered the use of non-targeted single cell transcriptome sequencing for mixture deconvolution and 2 Forensic Science International: Genetics 71 (2024) 103030 diluted with 1 vol of PBS with 2% FBS. Next, the sample was layered on. However, sequencing RNA has limitations, particularly when lymphoprep™ and centrifuged without a break for 20 min. The PBMC layer was extracted into PBS with 2% FBS and washed twice before filtering through 40 µL cell strainer. Cell integrity and count were assessed using Countess 3 FL cell counter. Cells were then mixed in an even ratio and nuclei isolation was performed as recommended by 10xGenomics (CG000169revD) with a 3-minute lysis step and DNase treatment. The recovered nuclei were counted on Countess and the appropriate number of nuclei was selected for a library preparation depending on the requirements of the experiment. This 2.2. Library preparation strategy to ScATAC-seq libraries were generated using Chromium Next GEM Single Cell ATAC reagent kit v1.1 (10xGenomics). Quality and number of nuclei was verified by counting tryphan blue positive nuclei using a haemocytometer and a Countess 3 FL (ThermoFischer). Transposase incubation (Tn5) and subsequent library preparation was done with the Chromium Next GEM Single Cell ATAC reagent kit v1.1 (10xGenomics). Libraries were sequenced on a Novaseq6000 platform (Illumina), with 50–8–16–50 cycle setting. 2.3. Data processing The scATAC-seq mixture data were de-multiplexed using cellranger- atac mkfastq (version 2.0.0). Sequencing reads were aligned to the human genome (GRCh38) with the STAR aligner that is part of the Cell Ranger ATAC pipeline (version2.0.0) using the cellranger-atac count command. The quality of the generated data was assessed based on several parameters: fragment distribution, fraction of high-quality fragments overlapping peaks, estimated number of cells. Regions that previously were reported to result in poor alignment (blacklist regions) were removed by using the list provided by ENCODE (the ENCODE blacklist regions ). The resulting aligned bam file as well as the cell barcodes resulting from the cell calling containing true cells (filtered cell barcodes) were used for downstream analyses. first time the for 2.4. Deconvolution of biological mixtures empir Mixture deconvolution was performed using the two-step deconvo lution tool of the previously described De-goulash pipeline. SNP calling, generation of cell matrix and clustering, and variant calling were all performed as described previously with the noticeable deviation of a reduced requirement for coverage per SNPs (from 2 to 1) allowing for inclusion of the residual mitochondrial DNA (mtDNA) SNP information available with scATAC-seq data. A minimum of 10 mtDNA SNPs per cell (4 for scRNA-seq previously) were used for the first iteration and 60 whole-genome SNPs (20 for scRNA-seq previously) in the second itera tion. For the first iteration, the mtDNA SNPs used for the deconvolution analysis were selected from the bulk data mtDNA SNPs freebayes call (freebayes –iXu –c 2 –q 1) with SNP quality of 80 and depth of 20. For UMAP , 300 neighbours were selected and the number of clusters was determined using NbClust. For the second iteration, the whole-genome SNPs were selected by using SNP lists of each cluster established from iteration 1. Non-unique variants were removed using bcftools norm (version 1.9) [55,56]. The selected SNPs were then filtered for depth and quality (QUAL > 80, DP > 20) and used for clustering as described for iteration 1. Detailed information regarding the separation process can be found in the methods section of our pre vious study and the tool’s GitHub page. Since cluster numbers are assigned arbitrarily by the software, in order to match clusters in different datasets that are generated from similar individuals (S14 Table), SNPs obtained from clusters in the different datasets were compared. 3 Forensic Science International: Genetics 71 (2024) 103030 model assumes linkage equilibrium and therefore the De-goulash pipe line performs a pruning (minimum of 0.5 cM between any included markers) of the overlapping markers prior to LR calculations. We further note that genetic linkage is not an issue for random match probability calculations. ancestry- 2.7. In silico mixture preparation and analysis i) Testing the limits of scATAC-seq in balanced mixtures In silico balanced mixtures based on two publicly available single-individual scATAC-seq datasets (A1, A2, details in data availability section of this manuscript) from 10x Genomics were created in various increments. Equal number of barcodes, ranging from 100 down to 5 cells per each of the two individuals were selected randomly. Each dataset was filtered for reads containing the selected barcodes. The resulting datasets were then merged into two-person balanced mixtures, and mixture deconvolution was performed using the De-goulash pipeline. Due to the low number of cells used, the requirement for depth of SNPs and the number of SNPs observed per cell were adjusted (10 depth, 4 SNPs required for mtDNA SNPs, 10 for autosomal SNPs), as well as UMAP neighbours were adjusted in each run to match the number of cells available for UMAP (100 for the 100:100 dataset, 50 for 50:50 dataset, 20 for 20:20 dataset, 10 for 10:10 dataset). The cluster assignment of each separated dataset was matched to the original single-individual data source using the original barcodes to establish that the obtained clusters are individual-specific. For the subsequent genetic characterisation analysis based on the obtained individual-specific cell clusters, we adjusted De-goulash the following parameters to accommo date low cell numbers: the depth of SNPs was lowered to 10, the quality of SNP was lowered to 50, STRUCTURE was run on all available SNPs in a cluster. Biological sex, Y chromosome and mtDNA haplogroups were assessed as described above. Aiming to obtain the true biological sex and biogeographic ancestry of the individuals used in the in-silico mixtures, the initial individual scATAC-seq data, prior to their use in the in-silico mixture, were used. For this, variants were called on the unmixed datasets using freebayes variant calling (freebayes –iXu –c 2 –q 1). The SNPs were filter for quality of 80 and depth of 20, and the modified De-goulash pipeline for analysis was run. All available SNPs were used for STRUCTURE with the downstream distance and allele frequency filter applied. MtDNA and Y chromosomal haplogroups were assigned using Haplogrep2 and Y-leaf as described above. The resulting ancestry and sex assign ments were then compared to the corresponding findings from the bulk single-source data to establish the limits of the approach in assessing the true sex and ancestry after mixture deconvolu tion. Individual genetic identification could not be performed from in-silico mixture due to the lack of reference data. However, the correct individual-specific mixture deconvolution was assessed based on the barcodes known to belong to one individual or the other. ii) Testing the limits of scATAC-seq in imbalanced mixtures Two publicly available single-individual scATAC-seq datasets from 10x Genomics (A1, A2, details in the data availability sec tion of this manuscript) were used to create imbalanced mixtures with minor and major components of variable degrees. In total, 1000 cells were selected from both individual datasets together. For the minor component, cells with high number of reads were selected to increase reliability. Cells from the two individual datasets were mixed in ratios ranging from 1:10 down to 1:90. From each dataset, the reads containing the chosen cell barcodes were filtered and the resulting subsets were then merged to create a mixed dataset. Subsequently, De-goulash was used to deconvolute these imbalanced mixtures. Taking into account the 4 Forensic Science International: Genetics 71 (2024) 103030 genetic analyses with De-goulash analysis of scATAC-seq data. A) Experiment design, B) 3D iteration. C) Percentage of sequencing reads mapping to the X-chromosomal XIST gene and D) to the colour indications of inferred sex. C and D include a colour guide with the determined sex of each cluster red presumed biological male cluster. E) STRUCTURE results showing bi-parental biogeographic for each of the 5 separated cell clusters together with the population reference data used (EUR: Eu AFR: Sub-Saharan-Africans). F) Percentage of selected human identity SNPs matching between database for each of the 5 cell clusters / individuals. G) LR of individual genetic identification for data, the green line represents the certainty threshold of 10E9. H) Percentage of selected human identity based on the GSA reference database for each of the 5 cell clusters / individuals. I) LR of individual based on the GSA reference database, the green line represents the certainty threshold of 10E9. Fig. 1A was 5 Forensic Science International: Genetics 71 (2024) 103030 from two complex blood mixtures involving the same 5 individuals, respectivelya. Fraction of high-quality fragments in Mean raw read pairs per Sequencing cells cell saturation 0.6456 74951.26 0.5213 0.5624 253377.5 0.6661 used in a previous study. To this end, we removed the ENCODE blacklist region from the scATAC-seq data. As the previous analysis successfully demonstrated the accurate separation of cells into individ ual clusters, we could utilize the resulting separation data as individual data. Hence, we employed the distinct lists of cells from each dataset, identification could which were assigned individually by De-goulash. Next, we randomly selected an increasing number of barcodes from each cluster-barcodes subset. We selected cell numbers between 10 and 400, each selection was repeated 10 times. For each barcode selection we extracted the barcodes corresponding to the selected cells and called SNPs using FreeBayes v1.3.1 (“-iXu -C 2 -q 1″). We then filtered the SNPs using bcftools filter (QUAL > 80 and INFO/DP > 20) and counted the number of SNPs. The results from each cell number point were then averaged for the corresponding sequencing and cluster. 2.9. Genomic localization of SNPs To determine the genomic location of SNPs in scRNA-seq and scATAC-seq, we used a scRNA-seq dataset with a mixture of four in dividuals (dataset M4 in our previous study , and the two cells. scATAC-seq mixtures generated for this study (S1 Table), and the two references (WES and GSA data). The scATAC-seq datasets were pro cessed by excluding ENCODE blacklist regions. In all datasets, SNPs were called using FreeBayes v1.3.1 parsing arguments “-iXu -C 2 -q 1″ was. The variant/SNPs list was further filtered using bcftools filter (QUAL > 80 and INFO/DP > 20). SnpEff v4.3 was used to annotate and analyse the resulting variants list (java -Xmx8g -jar snpEff. jar hg38). Finally, the count of variant types and locations in the different datasets was recorded and compared. before Haplogrep2 3. Results 3.1. Deconvolution of complex mixture and down-stream forensic DNA profiling and ancestry In order to investigate the suitability of scDNA-seq in general, and scATAC-seq in particular, for deconvoluting complex biological mix tures on the individual-specific level, we first generated a balanced blood mixture comprising five individuals with an equal contribution. The five mixture contributors were selected to cover both sexes and diverse biogeographic origin from the same and different continental ancestries i.e., two European females, one European male, one African- European admixed male, and one East-African male. After nuclei isola tion, the mixture of these 5 individuals was subjected to scATAC-seq using 10x Genomics library preparation followed by sequencing (Fig. 1A), generating the dataset scATAC-H. After sequencing, 3256 nuclei were recovered and analysed (Table 1, for extended information refer to S1 Table) with our recently developed De-goulash bioinformatic pipeline [44,45] (for comparison of De-goulash with other available separation pipelines see S1 Text). As described previously, De-goulash performs mixture deconvolution via a two-step approach, with the first iteration step based on mtDNA SNPs and the subsequent second iteration step based on genomic SNPs. Due to scATAC-seq’s expectedly low mtDNA coverage when using nuclei as starting material, the first deconvolution step only separated 25.33% of cells, which were grouped into four discrete clusters (S1A Fig. S2 Table). After the second iteration step however, 100% of the cells were separated into 5 discrete cell 6 Forensic Science International: Genetics 71 (2024) 103030 with 98.7%, 88.4%, and 89.3% European ancestry quantification, respectively (Fig. 1E, S2 Fig. S6 Table). The remaining two cell clusters 1 and 2 are both of inferred male sex and inferred African-European admixed ancestry with cluster 2 showing 73% African together with 24.4% European ancestry, while for cluster 1 similarly high proportions of African and European ancestry were obtained with 49% each (Fig. 1E, S2 Fig, S6 Table). These results obtained from the five separated cell clusters agree with the a priory ancestry knowledge of the five mixture donors and with their bi-parental ancestry inferred from their WES reference data. Overall, the genetic characterisation results obtained from the five separated cell clusters were consistent with the prior knowledge of biological sex and biogeographic ancestry of the five individuals used as mixture donors. Additionally, the results are in agreement with the ge netic sex and ancestry information obtained from the WES reference data of the five donors. On one hand, these results show that biological sex and at least bi-parental and to a large degree also paternal biogeo graphic ancestry can be reliably inferred from the separated cell clusters. for three cell clusters 1, 2, and 4, implying that blood mixture.

Deconvoluting Multi-Person Biological Mixtures PDF

Document Details

Tags

Related

Summary

Full Transcript