Deconvoluting Multi-Person Biological Mixtures PDF

Document Details

SupportedSeal3675

Uploaded by SupportedSeal3675

Erasmus University Medical Center Rotterdam

2024

Lucie Kulhankova,Eric Bindels,Manfred Kayser,Eskeatnaf Mulugeta

Tags

single-cell DNA sequencing forensics biological mixtures genetic identification

Summary

This article explores a novel method for deconvoluting multi-person biological mixtures using non-targeted single-cell DNA sequencing (scDNA-seq), specifically scATAC-seq. The authors present a comprehensive approach to separate and characterize individual contributors to complex mixtures in forensic contexts, examining aspects of sex and biogeographic ancestry determination.

Full Transcript

Forensic Science International: Genetics 71 (2024) 103030 Contents lists available at ScienceDirect Forensic Science...

Forensic Science International: Genetics 71 (2024) 103030 Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsigen Deconvoluting multi-person biological mixtures and accurate characterization and identification of separated contributors using non-targeted single-cell DNA sequencing Lucie Kulhankova a, Eric Bindels c, Manfred Kayser a, *, 1, Eskeatnaf Mulugeta b, **, 1 a Department of Genetic Identification, Erasmus MC University Medical Center Rotterdam, Rotterdam, the Netherlands b Department of Cell Biology, Erasmus MC University Medical Center Rotterdam, Rotterdam, the Netherlands c Department of Haematology, Erasmus MC University Medical Center Rotterdam, Rotterdam, the Netherlands A R T I C L E I N F O A B S T R A C T Keywords: The genetic characterization and identification of individuals who contributed to biological mixtures are com­ Single-cell DNA sequencing plex and mostly unresolved tasks. These tasks are relevant in various fields, particularly in forensic in­ Single-cell ATAC sequencing vestigations, which frequently encounters crime scene stains generated by more than one person. Currently, Mixture Deconvolution forensic mixture deconvolution is mostly performed subsequent to forensic DNA profiling at the level of the Genetic identification Bio-geographic ancestry mixed DNA profiles, which comes with several limitations. Some previous studies attempted at separating single Forensics cells prior to forensic DNA profiling. However, these approaches are biased at selection of the cells and, due to their targeted DNA analysis on low template DNA, provide incomplete and unreliable forensic DNA profiles. We recently demonstrated the feasibility of performing mixture deconvolution prior to forensic DNA profiling through the utilization of a non-targeted single-cell transcriptome sequencing (scRNA-seq). In addition to individual-specific mixture deconvolution, this approach also allowed accurate characterisation of biological sex, biogeographic ancestry and individual identification of the separated mixture contributors. However, RNA has the forensic disadvantage of being prone to degradation, and sequencing RNA - focussing on coding regions - limits the number of single nucleotide polymorphisms (SNPs) utilized for genetic mixture deconvolution, characterization, and identification. These limitations can be overcome by performing single-cell sequencing on the level of DNA instead of RNA. Here, for the first time, we applied non-targeted single-cell DNA sequencing (scDNA-seq) by applying the scATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) tech­ nique to address the challenges of mixture deconvolution in the forensic context. We demonstrated that scATAC- seq, together with our recently developed De-goulash data analysis pipeline, is capable of deconvoluting complex blood mixtures of five individuals from both sexes with varying biogeographic ancestries. We further showed that our approach achieved correct genetic characterization of the biological sex and the biogeographic ancestry of each of the separated mixture contributors and established their identity. Furthermore, by analysing in-silico generated scATAC-seq data mixtures, we demonstrated successful individual-specific mixture deconvolution of i) highly complex mixtures of 11 individuals, ii) balanced mixtures containing as few as 20 cells (10 per each individual), and iii) imbalanced mixtures with a ratio as low as 1:80. Overall, our proof-of-principle study demonstrates the general feasibility of scDNA-seq in general, and scATAC-seq in particular, for mixture decon­ volution, genetic characterization and individual identification of the separated mixture contributors. Further­ more, it shows that compared to scRNA-seq, scDNA-seq detects more SNPs from fewer cells, providing higher sensitivity, that is valuable in forensic genetics. * Correspondence to: Erasmus MC University Medical Center Rotterdam, Department of Genetic Identification, Wytemaweg 80, Rotterdam 3015CN, the Netherlands. ** Correspondence to: Erasmus MC University Medical Center Rotterdam, Department of Cell Biology, Wytemaweg 80, Rotterdam 3015CN, the Netherlands. E-mail addresses: [email protected] (M. Kayser), [email protected] (E. Mulugeta). 1 These authors contributed equally https://doi.org/10.1016/j.fsigen.2024.103030 Received 22 August 2023; Received in revised form 16 February 2024; Accepted 4 March 2024 Available online 13 March 2024 1872-4973/© 2024 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). L. Kulhankova et al. Forensic Science International: Genetics 71 (2024) 103030 1. Introduction two (or more) autosomal SNPs, were also explored for the purpose of forensic mixture deconvolution, especially when analyzed with targeted The individual-specific separation of cells from biological mixtures MPS. In addition to the general limitation that the forensic appli­ generated by multiple contributors is important in various areas of cation of these compound markers requires the availability of reference fundamental and translational research as well as in societally relevant samples of the suspects for generating the needed reference data, since practical applications. Cellular mixtures are generated in different nat­ these alternative DNA markers are not included in criminal offender ural processes and under various non-natural conditions. Naturally- DNA databases, a major drawback of these methods is that they are induced biological mixtures are cellular heterogeneities generated dur­ generally suitable to two-person mixtures only. ing development and disease such as cancer. Non-naturally induced A completely different approach on forensic mixture deconvolution biological mixtures include those i) generated for experimental purposes is to separate the biological mixture into individual cells prior to forensic e.g., pooling samples from different sources, ii) observed after disease DNA profiling. Various technologies such as FACS [25,26], laser treatment e.g., bone marrow transplantation, iii) established by mistake microdissection [27,28], DEPArray [29–33], and optical tweezers e.g., human contaminations in cell and tissue cultures, or iv) retrieved have been previously tested for this purpose, all with limited success for from certain locations e.g., crime scenes. The individual-specific sepa­ various reasons [26, 33, 35–38]. The major limitation of all these ap­ ration of such cellular mixtures together with downstream analyses to proaches is that they attempt to obtain forensic DNA profiles from DNA genetically characterize and / or individually identify the separated cell of single cells in a targeted way, either by standard genotyping or donors therefore plays a crucial role in different parts of science and MPS-based genotyping i.e., only analysing the specific DNA markers society. finally used in forensic DNA profiling from very limited DNA. This One of the most societally relevant areas where mixture deconvo­ consequently leads to the typical problems in low template DNA analysis lution is required is forensic genetics. While genetically characterizing with allele drop-outs and drop-ins, resulting in unreliable or incomplete and individually identifying perpetrators using DNA obtained from forensic DNA profiles. Furthermore, with these cell separation tech­ crime scene traces produced by a single person is well established [1,2], niques, the selection of cells from a mixture is often guided by the this is challenging and often impossible from multi-person mixtures investigator that introduces a significant bias in the final forensic DNA [3–6] that are often collected at scenes of crime. Increased sensitivity of profiling outcome. the reagents and machines used for forensic DNA profiling has resulted In principle, challenges faced in previous attempts to deconvolute in increased observation of multi-person mixtures in crime scene sam­ human biological mixtures in forensic applications could be overcome ples. Therefore, over the last years, the limitations in identifying per­ with a technique that allows the separation of a mixture into its petrators from DNA mixtures has become an increasing problem and individual-specific cells prior to downstream forensic genetic analyses, currently represents one of the major road blocks in forensic DNA and in an unbiased and non-targeted manner. In recent years, such suite analysis [7–14]. of technologies has emerged and is typically referred to as single cell Currently, forensic mixture deconvolution is typically aimed subse­ sequencing. In single cell sequencing, the RNA or DNA of each cell is first quently to forensic DNA analysis, i.e., a mixed forensic DNA profile is labelled with barcodes, followed by non-targeted high-throughput obtained from a crime scene sample (containing mixtures), and the sequencing of large parts of the transcriptome or genome of a cell using mixture deconvolution is performed at the level of the mixed DNA next generation sequencing technologies. Notably, these non-targeted profile, presenting various challenges. To address these challenges, single cell RNA or DNA sequencing techniques are substantially alternative DNA analysis, extraction methods and markers were pro­ different to targeted MPS analysis of forensic DNA markers previously posed. This includes the exploration of alternative genotyping applied to the problem of forensic mixture deconvolution. This technologies, such as targeted massively parallel sequencing (MPS), distinction arises from their ability to generate sequencing data from which allow the quantification of genotyping outcomes with beneficial large parts of the human transcriptome or genome from which a large effects on mixture deconvolution. Alternative DNA extraction numbers of single nucleotide polymorphisms (SNPs) can be obtained. methods, such as differential lysis , were employed to enrich for the These SNPs are subsequently utilised for genetic mixture deconvolution DNA of a male contributor. Alternative DNA markers, such as Y chro­ and downstream forensic genetic analyses. In the past years, several mosome STRs , which only exist in male contributor were proposed. single-cell RNA and DNA sequencing techniques have been developed Although these alternatives offer potential benefits, it is important to and applied in different biological fields for understanding cellular note they also come with general and specific limitations. For processes, cellular heterogeneity and the molecular mechanisms that instance, DNA extraction involving differential lysis is only applicable in drive this heterogeneity [40–43]. Until very recently, such single cell mixtures that involve sperm cells in cases where the male contributor is sequencing technologies had not been applied to the forensic problem of the perpetrator. Although this leads to an enrichment of the male DNA, mixture deconvolution. the resulting forensic STR profile often represents a mixed profile In 2023, we have presented a proof-of-principle study demonstrating leading to problems in the deconvolution. The analysis of male-specific the feasibility of non-targeted transcriptome-wide single-cell RNA Y-STRs allows complete mixture deconvolutions for the male contrib­ sequencing (scRNA-seq) and our bioinformatics pipeline De-goulash for utor in a male-female mixtures, which is suitable in cases where the male mixture deconvolution and downstream forensic genetic analyses [44, contributor is the perpetrator. However, because of the 45]. We showed that the genotype data of different sets of carefully non-recombining nature of Y-STRs, male relatives usually share the selected SNPs that De-goulash automatically selected from the mixture’s same Y-STR profile, thereby typically not allowing individual-specific scRNA-seq data, allowed the correct separation of the mixtures ac­ conclusions. More recent developments in developing alternative cording to the individual contributors in up to 9-person balanced and biomarkers for forensic mixture deconvolution are compound bio­ imbalanced mixtures. We further demonstrated that different sets markers composed of two different polymorphic DNA markers that are of SNPs automatically selected from the scRNA-seq data of the separated located in close physical proximity so that they are inherited as haplo­ cell clusters allowed for accurate genetic characterization of the sepa­ types. Most of these compound markers take advantage of rated mixture contributors regarding biological sex and bio-geographic combining different types of DNA markers with different underlying ancestry (maternal, paternal, and bi-parental) as well as their correct mutation rates such as: DIP-STRs [18,19], a combination of an individual genetic identification with statistical certainty. Furthermore, deletion-insertion polymorphism (DIP) and a nearby STR; SNP-STRs, this approach provided accurate identification of the cellular / tissue from both autosomes [20,21] and the Y-chromosome ; and origin of the analysed mixture. DIP-SNPs. Compound biomarkers consisting of the same type of In our previous study, we pioneered the use of non-targeted single DNA markers, such as so called microhaplotypes typically consisting of cell transcriptome sequencing for mixture deconvolution and 2 L. Kulhankova et al. Forensic Science International: Genetics 71 (2024) 103030 subsequent forensic genetic analyses and delivered promising results diluted with 1 vol of PBS with 2% FBS. Next, the sample was layered on. However, sequencing RNA has limitations, particularly when lymphoprep™ and centrifuged without a break for 20 min. The PBMC dealing with biological material of low quality and quantity, which is layer was extracted into PBS with 2% FBS and washed twice before often encountered in areas where mixture deconvolution is relevant, filtering through 40 µL cell strainer. Cell integrity and count were particularly in the field of forensic genetics. One of these issues is the assessed using Countess 3 FL cell counter. Cells were then mixed in an inherent instability of RNA molecules that are prone to in vivo and in even ratio and nuclei isolation was performed as recommended by vitro degradation processes, which presents practical limitations when 10xGenomics (CG000169revD) with a 3-minute lysis step and DNase working with low quality mixture material, a common source in forensic treatment. The recovered nuclei were counted on Countess and the genetics. Another challenge is the relatively limited number of SNPs that appropriate number of nuclei was selected for a library preparation can be captured by scRNA-seq, which focuses on coding regions that are depending on the requirements of the experiment. relatively conserved compared to the non-coding part of the genome [46,47], requiring higher number of cells for successful analyses. This poses practical constrains when dealing with low quantity mixtures that 2.2. Library preparation are often encountered in forensic genetics. An alternative strategy to overcome both challenges is to shift non-targeted single-cell sequencing ScATAC-seq libraries were generated using Chromium Next GEM from RNA to DNA. DNA is much more stable and resistant to degradation Single Cell ATAC reagent kit v1.1 (10xGenomics). Quality and number compared with RNA. Sequencing DNA delivers SNPs from coding and of nuclei was verified by counting tryphan blue positive nuclei using a non-coding regions and thus more SNPs than available from coding re­ haemocytometer and a Countess 3 FL (ThermoFischer). Transposase gions when sequencing RNA. incubation (Tn5) and subsequent library preparation was done with the We therefore hypothesize that non-targeted single-cell DNA Chromium Next GEM Single Cell ATAC reagent kit v1.1 (10xGenomics). sequencing (scDNA-seq) is a more viable option for analyzing biological Libraries were sequenced on a Novaseq6000 platform (Illumina), with mixtures, including those of low quality as relevant in forensic genetics. 50–8–16–50 cycle setting. Moreover, we hypothesize that moving from scRNA-seq to scDNA-seq will increase the number of available SNPs, which shall decrease the 2.3. Data processing number of cells required for successful analyses, thereby providing benefits when dealing with biological mixtures of limited quantity such The scATAC-seq mixture data were de-multiplexed using cellranger- as in forensic genetics. Although performing scDNA-seq on the whole atac mkfastq (version 2.0.0). Sequencing reads were aligned to the human genome (WGS) would be the most desirable strategy, it is human genome (GRCh38) with the STAR aligner that is part of the Cell currently very expensive and will not be feasible for many practical Ranger ATAC pipeline (version2.0.0) using the cellranger-atac count applications , asking for an affordable and simpler scDNA-seq command. The quality of the generated data was assessed based on approach. Single-cell Assay for Transposase-Accessible Chromatin several parameters: fragment distribution, fraction of high-quality using sequencing (scATAC-seq) is a less expensive and established fragments overlapping peaks, estimated number of cells. Regions that scDNA-seq technology, which was initially developed to study previously were reported to result in poor alignment (blacklist regions) genome-wide chromatin accessibility, and has been successfully applied were removed by using the list provided by ENCODE (the ENCODE for understanding chromatin dynamics and gene regulation at the blacklist regions ). The resulting aligned bam file as well as the cell single-cell level [49,50], but had never been applied to the problem of barcodes resulting from the cell calling containing true cells (filtered cell mixture deconvolution in a forensic context. barcodes) were used for downstream analyses. In this proof-of-principle study, we demonstrate for the first time the feasibility of scDNA-seq in general, and scATAC-seq in particular, for genetically separating, characterizing and individually identifying do­ 2.4. Deconvolution of biological mixtures nors of multi-person biological mixtures. Moreover, we provide empir­ ical proof for the advantage of scDNA-seq over scRNA-seq we had Mixture deconvolution was performed using the two-step deconvo­ recently introduced for this purpose. We conducted scATAC-seq on lution tool of the previously described De-goulash pipeline. SNP experimentally established multi-person blood mixtures of different calling, generation of cell matrix and clustering, and variant calling were quantities that contained individuals of both biological sex and with all performed as described previously with the noticeable deviation of a varying biogeographic ancestries. We additionally used in-silico mix­ reduced requirement for coverage per SNPs (from 2 to 1) allowing for tures generated from publicly available single-individual scATAC-seq inclusion of the residual mitochondrial DNA (mtDNA) SNP information datasets to explore the limitations of our novel approach. Our results available with scATAC-seq data. A minimum of 10 mtDNA SNPs per cell demonstrate the capability of scATAC-seq to correctly separate balanced (4 for scRNA-seq previously) were used for the first iteration and 60 and imbalanced mixtures according to the level of individual contribu­ whole-genome SNPs (20 for scRNA-seq previously) in the second itera­ tors and allows accurate determination of the biological sex and the tion. For the first iteration, the mtDNA SNPs used for the deconvolution biogeographic ancestry of the separated donors as well as their indi­ analysis were selected from the bulk data mtDNA SNPs freebayes call vidual genetic identification. Our results highlight the benefits of (freebayes –iXu –c 2 –q 1) with SNP quality of 80 and depth of 20. scDNA-seq in general, and scATAC-seq in particular, for successful For UMAP , 300 neighbours were selected and the number of mixture analysis relevant in various fields, including forensic genetics. clusters was determined using NbClust. For the second iteration, the whole-genome SNPs were selected by using SNP lists of each cluster 2. Materials and Methods established from iteration 1. Non-unique variants were removed using bcftools norm (version 1.9) [55,56]. The selected SNPs were then 2.1. Sample preparation filtered for depth and quality (QUAL > 80, DP > 20) and used for clustering as described for iteration 1. Detailed information regarding Blood samples of 5 volunteers were collected after written informed the separation process can be found in the methods section of our pre­ consent via venipuncture into a 10 mL EDTA tube by a trained phle­ vious study and the tool’s GitHub page. Since cluster numbers botomist. The study was approved by the Medical Ethics Board (METC) are assigned arbitrarily by the software, in order to match clusters in of Erasmus MC (MEC-2020–0528). The PBMCs were isolated using a different datasets that are generated from similar individuals (S14 lymphoprep™ (Stemcell, #07811) protocol. The whole blood was Table), SNPs obtained from clusters in the different datasets were centrifuged for 5 min. The plasma was removed and the sample was compared. 3 L. Kulhankova et al. Forensic Science International: Genetics 71 (2024) 103030 2.5. Genetic characterization of mixture donors from separated cell model assumes linkage equilibrium and therefore the De-goulash pipe­ clusters line performs a pruning (minimum of 0.5 cM between any included markers) of the overlapping markers prior to LR calculations. We further Per each separated cell cluster, the cluster variant files and note that genetic linkage is not an issue for random match probability sequencing files were further analysed using the De-goulash analysis calculations. pipeline. Bi-parental bio-geographic ancestry using autosomal ancestry- informative SNPs was determined using STRUCTURE (v2.3.4) and 2.7. In silico mixture preparation and analysis the 1000 Genomes database with five major populations (K = 5): Eu­ ropean, African, South Asian, East Asian, and Native American. Y i) Testing the limits of scATAC-seq in balanced mixtures chromosome haplogroups were determined from Y-SNP data using In silico balanced mixtures based on two publicly available Y-leaf 2.1 with the basic setting as described in the manual (-r 1 -q single-individual scATAC-seq datasets (A1, A2, details in data 20 -b 90 -t 1) and paternal biogeographic ancestry was inferred from the availability section of this manuscript) from 10x Genomics were geographic distribution of the identified Y haplogroups using literature created in various increments. Equal number of barcodes, ranging resources. MtDNA haplogroups were determined from mtDNA SNP data from 100 down to 5 cells per each of the two individuals were using Haplogrep2 and maternal bio-geographic ancestry was selected randomly. Each dataset was filtered for reads containing inferred from the identified mtDNA haplogroups using population data the selected barcodes. The resulting datasets were then merged stored in the EMPOP database [60,61]. Likelihood ratios (LRs) for in­ into two-person balanced mixtures, and mixture deconvolution dividual genetic identification were calculated using the respective tool was performed using the De-goulash pipeline. Due to the low in De-goulash. Biological sex was determined by first considering number of cells used, the requirement for depth of SNPs and the the sequencing reads aligned to the Y-chromosome with alignment number of SNPs observed per cell were adjusted (10 depth, 4 percentages lower than 0.1 interpreted as female sex and higher than 0.1 SNPs required for mtDNA SNPs, 10 for autosomal SNPs), as well as male sex. Secondly, the percentage of X-chromosome reads aligned to as UMAP neighbours were adjusted in each run to match the the XIST gene was recorded using the genomic coordinates obtained number of cells available for UMAP (100 for the 100:100 from Ensembl (Chromosome X: 73,820,649–73,852,723) , with dataset, 50 for 50:50 dataset, 20 for 20:20 dataset, 10 for 10:10 alignment percentages lower than 0.04 interpreted as male sex and over dataset). The cluster assignment of each separated dataset was 0.05 as female sex. This threshold was established empirically based on matched to the original single-individual data source using the observations we made with samples of know biological sex (S16 Table). original barcodes to establish that the obtained clusters are individual-specific. For the subsequent genetic characterisation 2.6. Individual genetic identification of mixture donors from separated analysis based on the obtained individual-specific cell clusters, cell clusters we adjusted De-goulash the following parameters to accommo­ date low cell numbers: the depth of SNPs was lowered to 10, the For each of the five individuals in the experimentally generated quality of SNP was lowered to 50, STRUCTURE was run on mixtures, a previously created whole exome sequencing (WES) reference all available SNPs in a cluster. Biological sex, Y chromosome and dataset was used for matching for the purpose of individual identifica­ mtDNA haplogroups were assessed as described above. Aiming to tion as described elsewhere. As alternative reference dataset, we obtain the true biological sex and biogeographic ancestry of the newly generated SNP microarray data for each individual used in the individuals used in the in-silico mixtures, the initial individual experimental mixtures because generating SNP array data is much less scATAC-seq data, prior to their use in the in-silico mixture, were pricy than WES data. For this, after obtaining written informed consent, used. For this, variants were called on the unmixed datasets using a buccal swab was taken by rubbing a sterile OmniSwab (Qiagen) on the freebayes variant calling (freebayes –iXu –c 2 –q 1). The inside of the cheek for 15 s. The swabs were immediately processed and SNPs were filter for quality of 80 and depth of 20, and the DNA was extracted using the QIAamp DNA Investigator Kit (Qiagen). modified De-goulash pipeline for analysis was run. All available DNA concentration was determined using nanodrop and diluted to the SNPs were used for STRUCTURE with the downstream distance 50 ng/µL concentration used as input for SNP microarray analysis using and allele frequency filter applied. MtDNA and Y chromosomal the GSA-MD version 3 array from Illumina. The obtained SNP array data haplogroups were assigned using Haplogrep2 and Y-leaf were processed in Genome Studio 2.0 using a pre-existing cluster file as described above. The resulting ancestry and sex assign­ based on 4405 high-quality DNA samples. A total of 5574 DNA variants ments were then compared to the corresponding findings from with missing genotypes were removed from the data, leaving 719,923 the bulk single-source data to establish the limits of the approach SNPs for analysis. The reference alleles were set to the GRCh38 reference in assessing the true sex and ancestry after mixture deconvolu­ genome in PLINK v2.00. Indels as well as variants on sex chromosomes tion. Individual genetic identification could not be performed and mtDNA variants were all removed before further analysis. The from in-silico mixture due to the lack of reference data. However, identity SNP lists obtained from the scATAC-seq data per each separated the correct individual-specific mixture deconvolution was cell cluster were overlapped with the WES and GSA reference data from assessed based on the barcodes known to belong to one individual the reference database including all individuals used in the mixture by or the other. using bcftools isec (v1.9). The overlapping positions were ii) Testing the limits of scATAC-seq in imbalanced mixtures compared by genotype and the match percentage was calculated as the Two publicly available single-individual scATAC-seq datasets percentage of SNPs with matching genotypes. Upon confirming a match from 10x Genomics (A1, A2, details in the data availability sec­ with an individual in the reference dataset, the criterion for matching tion of this manuscript) were used to create imbalanced mixtures was set at 85% agreement in genotypes across overlapping positions. with minor and major components of variable degrees. In total, De-goulash was next used to calculate a statistical weight of evidence 1000 cells were selected from both individual datasets together. pertaining to individual identification. To this end, we employed the For the minor component, cells with high number of reads were widely used likelihood ration (LR) framework. Briefly, in our selected to increase reliability. Cells from the two individual setting, the LR compares the data given two competing hypotheses, H1: datasets were mixed in ratios ranging from 1:10 down to 1:90. The obtained SNP profile and the reference profile are from the same From each dataset, the reads containing the chosen cell barcodes individual versus H2: The obtained SNP profile and the reference profile were filtered and the resulting subsets were then merged to create have two unrelated donors. The LR becomes the inverse of the random a mixed dataset. Subsequently, De-goulash was used to match probability for a match between the two profiles. We note that the deconvolute these imbalanced mixtures. Taking into account the 4 L. Kulhankova et al. Forensic Science International: Genetics 71 (2024) 103030 Fig. 1. Deconvolution of 5-person balanced blood mixture and genetic analyses with De-goulash analysis of scATAC-seq data. A) Experiment design, B) 3D UMAP cell separation graph after first and second deconvolution iteration. C) Percentage of sequencing reads mapping to the X-chromosomal XIST gene and D) to the Y-chromosome in each of the 5 separated cell clusters with colour indications of inferred sex. C and D include a colour guide with the determined sex of each cluster with black signifying presumed biological female clusters and red presumed biological male cluster. E) STRUCTURE results showing bi-parental biogeographic ancestry using selected autosomal ancestry-informative SNPs for each of the 5 separated cell clusters together with the population reference data used (EUR: Eu­ ropeans, EAS: East Asians, SAS: South Asians, AMR: Native Americans, AFR: Sub-Saharan-Africans). F) Percentage of selected human identity SNPs matching between cell cluster and the best matching individual in the WES reference database for each of the 5 cell clusters / individuals. G) LR of individual genetic identification for the 5 cell clusters / individuals based on the WES reference data, the green line represents the certainty threshold of 10E9. H) Percentage of selected human identity SNPs matching between cell cluster and the best match individual based on the GSA reference database for each of the 5 cell clusters / individuals. I) LR of individual genetic identification for the 5 cell clusters / individuals based on the GSA reference database, the green line represents the certainty threshold of 10E9. Fig. 1A was made using Biorender and some images are adapted from 10X genomics. 5 L. Kulhankova et al. Forensic Science International: Genetics 71 (2024) 103030 Table 1 Summary of the scATAC sequencing quality and cell recovery from two complex blood mixtures involving the same 5 individuals, respectivelya. Sample / Estimated number of Fraction of genome in Fraction of high-quality fragments in Mean raw read pairs per Sequencing dataset cells peaks cells cell saturation scATAC-H 3256 0.0338 0.6456 74951.26 0.5213 scATAC-L 1306 0.0312 0.5624 253377.5 0.6661 a For additional information, see S1 Table. reduced number of cells, the parameters used for deconvolution used in a previous study. To this end, we removed the ENCODE with De-goulash were lowered from the default with number of blacklist region from the scATAC-seq data. As the previous analysis required SNPs in iteration from 1 to 3, UMAP neighbours to 50, successfully demonstrated the accurate separation of cells into individ­ sequencing depth per SNP to 10 and SNP quality to 50. The ual clusters, we could utilize the resulting separation data as individual assignment of the barcodes was analysed and compared to each data. Hence, we employed the distinct lists of cells from each dataset, cell’s original data source. Individual genetic identification could which were assigned individually by De-goulash. Next, we not be performed from in-silico mixture because of the missing randomly selected an increasing number of barcodes from each reference data. However, the correct individual-specific assign­ cluster-barcodes subset. We selected cell numbers between 10 and 400, ment of cells was assessed by comparing the cells barcodes be­ each selection was repeated 10 times. For each barcode selection we tween the separated clusters and the original single-source data. extracted the barcodes corresponding to the selected cells and called Biological sex and biogeographic ancestry of the minor compo­ SNPs using FreeBayes v1.3.1 (“-iXu -C 2 -q 1″). We then filtered the nent cell cluster was done as described in the previous section. SNPs using bcftools filter (QUAL > 80 and INFO/DP > 20) and iii) Testing limits of scATAC-seq in highly complex mixture counted the number of SNPs. The results from each cell number point were then averaged for the corresponding sequencing and cluster. We utilised 11 publicly available single-individual scATAC-seq datasets to create a highly complex mixture, containing 11 individuals. 2.9. Genomic localization of SNPs This included the two datasets used in the previous sections from 10xGenomics (A1, A2), 9 datasets from study with accession To determine the genomic location of SNPs in scRNA-seq and PRJNA718009, and one dataset from study PRJNA658078 (dataset scATAC-seq, we used a scRNA-seq dataset with a mixture of four in­ SRP278094). From each of the 11 datasets, 500 barcodes were randomly dividuals (dataset M4 in our previous study , and the two selected and filtered for reads that corresponded to the selected cells. scATAC-seq mixtures generated for this study (S1 Table), and the two The resulting selected reads were mixed and deconvoluted using the De- references (WES and GSA data). The scATAC-seq datasets were pro­ goulash pipeline. The number of expected clusters was set to 11 after cessed by excluding ENCODE blacklist regions. In all datasets, SNPs repeatedly running NbClust pipeline that allows estimation of the were called using FreeBayes v1.3.1 parsing arguments “-iXu -C 2 -q 1″ number of cell clusters. Evidence for correct individual separation was. The variant/SNPs list was further filtered using bcftools filter obtained by comparing the clustered barcodes to the original (QUAL > 80 and INFO/DP > 20). SnpEff v4.3 was used to single-source data. The analysis pipeline of De-goulash was run for annotate and analyse the resulting variants list (java -Xmx8g -jar snpEff. bi-parental biogeographical ancestry using the STRUCTURE algorithm jar hg38). Finally, the count of variant types and locations in the and software. For STRUCTURE analysis, all available SNPs over the different datasets was recorded and compared. standard threshold (Depth > 20, Quality > 80) were preselected before further filtering for distance and minor allele frequency. Haplogrep2 3. Results and Y-leaf scripts implemented in De-goulash were used for mtDNA and Y-chromosome haplogrouping, respectively. Reads from the 3.1. Deconvolution of complex mixture and down-stream forensic DNA Y-chromosome and the X-chromosomal XIST gene were recorded to profiling genetically assess biological sex. Testing for correct sex and ancestry inference was done by comparing the genetic outcomes from analysing In order to investigate the suitability of scDNA-seq in general, and the separated cell clusters with the original single-source data. scATAC-seq in particular, for deconvoluting complex biological mix­ tures on the individual-specific level, we first generated a balanced 2.8. Comparison scRNA-seq and scATAC-seq data blood mixture comprising five individuals with an equal contribution. The five mixture contributors were selected to cover both sexes and i) SNP comparison whole datasets diverse biogeographic origin from the same and different continental To determine the number of quality SNPs in each scATAC-seq ancestries i.e., two European females, one European male, one African- dataset, we called SNPs on the whole data without the ENCODE European admixed male, and one East-African male. After nuclei isola­ blacklist region using FreeBayes v1.3.1 with arguments “-iXu -C 2 -q tion, the mixture of these 5 individuals was subjected to scATAC-seq 1″. After SNP calling, we filtered for quality SNPs using the using 10x Genomics library preparation followed by sequencing bcftools filter (QUAL > 80 and INFO/DP > 20) and counted the (Fig. 1A), generating the dataset scATAC-H. After sequencing, 3256 number of SNPs in the resulting SNP list. To create a comparison nuclei were recovered and analysed (Table 1, for extended information subset of scRNA-seq data, we randomly selected the corresponding refer to S1 Table) with our recently developed De-goulash bioinformatic number of cells (3256 and 1306) and created a subset of mixture of pipeline [44,45] (for comparison of De-goulash with other available four scRNA-seq data used in a previous study (M4). For these separation pipelines see S1 Text). As described previously, De-goulash two resulting datasets we called SNPs and filtered the SNPs for performs mixture deconvolution via a two-step approach, with the quality and depth as described above. We then quantified the num­ first iteration step based on mtDNA SNPs and the subsequent second ber of SNPs in the resulting SNP list. iteration step based on genomic SNPs. Due to scATAC-seq’s expectedly ii) SNP comparison separated cell clusters low mtDNA coverage when using nuclei as starting material, the first deconvolution step only separated 25.33% of cells, which were grouped To create comparison of scATAC-seq and scRNA-seq SNPs, we used into four discrete clusters (S1A Fig. S2 Table). After the second iteration the scATAC-H, scATAC-L and a mixture of four scRNA-seq dataset (M4) step however, 100% of the cells were separated into 5 discrete cell 6 L. Kulhankova et al. Forensic Science International: Genetics 71 (2024) 103030 clusters (Fig. 1B, S2 Table), which agree with the number of individuals with 98.7%, 88.4%, and 89.3% European ancestry quantification, present in the mixture. It is noteworthy that the number of cells obtained respectively (Fig. 1E, S2 Fig. S6 Table). The remaining two cell clusters 1 per each of the five separated clusters was not proportional to the equal and 2 are both of inferred male sex and inferred African-European number of cells used as input in the preparation of the balanced blood admixed ancestry with cluster 2 showing 73% African together with mixture (S3 Table). We observed a roughly two-fold difference between 24.4% European ancestry, while for cluster 1 similarly high proportions the cluster with the largest number of cells and the one with the smallest of African and European ancestry were obtained with 49% each (Fig. 1E, cell number (S3 Table). S2 Fig, S6 Table). These results obtained from the five separated cell Aiming to demonstrate that the mixture deconvolution occurred on clusters agree with the a priory ancestry knowledge of the five mixture the individual-specific level i.e., the cells separated in the five cell donors and with their bi-parental ancestry inferred from their WES clusters correspond to the five individual donors of the blood mixture, reference data. we first performed genetic characterization analyses regarding biolog­ Overall, the genetic characterisation results obtained from the five ical sex and biogeographic ancestry for each of the 5 separated cell separated cell clusters were consistent with the prior knowledge of clusters. For genetic determination of the biological sex, we first looked biological sex and biogeographic ancestry of the five individuals used as at the number of sequencing reads originating from the XIST gene. mixture donors. Additionally, the results are in agreement with the ge­ Expression of XIST inactivates one of the two X chromosomes in females netic sex and ancestry information obtained from the WES reference and can thus be used as genetic marker for biological sex. This analysis data of the five donors. On one hand, these results show that biological showed high percentage (>0.05) in the two cell clusters 3 and 5 (Fig. 1C, sex and at least bi-parental and to a large degree also paternal biogeo­ S4 Table), indicating that they derive from females. The remaining three graphic ancestry can be reliably inferred from the separated cell clusters. cell clusters 1, 2, and 4 presented low percentages (0.1%) for three cell clusters 1, 2, and 4, implying that blood mixture. they are derived from males, while low percentages (

Use Quizgecko on...
Browser
Browser