Phasing Singletons: Pushing the Limits of Statistical Haplotype Estimation
Population-based phasing methods have long struggled with singleton variants, where limited sharing of haplotype information makes accurate phase inference very difficult. SHAPEIT5 addresses this by leveraging the local haplotype context rather than relying solely on allele sharing.
Accurate statistical haplotype phasing depends on observing shared haplotype segments between individuals. Common variants pose little difficulty — there are many carriers to learn from, and phasing is generally reliable. Singleton variants, by contrast, are found in only one person in the dataset. Without other carriers to compare against, conventional methods have little information to work with and tend to phase these variants close to randomly, with switch error rates around 50%.
This limitation has grown more pressing as sequencing studies expand. Rare and ultra-rare variants, including many singletons, are often the most relevant for disease gene discovery. Compound heterozygous analysis — identifying cases where both copies of a gene carry distinct pathogenic variants — requires correct phase assignments to be informative at all.
Why singletons are hard to phase
Standard phasing approaches build on Identity-by-Descent (IBD): segments of the genome shared between individuals because they descend from a common ancestor. For a variant present in many people, there are many IBD segments to inform phase. For a singleton, there is, by definition, no other carrier — so IBD-based methods have essentially no signal for that variant directly.
As a result, most tools handle singletons by assigning phase at random or leaving them unphased. For studies focused on common variation this is often acceptable, but it becomes a real problem when rare variant phasing matters.
The SHAPEIT5 approach
SHAPEIT5 builds on a coalescent-inspired framework in which each haplotype is modelled as a mosaic of segments inherited from other haplotypes in the panel. Rather than relying exclusively on direct allele sharing at the singleton site, SHAPEIT5 uses the surrounding haplotype context: even if nobody else carries the variant, many individuals share the genomic background around it. By conditioning on that background, the method can recover useful phase information that would otherwise be discarded.

Figure 1: Common and rare variant phasing using SHAPEIT5 and overview of the work.
In benchmarks on large sequencing cohorts, SHAPEIT5 achieved:
- < 5% switch error rate for ultra-rare variants (MAF < 0.001%).
- Substantially non-random phasing of singletons — a meaningful improvement over random assignment, and achievable without family data.
- < 1% switch error rate for common variants.
These figures will vary depending on sample size and panel composition, but represent consistent gains over previous methods in the settings we evaluated.
Implications for compound heterozygous analysis
One area where phasing quality is directly consequential is the detection of compound heterozygous loss-of-function (CH-LOF) events — pairs of variants affecting both copies of a gene, on opposite haplotypes. Misphased singletons can either mask genuine CH-LOF events or generate false positives.
Applying SHAPEIT5 to UK Biobank data, we identified 549 genes with CH-LOF patterns that would have been classified differently under prior phasing methods. This suggests that improved singleton phasing has practical consequences for rare disease analyses, though the clinical significance of any individual finding still requires careful follow-up.

Figure 2: Statistical analysis of compound heterozygous events across different variant annotations.
Outlook
SHAPEIT5 is publicly available and designed to work with large whole-genome sequencing datasets. As cohorts continue to grow, the proportion of rare and singleton variants in analyses will increase, and getting their phase right matters for a growing range of applications — from fine-mapping to structural variant interpretation to rare disease diagnosis.
Full paper: Hofmeister, Ribeiro, Rubinacci, Delaneau, Nature Genetics 2023. doi:10.1038/s41588-023-01415-w