Foundation Models for Single-Cell Biology: scGPT, Geneformer, and What They Mean for Your Analyses
A practitioner's guide to single-cell foundation models — what scGPT, Geneformer, and UCE actually do well in 2026, where they fall short of classic Scanpy/Seurat workflows, and how to wire them into your pipeline without losing biological interpretability.
For the last decade, the single-cell stack has felt remarkably stable. You load counts into Scanpy or Seurat, you normalize, you find variable genes, you run PCA + Harmony + UMAP, and you annotate with a handful of marker genes or SingleR. It is a workflow built on linear algebra, careful biology, and a generation of curated marker tables.
Foundation models do not replace that workflow. They sit alongside it — and in 2026, they have finally crossed the threshold where ignoring them costs you something. This post is what I tell mentees who ask whether scGPT, Geneformer, or UCE belong in their PhD project: yes, with caveats, and here is how to use them without lying to yourself.
What a single-cell foundation model actually is
All three of the leading models — scGPT, Geneformer, and UCE — are transformers pre-trained on tens of millions of cells from public atlases. Instead of words, they tokenize genes; instead of next-token prediction, they learn to reconstruct masked expression values, rank-ordered gene sequences, or cross-modal embeddings. The pretraining objective varies, but the output is the same: a continuous embedding for every cell that, in principle, captures cell state in a way that generalizes across tissues, donors, and technologies.
What you get back is a vector. What you do with that vector — clustering, annotation, perturbation prediction, batch integration — is mostly orthogonal to the model itself. Treat the embedding as a feature representation, not as an oracle.
Where they genuinely help
Cell type annotation is the clearest win. Zero-shot annotation against a reference atlas (Tabula Sapiens, HCA, CellxGene) using scGPT or UCE embeddings routinely beats marker-gene methods on rare populations and on tissues where canonical markers are weak — early progenitors, exhausted T cell states, stromal subpopulations. It is also faster: you embed once and query thousands of references in seconds.
Batch integration is the second win. Foundation-model embeddings are surprisingly resistant to technical batch effects without needing Harmony or scVI to be re-trained per dataset. On reasonably similar tissues, you can often skip a dedicated integration step entirely and cluster directly on the embedding.
Perturbation prediction — predicting transcriptional responses to gene knockouts or drugs from a fine-tuned model — is the most exciting capability and the most fragile. It works on perturbations within the distribution of the training data (common cancer cell lines, well-studied transcription factors) and degrades quickly outside it. Useful for hypothesis generation, dangerous as a stand-in for wet-lab validation.
Where they fall short (and what to do about it)
Differential expression is still a Scanpy job. Foundation-model embeddings collapse the gene-level signal you need for DEG analysis, pathway enrichment, and biological interpretation. Do annotation in embedding space, then go back to the count matrix for DE.
Rare disease and non-human samples are a known weak spot. Pretraining corpora skew toward healthy human tissue and common cancer lines. If you work on parasites, plants, or rare pediatric tumors, the zero-shot embedding will still cluster your cells — but the biological priors baked into the model are not yours, and confidence in cross-dataset transfer drops sharply.
Benchmarks lie if you do not control batch leakage. Several headline results from 2023–2024 turned out to be inflated by pretraining-set contamination. The 2024 benchmark from Boiarsky et al. is the right reference point: on properly held-out data, simpler baselines (PCA + Harmony + logistic regression on marker genes) remain competitive for many tasks. Use foundation models where they clearly help; do not assume they help everywhere.
A pragmatic workflow for 2026
Start with the classic Scanpy/Seurat preprocessing — QC, doublet removal, normalization. Do not skip these. A foundation model on garbage cells gives you confident garbage embeddings.
Embed twice. Compute a foundation-model embedding (scGPT or UCE for human, Geneformer for transcription factor / network questions) alongside a PCA-Harmony embedding. Cluster on both. Where they disagree, you have a story worth investigating — usually rare populations or technical artifacts.
Annotate in embedding space, validate with markers in count space. Use the foundation model to propose labels by nearest-neighbor lookup against a reference atlas; confirm with a small marker-gene panel and visualize on the original UMAP. The two coordinate systems do not need to agree perfectly; they need to tell a consistent biological story.
Do differential expression, pathway enrichment, and trajectory analysis on the count matrix with the tools you trust. Foundation models are upstream feature extractors, not replacements for the interpretable end of the pipeline.
Compute, licensing, and the practical reality
Inference with scGPT or UCE on a 100k-cell dataset runs comfortably on a single consumer GPU (16–24 GB VRAM) in minutes. Fine-tuning requires more — a 40–80 GB A100 or H100 for non-trivial datasets — but most users will never need to fine-tune. The pretrained checkpoints are good enough.
Licensing matters for clinical work. Most foundation-model checkpoints are released under permissive academic licenses; commercial and clinical use may require separate agreements. Read the license before you write the IRB protocol.
Where this is going
Multi-modal foundation models — joint embeddings of transcriptomes, spatial coordinates, chromatin accessibility, and protein abundance — are the next frontier. Early 2025–2026 work from Yang et al., Boiarsky et al., and the Universal Cell Embeddings group all point to the same direction: a unified cell-state representation that can be queried across modalities. We are not there yet, but the architecture is in place.
If you are starting a single-cell project today, my advice is the same as it has been for the last two years: keep your classic pipeline, add foundation-model embeddings as a parallel track, and let your biology decide which one you trust for each question. The interesting science still lives in the disagreement between methods.
New essays, in your inbox.
Bioinformatics, multi-omics, and AI notes. No spam. Unsubscribe any time.