CROP-Seq training datasets for AI-based foundation models of human cell biology
A foundation model of the human cell is a digital representation of the biology of a human cell. The amount of publicly available training data for such models is very limited, and Myllia's CROP-Seq perturbation datasets are a powerful source of information to fuel AI/ML-based engines accelerating drug target discovery.

Figure 1: Experimental workflow of CROP-Seq for AI training datasets at scale
CROP-Seq data – the ideal source of data to train a Foundation Model of the Human Cell. Single-cell RNA sequencing data can be used to train such a model because these data are information-rich and yet available at single-cell resolution. Existing models have been trained on public datasets, such as the data gathered by the Human Cell Atlas. However, this type of data is mostly descriptive.
The most powerful data for such a model is CROP-seq data in which CRISPR perturbation is linked to single cell RNA sequencing. It establishes a causal link between the CRISPR perturbation and the downstream effects and provides insight into the molecular mechanisms underlying cellular processes. However, as of today, very few large-scale CROP-seq datasets are available.
Here, Myllia presents a unique CROP-seq dataset in which the same set of 218 genes have been perturbed across six cell lines (THP-1, Jurkat, K562, A549, U2OS or K562 cells, see Figure 1). In addition, THP1 cells were differentiated macrophage-like cells using PMA (M0) or further differentiated using LPS treatment (M1). Following perturbation, a transcriptomic snapshot was recorded using unbiased single-cell RNA sequencing.
Analysis of single-cell RNA sequencing data revealed that the different cell lines clustered by cell identity (Figure 2). Of note, M0 and M1 macrophages clustered near their THP1 monocyte “parents”.
Figure 2: UMAP plot of all experimental conditions
A detailed analysis of one of the conditions (M1) revealed the clustering of single CRISPR knockouts in distinct areas of the UMAP plot (Figure 3), suggesting that these gene knockouts had a significant impact on the transcriptome of these cells.
Figure 3: Transcriptomic phenotypes of distinct gene knockouts
Does this example trigger your curiosity? If so, please
• Download the introductory slide deck about the comparative CROP-Seq screen conducted across 8 different cancer cell lines, all performed using a single sgRNA library targeting the very same set of 218 genes in parallel
• Contact us at info@myllia.com in case you are interested in gaining access to these existing datasets
• Reach out to info@myllia.com if you would like to gather additional datasets to train, fine-tune or test your Foundation Model
Myllia is your preferred partner for CROP-seq and can create such datasets at unprecedented scale and quality.