.karayel_2020
- proteopy.datasets.karayel_2020(fill_na=None)[source]
Load Karayel 2020 erythropoiesis proteomics dataset.
Download and process the protein-level DIA-MS dataset from Karayel et al. [1] studying dynamic phosphosignaling networks during human erythropoiesis. The study quantified ~7,400 proteins from CD34+ hematopoietic stem/progenitor cells (HSPCs) isolated from healthy donors, across five sequential erythroid differentiation stages with four biological replicates each (20 samples total). Cells were FACS-sorted using CD235a, CD49d, and Band 3 surface markers. The differentiation stages are:
Progenitor: CFU-E progenitor cells (CD34+ HSPCs, negative fraction)
ProE&EBaso: Proerythroblasts and early basophilic erythroblasts
LBaso: Late basophilic erythroblasts
Poly: Polychromatic erythroblasts
Ortho: Orthochromatic erythroblasts
Data are sourced from the PRIDE archive (PXD017276). Protein quantities marked as
Filteredin the original data are converted tonp.nan. Samples collected at day 7 are excluded.Sample annotation (
.obs) includes:sample_id: Unique identifier (cell_type_replicate).cell_type: Differentiation stage abbreviation.replicate: Technical replicate identifier.
Variable annotation (
.var) includes:protein_id: Protein group identifier (matches.var_names).gene_id: Associated gene name(s).
- Parameters:
fill_na (float | int | None, optional) – If not
None, replacenp.nanin.Xwith this value.- Returns:
Protein-level quantification data.
.Xcontains protein intensities (samples x proteins).- Return type:
AnnData
- Raises:
urllib.error.URLError – If the download from the PRIDE archive fails.
Examples
>>> import proteopy as pr >>> adata = pr.datasets.karayel_2020() >>> adata AnnData object with n_obs × n_vars obs: 'sample_id', 'cell_type', 'replicate' var: 'protein_id', 'gene_id'
>>> adata.obs['cell_type'].unique() ['Progenitor', 'ProE&EBaso', 'LBaso', 'Poly', 'Ortho']
References