.karayel_2020

proteopy.datasets.karayel_2020(fill_na=None)[source]

Load Karayel 2020 erythropoiesis proteomics dataset.

Download and process the protein-level DIA-MS dataset from Karayel et al. [1] studying dynamic phosphosignaling networks during human erythropoiesis. The study quantified ~7,400 proteins from CD34+ hematopoietic stem/progenitor cells (HSPCs) isolated from healthy donors, across five sequential erythroid differentiation stages with four biological replicates each (20 samples total). Cells were FACS-sorted using CD235a, CD49d, and Band 3 surface markers. The differentiation stages are:

  • Progenitor: CFU-E progenitor cells (CD34+ HSPCs, negative fraction)

  • ProE&EBaso: Proerythroblasts and early basophilic erythroblasts

  • LBaso: Late basophilic erythroblasts

  • Poly: Polychromatic erythroblasts

  • Ortho: Orthochromatic erythroblasts

Data are sourced from the PRIDE archive (PXD017276). Protein quantities marked as Filtered in the original data are converted to np.nan. Samples collected at day 7 are excluded.

Sample annotation (.obs) includes:

  • sample_id: Unique identifier (cell_type_replicate).

  • cell_type: Differentiation stage abbreviation.

  • replicate: Technical replicate identifier.

Variable annotation (.var) includes:

  • protein_id: Protein group identifier (matches .var_names).

  • gene_id: Associated gene name(s).

Parameters:

fill_na (float | int | None, optional) – If not None, replace np.nan in .X with this value.

Returns:

Protein-level quantification data. .X contains protein intensities (samples x proteins).

Return type:

AnnData

Raises:

urllib.error.URLError – If the download from the PRIDE archive fails.

Examples

>>> import proteopy as pr
>>> adata = pr.datasets.karayel_2020()
>>> adata
AnnData object with n_obs × n_vars
    obs: 'sample_id', 'cell_type', 'replicate'
    var: 'protein_id', 'gene_id'
>>> adata.obs['cell_type'].unique()
['Progenitor', 'ProE&EBaso', 'LBaso', 'Poly', 'Ortho']

References