`.karayel_2020`

proteopy.datasets.karayel_2020(fill_na=None)[source]

Load Karayel 2020 erythropoiesis proteomics dataset.

Download and process the protein-level DIA-MS dataset from Karayel et al. [1] studying dynamic phosphosignaling networks during human erythropoiesis. The study quantified ~7,400 proteins from CD34+ hematopoietic stem/progenitor cells (HSPCs) isolated from healthy donors, across five sequential erythroid differentiation stages with four biological replicates each (20 samples total). Cells were FACS-sorted using CD235a, CD49d, and Band 3 surface markers. The differentiation stages are:

Progenitor: CFU-E progenitor cells (CD34+ HSPCs, negative fraction)
ProE&EBaso: Proerythroblasts and early basophilic erythroblasts
LBaso: Late basophilic erythroblasts
Poly: Polychromatic erythroblasts
Ortho: Orthochromatic erythroblasts

Data are sourced from the PRIDE archive (PXD017276). Protein quantities marked as Filtered in the original data are converted to np.nan. Samples collected at day 7 are excluded.

Sample annotation (.obs) includes:

sample_id: Unique identifier (cell_type_replicate).
cell_type: Differentiation stage abbreviation.
replicate: Technical replicate identifier.

Variable annotation (.var) includes:

protein_id: Protein group identifier (matches .var_names).
gene_id: Associated gene name(s).

Parameters:: fill_na (float | int | None, optional) – If not None, replace np.nan in .X with this value.
Returns:: Protein-level quantification data. .X contains protein intensities (samples x proteins).
Return type:: AnnData
Raises:: urllib.error.URLError – If the download from the PRIDE archive fails.

Examples

>>> import proteopy as pr
>>> adata = pr.datasets.karayel_2020()
>>> adata
AnnData object with n_obs × n_vars
    obs: 'sample_id', 'cell_type', 'replicate'
    var: 'protein_id', 'gene_id'

>>> adata.obs['cell_type'].unique()
['Progenitor', 'ProE&EBaso', 'LBaso', 'Poly', 'Ortho']

References

.karayel_2020

`.karayel_2020`