.long

proteopy.read.long(intensities, level=None, *, sample_annotation=None, var_annotation=None, column_map=None, sep=None, fill_na=None, zero_to_na=False, sort_obs_by_annotation=False, verbose=False)[source]

Read long-format peptide or protein tabular data into an AnnData container.

The intensities table must be in long format with one row per (sample, feature) measurement. Required columns differ by level:

  • Peptide level: sample_id, intensity, and peptide_id must be present. protein_id may come from the intensities table or from var_annotation; see below.

  • Protein level: sample_id, intensity, and protein_id must all be present.

At peptide level, protein_id is resolved in two steps. If the intensities table already contains protein_id, it is used directly. Otherwise, var_annotation must be supplied and contain both peptide_id and protein_id.

sample_annotation, when supplied, must contain a sample_id column and is merged into adata.obs.

var_annotation, when supplied, must contain a peptide_id column (peptide level) or a protein_id column (protein level) and is merged into adata.var.

Column names that differ from the defaults above can be mapped to the canonical names via column_map.

Parameters:
  • intensities (str | Path | pd.DataFrame) – Long-form intensities data. Accepts a file path (str or Path) or a pandas.DataFrame.

  • level ({"peptide", "protein"}, default None) – Select whether to process peptide- or protein-level inputs. This argument is required.

  • sample_annotation (str | Path | pd.DataFrame, optional) – Optional obs annotations. Accepts a file path or DataFrame.

  • var_annotation (str | Path | pd.DataFrame, optional) – Optional var annotations. Accepts a file path or DataFrame. Interpreted as peptide annotations when level="peptide" and as protein annotations when level="protein".

  • column_map (dict, optional) – Optional mapping that specifies custom column names for the expected keys: peptide_id, protein_id, sample_id, intensity.

  • sep (str, optional) – Delimiter passed to pandas.read_csv. If None (the default), the separator is auto-detected from the file extension. Ignored when input is a DataFrame.

  • fill_na (float, optional) – Optional replacement value for missing intensity entries.

  • zero_to_na (bool, optional) – If True, zeros in the AnnData X matrix will be replaced with np.nan.

  • sort_obs_by_annotation (bool, default False) – When True, reorder observations to match the order of samples in the annotation (if supplied) or the original intensity table.

  • verbose (bool, optional) – If True, print status messages.

Returns:

Structured representation of the long-form intensities ready for downstream analysis.

Return type:

AnnData

Examples

Example 1: Minimal peptide-level read with protein_id in the intensities DataFrame.

>>> import pandas as pd
>>> import proteopy as pr
>>> intensities = pd.DataFrame({
...     "sample_id": [
...         "S1", "S1", "S2", "S2",
...     ],
...     "peptide_id": [
...         "PEP1", "PEP2", "PEP1", "PEP2",
...     ],
...     "protein_id": [
...         "PROT1", "PROT1", "PROT1", "PROT1",
...     ],
...     "intensity": [
...         12450.0, 8730.0, 15320.0, 6890.0,
...     ],
... })
>>> adata = pr.read.long(
...     intensities, level="peptide",
... )
>>> adata
AnnData object with n_obs × n_vars = 2 × 2
    obs: 'sample_id'
    var: 'peptide_id', 'protein_id'

Example 2: Peptide-level read with protein_id supplied via var_annotation instead of the intensities DataFrame.

>>> intensities = pd.DataFrame({
...     "sample_id": [
...         "S1", "S1", "S2", "S2",
...     ],
...     "peptide_id": [
...         "PEP1", "PEP2", "PEP1", "PEP2",
...     ],
...     "intensity": [
...         12450.0, 8730.0, 15320.0, 6890.0,
...     ],
... })
>>> var_ann = pd.DataFrame({
...     "peptide_id": ["PEP1", "PEP2"],
...     "protein_id": ["PROT1", "PROT1"],
... })
>>> adata = pr.read.long(
...     intensities,
...     level="peptide",
...     var_annotation=var_ann,
... )
>>> adata
AnnData object with n_obs × n_vars = 2 × 2
    obs: 'sample_id'
    var: 'peptide_id', 'protein_id'

Example 3: Peptide-level read with non-standard column names remapped via column_map.

>>> intensities = pd.DataFrame({
...     "run": ["S1", "S1", "S2", "S2"],
...     "seq": [
...         "PEP1", "PEP2", "PEP1", "PEP2",
...     ],
...     "prot": [
...         "PROT1", "PROT1", "PROT1", "PROT1",
...     ],
...     "quant": [
...         12450.0, 8730.0, 15320.0, 6890.0,
...     ],
... })
>>> adata = pr.read.long(
...     intensities,
...     level="peptide",
...     column_map={
...         "sample_id": "run",
...         "peptide_id": "seq",
...         "protein_id": "prot",
...         "intensity": "quant",
...     },
... )
>>> adata
AnnData object with n_obs × n_vars = 2 × 2
    obs: 'sample_id'
    var: 'peptide_id', 'protein_id'