Proteoform inference: reproducing Bludau et al. (2021)

This notebook shows how to use ProteoPy for proteoform group inference following the approach of Bludau et al. (2021):

Bludau I, Frank M, Dörig C, Cai Y, Heusel M, Rosenberger G, Picotti P, Collins BC, Röst H, and Aebersold R. Systematic detection of functional proteoform groups from bottom-up proteomic datasets. Nature Communications, 12:3810, 2021. doi:10.1038/s41467-021-24030-x

The authors introduced COPF (COrrelation-based functional ProteoForm assessment), a data-driven strategy that assigns peptides to co-varying proteoform groups based on peptide correlation patterns. They applied it to SWATH-MS (DIA) data from five mouse tissues (brain, brown adipose tissue, heart, liver, and quadriceps) across eight BXD mice, originally published by Williams et al. (2018):

Williams EG, Wu Y, Wolski W, Kim JY, Lan J, Hasan M, Halter C, Jha P, Ryu D, Auwerx J, and Aebersold R. Quantifying and Localizing the Mitochondrial Proteome Across Five Tissues in A Mouse Population. Molecular & Cellular Proteomics, 17(9):1766–1777, 2018. doi:10.1074/mcp.RA118.000554

Here, we show that the COPF implementation in ProteoPy can reproduce the core proteoform inference workflow and selected results (Figures 6B and 7A/C/D/F).

Setup

[1]:
# Jupyter autoreload for development
%load_ext autoreload
%autoreload 2
[2]:
import os
import random
from pathlib import Path
import numpy as np
import pandas as pd
import anndata as ad
import scanpy as sc
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context

import proteopy as pr  # Convention: import proteopy as pr
from proteopy.utils import is_proteodata

random.seed(42)

cwd = Path(".").resolve()
root = cwd.parents[1]
os.chdir(root)
/home/ifichtner/miniforge3/envs/proteopy-usage2/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Reading in the data

The Williams et al. (2018) dataset contains peptide-level SWATH-MS intensities from five tissues of eight BXD mice. ProteoPy provides a built-in download function (pr.download.williams_2018()) that fetches the dataset from the original supplementary archive, processes it, and saves it as three separate files:

  • Intensities — long-format table with columns sample_id, peptide_id, and intensity (one row per measurement)

  • Sample annotation — maps each sample_id to its tissue and mouse_id

  • Peptide annotation — maps each peptide_id to its protein_id and gene_id

This separation into three files mirrors a common data delivery format in proteomics. The files are then read into an AnnData object using pr.read.long().

[3]:
intensities_path = "data/williams-2018_mouse-tissue_intensities.tsv"
sample_annotation_path = "data/williams-2018_mouse-tissue_sample_annotation.tsv"
peptide_annotation_path = "data/williams-2018_mouse-tissue_peptide_annotation.tsv"

pr.download.williams_2018(
    intensities_path=intensities_path,
    sample_annotation_path=sample_annotation_path,
    var_annotation_path=peptide_annotation_path,
    force=True,  # Overwrite files if already exist
)
Downloading data from 'https://ars.els-cdn.com/content/image/1-s2.0-S1535947620320569-mmc1.zip' to file '/home/ifichtner/.cache/proteopy/williams_2018_mmc1.zip'.

A quick look at the first rows of each file confirms the expected structure:

[4]:
!head -n5 {intensities_path}
sample_id       peptide_id      intensity
C57_Brain       AAAAAAAAAAAAAAAGAAGK    26302.0
DBA_Brain       AAAAAAAAAAAAAAAGAAGK    15460.0
45_Brain        AAAAAAAAAAAAAAAGAAGK    24631.0
66_Brain        AAAAAAAAAAAAAAAGAAGK    26554.0
[5]:
!head -n5 {sample_annotation_path}
sample_id       tissue  mouse_id
C57_Brain       Brain   C57
DBA_Brain       Brain   DBA
45_Brain        Brain   45
66_Brain        Brain   66
[6]:
!head -n5 {peptide_annotation_path}
peptide_id      protein_id      gene_id
AAAAAAAAAAAAAAAGAAGK    P55012  Slc12a2
AAAAADLANR      Q9JHS4  Clpx
AAAADGEPLHNEEER Q80WW9  Ddrgk1
AAAAEGARPLER    Q80UM7  Mogs

With the three files on disk, pr.read.long() reads the long-format intensities, pivots them into a dense matrix, and merges the sample and peptide annotations into a single AnnData object. Setting fill_na=0 replaces any missing intensity values with zero.

[7]:
adata = pr.read.long(
    intensities=intensities_path,
    sample_annotation=sample_annotation_path,
    var_annotation=peptide_annotation_path,
    level="peptide",
    fill_na=0,
)

adata
[7]:
AnnData object with n_obs × n_vars = 40 × 32690
    obs: 'sample_id', 'tissue', 'mouse_id'
    var: 'peptide_id', 'protein_id', 'protein_id_annotation', 'gene_id'
[8]:
print("Is proteodata: ", is_proteodata(adata))
Is proteodata:  (True, 'peptide')
[9]:
adata.obs.head(n=10)
[9]:
sample_id tissue mouse_id
101_Brain 101_Brain Brain 101
45_Brain 45_Brain Brain 45
66_Brain 66_Brain Brain 66
68_Brain 68_Brain Brain 68
73_Brain 73_Brain Brain 73
80_Brain 80_Brain Brain 80
BAT_101 BAT_101 BAT 101
BAT_45 BAT_45 BAT 45
BAT_66 BAT_66 BAT 66
BAT_68 BAT_68 BAT 68
[10]:
adata.var.head(n=10)
[10]:
peptide_id protein_id protein_id_annotation gene_id
AAAAAAAAAAAAAAAGAAGK AAAAAAAAAAAAAAAGAAGK P55012 P55012 Slc12a2
AAAAADLANR AAAAADLANR Q9JHS4 Q9JHS4 Clpx
AAAADGEPLHNEEER AAAADGEPLHNEEER Q80WW9 Q80WW9 Ddrgk1
AAAAEGARPLER AAAAEGARPLER Q80UM7 Q80UM7 Mogs
AAAAKEEAPK AAAAKEEAPK Q91XV3 Q91XV3 Basp1
AAAANLC(UniMod:4)PGDVILAIDGFGTESMTHADAQDR AAAANLC(UniMod:4)PGDVILAIDGFGTESMTHADAQDR O70209 O70209 Pdlim3
AAAAYALGR AAAAYALGR Q6ZQ73 Q6ZQ73 Cand2
AAADLM(UniMod:35)AYC(UniMod:4)EAHAKEDPLLTPVPASENPFR AAADLM(UniMod:35)AYC(UniMod:4)EAHAKEDPLLTPVPAS... P63213 P63213 Gng2
AAADLMAYC(UniMod:4)EAHAK AAADLMAYC(UniMod:4)EAHAK P63213 P63213 Gng2
AAADLMAYC(UniMod:4)EAHAKEDPLLTPVPASENPFR AAADLMAYC(UniMod:4)EAHAKEDPLLTPVPASENPFR P63213 P63213 Gng2

Define consistent color schemes for sample annotation metadata

[11]:
# Tissues
n_tissues = adata.obs["tissue"].nunique()
cmap = mpl.colormaps["Set2"]
adata.uns["colors_tissue"] = cmap(range(n_tissues)).tolist()

tissue_order = ["Brain", "BAT", "Heart", "Liver", "Quad"]

# BXD mice
n_mice = adata.obs["mouse_id"].nunique()
cmap = mpl.colormaps["tab10"]
adata.uns["colors_mouse_id"] = cmap(range(n_mice)).tolist()

Quality control and preprocessing

By study design, the dataset includes 5 tissues derived from 8 BXD mice:

[12]:
pr.pl.n_samples_per_category(
    adata,
    category_key="tissue",
    order=tissue_order,
    color_scheme=adata.uns["colors_tissue"],
)
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_18_0.png
[13]:
pr.pl.n_samples_per_category(
    adata,
    category_key="mouse_id",
    color_scheme=adata.uns["colors_mouse_id"],
)
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_19_0.png

Similar to the original analysis, we remove peptides from the iRT protein, which was used for calibration, and A2ASS6, which has extraordinarily many peptides and distorts the analysis.

[14]:
irt_mask = (adata.var["protein_id"] == "iRT_protein")
adata = adata[:, ~irt_mask]
[15]:
A2ASS6_mask = (adata.var["protein_id"] == "A2ASS6")
print(f"N peptides for protein A2ASS6: {A2ASS6_mask.sum()}")
N peptides for protein A2ASS6: 919
[16]:
adata = adata[:, ~A2ASS6_mask]

Next, peptides that differ only by technical modifications or derived from missed cleavages are summarized by summing up their intensities.

[17]:
pr.pp.summarize_modifications(adata, method="sum", verbose=True)
pr.pp.summarize_overlapping_peptides(adata)

Stripping modifications: 31760 peptides -> 29862 unique stripped sequences (method='sum').

After these preprocessing steps, the dataset includes approximately 25,000 peptides and 3,800 proteins per sample.

[18]:
pr.pl.n_peptides_per_sample(
    adata,
    zero_to_na=True,
    order_by="tissue",
    order=tissue_order,
    color_scheme=adata.uns["colors_tissue"],
    print_stats=True,
)
Global:
 mean_count  std_count  median_count  min_count  max_count  mean_pct  std_pct  median_pct  min_pct  max_pct
    25184.2      199.1       25258.5      24623      25418      98.8      0.8        99.1     96.6     99.7

Per tissue:
tissue  mean_count  std_count  median_count  min_count  max_count  mean_pct  std_pct  median_pct  min_pct  max_pct
   BAT     25181.2      191.5       25077.5      24997      25418      98.8      0.8        98.4     98.0     99.7
 Brain     25269.4        7.2       25268.5      25261      25280      99.1      0.0        99.1     99.1     99.1
 Heart     24860.5      110.9       24858.0      24623      24996      97.5      0.4        97.5     96.6     98.0
 Liver     25369.6       11.8       25372.0      25352      25386      99.5      0.0        99.5     99.4     99.6
  Quad     25240.0       27.7       25240.0      25211      25289      99.0      0.1        99.0     98.9     99.2
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_27_1.png
[19]:
pr.pl.n_proteins_per_sample(
    adata,
    zero_to_na=True,
    order_by="tissue",
    order=tissue_order,
    color_scheme=adata.uns["colors_tissue"],
    print_stats=True,
)
Global:
 mean_count  std_count  median_count  min_count  max_count  mean_pct  std_pct  median_pct  min_pct  max_pct
     3808.8        5.8        3811.5       3790       3815      99.8      0.2        99.9     99.3     99.9

Per tissue:
tissue  mean_count  std_count  median_count  min_count  max_count  mean_pct  std_pct  median_pct  min_pct  max_pct
   BAT      3807.1        5.3        3805.0       3801       3814      99.7      0.1        99.7     99.6     99.9
 Brain      3812.1        0.6        3812.0       3811       3813      99.9      0.0        99.9     99.8     99.9
 Heart      3800.1        4.6        3801.0       3790       3806      99.6      0.1        99.6     99.3     99.7
 Liver      3814.0        0.5        3814.0       3813       3815      99.9      0.0        99.9     99.9     99.9
  Quad      3810.8        1.5        3811.0       3808       3813      99.8      0.0        99.8     99.8     99.9
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_28_1.png
[20]:
pr.pl.n_peptides_per_protein(adata, print_stats=True)
    mean  median  mode  variance  min  max
6.680115     4.0     1 70.811955    1  118
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_29_1.png
[20]:
<Axes: xlabel='Number of peptide_id per protein_id', ylabel='# protein_id'>

Note that no log transformation, data normalization or missing value imputation is performed here. This is because of the overall very high data consistency across the dataset and because COPF is based on covariation rather than direct intensity differences across conditions. This is also in agreement with the analysis by Bludau et al. (2021). However, other datasets might require additional preprocessing steps.

Exploratory analysis of peptide-level data

Before proteoform inference, ProteoPy can be used to perform classical exploratory data analysis steps, for example to investigate the primary sources of variation in the dataset.

Sample correlation matrix

High intra-tissue correlation and lower inter-tissue correlation show that peptide intensities are strongly driven by the tissue of origin and not by the mouse strain.

[21]:
pr.pl.sample_correlation_matrix(
    adata,
    method="pearson",
    margin_color="tissue",
    color_scheme=adata.uns["colors_tissue"],
)
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_33_0.png

This is further confirmed by both PCA and UMAP, where the tissue is clearly shown as the main source of variation.

PCA

[22]:
sc.tl.pca(adata)

with rc_context({"figure.figsize": (5, 3)}):
    sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True)

../_images/tutorials_proteoform-inference_bludau-2021_reproduction_36_0.png
[23]:
sc.pl.pca(
    adata,
    color=["tissue", "tissue", "tissue"],
    dimensions=[(0, 1), (1, 2), (2, 3)],
    ncols=3,
    size=90,
    palette=adata.uns["colors_tissue"],
)

sc.pl.pca(
    adata,
    color=["mouse_id", "mouse_id", "mouse_id"],
    dimensions=[(0, 1), (1, 2), (2, 3)],
    ncols=3,
    size=90,
    palette=adata.uns["colors_mouse_id"],
)
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_37_0.png
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_37_1.png

UMAP

[24]:
sc.pp.neighbors(adata, n_neighbors=4)
sc.tl.umap(adata)

sc.pl.umap(
    adata,
    color=["tissue"],
    size=100,
    palette=adata.uns["colors_tissue"],
)

sc.pl.umap(
    adata,
    color=["mouse_id"],
    size=100,
    palette=adata.uns["colors_mouse_id"],
)
/home/ifichtner/miniforge3/envs/proteopy-usage2/lib/python3.10/site-packages/sklearn/manifold/_spectral_embedding.py:328: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn(
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_39_1.png
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_39_2.png

Proteoform inference

ProteoPy includes the core functionality of the COPF workflow: (1) compute pairwise peptide correlations between all peptides of a protein, (2) perform hierarchical clustering on the correlation distances, (3) cut each dendrogram into two clusters representing proteoform groups, and (4) score the between- vs. within-cluster correlation difference. Proteins require at least 4 peptides for meaningful clustering.

[25]:
# Require ≥4 peptides per protein for COPF analysis
pr.pp.filter_proteins_by_peptide_count(adata, min_count=4)
pr.pp.remove_zero_variance_vars(adata)
Removed 1770 proteins and 2984 peptides.
Removed 12 variables.
[26]:
# COPF pipeline: correlate, cluster, score
pr.tl.pairwise_peptide_correlations(adata)
pr.tl.peptide_dendograms_by_correlation(adata, method="agglomerative-hierarchical-clustering")
pr.tl.peptide_clusters_from_dendograms(adata, n_clusters=2, min_peptides_per_cluster=2)
pr.tl.proteoform_scores(adata, min_score=0.1, min_pval_adj=0.1)
[27]:
# Remove COPF outlier peptides (cluster_id == 1000000)
copf_outliers = (adata.var["cluster_id"] == 1000000)
adata = adata[:, ~copf_outliers]

Proteoform scores

The pseudo-volcano plot shows proteoform scores vs. adjusted p-values (related to Figure 6B in Bludau et al. (2021)). The two highlighted proteins — Ldb3 (Q9JKS4) and Sorbs2 (Q3UTJ2) — are examined in detail below.

[28]:
bludauI_proteoforms = ["Q9JKS4", "Q3UTJ2"]
pr.pl.proteoform_scores(
    adata,
    adj=True,
    pval_threshold=0.1,
    score_threshold=0.1,
    log_scores=True,
    highlight_prots=bludauI_proteoforms,
)
Looks like you are using a tranform that doesn't support FancyArrowPatch, using ax.annotate instead. The arrows might strike through texts. Increasing shrinkA in arrowprops might help.
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_45_1.png
[28]:
<Axes: xlabel='Proteoform Score', ylabel='-log10(adj. p-value)'>
[29]:
pf_df = pr.get.proteoforms_df(adata, only_proteins=True)
n_proteoforms = pf_df[pf_df["is_proteoform"] == 1].shape[0]
print(f"{n_proteoforms} significant proteoforms inferred.")
65 significant proteoforms inferred.

Proteoform exploration

ProteoPy also recovered the two exemplary proteins with proteoform groups highlighted in Bludau et al (2021): LIM domain-binding protein 3 (Ldb3, Q9JKS4) and Sorbin and SH3 domain-containing protein 2 (Sorbs2, Q3UTJ2).

LIM domain-binding protein 3 — Ldb3 (Q9JKS4)

Ldb3 is a muscle-specific protein. COPF assigns its peptides to two tissue-specific proteoform groups, consistent with known alternative splice variants (reproduces Figure 7A).

[30]:
pr.pl.proteoform_intensities(
    adata,
    protein_ids="Q9JKS4",
    order_by="tissue",
    order=tissue_order,
    xlab_rotation=45,
)
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_49_0.png
[31]:
pr.get.proteoforms_df(adata, proteins="Q9JKS4")
[31]:
protein_id peptide_id cluster_id proteoform_score proteoform_score_pval proteoform_score_pval_adj is_proteoform
0 Q9JKS4 QYNNPIGLYSAETLR 0.0 0.529813 0.002771 0.062994 1.0
1 Q9JKS4 ASSEGAQGSVSPK 0.0 0.529813 0.002771 0.062994 1.0
2 Q9JKS4 VLPGPSQPR 0.0 0.529813 0.002771 0.062994 1.0
3 Q9JKS4 EMAQMYQMSLR 0.0 0.529813 0.002771 0.062994 1.0
4 Q9JKS4 ILAQMTGTEYMQDPDEEALRR 1.0 0.529813 0.002771 0.062994 1.0
5 Q9JKS4 IMGEVMHALR 1.0 0.529813 0.002771 0.062994 1.0
6 Q9JKS4 QTWHTTCFVCAACK 1.0 0.529813 0.002771 0.062994 1.0
7 Q9JKS4 SKRPIPISTTAPPIQSPLPVIPHQK 1.0 0.529813 0.002771 0.062994 1.0
8 Q9JKS4 SASYNLSLTLQK 1.0 0.529813 0.002771 0.062994 1.0
9 Q9JKS4 SWHPEEFNCAYCK 1.0 0.529813 0.002771 0.062994 1.0
10 Q9JKS4 ASGAGLLGGSLPVKDLAVDSASPVYQAVIK 1.0 0.529813 0.002771 0.062994 1.0
11 Q9JKS4 TPLCGHCNNVIR 1.0 0.529813 0.002771 0.062994 1.0
12 Q9JKS4 TQSKPEDEADEWARR 1.0 0.529813 0.002771 0.062994 1.0
13 Q9JKS4 TSLADVCFVEEQNNVYCER 1.0 0.529813 0.002771 0.062994 1.0
14 Q9JKS4 AAQSQLSQGDLVVAIDGVNTDTMTHLEAQNK 1.0 0.529813 0.002771 0.062994 1.0
15 Q9JKS4 CYEQFFAPICAK 1.0 0.529813 0.002771 0.062994 1.0
16 Q9JKS4 LQGGKDFNMPLTISR 1.0 0.529813 0.002771 0.062994 1.0
17 Q9JKS4 GAPAYNPTGPQVTPLAR 1.0 0.529813 0.002771 0.062994 1.0
18 Q9JKS4 GPFLVAMGR 1.0 0.529813 0.002771 0.062994 1.0

Ldb3 peptide sequence map (reproduces Figure 7C)

Mapping detected peptides onto the canonical protein sequence and annotated UniProt alternative sequences confirms that the two proteoform groups correspond to known splice variants.

[32]:
# Canonical sequence and UniProt alternative sequences for Q9JKS4
# Source: https://www.uniprot.org/uniprotkb/Q9JKS4/entry#sequences
seq = "MSYSVTLTGPGPWGFRLQGGKDFNMPLTISRITPGSKAAQSQLSQGDLVVAIDGVNTDTMTHLEAQNKIKSASYNLSLTLQKSKRPIPISTTAPPIQSPLPVIPHQKDPALDTNGSLATPSPSPEARASPGALEFGDTFSSSFSQTSVCSPLMEASGPVLPLGSPVAKASSEGAQGSVSPKVLPGPSQPRQYNNPIGLYSAETLREMAQMYQMSLRGKASGAGLLGGSLPVKDLAVDSASPVYQAVIKTQSKPEDEADEWARRSSNLQSRSFRILAQMTGTEYMQDPDEEALRRSSTPIEHAPVCTSQATSPLLPASAQSPAAASPIAASPTLATAAATHAAAASAAGPAASPVENPRPQASAYSPAAAASPAPSAHTSYSEGPAAPAPKPRVVTTASIRPSVYQPVPASSYSPSPGANYSPTPYTPSPAPAYTPSPAPTYTPSPAPTYSPSPAPAYTPSPAPNYTPTPSAAYSGGPSESASRPPWVTDDSFSQKFAPGKSTTTVSKQTLPRGAPAYNPTGPQVTPLARGTFQRAERFPASSRTPLCGHCNNVIRGPFLVAMGRSWHPEEFNCAYCKTSLADVCFVEEQNNVYCERCYEQFFAPICAKCNTKIMGEVMHALRQTWHTTCFVCAACKKPFGNSLFHMEDGEPYCEKDYINLFSTKCHGCDFPVEAGDKFIEALGHTWHDTCFICAVCHVNLEGQPFYSKKDKPLCKKHAHAINV"
alt_seqs = {
    "Q9JKS4-2_0": {"seq_coord": (107, 227), "group": "alt_2"},
    "Q9JKS4-3_0": {"seq_coord": (295, 357), "group": "alt_3"},
    "Q9JKS4-4_0": {"seq_coord": (107, 227), "group": "alt_4"},
    "Q9JKS4-4_1": {"seq_coord": (295, 357), "group": "alt_4"},
    "Q9JKS4-5_0": {"seq_coord": (295, 327), "group": "alt_5"},
    "Q9JKS4-5_1": {"seq_coord": (328, 723), "group": "alt_5"},
    "Q9JKS4-6_0": {"seq_coord": (107, 227), "group": "alt_6"},
    "Q9JKS4-6_1": {"seq_coord": (295, 327), "group": "alt_6"},
    "Q9JKS4-6_2": {"seq_coord": (328, 723), "group": "alt_6"},
}

pr.pl.peptides_on_prot_sequence(
    adata,
    protein_id="Q9JKS4",
    group_by="proteoform_id",
    ref_sequence=seq,
    add_sequences=alt_seqs,
    title="LIM domain binding protein 3 (Q9JKS4)",
    figsize=(8, 4),
)
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_52_0.png
[32]:
<Axes: title={'center': 'LIM domain binding protein 3 (Q9JKS4)'}, xlabel='Position'>

Sorbin and SH3 domain-containing protein 2 — Sorbs2 (Q3UTJ2)

Sorbs2 peptides are assigned to two proteoform groups with distinct tissue expression patterns: one group is abundant across brain, heart, and liver, while the other is brain-specific (reproduces Figure 7D).

[33]:
pr.pl.proteoform_intensities(
    adata,
    protein_ids="Q3UTJ2",
    order_by="tissue",
    order=tissue_order,
    xlab_rotation=45,
)
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_54_0.png
[34]:
pr.get.proteoforms_df(adata, proteins="Q3UTJ2")
[34]:
protein_id peptide_id cluster_id proteoform_score proteoform_score_pval proteoform_score_pval_adj is_proteoform
0 Q3UTJ2 LAFLVSPVPFR 0.0 0.619637 0.000343 0.01393 1.0
1 Q3UTJ2 ASVVEALDSALKDICDQIK 0.0 0.619637 0.000343 0.01393 1.0
2 Q3UTJ2 QGIFPVSYVEVVKR 1.0 0.619637 0.000343 0.01393 1.0
3 Q3UTJ2 APHYPGIGPVDESGIPTAIR 1.0 0.619637 0.000343 0.01393 1.0
4 Q3UTJ2 SFISSSPSSPSR 1.0 0.619637 0.000343 0.01393 1.0
5 Q3UTJ2 SIFEYEPGK 1.0 0.619637 0.000343 0.01393 1.0
6 Q3UTJ2 AQPARPPPPVQPGEIGEAIAK 1.0 0.619637 0.000343 0.01393 1.0
7 Q3UTJ2 SYSSTLTDLGR 1.0 0.619637 0.000343 0.01393 1.0
8 Q3UTJ2 VGIFPISYVEK 1.0 0.619637 0.000343 0.01393 1.0
9 Q3UTJ2 ADLPGSSSTFTK 1.0 0.619637 0.000343 0.01393 1.0
10 Q3UTJ2 GLGDQSSSR 1.0 0.619637 0.000343 0.01393 1.0

Sorbs2 peptide sequence map (reproduces Figure 7F)

The brain-specific proteoform group maps to a region covered by alternative sequences 3, 4 and 5 in UniProt (10 in Bludau et. al. Figure 7F), a brain-specific splice variant.

[35]:
# Canonical sequence and UniProt alternative sequences for Q3UTJ2
# Source: https://www.uniprot.org/uniprotkb/Q3UTJ2/entry#sequences
seq = "MNTDSGGCARKRAAMSVTLTSVKRVQSSPNLLAAGRESQSPDSAWRSYNDRNPETLNGDATYSSLAAKGFRSVRPNLQDKRSPTQSQITINGNSGGAVSPVSYYQRPFSPSAYSLPASLNSSIIMQHGRSLDSAETYSQHAQSLDGTMGSSIPLYRSSEEEKRVTVIKAPHYPGIGPVDESGIPTAIRTTVDRPKDWYKTMFKQIHMVHKPGLYNSPYSAQSHPAAKTQTYRPLSKSHSDNGTDAFKEVPSPVPPPHVPPRPRDQSSTLKHDWDPPDRKVDTRKFRSEPRSIFEYEPGKSSILQHERPVSIYQSSIDRSLERPSSSASMAGDFRKRRKSEPAVGPLRGLGDQSSSRTSPGRADLPGSSSTFTKSFISSSPSSPSRAQGGDDSKMCPPLCSYSGLNGTPSGELECCNAYRQHLDVPGDSQRAITFKNGWQMARQNAEIWSSTEETVSPKIKSRSCDDLLNDDCDSFPDPKTKSESMGSLLCEEDSKESCPMTWASPYIQEVCGNSRSRLKHRSAHNAPGFLKMYKKMHRINRKDLMNSEVICSVKSRILQYEKEQQHRGLLHGWSQSSTEEVPRDVVPTRISEFEKLIQKSKSMPNLGDEMLSPITLEPPQNGLCPKRRFSIESLLEEETQVRHPSQGQRSCKSNTLVPIHIEVTSDEQPRTHMEFSDSDQDGVVSDHSDYVHVEGSSFCSESDFDHFSFTSSESFYGSSHHHHHHHHHHRHLISSCKGRCPASYTRFTTMLKHERAKHENMDRPRRQEMDPGLSKLAFLVSPVPFRRKKILTPQKQTEKAKCKASVVEALDSALKDICDQIKAEKRRGSLPDNSILHRLISELLPQIPERNSSLHALKRSPMHQPFHPLPPDGASHCPLYQNDCGRMPHSASFPDVDTTSNYHAQDYGSALSLQDHESPRSYSSTLTDLGRSASRERRGTPEKEKLPAKAVYDFKAQTSKELSFKKGDTVYILRKIDQNWYEGEHHGRVGIFPISYVEKLTPPEKAQPARPPPPVQPGEIGEAIAKYNFNADTNVELSLRKGDRIILLKRVDQNWYEGKIPGTNRQGIFPVSYVEVVKRNAKGAEDYPDPPLPHSYSSDRIYTLSSNKPQRPGFSHENIQGGGEPFQALYNYTPRNEDELELRESDVVDVMEKCDDGWFVGTSRRTKFFGTFPGNYVKRL"
alt_seqs = {
    "Q3UTJ2-2_0": {"seq_coord": (210, 211), "group": "alt_2"},
    "Q3UTJ2-2_1": {"seq_coord": (307, 308), "group": "alt_2"},
    "Q3UTJ2-3_0": {"seq_coord": (3, 34), "group": "alt_3"},
    "Q3UTJ2-3_1": {"seq_coord": (210, 211), "group": "alt_3"},
    "Q3UTJ2-3_2": {"seq_coord": (387, 914), "group": "alt_3"},
    "Q3UTJ2-3_3": {"seq_coord": (1125, 1180), "group": "alt_3"},
    "Q3UTJ2-4_0": {"seq_coord": (3, 34), "group": "alt_4"},
    "Q3UTJ2-4_1": {"seq_coord": (387, 914), "group": "alt_4"},
    "Q3UTJ2-5_0": {"seq_coord": (44, 67), "group": "alt_5"},
    "Q3UTJ2-5_1": {"seq_coord": (210, 211), "group": "alt_5"},
    "Q3UTJ2-5_2": {"seq_coord": (307, 308), "group": "alt_5"},
    "Q3UTJ2-5_3": {"seq_coord": (387, 914), "group": "alt_5"},
    "Q3UTJ2-5_4": {"seq_coord": (1125, 1135), "group": "alt_5"},
    "Q3UTJ2-6_0": {"seq_coord": (44, 67), "group": "alt_6"},
    "Q3UTJ2-6_1": {"seq_coord": (210, 211), "group": "alt_6"},
    "Q3UTJ2-6_2": {"seq_coord": (308, 316), "group": "alt_6"},
    "Q3UTJ2-6_3": {"seq_coord": (316, 1180), "group": "alt_6"},
    "Q3UTJ2-7_0": {"seq_coord": (310, 313), "group": "alt_7"},
    "Q3UTJ2-7_1": {"seq_coord": (313, 1180), "group": "alt_7"},
}

pr.pl.peptides_on_prot_sequence(
    adata,
    protein_id="Q3UTJ2",
    group_by="proteoform_id",
    ref_sequence=seq,
    add_sequences=alt_seqs,
    title="Sorbin and SH3 domain-containing protein 2 (Q3UTJ2)",
    figsize=(8, 4),
)
../_images/tutorials_proteoform-inference_bludau-2021_reproduction_57_0.png
[35]:
<Axes: title={'center': 'Sorbin and SH3 domain-containing protein 2 (Q3UTJ2)'}, xlabel='Position'>

Proteoform quantification and statistical analysis

To test if the inferred proteoform groups are indeed tissue-specific, a one-way ANOVA analysis can be performed directly using ProteoPy functions. In strong agreement with Bludau et al. (2021), our analysis revealed that 59 out of 65 proteoform-containing proteins (90.8%) are expressed in a tissue-specific manner.

[36]:
# Aggregate peptide intensities to proteoform level
adata_pfs = adata.copy()
pr.pp.quantify_proteoforms(adata_pfs, group_by="proteoform_id")
[37]:
# Retain only significant proteoforms (score >= 0.1, adj. p-value <= 0.1)
pf_mask = (
    (adata_pfs.var["proteoform_score"].astype(float) >= 0.1)
    & (adata_pfs.var["proteoform_score_pval_adj"].astype(float) <= 0.1)
)
adata_pfs = adata_pfs[:, pf_mask].copy()
[38]:
# Log2 transform for statistical testing
adata_pfs.layers["raw"] = adata_pfs.X
adata_pfs.X[adata_pfs.X == 0] = np.nan
adata_pfs.X = np.log2(adata_pfs.X)
adata_pfs.X[np.isnan(adata_pfs.X)] = 0
[39]:
# One-way ANOVA across tissues
pr.tl.differential_abundance(
    adata_pfs,
    method="anova_oneway",
    multitest_correction="bonferroni",
    group_by="tissue",
    alpha=0.01,
    space="log",
)
Saved test results in .varm['anova_oneway;tissue;all']
[40]:
anova_results = pr.get.differential_abundance_df(adata_pfs, keys="anova_oneway;tissue;all")
anova_results.rename(columns={"var_id": "proteoform_id"}, inplace=True)
anova_results
[40]:
proteoform_id test_type group_by design fstat pval mean_Brain mean_BAT mean_Heart mean_Liver mean_Quad pval_adj is_diff_abundant
0 O08601_0 anova_oneway tissue all 29.998692 7.135979e-11 15.447875 16.438822 13.642355 15.286296 13.226680 9.276772e-09 True
1 O08601_1 anova_oneway tissue all 309.316738 8.846917e-27 17.173194 16.274997 16.297549 19.401295 16.539176 1.150099e-24 True
2 O54724_0 anova_oneway tissue all 12.603239 1.878084e-06 15.879478 13.639894 8.489357 10.545068 11.828292 2.441509e-04 True
3 O54724_1 anova_oneway tissue all 184.250825 5.444913e-23 16.359031 19.356209 18.207868 15.845380 17.226098 7.078387e-21 True
4 O54749_0 anova_oneway tissue all 5.288221 1.938883e-03 13.162980 12.879554 14.428646 13.568651 13.071253 2.520548e-01 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
125 Q9JKS4_1 anova_oneway tissue all 66.936903 6.594055e-16 17.255804 17.226273 19.196674 17.040296 19.944539 8.572271e-14 True
126 Q9QYR6_0 anova_oneway tissue all 14.856602 3.441306e-07 15.506743 12.560879 13.676294 13.230452 13.061866 4.473698e-05 True
127 Q9QYR6_1 anova_oneway tissue all 646.764070 2.861598e-32 20.504493 16.481401 16.861090 17.344528 16.848407 3.720078e-30 True
128 Q9Z204_0 anova_oneway tissue all 75.275766 1.071660e-16 16.117296 12.403909 12.737527 13.374945 12.716591 1.393159e-14 True
129 Q9Z204_1 anova_oneway tissue all 152.715690 1.224770e-21 15.790335 14.611674 13.502690 16.977432 13.218752 1.592201e-19 True

130 rows × 13 columns

[41]:
# Count proteins with all proteoforms being significantly tissue-specific
protein_id_map = adata_pfs.var[["protein_id", "protein_id_old"]].set_index("protein_id")["protein_id_old"]
anova_results["protein_id"] = anova_results["proteoform_id"].map(protein_id_map)
n_tissue_specific_pfs = anova_results.groupby("protein_id")["is_diff_abundant"].all().sum()

print(f"{n_tissue_specific_pfs} tissue-specific proteoform groups found via ANOVA.")
59 tissue-specific proteoform groups found via ANOVA.

Summary

This notebook reproduced the core proteoform inference workflow from Bludau et al. (2021) using ProteoPy. Key results:

  • Proteoform detection: COPF identified 65 proteins with proteoform groups, comparable to the 63 reported in the original study (Figure 6B).

  • Biological verification: The tissue-specific proteoform assignments for Ldb3 (Q9JKS4) and Sorbs2 (Q3UTJ2) match the published findings (Figures 7A, 7C, 7D, 7F), with peptide-to-sequence mappings consistent with known alternative splice variants.

  • Tissue specificity: ANOVA analysis recovered 59 tissue-specific proteoform groups (90.8% of the inferred proteoform groups) — similar to the 56 tissue-specific proteoform groups (88.9% of the inferred proteoform groups) reported by Bludau et al.

Minor discrepancies between reported numbers are expected due to implementation details in preprocessing (e.g., peptide modification summarization, overlapping peptide handling) and statistical testing.

[42]:
!pip freeze
adjustText==1.3.0
anndata==0.11.4
anyio==4.12.1
argon2-cffi==25.1.0
argon2-cffi-bindings==25.1.0
array-api-compat==1.13.0
arrow==1.4.0
asttokens==3.0.1
async-lru==2.2.0
attrs==25.4.0
babel==2.18.0
beautifulsoup4==4.14.3
biopython==1.86
bleach==6.3.0
certifi==2026.1.4
cffi==2.0.0
charset-normalizer==3.4.4
comm==0.2.3
contourpy==1.3.2
cycler==0.12.1
debugpy==1.8.20
decorator==5.2.1
defusedxml==0.7.1
et_xmlfile==2.0.0
exceptiongroup==1.3.1
executing==2.2.1
fastjsonschema==2.21.2
fonttools==4.61.1
fqdn==1.5.1
h11==0.16.0
h5py==3.15.1
httpcore==1.0.9
httpx==0.28.1
idna==3.11
igraph==1.0.0
ipykernel==7.2.0
ipython==8.38.0
isoduration==20.11.0
jedi==0.19.2
Jinja2==3.1.6
joblib==1.5.3
json5==0.13.0
jsonpointer==3.0.0
jsonschema==4.26.0
jsonschema-specifications==2025.9.1
jupyter-events==0.12.0
jupyter-lsp==2.3.0
jupyter_client==8.8.0
jupyter_core==5.9.1
jupyter_server==2.17.0
jupyter_server_terminals==0.5.4
jupyterlab==4.5.4
jupyterlab_pygments==0.3.0
jupyterlab_server==2.28.0
kiwisolver==1.4.9
lark==1.3.1
legacy-api-wrap==1.5
llvmlite==0.46.0
MarkupSafe==3.0.3
matplotlib==3.10.8
matplotlib-inline==0.2.1
mistune==3.2.0
natsort==8.4.0
nbclient==0.10.4
nbconvert==7.17.0
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.2
notebook_shim==0.2.4
numba==0.64.0
numpy==2.2.6
openpyxl==3.1.5
overrides==7.7.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_packaging_1769093650/work
pandas==2.3.3
pandocfilters==1.5.1
parso==0.8.6
patsy==1.0.2
pexpect==4.9.0
pillow==12.1.1
platformdirs==4.9.2
pooch==1.9.0
prometheus_client==0.24.1
prompt_toolkit==3.0.52
-e git+https://github.com/UKHD-NP/proteopy.git@05b165881b1465933b5ffe86e3acbb30abc2b6c0#egg=proteopy
psutil==7.2.2
ptyprocess==0.7.0
pure_eval==0.2.3
pycparser==3.0
Pygments==2.19.2
pynndescent==0.6.0
pyparsing==3.3.2
python-dateutil==2.9.0.post0
python-igraph==1.0.0
python-json-logger==4.0.0
pytz==2025.2
PyYAML==6.0.3
pyzmq==27.1.0
referencing==0.37.0
requests==2.32.5
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rfc3987-syntax==1.1.0
rpds-py==0.30.0
scanpy==1.11.5
scikit-learn==1.7.2
scipy==1.15.3
seaborn==0.13.2
Send2Trash==2.1.0
session-info2==0.4
six==1.17.0
soupsieve==2.8.3
stack-data==0.6.3
statsmodels==0.14.6
terminado==0.18.1
texttable==1.7.0
threadpoolctl==3.6.0
tinycss2==1.4.0
tomli==2.4.0
tornado==6.5.4
tqdm==4.67.3
traitlets==5.14.3
typing_extensions==4.15.0
tzdata==2025.3
umap-learn==0.5.11
uri-template==1.3.0
urllib3==2.6.3
wcwidth==0.6.0
webcolors==25.10.0
webencodings==0.5.1
websocket-client==1.9.0