Hotspot driven peptide design

AI Summary

Bulleted

Text

Key Insights

The document introduces PepHAR, a novel hot-spot-driven autoregressive generative model for designing peptides that target specific proteins.
PepHAR addresses key challenges in peptide design by focusing on unequal residue contributions, peptide bond geometries, and practical benchmarking scenarios.
The model combines energy-based hot spot sampling, autoregressive fragment extension using dihedral angles, and an optimization process to assemble valid peptides.
Experiments demonstrate PepHAR's effectiveness in peptide generation and scaffold-based design, highlighting its potential for advancing drug discovery.
The model breaks the process into three stages; energy model based residue distrubution around target, extending fragments step-by-step using dihedral angles, then optimization function based fragments are assembled

#Peptide

1/33

Preprint, Under review
Figure 1: Left: Hot spot residues are a small number of critical residues in the peptide-target binding
interface, while the remaining residues act as scaffolds. In peptide design, we first sample the hot
spots and then use scaffold residues to link them. Right: Each residue in the protein consists of
backbone heavy atoms and side-chain groups. Adjacent residues are connected by peptide bonds,
which establish a planar conformation around neighboring atoms. The backbone structure of adjacent residues can be reconstructed using dihedral angles through the operations Left and Right.
Although these methods have achieved initial success in generating peptide binders with native-like
structures and high affinities, several challenges remain. First, not all peptide residues contribute
equally to binding. As shown in Figure 1, some residues establish key functional interactions with
the target, possessing high stability and affinity. These are referred to as hot spot residues, and
are critical in drug discovery (Bogan & Thorn, 1998; Keskin et al., 2005; Moreira et al., 2007).
Other residues, known as scaffolds, help position the hot spots in the binding region and stabilize
the peptide (Matson & Stupp, 2012; Hosseinzadeh et al., 2021). Considering the different roles of
these residues, generating all of them in one step may be inefficient. Second, the generated peptides
must respect the constraints imposed by peptide bonds, which are non-rotatable and enforce fixed
bond lengths and planar structures (Fisher, 2001). As illustrated in Figure 1, adjacent residues
must maintain specific relative positions to form proper peptide bonds. A model that represents
peptide backbone structures independently as local frames (Jumper et al., 2021) may neglect these
geometric constraints. Third, in practical drug discovery, peptides are not always designed from
scratch. Often, initial peptide candidates are optimized, or key hot spot residues are linked via
scaffold residues (Zhang et al., 2009; Yu et al., 2023). Thus, more realistic in-silico benchmarks are
needed to simulate these scenarios.
To tackle these challenges, we propose PepHAR, a hot-spot-driven autoregressive generative model.
By distinguishing between hot spot and scaffold residues, we break down the generation process into
three stages. First, we use an energy-based density model to capture the residue distribution around
the target, and apply use Langevin dynamics to sample statistically favorable and feasible hot spots.
Next, instead of generating all residues simultaneously, we autoregressively extend fragments step
by step, modeling dihedral angles parameterized by a von Mises distribution to maintain peptide
bond geometry. Finally, since the generated fragments may not align perfectly, we apply a hybrid
optimization function to assemble the fragments into a complete peptide. To simulate practical
peptide drug discovery scenarios, we evaluate our method not only in de novo peptide design but
also in scaffold generation, where the model scaffolds known hot spot residues into a functional
peptide, akin to peptide design based on prior knowledge.
In summary, our key contributions are:
• We introduce PepHAR, an autoregressive generative model based on hot spot residues for
peptide binder design;
• We address current challenges in peptide design by using an energy-based model for hot
spot identification, autoregressive fragment extension for maintaining peptide geometry,
and an optimization step for fragment assembly;
• We propose a new experimental setting, scaffold generation, to mimic practical scenarios,
and demonstrate the competitive performance of our method in both peptide binder design
and scaffold generation tasks.
2

2/33

Preprint, Under review
2 RELATED WORK
Generative Models for Protein Design Generative models have shown significant promise in designing functional proteins (Yeh et al., 2023; Dauparas et al., 2023; Zhang et al., 2023c; Wang et al.,
2021; Trippe et al., 2022; Yim et al., 2024). Some approaches focus on generating protein sequences
using protein language models (Madani et al., 2020; Verkuil et al., 2022; Nijkamp et al., 2023) or
through methods like directed evolution (Jain et al., 2022; Ren et al., 2022; Khan et al., 2022; Stanton et al., 2022). Others aim to design sequences based on backbone structures (Ingraham et al.,
2019; Jing et al., 2020; Hsu et al., 2022; Li et al., 2022; Gao et al., 2022; Dauparas et al., 2022). For
protein structures, which are crucial for determining protein function, diffusion-based (Luo et al.,
2022; Watson et al., 2022; Yim et al., 2023b) and flow-based models (Yim et al., 2023a; Bose et al.,
2023; Li et al., 2024a; Cheng et al., 2024) have been successfully applied to both unconditional
(Campbell et al., 2024) and conditional protein design (Yim et al., 2024). However, these generative
models typically treat all residues as equal, generating them simultaneously and overlooking the
distinct roles of residues, such as those involved in catalytic sites (Giessel et al., 2022) or binding
regions (Li et al., 2024a).
Computational Peptide Design The earliest methods for peptide design rely on protein or peptide
templates for design (Bhardwaj et al., 2016; Hosseinzadeh et al., 2021; Swanson et al., 2022). These
approaches use heuristic rules to search for similar sequences or structures in the PDB database as
seeds for peptide design. A more prevalent class of models focuses on optimizing hand-crafted or
statistical energy functions for peptide design (Cao et al., 2022; Bryant & Elofsson, 2023). While
effective, these methods are computationally expensive and tend to get stuck in local minima (Raveh
et al., 2011; Alford et al., 2017). Recently, deep generative models, such as GANs (Xie et al., 2023),
diffusion models (Xie12 et al.; Wang et al., 2024), and flow models (Li et al., 2024a; Lin et al.,
2024), have been applied to design peptide structures and sequences, conditioned on target protein
information, offering more flexibility and efficiency in the design process.
3 PRELIMINARY
Protein Composition A protein or peptide is composed of multiple amino acid residues, each characterized by its type and backbone structure, which includes both position and orientation (Jumper
et al., 2021). For the i-th residue, denoted as Ri = (ci, xi, Oi), its type ci ∈ {1...20} refers to the
class of its side-chain R group. The backbone position xi ∈ R
3
represents the coordinates of the
central Cα atom, while the backbone orientation Oi ∈ SO(3) is defined by the spatial configuration
of the heavy backbone atoms (N-Cα-C). In this way, a protein can be represented as a sequence of
N residues: [R1, . . . , RN ].
Problem Formation The goal of this work is to generate a peptide D = [R1, . . . , RN ], consisting
of N residues, based on a target protein T = [R1, . . . , RM] of length M. We also define fragments,
where the k-th fragment is denoted as F(k,ik,lk) = [Rik
, . . . , Rik+lk−1], a contiguous subset of
residues. Fragments are sequentially connected within the protein, where ik indicates the N-terminal
residue’s index of the fragment in the original peptide, and lk represents the fragment’s length.
Multiple fragments can be assembled into a complete protein based on their residue indices.
Directional Relations The sequential ordering from the N-terminal to the C-terminal residue, along
with the covalent bonds between adjacent residues, is fundamental in our approach. As illustrated in
Fig 1, residues are linked via covalent peptide bonds (CO-NH), with each residue Ri connecting to
its neighboring residues Ri−1 and Ri+1. These peptide bonds are partially double bonds, limiting
their rotational freedom and resulting in a planar configuration for atoms between adjacent residues
(Cα, Cβ, O, and H atoms). The backbone structure of a protein can thus be described using dihedral
angles, which define the spatial relations between these planes in 3D space. Each residue has three
associated dihedrals: ψi, ϕi, and ωi. The first two angles determine the geometric relationship
between adjacent residues, while the third controls the position of the O atom. Given a protein’s
backbone structure, we can calculate the dihedral angles for each residue. Conversely, the backbone
structure of neighboring residues can also be derived from the dihedral angles, which serve as the
building blocks in our model. Specifically, given the backbone position xi and orientation Oi, we
can approximate the backbone structures of neighboring residues Ri−1 and Ri+1 using coordinate
3

3/33

Preprint, Under review
Figure 2: Overview of our three-stage approach: In the first foudning stage, k hot-spot residues
are generated (k = 2 in this example) from learned residue distribution around the target. In the
second stage, new residues are extended to the fragments’ left or right based on dihedral angle
distributions. Finally, in the correction stage, gradients from the objective functions are applied to
refine the complete peptide.
transformations. Details are included in the Appendix B.
(xi−1, Oi−1) =Left(xi, Oi, ψi−1, ϕi), (1)
(xi+1, Oi+1) =Right(xi, Oi, ψi, ϕi+1). (2)
4 METHODS
To tackle the challenges in peptide design, we propose a three-stage approach to generate peptides D
based on their target protein T. Our method involves generating hot-spot residues, extending fragments, and correcting peptide structures. As shown in Figure 2 and Algorithm 1, the first stage, the
Founding Stage, independently generates a small number k of hot-spot residues Ri1, ..., Rikfrom
the learned residue distribution P(R | T). In the second stage, the Extension Stage, these hot-spot
residues are used as starting points to progressively extend k fragments F1, ..., Fk by adding new
residues to the Left or Right in an autoregressive manner, until the total peptide reaches the desired
length L. Finally, since each fragment is extended independently, the third stage, the Correction
Stage, adjusts the sequences and structures of the fragments, refining them based on the gradients
of the objective function to ensure valid geometries and meaningful peptide sequences.
4.1 FOUNDING STAGE
The founding stage first generates k hot-spot residues based on the target protein T. By introducing
the residue distribution P(R | T), hot-spot residues represent those that have higher probabilities
(i.e., lower energies) of appearing near the binding pocket compared to other backbone structures
and residue types. The generation of hot-spot residues focuses on finding regions with high probabilities, where residues are more likely to interact directly with the target. Conversely, regions
with low probabilities (i.e., high energies) will have fewer peptide residues. For example, peptide
residues should neither occur too far from nor too close to the pocket (Cao et al., 2022; Li et al.,
2024a), as such regions may have near-zero probabilities (high energies) of residue occurrence. We
parameterize P(R | T) using an energy-based model, which is a conditional joint distribution of
backbone position x, orientation O, and residue type c:
Pθ(c, x, O | T) = 1
Z
exp (gθ,c(x, O | T)). (3)
here gθ is a scoring function parameterized by an equivariant network, which quantifies the score
of residue type c occuring at a given backbone structure with position x and orientation O In other
words, gθ,c is the unnormalized probability of type c. And Z is the normalizing constant associated
with sequence and structure information, which we do not explicitly estimate.
4

4/33

Preprint, Under review
Algorithm 1: Peptide Sampling Outline
Data: Target protein T, peptide length N, hot-spot residue count k, and indices [i1, ..., ik]
1 Founding Stage
2 for j ← 1 to k do
3 Sample a hot-spot residue Rij ∼ Pθ(c, x, O | T) based on Eq. 6, 7, and 8;
4 Initialize fragment F(j,ij ,lj=1) ←

Rij

;
5 Extension Stage
6 while l1 + ... + lk < N do
7 Randomly choose a fragment index i ∈ 1, ..., k and direction d ∈ {L, R};
8 Set the starting residue as either the N-terminal Rij or the C-terminal Rij+lj−1;
9 Sample a new residue on the left Rij−1 or on the right Rij+ljbased on Eq. 15 and 16;
10 Add the new residue to fragment Fj ;
11 Merge fragments into the peptide D ← F1 + ... + Fk
12 Correction Stage
13 for t ← 1, ... do
14 Calculate the objective J based on the current peptide using Eq.22;
15 Update the peptide using gradients from Eq.23 and 24;
16 return D = [R1, ..., RN ]
Network Implementation The density model gθ is parameterized by the Invariant Point Attention
backbone network (Jumper et al., 2021; Luo et al., 2022; Yim et al., 2023b), which is SE(3) invariant. It takes positive residues (peptide residues) and negative residues (perturbed residues) along
with the target protein as input, encoding them into hidden representations. A shallow Multi-Layer
Perceptron (MLP) is then used to classify residue types for likelihood evaluation.
Training We use the Noise Contrastive Estimation (NCE) to train this parameterized energy-based
model (Gutmann & Hyvarinen, 2010). NCE distinguish between samples from the true data ¨
distribution (positive points) and samples from a noise distribution (negative points). The positive distribution corresponds to the ground truth residue distribution of the peptide over the target (c, x, O) ∼ p(R | T), while the negative samples are drawn from the disturbed distribution
(cneg, x
−, O−) ∼ p(R˜ | T) by adding large spatial noises to the ground truth positions and orientations, labeled as type cneg. As positive and negative data are sampled equally, the NCE objective
for a single positive data point is:
l(c, x, O, | T) = log exp gθ,c(x, O | T)
P
c
′ exp gθ,c′ (x, O | T) + p(cneg, x, O | T)
. (4)
As a common practice (Gutmann & Hyvarinen, 2012), we fix the negative probability ¨ p(cneg, x, O |
T) as a constant, simplifying the evaluation of log likelihoods for negative samples. The final loss
function is given by:
L
NCE = −E+ [l(c, x, O | T)] − E−

l(cneg, x
−, O− | T)

. (5)
Sampling In the founding stage, we sample k hot-spot residues from the learned energy-based
distribution, where k is kept small relative to the peptide length (e.g., k = 1 ∼ 3 in our experiments).
Since hot-spots are assumed to be sparsely distributed along the peptide (Bogan & Thorn, 1998), we
approximately sample them independently. For each hot-spot residue, we employ the Langevin
MCMC Sampling algorithm (Welling & Teh, 2011), starting from an initial guessed position x
0
and
orientation O0, and iteratively update them using the following gradients:
x
t+1 ←xt +
ϵ
2
2
X
c
′
∇xgθ,c′ (x
t
, Ot| T) + ϵz
t
x
, z
t
x ∼ N (0,I3), (6)
Ot+1 ←expOt (
ϵ
2
2
X
c
′
∇Ogθ,c′ (x
t
, Ot| T) + ϵZ
t
O), Z
t
O ∼ T N Ot (0,I3), (7)
c
t+1 ∼softmax gθ(xt
, Ot| T). (8)
Since orientation lies in the SO(3) space, we employ the exponential map and a Riemannian random
walk on the tangent space for updates (De Bortoli et al., 2022). The summation over all possible
5

5/33

Preprint, Under review
residue types ensures that we transition from regions of low occurrence probability to regions of
higher probability. Finally, after each iteration, the residue type c is sampled conditioned on the
updated position and orientation.
4.2 EXTENSION STAGE
The extension stage expands fragments into longer sequences, starting from the sampled hot-spot
residues. At each extension step, we adds a new residue to either the left or right of fragment F.
Based on the relationship between adjacent residues (Eq.1,2), the backbone structure of the new
residue is inferred from its dihedral angles and the structure of the adjacent residue, which is either
the N-terminal or C-terminal residue of the fragment. Specifically, when connecting a new residue
to residue Rij
in the j th fragment Fj , we model the dihedral angle distribution P(ψ, ϕ | d, Rij
, E),
where d ∈ {L, R} indicates the extension direction, and E represents the surrounding residues,
including target T and other residues in the currently generated fragments.
P(ψ, ϕ | d, Rij
, E) =
P(ψij−1, ϕij
), d = L,
P(ψij, ϕij+1), d = R.
(9)
Since multiple angles are involved, the dihedral angle distribution is modeled as a product of parameterized von Mises distributions (Lennox et al., 2009), which use cosine distance instead of L2
distance to measure the difference between angles, behaving like circular normal distributions. For
example, when d = L, we have:
P(ψij−1, ϕij) =fVM(ψ; µψij−1, κψij−1)fVM(ϕij; µϕij, κϕij), (10)
fVM(ψij−1; µψij−1, κψij−1) = 1
2πI0(κψij−1)
exp κψij−1· cos(µψij−1 − ψij−1)

, (11)
fVM(ϕij; µϕij, κϕij) = 1
2πI0(κϕij)
exp κϕij· cos(µϕij− ϕij)

. (12)
Here, I0(·) denotes the modified Bessel function of the first kind of order 0. The four distribution
parameters are predicted by a neural network hθ, referred to as the prediction network. Similarly,
for d = R, the network predicts another set of four parameters:
hθ(d, Rij, E) =
(µψij−1, κψij−1, µϕij, κϕij), d = L,
(µψij, κψij, µϕij+1 , κϕij+1 ), d = R.
(13)
Network Implementation The prediction network hθ uses the same IPA backbone to extract features. However, to avoid data leakage during training, since the neighboring backbone structures are
known and dihedral angles can be derived analytically, we employ directional masks in the Attention
module. For instance, if the direction is Left, residues can only attend to their neighbors on the right
during attention updates, and vice versa for Right.
Training We optimize the network parameters using Maximum Likelihood Estimation (MLE) over
directions d ∼ {L, R} and peptides in the peptide-target complex dataset. The MLE objective is
given by:
L
MLE = −E

log P(ψ, ϕ | d, Rij, E)

. (14)
Sampling During the extension stage, we generate k fragments corresponding to k hot-spot residues
from the founding stage. The extension process is iterative, where fragments are autoregressively
extended until the total peptide length (the sum of fragment lengths) reaches a predefined value
(e.g., the length of the native peptide). Consider a one-step extension of fragment F in direction
d. The starting residue Rij depends on the direction: d = L implies adding a residue to the left of
the fragment, making Rijthe N-terminal residue (first residue); d = R implies adding to the right,
making Rij
the C-terminal residue (last residue). The other residues in the fragment and target form
the environment E. We then sample the dihedral angles for the new residue in the chosen direction
from the predicted distribution, using hθ. For example, when d = L:
ψij−1 ∼fVM(ψij−1; hθ(d = L, Rij
, E)), (15)
ϕij ∼fVM(ϕij; hθ(d = R, Rij, E)). (16)
6

6/33

Preprint, Under review
Next, the backbone structure of the newly added residue Rij−1 is reconstructed using the transformations in Eq 1. The residue type is then estimated by the density model gθ used during the founding
stage:
(xij−1, Oij−1) =Left(xij, Oij, ψij−1, ϕij), (17)
cij−1 ∼softmax gθ(xij−1, Oij−1 | E). (18)
Finally, the process is repeated for another randomly selected fragment and direction.
4.3 CORRECTION STAGE
Although we autoregressively extended each fragment, the resulting fragments may not form a valid
peptide with accurate geometry. For example, some fragments may not maintain proper distances
between each other, leading to broken peptide bonds, while others may have incorrect dihedrals
or residue types in relation to the whole peptide and the target protein. Some fragments may also
exhibit atom clashes within the target protein. Inspired by traditional methods using hand-crafted
energy functions (Alford et al., 2017), we introduce a correction stage as a post-processing step
to refine the generated peptides. Rather than relying on empirical functions, we use the learned,
network-parameterized distributions from the first two stages to regularize the peptides.
For a generated peptide D = [(c1, x1, O1), ...,(cN , xN , ON )], the dihedrals of each residue are
derived based on the backbone structures of adjacent residues. To ensure self-consistency between
dihedrals and backbone structures, we use the dihedrals to estimate new backbone structures and
compare them with the original ones. The distance between these backbone structures reflects the
validity of the generated peptide with respect to peptide bond properties and planarity. We define
the distance between two residues’ backbone structures separately for position and orientation, and
derive the backbone objective considering both directions:
d((xi, Oi),(xj , Oj )) = ∥xi − xj∥
2+∥log(Oi) − log(Oj )∥2
, (19)
Jbb = −
X
N
i=2
d(Left(xi, Oi, ψi−1, ϕi) − (xi−1, Oi−1))−
N
X−1
i=1
d(Right(xi, Oi, ψi, ϕi+1) − (xi+1, Oi+1)).
(20)
Additionally, the dihedral angles must conform to the learned distribution P(ψ, ϕ) to ensure correct
geometric relationships between neighboring residues. This leads to the dihedral objective, which
is similar to Eq. 14. However, in Eq. 14, we optimize the network parameters to fit the angle
distribution, whereas here, we keep the learned networks fixed and update the dihedrals instead:
Jang = −
X
N
i=2
logP(ψi−1, ϕi) −
N
X−1
i=1
logP(ψi, ϕi+1). (21)
The final optimization objective is a weighted sum of the backbone and dihedral objectives. We
iteratively update the peptide’s backbone structures by taking gradients, similar to the founding
stage, but here we optimize the entire peptide at each timestep. The residue types are predicted by
the density model gθ at the end of each update step. Unlike the founding stage, where we started
from random structures, the correction stage begins with the complete peptide.
Jcorr =λbbJbb + λangJang, (22)
(x
t+1
i
, O
t+1
i
) ←update(x
t
i
, Ot
i
, ∇xiJ , ∇OiJ ,), (23)
c
t+1 ∼softmax(xt
, Ot| E). (24)
5 EXPERIMENTS
Overview We evaluate PepHAR and several baseline methods on two main tasks: (1) Peptide Binder
Design and (2) Peptide Scaffold Generation. In Peptide Design, we co-generate both the structure
and sequence of peptides based on their binding pockets within the target protein. However, in realworld drug discovery, prior knowledge—such as key interaction residues at the binding interface or
7

7/33

Preprint, Under review
initial peptides for optimization—is often available. Therefore, we introduce Peptide Scaffold Generation to assess how well models can scaffold and link these key residues into complete peptides,
reflecting more practical applications. Details are included in the Appendix E.
Dataset Following Li et al. (2024a), we construct our training and test datasets. This moderatelength benchmark is derived from PepBDB (Wen et al., 2019) and Q-BioLip (Wei et al., 2024), with
duplicates and low-quality entries removed. The binding pocket is defined as the residues in the
target protein which has heavy atoms lying in the 10Aradius of any heavy atom in the peptide. The ˚
dataset consists of 158 complexes across 10 clusters from mmseqs2 (Steinegger & Soding, 2017), ¨
with an additional 8, 207 non-homologous examples used for training and validation.
5.1 PEPTIDE BINDER DESIGN
In Peptide Binder Design, we co-generate peptide sequences and structures conditioned on the binding pockets of their target proteins. The models are provided with both the sequence and structure
of the target protein pockets and tasked with generating bound-state peptides.
Table 1: Evaluation of methods in the peptide design task. K = 1, 2, 3 is the number of hot spots.
Valid % ↑ RMSD A˚ ↓ SSR % ↑ BSR % ↑ Stability % ↑ Affinity % ↑ Novelty % ↑ Diversity % ↑ Success % ↑
RFDiffusion 66.04 4.17 63.86 26.71 26.82 16.53 53.74 25.39 25.38
ProteinGenerator 65.88 4.35 29.15 24.62 23.84 13.47 52.39 22.57 24.43
PepFlow 40.27 2.07 83.46 86.89 18.15 21.37 50.26 20.23 27.96
PepGLAD 55.20 3.83 80.24 19.34 20.39 10.47 75.07 32.10 14.05
PepHAR (K = 1) 57.99 3.73 79.93 84.17 15.69 18.56 81.21 32.69 23.00
PepHAR (K = 2) 55.67 3.19 80.12 84.57 15.91 19.82 79.07 31.57 22.85
PepHAR (K = 3) 59.31 2.68 84.91 86.74 16.62 20.53 79.11 29.58 25.54
Figure 3: RMSD of generated peptides, considering different tasks and
numbers of hotspots. More hotspot
residues lead to better results.
Figure 4: Two examples of generated peptides,
along with RMSD and binding energy. PepHAR
can generate native-like peptides with better binding affinities.
Metrics A robust generative model should produce diverse, valid peptides with favorable stability
and affinity. The following metrics are employed: (1) Valid measures whether the distance between adjacent residues is consistent with peptide bond formation, considering Cα atoms within
3.8A˚ as valid (Chelvanayagam et al., 1998; Zhang et al., 2012). (2) RMSD (Root-Mean-Square
Deviation) compares the generated peptide structures to the native ones based on Cα distances after alignment. (3) SSR (Secondary Structure Ratio) evaluates the proportion of shared secondary
structures between the generated and native peptides labeled by DSSP (Kabsch & Sander, 1983).
(4) BSR (Binding Site Rate) assesses the similarity of peptide-target interactions by measuring the
overlap of binding sites. (5) Stability calculates the percentage of generated complexes that are
more stable (lower total energy) than their native counterparts, based on rosetta energy functions
(Chaudhury et al., 2010; Alford et al., 2017). (6) Affinity measures the percentage of peptides with
higher binding affinities (lower binding energies) than the native peptide. Beyond geometric and energetic factors, the model should also exhibit strong generalizability in discovering novel peptides.
(7) Novelty is the ratio of novel peptides, defined by two criteria: (a) TM-score ≤ 0.5 (Zhang &
Skolnick, 2005) and (b) sequence identity ≤ 0.5. (8) Diversity quantifies structural and sequence
8

8/33

Preprint, Under review
variability, calculated as the product of pairwise (1-TM-score) and (1-sequence identity) across all
generated peptides for a given target. (9) Success rate evaluates the proportion of AF2-predicted
complex structures with a whole ipTM value higher than 0.6.
Baselines We compare PepHAR against three state-of-the-art peptide design models. RFDiffusion
(Watson et al., 2022) uses pre-trained weights from RoseTTAFold (Baek et al., 2021) and generates
protein backbone structures through a denoising diffusion process. Peptide sequences are then recovered using ProteinMPNN (Dauparas et al., 2022). ProteinGenerator augument RFDiffusion with
sequence-structure jointly generation (Lisanza et al., 2023). PepFlow (Li et al., 2024a) models fullatom peptide and samples peptides using a flow-matching framework on a Riemannian manifold.
PepGLAD (Kong et al., 2024) employs equivariant latent diffusion networks to generate full-atom
peptide structures.
Results As shown in Table 1, PepHAR effectively generates peptides that exhibit valid geometries, native-like structures, and improved energies. While RFDiffusion produces valid peptides
due to its pretrained protein folding weights, PepFlow, which is trained solely on peptide datasets,
struggles with generating valid peptides, making it challenging for practical applications. In contrast, PepHAR’s autoregressive generation based on dihedral angles not only ensures the production of valid peptides but also allows for precise placement at the binding site with accurate secondary structures. Similar to previous work (Li et al., 2024a), RFDiffusion excels at generating
stable peptide-target structures, while PepHAR demonstrates competitive performance compared to
PepFlow. Additionally, PepHAR shows impressive results in terms of novelty and diversity, highlighting its potential for exploring peptide distributions and designing a wide range of peptides for
real-world applications. Figure 3 illustrates two examples of peptides generated by PepHAR, which
closely resemble the structures and binding sites of native peptides while exhibiting low binding
energies, indicating high affinities for the target.
5.2 PEPTIDE SCAFFOLD GENERATION
Compared to designing peptides from scratch, a more practical approach involves leveraging prior
knowledge, such as key interaction residues. We introduce this as the task of scaffold generation,
where certain hot spot residues in the peptide are fixed, and the model must generate a complete
peptide by connecting these residues. In this context, the generated peptide should incorporate
the hot spot residues in the correct positions, effectively scaffolding them. Hot spot residues are
selected based on their higher potential for interacting with the target protein. To identify these,
we first calculate the energy of each residue using an energy function (Alford et al., 2017), then
manually select residues that are both energetically favorable and sparsely distributed along the
peptide sequence. These selected residues are fixed as the condition for scaffold generation.
Table 2: Evaluation of methods in the scaffold generation task. K = 1, 2, 3 is the number of hot
spots.
Valid % ↑ RMSD A˚ ↓ SSR % ↑ BSR % ↑ Stability % ↑ Affinity % ↑ Novelty % ↑ Diversity % ↑ Success % ↑
RFDiffusion (K = 3) 69.88 4.09 63.66 26.83 20.07 21.26 55.03 26.67 23.15
ProteinGenerator (K = 3) 68.52 3.95 65.86 24.17 20.40 22.80 50.73 20.82 20.42
PepFlow (K = 3) 42.68 2.45 81.00 82.76 11.17 18.27 50.93 16.97 24.54
PepGLAD (K = 3) 53.51 3.84 76.26 19.61 12.22 18.27 50.93 30.99 14.85
PepHAR (K = 1) 56.01 3.72 80.61 78.18 17.89 19.94 80.61 29.79 20.43
PepHAR (K = 2) 55.36 2.85 82.79 85.80 19.18 19.17 74.76 25.32 22.09
PepHAR (K = 3) 55.41 2.15 83.02 88.02 20.50 20.65 72.56 19.68 21.45
Baselines and Metrics We use the same baselines and metrics as in the Peptide Design task. Specifically, for RFDiffusion and Protein Generator, the known hot spot residues are provided as an additional condition, along with the target. For PepFlow, we modify the ODE sampling process by
initializing it with the ground truth hot spot residues and ensure the model to only modify the remaining residues. In our method, we replace the sampled hot spot residues with the known ground
truth residue.
Results As shown in Table 2, PepHAR demonstrates excellent performance in scaffolding hot spot
residues into complete peptides. Given that the hot spot residues have functional binding capabilities, while the scaffold residues contribute primarily to structural integrity, the generated peptides are
expected to possess valid, native structures with high stability. PepHAR successfully generates valid
and native-like structures, ensuring that scaffold residues do not disrupt interactions between hot
spot residues, achieving the best scores in SSR and BSR. Moreover, PepHAR achieves competitive
stability results compared to RFDiffusion, which is trained on a larger PDB dataset. Additionally,
9

9/33

Preprint, Under review
Figure 5: Examples of generated scaffolded peptides by PepHAR. PepHAR can scaffold hotspot
residues, leading to more stable complexes with native-like valid geometries
PepHAR produces novel and diverse scaffolds. Figure 5 presents two examples of scaffolded peptides alongside native peptides and given residues. The generated scaffolds exhibit similar structures
to the native ones, while displaying variations in geometry and orientation at the midpoints and ends
of the peptides, indicating flexibility in the scaffolding regions. Furthermore, the generated scaffolds
often have lower total energy than the native peptides, suggesting enhanced stability of the complex
and improved interaction potential.
5.3 ANALYSIS
Table 3: Ablation results of PepHAR in peptide design task.
Valid % ↑ RMSD A˚ ↓ SSR % ↑ BSR % ↑ Stability % ↑ Affinity % ↑ Novelty % ↑ Diversity % ↑
PepHAR (K = 3) 59.31 2.68 84.91 86.74 16.62 20.53 79.11 29.58
PepHAR w/o Von Mosies 56.21 3.10 80.86 82.21 17.24 15.68 79.44 29.65
PepHAR w/o Hot Spot 55.67 3.99 79.93 74.17 11.23 12.21 81.51 37.03
PepHAR w/o Correction 53.66 3.41 80.46 81.43 15.72 14.85 82.75 37.87
Effect of Hot Spots Comparing Tables 1 and 2, we observe that introducing hot spots as prior
knowledge significantly boosts PepHAR’s performance, while providing little benefit to RFDiffusion and PepFlow. This highlights PepHAR’s versatility across different design tasks. We also
investigate the effect of varying the number of hot spots, denoted as K = 1, 2, 3. As shown in
Tables and Figure 3, increasing the number of hot spots improves geometries and energies, regardless of whether the hotspots are estimated by density models or provided as ground truth; however, it
negatively impacts novelty and diversity. This illustrates a trade-off between designing low-diversity
but high-quality peptides (in comparison to the native) and high-diversity but varied peptides (Luo
et al., 2022; Li et al., 2024a).
Ablation Study Table 3 presents our ablation study, which assesses the effectiveness of different
components in PepHAR. ”PepHAR w/o Hot Spot” refers to the model where hot spots sampled
from the density model are replaced with randomly positioned and typed residues. ”PepHAR w/o
Von Mises” indicates the use of direct angle predictions instead of modeling angle distributions. We
also remove the correction stage in ”PepHAR w/o Correction.” Our findings reveal that generated
hot spots are crucial for Valid, RMSD, SSR, and BSR metrics, underscoring their importance for
achieving valid geometries and interactions. Modeling angle distributions also contributes positively
by accounting for the flexibility of dihedral angles. Lastly, the final correction stage plays a vital
role in enhancing fragment assembly, leading to peptides with higher affinity and stability, which
are essential for effective protein binding.
10

10/33

Preprint, Under review
6 CONCLUSION
In this work, we presented PepHAR, a hot-spot based autoregressive generative model designed
for efficient and precise peptide design targeting specific proteins. By addressing key challenges in
peptide design—such as the unequal contribution of residues, the geometric constraints imposed by
peptide bonds, and the need for practical benchmarking scenarios—PepHAR provides a comprehensive approach for generating peptides from scratch or assembling peptides around key hot spot
residues. Our method leverages energy-based hot spot sampling, autoregressive fragment extension
through dihedral angles, and an optimization process to ensure valid peptide assembly. Through
extensive experiments on both peptide generation and scaffold-based design, we demonstrated the
effectiveness of PepHAR in computational peptide design, highlighting its potential for advancing
drug discovery and therapeutic development.
REFERENCES
Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022.
Rebecca F Alford, Andrew Leaver-Fay, Jeliazko R Jeliazkov, Matthew J O’Meara, Frank P DiMaio,
Hahnbeom Park, Maxim V Shapovalov, P Douglas Renfrew, Vikram K Mulligan, Kalli Kappel,
et al. The rosetta all-atom energy function for macromolecular modeling and design. Journal of
chemical theory and computation, 13(6):3031–3048, 2017.
Namrata Anand and Tudor Achim. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
J Atchison and Sheng M Shen. Logistic-normal distributions: Some properties and uses. Biometrika,
67(2):261–272, 1980.
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured
denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing
Systems, 34:17981–17993, 2021.
Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie
Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of
protein structures and interactions using a three-track neural network. Science, 373(6557):871–
876, 2021.
Heli Ben-Hamu, Samuel Cohen, Joey Bose, Brandon Amos, Maximillian Nickel, Aditya Grover,
Ricky TQ Chen, and Yaron Lipman. Matching normalizing flows and probability paths on manifolds. In International Conference on Machine Learning, pp. 1749–1763. PMLR, 2022.
Nathaniel R Bennett, Brian Coventry, Inna Goreshnik, Buwei Huang, Aza Allen, Dionne Vafeados,
Ying Po Peng, Justas Dauparas, Minkyung Baek, Lance Stewart, et al. Improving de novo protein
binder design with deep learning. Nature Communications, 14(1):2625, 2023.
Gaurav Bhardwaj, Vikram Khipple Mulligan, Christopher D Bahl, Jason M Gilmore, Peta J Harvey,
Olivier Cheneval, Garry W Buchko, Surya VSRK Pulavarti, Quentin Kaas, Alexander Eletsky,
et al. Accurate de novo design of hyperstable constrained peptides. Nature, 538(7625):329–335,
2016.
Suhaas Bhat, Kalyan Palepu, Vivian Yudistyra, Lauren Hong, Venkata Srikar Kavirayuni, Tianlai
Chen, Lin Zhao, Tian Wang, Sophia Vincoff, and Pranam Chatterjee. De novo generation and
prioritization of target-binding peptide motifs from sequence alone. bioRxiv, pp. 2023–06, 2023.
Jose Luis Blanco-Claraco. A tutorial on se(3) transformation parameterizations and on-manifold ´
optimization. arXiv preprint arXiv:2103.15980, 2021.
Miklos Bodanszky. Peptide chemistry. A Practical Textbook,, 1988.
Eric T Boder and K Dane Wittrup. Yeast surface display for screening combinatorial polypeptide
libraries. Nature biotechnology, 15(6):553–557, 1997.
11

11/33

Preprint, Under review
Andrew A Bogan and Kurt S Thorn. Anatomy of hot spots in protein interfaces. Journal of molecular
biology, 280(1):1–9, 1998.
Avishek Joey Bose, Tara Akhound-Sadegh, Kilian Fatras, Guillaume Huguet, Jarrid Rector-Brooks,
Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, and Alexander Tong. Se (3)-stochastic flow matching for protein backbone generation. arXiv preprint
arXiv:2310.02391, 2023.
Patrick Bryant and Arne Elofsson. Evobind: in silico directed evolution of peptide binders with
alphafold. bioRxiv, pp. 2022–07, 2022.
Patrick Bryant and Arne Elofsson. Peptide binder design with inverse folding and protein structure
prediction. Communications Chemistry, 6(1):229, 2023.
Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative
flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design.
arXiv preprint arXiv:2402.04997, 2024.
Longxing Cao, Brian Coventry, Inna Goreshnik, Buwei Huang, William Sheffler, Joon Sung Park,
Kevin M Jude, Iva Markovic, Rameshwar U Kadam, Koen HG Verschueren, et al. Design of ´
protein-binding proteins from the target structure alone. Nature, 605(7910):551–560, 2022.
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative
image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 11315–11325, 2022.
Sidhartha Chaudhury, Sergey Lyskov, and Jeffrey J Gray. Pyrosetta: a script-based interface for implementing molecular modeling algorithms using rosetta. Bioinformatics, 26(5):689–691, 2010.
Gareth Chelvanayagam, Lukas Knecht, Thomas Jenny, Steven A Benner, and Gaston H Gonnet.
A combinatorial distance-constraint approach to predicting protein tertiary models from known
secondary structure. Folding and Design, 3(3):149–160, 1998.
Ricky TQ Chen and Yaron Lipman. Riemannian flow matching on general geometries. arXiv
preprint arXiv:2302.03660, 2023.
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary
differential equations. Advances in neural information processing systems, 31, 2018.
Tianlai Chen, Sarah Pertsemlidis, Venkata Srikar Kavirayuni, Pranay Vure, Rishab Pulugurta, Ashley Hsu, Sophia Vincoff, Vivian Yudistyra, Lauren Hong, Tian Wang, et al. Pepmlm: Target
sequence-conditioned generation of peptide binders via masked language modeling. ArXiv, 2023.
Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. arXiv preprint arXiv:2405.16441, 2024.
Alexander E Chu, Lucy Cheng, Gina El Nesr, Minkai Xu, and Po-Ssu Huang. An all-atom protein
generative model. bioRxiv, pp. 2023–05, 2023.
David J Craik, David P Fairlie, Spiros Liras, and David Price. The future of peptide-based drugs.
Chemical biology & drug design, 81(1):136–147, 2013.
Onur Dagliyan, Elizabeth A Proctor, Kevin M D’Auria, Feng Ding, and Nikolay V Dokholyan.
Structural and dynamic determinants of protein-peptide recognition. Structure, 19(12):1837–
1845, 2011.
Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles,
Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–
based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, and David Baker. Atomic context-conditioned protein sequence design using ligandmpnn.
Biorxiv, pp. 2023–12, 2023.
12

12/33

Preprint, Under review
Oscar Davis, Samuel Kessler, Mircea Petrache, Avishek Joey Bose, et al. Fisher flow matching for
generative modeling over discrete data. arXiv preprint arXiv:2405.14664, 2024.
Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and
Arnaud Doucet. Riemannian score-based generative modelling. Advances in Neural Information
Processing Systems, 35:2406–2422, 2022.
Guy D Duffaud, Susan K Lehnhardt, Paul E March, and Masayori Inouye. Structure and function
of the signal peptide. In Current topics in membranes and transport, volume 24, pp. 65–104.
Elsevier, 1985.
Richard Evans, Michael O’Neill, Alexander Pritzel, Natasha Antropova, Andrew Senior, Tim Green,
Augustin Zˇ´ıdek, Russ Bates, Sam Blackwell, Jason Yim, et al. Protein complex prediction with
alphafold-multimer. biorxiv, pp. 2021–10, 2021.
Noelia Ferruz, Steffen Schmidt, and Birte Hocker. Protgpt2 is a deep unsupervised language model ¨
for protein design. Nature communications, 13(1):4348, 2022.
Matthew Fisher. Lehninger principles of biochemistry, ; by david l. nelson and michael m. cox. The
Chemical Educator, 6:69–70, 2001.
Griffin Floto, Thorsteinn Jonsson, Mihai Nica, Scott Sanner, and Eric Zhengyu Zhu. Diffusion on
the probability simplex. In ICML 2023 Workshop: Sampling and Optimization in Discrete Space,
2023.
Keld Fosgerau and Torsten Hoffmann. Peptide therapeutics: current status and future directions.
Drug discovery today, 20(1):122–128, 2015.
Pablo Gainza, Freyr Sverrisson, Frederico Monti, Emanuele Rodola, Davide Boscaini, Michael M
Bronstein, and Bruno E Correia. Deciphering interaction fingerprints from protein molecular
surfaces using geometric deep learning. Nature Methods, 17(2):184–192, 2020.
Pablo Gainza, Sarah Wehrle, Alexandra Van Hall-Beauvais, Anthony Marchand, Andreas Scheck,
Zander Harteveld, Stephen Buckley, Dongchun Ni, Shuguang Tan, Freyr Sverrisson, et al. De
novo design of protein interactions with learned surface fingerprints. Nature, 617(7959):176–
184, 2023.
Zhangyang Gao, Cheng Tan, Pablo Chacon, and Stan Z Li. Pifold: Toward effective and efficient ´
protein inverse folding. arXiv preprint arXiv:2209.12643, 2022.
Andrew Giessel, Athanasios Dousis, Kanchana Ravichandran, Kevin Smith, Sreyoshi Sur, Iain McFadyen, Wei Zheng, and Stuart Licht. Therapeutic enzyme engineering using a generative neural
network. Scientific Reports, 12(1):1536, 2022.
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, and Gabriel Syn- `
naeve. Better & faster large language models via multi-token prediction. arXiv preprint
arXiv:2404.19737, 2024.
Christoph Grathwohl and Kurt Wuthrich. The x-pro peptide bond as an nmr probe for conforma- ¨
tional studies of flexible linear peptides. Biopolymers: Original Research on Biomolecules, 15
(10):2025–2041, 1976.
Jiaqi Guan, Xiangxin Zhou, Yuwei Yang, Yu Bao, Jian Peng, Jianzhu Ma, Qiang Liu, Liang Wang,
and Quanquan Gu. Decompdiff: diffusion models with decomposed priors for structure-based
drug design. arXiv preprint arXiv:2403.07902, 2024.
Ruihan Guo, Rui Wang, Ruidong Wu, Zhizhou Ren, Jiahan Li, Shitong Luo, Zuofan Wu, Jian Peng,
Jianzhu Ma, et al. Enhancing protein mutation effect prediction through a retrieval-augmented
framework. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
Michael Gutmann and Aapo Hyvarinen. Noise-contrastive estimation: A new estimation principle ¨
for unnormalized statistical models. In Proceedings of the thirteenth international conference on
artificial intelligence and statistics, pp. 297–304. JMLR Workshop and Conference Proceedings,
2010.
13

13/33

Preprint, Under review
Michael U Gutmann and Aapo Hyvarinen. Noise-contrastive estimation of unnormalized statistical ¨
models, with applications to natural image statistics. Journal of machine learning research, 13
(2), 2012.
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. SSD-LM: Semi-autoregressive simplex-based
diffusion language model for text generation and modular control. In Anna Rogers, Jordan BoydGraber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pp. 11575–11596, Toronto, Canada, July
2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.647. URL
https://aclanthology.org/2023.acl-long.647.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in
neural information processing systems, 33:6840–6851, 2020.
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P
Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition
video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forre, and Max Welling. Argmax flows ´
and multinomial diffusion: Learning categorical distributions. Advances in Neural Information
Processing Systems, 34:12454–12465, 2021.
Emiel Hoogeboom, Vıctor Garcia Satorras, Clement Vignac, and Max Welling. Equivariant diffu- ´
sion for molecule generation in 3d. In International conference on machine learning, pp. 8867–
8887. PMLR, 2022.
Parisa Hosseinzadeh, Paris R Watson, Timothy W Craven, Xinting Li, Stephen Rettie, Fatima Pardo- ´
Avila, Asim K Bera, Vikram Khipple Mulligan, Peilong Lu, Alexander S Ford, et al. Anchor
extension: a structure-guided approach to design cyclic peptides targeting enzyme active sites.
Nature Communications, 12(1):3384, 2021.
Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, and Alexander Rives. Learning inverse folding from millions of predicted structures. In International Conference on Machine Learning, pp. 8946–8970. PMLR, 2022.
Chin-Wei Huang, Milad Aghajohari, Joey Bose, Prakash Panangaden, and Aaron C Courville. Riemannian diffusion models. Advances in Neural Information Processing Systems, 35:2750–2761,
2022.
Po-Ssu Huang, Scott E Boyken, and David Baker. The coming of age of de novo protein design.
Nature, 537(7620):320–327, 2016.
John Ingraham, Vikas Garg, Regina Barzilay, and Tommi Jaakkola. Generative models for graphbased protein design. Advances in neural information processing systems, 32, 2019.
John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, Vincent
Frappier, Dana M Lord, Christopher Ng-Thow-Hing, Erik R Van Vlack, et al. Illuminating protein
space with a programmable generative model. Nature, pp. 1–9, 2023.
Matthew P Jacobson, Richard A Friesner, Zhexin Xiang, and Barry Honig. On the role of the crystal
environment in determining protein side-chain conformations. Journal of molecular biology, 320
(3):597–608, 2002.
Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure FP
Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, et al.
Biological sequence design with gflownets. In International Conference on Machine Learning,
pp. 9786–9801. PMLR, 2022.
Wengong Jin, Jeremy Wohlwend, Regina Barzilay, and Tommi Jaakkola. Iterative refinement graph
neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624,
2021.
Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael JL Townshend, and Ron Dror. Learning
from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.
14

14/33

$Preprint, Under review Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, and Tommi Jaakkola. Torsional diffusion for molecular conformer generation. Advances in Neural Information Processing Systems, 35:24240–24253, 2022. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zˇ´ıdek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021. Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 32 (5):922–923, 1976. Wolfgang Kabsch and Christian Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules, 22(12):2577–2637, 1983. Allan A Kaspar and Janice M Reichert. Future directions for peptide therapeutics development. Drug discovery today, 18(17-18):807–817, 2013. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, pp. 2. Minneapolis, Minnesota, 2019. Ozlem Keskin, Buyong Ma, and Ruth Nussinov. Hot regions in protein–protein interactions: the organization and contribution of structurally conserved hot spot residues. Journal of molecular biology, 345(5):1281–1294, 2005. Asif Khan, Alexander I Cowen-Rivers, Antoine Grosnit, Derrick-Goh-Xin Deik, Philippe A Robert, Victor Greiff, Eva Smorodina, Puneet Rawat, Kamil Dreczkowski, Rahmad Akbar, et al. Antbo: Towards real-world automated antibody design with combinatorial bayesian optimisation. arXiv preprint arXiv:2201.12570, 2022. Xiangzhe Kong, Wenbing Huang, and Yang Liu. Conditional antibody design as 3d equivariant graph translation. arXiv preprint arXiv:2208.06073, 2022. Xiangzhe Kong, Wenbing Huang, and Yang Liu. End-to-end full-atom antibody design. arXiv preprint arXiv:2302.00203, 2023. Xiangzhe Kong, Yinjun Jia, Wenbing Huang, and Yang Liu. Full-atom peptide design with geometric latent diffusion. arXiv preprint arXiv:2402.13555, 2024. Shohei Konno, Takao Namiki, and Koichiro Ishimori. Quantitative description and classification of protein structures by a novel robust amino acid network: interaction selective network (isn). Scientific Reports, 9(1):16654, 2019. Rohith Krishna, Jue Wang, Woody Ahern, Pascal Sturmfels, Preetham Venkatesh, Indrek Kalvet, Gyu Rie Lee, Felix S Morey-Burrows, Ivan Anishchenko, Ian R Humphreys, et al. Generalized biomolecular modeling and design with rosettafold all-atom. bioRxiv, pp. 2023–10, 2023. Georgii G Krivov, Maxim V Shapovalov, and Roland L Dunbrack Jr. Improved prediction of protein side-chain conformations with scwrl4. Proteins: Structure, Function, and Bioinformatics, 77(4): 778–795, 2009. Kit S Lam. Mini-review. application of combinatorial library methods in cancer research and drug discovery. Anti-cancer drug design, 12(3):145–167, 1997. Jolene L Lau and Michael K Dunn. Therapeutic peptides: Historical perspectives, current development trends, and future directions. Bioorganic & medicinal chemistry, 26(10):2700–2707, 2018. Ernest Y Lee, Gerard CL Wong, and Andrew L Ferguson. Machine learning-enabled discovery and design of membrane-active peptides. Bioorganic & medicinal chemistry, 26(10):2708–2718, 2018. John M Lee. Introduction to Riemannian manifolds, volume 2. Springer, 2018. 15$

15/33

Preprint, Under review
Julia Koehler Leman, Brian D Weitzner, Steven M Lewis, Jared Adolf-Bryfogle, Nawsad Alam, Rebecca F Alford, Melanie Aprahamian, David Baker, Kyle A Barlow, Patrick Barth, et al. Macromolecular modeling and design in rosetta: recent methods and frameworks. Nature methods, 17
(7):665–680, 2020.
Kristin P. Lennox, David B. Dahl, Marina Vannucci, and Jerry W. Tsai. Density Estimation for
Protein Conformation Angles Using a Bivariate von Mises Distribution and Bayesian Nonparametrics. Journal of the American Statistical Association, 104(486):586–596, June 2009. ISSN
0162-1459. doi: 10.1198/jasa.2009.0024.
Jiahan Li, Shitong Luo, Congyue Deng, Chaoran Cheng, Jiaqi Guan, Leonidas Guibas, Jianzhu
Ma, and Jian Peng. Orientation-aware graph neural networks for protein structure representation
learning. 2022.
Jiahan Li, Chaoran Cheng, Zuofan Wu, Ruihan Guo, Shitong Luo, Zhizhou Ren, Jian Peng, and
Jianzhu Ma. Full-atom peptide design based on multi-modal flow matching. arXiv preprint
arXiv:2406.00735, 2024a.
Qiuzhen Li, Efstathios Nikolaos Vlachos, and Patrick Bryant. Design of linear and cyclic peptide
binders of different lengths only from a protein target sequence. bioRxiv, pp. 2024–06, 2024b.
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image
generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024c.
Wuchen Li. Geometry of probability simplex via optimal transport. arXiv preprint
arXiv:1803.06360, 2(4):13, 2018.
Haitao Lin, Odin Zhang, Huifeng Zhao, Dejun Jiang, Lirong Wu, Zicheng Liu, Yufei Huang, and
Stan Z Li. Ppflow: Target-aware peptide design with torsional flow matching. bioRxiv, pp. 2024–
03, 2024.
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin,
Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level
protein structure with a language model. Science, 379(6637):1123–1130, 2023.
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching
for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
Sidney Lyayuga Lisanza, Jacob Merle Gershon, Sam Wayne Kenmore Tipps, Lucas Arnoldt, Samuel
Hendel, Jeremiah Nelson Sims, Xinting Li, and David Baker. Joint generation of protein sequence
and structure with rosettafold sequence space diffusion. bioRxiv, pp. 2023–05, 2023.
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and
transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast
ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural
Information Processing Systems, 35:5775–5787, 2022.
Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, and Jianzhu Ma. Antigen-specific
antibody design and optimization with diffusion-based generative models for protein structures.
Advances in Neural Information Processing Systems, 35:9754–9767, 2022.
Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi,
Po-Ssu Huang, and Richard Socher. Progen: Language modeling for protein generation. arXiv
preprint arXiv:2004.03497, 2020.
Negin Manshour, Fei He, Duolin Wang, and Dong Xu. Integrating protein structure prediction and
bayesian optimization for peptide design. In NeurIPS 2023 Generative AI and Biology (GenBio)
Workshop, 2023.
Karolis Martinkus, Jan Ludwiczak, Kyunghyun Cho, Wei-Ching Lian, Julien Lafrance-Vanasse,
Isidro Hotzel, Arvind Rajpal, Yan Wu, Richard Bonneau, Vladimir Gligorijevic, et al. Abdiffuser:
Full-atom generation of in-vitro functioning antibodies. arXiv preprint arXiv:2308.05027, 2023.
16

16/33

Preprint, Under review
John B Matson and Samuel I Stupp. Self-assembling peptide scaffolds for regenerative medicine.
Chemical communications, 48(1):26–33, 2012.
Matt McPartlon and Jinbo Xu. An end-to-end deep learning method for rotamer-free protein sidechain packing. bioRxiv, pp. 2022–03, 2022.
Mikita Misiura, Raghav Shroff, Ross Thyer, and Anatoly B Kolomeisky. Dlpacker: deep learning
for prediction of amino acid side chain conformations in proteins. Proteins: Structure, Function,
and Bioinformatics, 90(6):1278–1290, 2022.
Irina S Moreira, Pedro A Fernandes, and Maria J Ramos. Hot spots—a review of the protein–protein
interface determinant amino-acid residues. Proteins: Structure, Function, and Bioinformatics, 68
(4):803–812, 2007.
Markus Muttenthaler, Glenn F King, David J Adams, and Paul F Alewood. Trends in peptide drug
discovery. Nature reviews Drug discovery, 20(4):309–325, 2021.
Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. Progen2: exploring
the boundaries of protein language models. Cell Systems, 14(11):968–978, 2023.
Martin Pacesa, Lennart Nickel, Joseph Schmidt, Ekaterina Pyatova, Christian Schellhaas, Lucas
Kissling, Ana Alcaraz-Serna, Yehlin Cho, Kourosh H Ghamary, Laura Vinue, et al. Bindcraft:
one-shot design of functional protein binders. bioRxiv, pp. 2024–09, 2024.
Marco Pasini, Javier Nistal, Stefan Lattner, and George Fazekas. Continuous autoregressive models with noise augmentation avoid error accumulation. In Audio Imagination: NeurIPS 2024
Workshop AI-Driven Speech, Music, and Sound Generation.
Evangelia Petsalaki and Robert B Russell. Peptide-mediated interactions in biological systems: new
discoveries and applications. Current opinion in biotechnology, 19(4):344–350, 2008.
Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and
Ming Zhou. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. arXiv
preprint arXiv:2001.04063, 2020.
Barak Raveh, Nir London, Lior Zimmerman, and Ora Schueler-Furman. Rosetta flexpepdock abinitio: simultaneous folding, docking and refinement of peptides onto their receptors. PloS one, 6
(4):e18934, 2011.
Zhizhou Ren, Jiahan Li, Fan Ding, Yuan Zhou, Jianzhu Ma, and Jian Peng. Proximal exploration
for model-guided protein sequence design. In International Conference on Machine Learning,
pp. 18520–18536. PMLR, 2022.
Pierre H Richemond, Sander Dieleman, and Arnaud Doucet. Categorical sdes with simplex diffusion. arXiv preprint arXiv:2210.14784, 2022.
Jorge Roel-Touris, Marta Nadal, and Enrique Marcos. Single-chain dimers from de novo immunoglobulins as robust scaffolds for multiple binding loops. Nature Communications, 14(1):
5939, 2023.
Stefan Rose-John, Georg H Waetzig, Jurgen Scheller, Joachim Gr ¨ otzinger, and Dirk Seegert. The ¨
il-6/sil-6r complex as a novel target for therapeutic approaches. Expert opinion on therapeutic
targets, 11(5):613–624, 2007.
Quentin J Sattentau and John P Moore. The role of cd4 in hiv binding and entry. Philosophical
Transactions of the Royal Society of London. Series B: Biological Sciences, 342(1299):59–66,
1993.
Gisbert Schneider, Werner Neidhart, Thomas Giller, and Gerard Schmid. “scaffold-hopping” by
topological pharmacophore search: a contribution to virtual screening. Angewandte Chemie International Edition, 38(19):2894–2896, 1999.
Georg E Schulz and R Heiner Schirmer. Principles of protein structure. Springer Science & Business
Media, 2013.
17

17/33

Preprint, Under review
Maxim V Shapovalov and Roland L Dunbrack. A smoothed backbone-dependent rotamer library
for proteins derived from adaptive kernel density estimates and regressions. Structure, 19(6):
844–858, 2011.
Wataru Shihoya, Tamaki Izume, Asuka Inoue, Keitaro Yamashita, Francois Marie Ngako Kadji,
Kunio Hirata, Junken Aoki, Tomohiro Nishizawa, and Osamu Nureki. Crystal structures of human
etb receptor provide mechanistic insight into receptor activation and partial activation. Nature
communications, 9(1):4711, 2018.
Jung-Eun Shin, Adam J Riesselman, Aaron W Kollasch, Conor McMahon, Elana Simon, Chris
Sander, Aashish Manglik, Andrew C Kruse, and Debora S Marks. Protein design and variant
prediction using autoregressive generative models. Nature communications, 12(1):2403, 2021.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv
preprint arXiv:2010.02502, 2020a.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
Advances in neural information processing systems, 32, 2019.
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben
Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint
arXiv:2011.13456, 2020b.
Yuxuan Song, Jingjing Gong, Minkai Xu, Ziyao Cao, Yanyan Lan, Stefano Ermon, Hao Zhou,
and Wei-Ying Ma. Equivariant flow matching with hybrid probability transport. arXiv preprint
arXiv:2312.07168, 2023.
Robyn L Stanfield and Ian A Wilson. Protein-peptide interactions. Current opinion in structural
biology, 5(1):103–113, 1995.
Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, and Andrew Gordon Wilson. Accelerating bayesian optimization for biological sequence
design with denoising autoencoders. In International Conference on Machine Learning, pp.
20459–20478. PMLR, 2022.
Hannes Stark, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Harmonic self-conditioned flow ¨
matching for multi-ligand docking and binding site design. arXiv preprint arXiv:2310.05764,
2023.
Martin Steinegger and Johannes Soding. Mmseqs2 enables sensitive protein sequence searching for ¨
the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017.
Sebastian Swanson, Venkatesh Sivaraman, Gevorg Grigoryan, and Amy E Keating. Tertiary motifs
as building blocks for the design of protein-binding peptides. Protein Science, 31(6):e4322, 2022.
Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick
Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point
clouds. arXiv preprint arXiv:1802.08219, 2018.
Brian L Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and
Tommi Jaakkola. Diffusion probabilistic modeling of protein backbones in 3d for the motifscaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
Luc F Van Gaal, Ilse L Mertens, and Christophe E De Block. Mechanisms linking obesity with
cardiovascular disease. Nature, 444(7121):875–880, 2006.
Peter Vanhee, Almer M van der Sloot, Erik Verschueren, Luis Serrano, Frederic Rousseau, and Joost
Schymkowitz. Computational design of peptide ligands. Trends in biotechnology, 29(5):231–239,
2011.
18

18/33

Preprint, Under review
Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina
Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein
structure database: massively expanding the structural coverage of protein-sequence space with
high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
Lyubomir T Vassilev. Small-molecule antagonists of p53-mdm2 binding: research tools and potential therapeutics. Cell cycle, 3(4):417–419, 2004.
Robert Verkuil, Ori Kabeli, Yilun Du, Basile IM Wicky, Lukas F Milles, Justas Dauparas, David
Baker, Sergey Ovchinnikov, Tom Sercu, and Alexander Rives. Language models generalize beyond natural proteins. bioRxiv, pp. 2022–12, 2022.
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
Patrick Vlieghe, Vincent Lisowski, Jean Martinez, and Michel Khrestchatisky. Synthetic therapeutic
peptides: science and market. Drug discovery today, 15(1-2):40–56, 2010.
Jue Wang, Sidney Lisanza, David Juergens, Doug Tischer, Ivan Anishchenko, Minkyung Baek,
Joseph L Watson, Jung Ho Chun, Lukas F Milles, Justas Dauparas, et al. Deep learning methods
for designing proteins scaffolding functional sites. BioRxiv, pp. 2021–11, 2021.
Lei Wang, Nanxi Wang, Wenping Zhang, Xurui Cheng, Zhibin Yan, Gang Shao, Xi Wang, Rui
Wang, and Caiyun Fu. Therapeutic peptides: Current applications and future directions. Signal
Transduction and Targeted Therapy, 7(1):48, 2022.
Weiran Wang and Miguel A Carreira-Perpinan. Projection onto the probability simplex: An efficient ´
algorithm with a simple proof, and an application. arXiv preprint arXiv:1309.1541, 2013.
Yi Wang, Hui Tang, Lichao Huang, Lulu Pan, Lixiang Yang, Huanming Yang, Feng Mu, and Meng
Yang. Self-play reinforcement learning guides protein engineering. Nature Machine Intelligence,
5(8):845–860, 2023.
Yongkang Wang, Xuan Liu, Feng Huang, Zhankun Xiong, and Wen Zhang. A multi-modal contrastive diffusion model for therapeutic peptide generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 3–11, 2024.
Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. Broadly applicable
and accurate protein design by integrating structure prediction networks and diffusion generative
models. BioRxiv, pp. 2022–12, 2022.
Hong Wei, Wenkai Wang, Zhenling Peng, and Jianyi Yang. Q-biolip: A comprehensive resource for
quaternary structure-based protein–ligand interactions. Genomics, Proteomics & Bioinformatics,
pp. qzae001, 2024.
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In
Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688.
Citeseer, 2011.
Zeyu Wen, Jiahua He, Huanyu Tao, and Sheng-You Huang. Pepbdb: a comprehensive structural
database of biological peptide–protein interactions. Bioinformatics, 35(1):175–177, 2019.
Camille Georges Wermuth. The practice of medicinal chemistry. Academic Press, 2011.
James C Whisstock and Arthur M Lesk. Prediction of protein function from protein sequence and
structure. Quarterly reviews of biophysics, 36(3):307–340, 2003.
Chien-Hsun Wu, I-Ju Liu, Ruei-Min Lu, and Han-Chung Wu. Advancement and applications of
peptide phage display technology in biomedical science. Journal of biomedical science, 23:1–14,
2016.
Kevin E Wu, Kevin K Yang, Rianne van den Berg, James Y Zou, Alex X Lu, and Ava P Amini.
Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611, 2022.
19

19/33

Preprint, Under review
Kevin E Wu, Kevin K Yang, Rianne van den Berg, Sarah Alamdari, James Y Zou, Alex X Lu, and
Ava P Amini. Protein structure generation via folding diffusion. Nature communications, 15(1):
1059, 2024a.
Ruidong Wu, Ruihan Guo, Rui Wang, Shitong Luo, Yue Xu, Jiahan Li, Jianzhu Ma, Qiang Liu,
Yunan Luo, and Jian Peng. Fafe: Immune complex modeling with geodesic distance loss on
noisy group frames. arXiv preprint arXiv:2407.01649, 2024b.
Xuezhi Xie, Pedro A Valiente, and Philip M Kim. Helixgan a deep-learning methodology for conditional de novo design of α-helix structures. Bioinformatics, 39(1):btad036, 2023.
Xuezhi Xie12, Pedro A Valiente, Jisun Kim, and Philip M Kim123. Helixdiff: Hotspot-specific
full-atom design of peptides using diffusion models.
Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923,
2022.
Andy Hsien-Wei Yeh, Christoffer Norn, Yakov Kipnis, Doug Tischer, Samuel J Pellock, Declan
Evans, Pengchen Ma, Gyu Rie Lee, Jason Z Zhang, Ivan Anishchenko, et al. De novo design of
luciferases using deep learning. Nature, 614(7949):774–780, 2023.
Jason Yim, Andrew Campbell, Andrew YK Foong, Michael Gastegger, Jose Jim ´ enez-Luna, Sarah ´
Lewis, Victor Garcia Satorras, Bastiaan S Veeling, Regina Barzilay, Tommi Jaakkola, et al. Fast
protein backbone generation with se (3) flow matching. arXiv preprint arXiv:2310.05297, 2023a.
Jason Yim, Brian L Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay,
and Tommi Jaakkola. Se (3) diffusion model with application to protein backbone generation.
arXiv preprint arXiv:2302.02277, 2023b.
Jason Yim, Andrew Campbell, Emile Mathieu, Andrew YK Foong, Michael Gastegger, Jose´
Jimenez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S Veeling, Frank No ´ e, et al. Im- ´
proved motif-scaffolding with se (3) flow matching. arXiv preprint arXiv:2401.04082, 2024.
Lei Yu, Stephanie A Barros, Chengzao Sun, and Sandeep Somani. Cyclic peptide linker design and
optimization by molecular dynamics simulations. Journal of Chemical Information and Modeling,
63(21):6863–6876, 2023.
Vinicius Zambaldi, David La, Alexander E Chu, Harshnira Patani, Amy E Danson, Tristan OC
Kwan, Thomas Frerix, Rosalia G Schneider, David Saxton, Ashok Thillaisundaram, et al. De
novo design of high-affinity protein binders with alphaproteo. arXiv preprint arXiv:2409.08022,
2024.
Chengxin Zhang, Xi Zhang, Peter L Freddolino, and Yang Zhang. Biolip2: an updated structure
database for biologically relevant ligand–protein interactions. Nucleic Acids Research, 52(D1):
D404–D412, 2024.
Jianhua Zhang, Jun Yun, Zhigang Shang, Xiaohui Zhang, and Borong Pan. Design and optimization
of a linker for fusion protein construction. Progress in Natural Science, 19(10):1197–1200, 2009.
Jingfen Zhang, Zhiquan He, Qingguo Wang, Bogdan Barz, Ioan Kosztin, Yi Shang, and Dong Xu.
Prediction of protein tertiary structures using mufold. Functional Genomics: Methods and Protocols, pp. 3–13, 2012.
Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan
Wang, Silvio Savarese, Stefano Ermon, et al. Hive: Harnessing human feedback for instructional
visual editing. arXiv preprint arXiv:2303.09618, 2023a.
Yang Zhang and Jeffrey Skolnick. Tm-align: a protein structure alignment algorithm based on the
tm-score. Nucleic acids research, 33(7):2302–2309, 2005.
Yangtian Zhang, Zuobai Zhang, Bozitao Zhong, Sanchit Misra, and Jian Tang. Diffpack: A torsional
diffusion model for autoregressive protein side-chain packing. arXiv preprint arXiv:2306.01794,
2023b.
20

20/33

Preprint, Under review
Zaixi Zhang, Zepu Lu, Zhongkai Hao, Marinka Zitnik, and Qi Liu. Full-atom protein pocket design
via iterative refinement. arXiv preprint arXiv:2310.02553, 2023c.
Tongqing Zhou, Ling Xu, Barna Dey, Ann J Hessell, Donald Van Ryk, Shi-Hua Xiang, Xinzhen
Yang, Mei-Yun Zhang, Michael B Zwick, James Arthos, et al. Structural definition of a conserved
neutralization epitope on hiv-1 gp120. Nature, 445(7129):732–737, 2007.
21

21/33

Preprint, Under review
A HOT SPOT RESIDUES
In the context of protein interactions, hot spots refer to specific amino acid residues within a proteinprotein interface that significantly contribute to the binding affinity and stability of the complex
(Bogan & Thorn, 1998; Moreira et al., 2007; Keskin et al., 2005; Guo et al.). These residues are
often characterized by their energetic contributions, with a few critical interactions having a disproportionate effect on the overall binding energy. These regions typically have a higher likelihood
of binding events and are characterized by favorable structural and energetic features. Identifying
and understanding these hot spots is crucial for the design of effective inhibitors or modulators that
selectively target protein-protein interactions (PPIs). Several classic examples illustrate the concept
of protein interaction hot spots:
• p53 and MDM2: The interaction between the tumor suppressor protein p53 and its negative
regulator MDM2 is a well-studied example. Key residues in p53, such as Leu-22, Trp-23,
and Phe-19, have been identified as hot spots for binding to MDM2. Disruption of this
interaction is a promising strategy for cancer therapy (Vassilev, 2004).
• Antibody-Antigen Interactions: The binding of antibodies to their specific antigens often
involves hot spots that are essential for the specificity and affinity of the interaction. For
example, the interaction between the antibody 1G12 and the HIV-1 envelope glycoprotein
gp120 features specific residues on both molecules that contribute significantly to binding
(Zhou et al., 2007).
• Receptor-Peptide Interactions: The interaction between the CD4 receptor and the HIV-1
gp120 envelope protein is another notable example. Specific residues on CD4, such as
Asp368 and Tyr371, serve as hot spots that facilitate binding, leading to viral entry into
host cells (Sattentau & Moore, 1993).
• Cytokine-Receptor Binding: The interaction between cytokines and their receptors is critical for immune signaling. For instance, the binding of interleukin-6 (IL-6) to its receptor
IL-6R involves hot spot residues that are crucial for signal transduction and biological activity (Rose-John et al., 2007).
In addition to hot-spot residues, which directly mediate protein-protein interactions, the remaining
residues in the interface are referred to as scaffold residues. These scaffold residues play a crucial
role in providing structural support to maintain the stability and conformation of the interface. While
scaffold residues do not typically contribute significantly to the binding free energy, they ensure
that the hot spots are properly positioned to interact with their binding partners. This structural
framework is essential for maintaining the overall architecture of the protein complex and facilitating
specific interactions at the hot-spot regions. For example, scaffold residues often help to shield hot
spots from solvent exposure, thereby preserving their high binding affinities.
In our context, we focus on the hot-spot residues on the peptides that are crucial for key interactions
with their target, as they are thought to be structurally conserved. While there are also hot spots on
the target proteins, we do not model them here. Some works, however, focus on designing binders
that specifically target these hot-spot residues on the proteins, treating them as critical interaction
points (Watson et al., 2022; Zambaldi et al., 2024).
In our peptide design task, since we aim to design peptides from scratch, we lack prior knowledge
about the ground-truth hot-spot residues, which are typically defined by energy functions and conserved interactions. Instead, we employ a density model to identify statistically favorable residues
as hot spots. Essentially, our energy function is represented by a neural network. For the scaffold
generation task, we first use the Rosetta energy function (Raveh et al., 2011; Alford et al., 2017) to
compute the factorized binding energies of each peptide residue at the binding interface. We then
manually select key residues with lower binding energy contributions, ensuring they are uniformly
distributed along the peptide. This approach is important because, in scaffold generation, we aim to
link these hot spots with a feasible and evenly distributed scaffold structure, ensuring the stability
and functionality of each fragment.
22

22/33

Preprint, Under review
B ADJACENT RESIDUE RECONSTRUCTION
B.1 PEPTIDE BOND AND PLANAR
Peptide bonds are the key linkages between amino acids in proteins, formed through a condensation
reaction between the carboxyl group of one amino acid and the amino group of another. This bond is
an amide bond, specifically between the carbonyl carbon (C=O) of one amino acid and the nitrogen
(N-H) of another. The nature of the peptide bond plays a crucial role in determining the structure
and stability of peptides and proteins.
One significant characteristic of the peptide bond is its partial double bond nature. Although it’s
formally a single bond, resonance between the lone pair of electrons on the nitrogen and the carbonyl group gives the bond partial double-bond character. This resonance limits rotation around
the peptide bond, which leads to a rigid and planar structure between the alpha carbon atoms of the
amino acids involved in the bond. As a result, the peptide bond creates a flat, coplanar arrangement
of the six atoms involved: the nitrogen, hydrogen, carbon, oxygen, and the two alpha carbons (one
from each amino acid).
This planarity is crucial because it helps constrain the protein’s overall folding pattern. The rigid
planes of successive peptide bonds are connected by flexible single bonds at the alpha carbons,
allowing the polypeptide chain to fold into its specific secondary structures like alpha helices and
beta sheets. The restricted rotation around the peptide bond imposes dihedral angles, ϕ and ψ,
which define the conformation of the polypeptide backbone and are critical in shaping the overall
three-dimensional structure of proteins.
In summary, the partial double-bond character of the peptide bond is a key factor in maintaining the
planarity and rigidity of the peptide backbone, which in turn plays a fundamental role in determining
the folding and function of proteins.
B.2 DIHEDRAL ANGLES
In proteins, the dihedral angles—ϕ and ψ—define the conformation of the protein backbone by
describing the rotation around specific bonds in the peptide chain. These angles are crucial for
understanding the three-dimensional structure of a protein. Below is a detailed explanation of how
these angles are calculated:
General Formula for Dihedral Angle Calculation To calculate a dihedral angle between four
consecutive atoms (A, B, C, D), the steps are:
1. Compute the bond vectors:
AB⃗ = B − A, BC⃗ = C − B, CD⃗ = D − C
2. Calculate the normal vectors of the two planes formed by the atoms:
⃗n1 = AB⃗ × BC, ⃗n ⃗
2 = BC⃗ × CD⃗
3. Use the dot product to find the angle between the two planes:
angle = arctan 2 BC⃗ · (⃗n1 × ⃗n2), ⃗n1 · ⃗n2

(25)
This general method can be applied to calculate ϕ and ψ dihedral angles across a protein’s backbone.
Phi (ϕi) Angle The ϕi angle is the dihedral angle around the bond between the nitrogen (N) and
alpha carbon (Cα) of residue i. It is defined by the following four atoms:
• Ci−1: Carbonyl carbon of the previous residue
• Ni: Amide nitrogen of the current residue
• Cα,i: Alpha carbon of the current residue
• Ci: Carbonyl carbon of the current residue
23

23/33

Preprint, Under review
The dihedral angle ϕiis calculated as the angle between the planes formed by the atoms
(Ci−1, Ni, Cα,i) and (Ni, Cα,i, Ci):
ϕi = angle between planes (Ci−1, Ni, Cα,i) and (Ni, Cα,i, Ci) (26)
Psi (ψi) Angle The ψi angle describes the dihedral angle around the bond between the alpha
carbon (Cα) and the carbonyl carbon (C) of residue i. It is defined by the following four atoms:
• Ni: Amide nitrogen of the current residue
• Cα,i: Alpha carbon of the current residue
• Ci: Carbonyl carbon of the current residue
• Ni+1: Amide nitrogen of the next residue
The dihedral angle ψiis calculated as the angle between the planes formed by the atoms
(Ni, Cα,i, Ci) and (Cα,i, Ci, Ni+1):
ψi = angle between planes (Ni, Cα,i, Ci) and (Cα,i, Ci, Ni+1) (27)
Psi (ψi−1) of Previous Residue The ψi−1 angle is calculated similarly to ψi but for the previous
residue. It involves the following four atoms:
• Ni−1: Amide nitrogen of the previous residue
• Cα,i−1: Alpha carbon of the previous residue
• Ci−1: Carbonyl carbon of the current residue
• Ni: Amide nitrogen of the current residue
The dihedral angle ψi−1 is calculated as:
ψi−1 = angle between planes (Ni−1, Cα,i−1, Ci−1) and (Cα,i−1, Ci−1, Ni) (28)
Phi (ϕi+1) of Next Residue The ϕi+1 angle is similar to ϕi but for the next residue. It involves
the following four atoms:
• Ci: Carbonyl carbon of the current residue
• Ni+1: Amide nitrogen of the next residue
• Cα,i+1: Alpha carbon of the next residue
• Ci+1: Carbonyl carbon of the next residue
The dihedral angle ϕi+1 is calculated as:
ϕi+1 = angle between planes (Ci, Ni+1, Cα,i+1) and (Ni+1, Cα,i+1, Ci+1) (29)
C ADJACENT STRUCTURE RECONSTRUCTION
Reconstructing the backbone structures of adjacent residues Ri−1 and Ri+1 involves two main steps.
First, we use dihedral angles to rotate the standard residue coordinates in the local frame. Then, we
transform the local frame coordinates back to the global frame based on the position and orientation
of the given residue. As an example, let’s consider the Right operation, where we use the coordinates
of Ri and the associated dihedral angles ψi and ϕi+1 to compute the structure of Ri+1.
C.1 CALCULATING xi AND Oi
To convert local coordinates into global coordinates, we first need to calculate the translation vector
xi and the orientation matrix Oi for residue Ri.
24

24/33

Preprint, Under review
Translation Vector xi The translation vector xi defines the global position of residue Ri, and it is
typically chosen as the position of a key atom within the residue. A common choice is the α-carbon
(CAi), which is centrally located in the backbone of the residue, providing a stable reference point
for the rest of the structure.
xi = CAi (30)
This ensures that the relative coordinates of other atoms in Ri can be translated into the global
coordinate system.
Orientation Matrix Oi The orientation matrix Oi defines the local frame of reference for residue
Ri and allows us to rotate local coordinates into the global frame. To construct Oi, we use the
positions of three key atoms: Ci (carbonyl carbon), CAi (alpha carbon), and Ni (amide nitrogen).
These atoms form the backbone of the residue, and their relative positions define the orientation of
the local coordinate system.
The steps to compute Oi are as follows:
1. Compute the vector from Cito CAi, and normalize it to obtain the first basis vector e1:
v1 = Ci − CAi (31)
e1 =
v1
||v1|| (32)
2. Compute the vector from Nito CAi, and remove the projection of this vector onto e1 to get the
component orthogonal to e1. Normalize this to obtain the second basis vector e2:
v2 = Ni − CAi (33)
u2 = v2 −

v2 · e1
||e1||2

e1 (34)
e2 =
u2
||u2|| (35)
3. Compute the third orthogonal vector e3 as the cross product of e1 and e2:
e3 = e1 × e2 (36)
4. The orientation matrix Oiis formed by combining these three orthonormal vectors into a matrix:
Oi = [e1, e2, e3] (37)
This orientation matrix Oi defines the local coordinate system for residue Ri and can be used to
transform the local coordinates of neighboring residues into the global frame.
C.2 LOCAL COORDINATE RECONSTRUCTION
Given a reference peptide structure, we apply dihedral angle transformations to compute the new
coordinates based on the angles ψi and ϕi+1.
Rotation Matrix Definition The rotation matrix for an arbitrary axis d = (dx, dy, dz) and angle
θ is given by the Rodrigues’ rotation formula:
R(d, θ) = I + sin(θ)K + (1 − cos(θ))K2(38)
where K is the skew-symmetric matrix of d:
K =
"
0 −dz dy
dz 0 −dx
−dy dx 0
#
(39)
25

25/33

Preprint, Under review
Initial Reference Coordinates The initial reference coordinates for the peptide are as follows:
N1 = (−0.572, 1.337, 0.000), CA1 = (0.000, 0.000, 0.000), C1 = (1.517, 0.000, 0.000)
N2 = (2.1114, 1.1887, 0.0000), CA2 = (3.5606, 1.3099, 0.0000), C2 = (4.0913, −0.1112, 0.0000)
Note: The above coordinates are derived from the standard structure of glycine (GLY), one of the
simplest amino acids. GLY is chosen because it lacks a side chain (only a hydrogen atom as a side
chain), making it an ideal reference for constructing peptide backbone geometries. Additionally,
the peptide bond between C1 and N2 is assumed to be planar, with ψ1 = ψ2 = 0◦, reflecting the
typical rigidity of peptide bonds. This simplifies the geometric model by aligning the peptide bond
in a 0/180-degree plane.
Rotate C2 around CA2 → N2 axis (ϕi+1) We first rotate C2 around the axis from CA2 to
N2 by the dihedral angle ϕi+1. The axis is defined as:
dN2−CA2 =
N2 − CA2
||N2 − CA2|| (40)
The new position of C2 after applying the rotation matrix R(ϕi+1) is:
C2′ = CA2 + R(dN2−CA2, ϕi+1) · (C2 − CA2) (41)
Rotate C2 around CA1 → C1 axis (ψi) Next, we rotate C2 around the axis from CA1 to C1
by the dihedral angle ψi. The axis is defined as:
dC1−CA1 =
C1 − CA1
||C1 − CA1|| (42)
The new positions of C2, CA2, and N2 after applying the rotation matrix R(ψi) are:
C2rel = CA1 + R(dC1−CA1, ψi) · (C2′ − CA1) (43)
CA2rel = CA1 + R(dC1−CA1, ψi) · (CA2 − CA1) (44)
N2rel = CA1 + R(dC1−CA1, ψi) · (N2 − CA1) (45)
C.3 CONVERTING RELATIVE COORDINATES TO GLOBAL COORDINATES
Once the relative coordinates have been calculated, we transform them back into global coordinates
using a combination of rotation Oi and translation xi. We use the calculated relative coordinates of
CA2, C2, and N2 and transform them into the global coordinate system.
This transformation is applied to each of the atoms CA2, C2, and N2:
CAi+1 = CA2global = xi + Oi· CA2rel, (46)
Ci+1 = C2global = xi + Oi· C2rel, (47)
Ni+1 = N2global = xi + Oi· N2rel. (48)
After getting backbone atoms, the side-chain atoms are reconstructed by side-chain packing algorithms, in our implementation, we use PackRotamersMover in Pyrosetta (Chaudhury et al., 2010) to
pack side-chains for the newly added residue, which is based on rotamer library and rosetta energy
function (Shapovalov & Dunbrack, 2011; Alford et al., 2017).
D VON MOSIES DISTRIBUTION
The von Mises distribution is a continuous probability distribution defined on the circle, commonly
used to model angular or directional data. It is sometimes referred to as the ”circular normal distribution” due to its similarity to the normal distribution on a plane, but it is defined for angles
or directions. The von Mises distribution is often referred to as the circular normal distribution
because it shares several properties with the normal distribution, such as being unimodal and symmetric around the mean. As κ approaches infinity, the von Mises distribution approaches a normal
distribution in terms of angular deviation.
26

26/33

Preprint, Under review
D.1 PROBABILITY DENSITY FUNCTION (PDF)
The probability density function (PDF) of the von Mises distribution is given by:
f(θ | µ, κ) = 1
2πI0(κ)
exp (κ cos(θ − µ))
where:
• θ ∈ [0, 2π) is the random variable ((an angle measured in radians),
• µ is the mean direction of the distribution, or the central angle around which the data are
clustered,
• κ ≥ 0 is the concentration parameter,analogous to the inverse of the variance in the normal
distribution, which controls how tightly the data are clustered around the mean direction.
When κ = 0, the distribution is uniform around the circle, while larger values of κ indicate
that the data are more tightly concentrated around µ,
• I0(κ) is the modified Bessel function of the first kind of order 0, which serves as a normalization constant to ensure that the total probability integrates to 1 over the circle, defined
as:
I0(κ) = 1
π
Z π
0
e
κ cos(ϕ)
dϕ
The concentration parameter κ controls the spread of the distribution around the mean direction µ:
• When κ = 0, the von Mises distribution becomes a uniform distribution on the circle.
• As κ → ∞, the distribution becomes increasingly concentrated around µ and approaches a
normal distribution for small angular deviations.
D.2 CUMULATIVE DISTRIBUTION FUNCTION (CDF)
The cumulative distribution function (CDF) of the von Mises distribution does not have a simple
closed-form expression. However, it can be computed numerically as:
F(θ | µ, κ) = 1
2πI0(κ)
Z θ
−π
exp (κ cos(t − µ)) dt
where the integral is typically evaluated numerically due to the complexity introduced by the Bessel
function and the circular nature of the distribution.
D.3 MEAN AND VARIANCE
The mean direction µ is the central tendency of the von Mises distribution, and the concentration
parameter κ influences how closely the data are clustered around µ. The variance of the distribution
is related to κ as follows:
Var(θ) = 1 −
I1(κ)
I0(κ)
where I1(κ) is the modified Bessel function of the first kind of order 1.
In our implementation, we use VonMises distribution implemented in pytorch package for sampling
angles and evluating likelihood.
27

27/33

Preprint, Under review
E PEPHAR IMPLEMENTATIONS
E.1 NETWORK DETAILS
The network used in our method includes the density model gθ in the founding stage and the prediction model hθ in the extension stage. Both networks use an Invariant Point Attention (IPA)-based
encoder to encode hidden representations of peptide residues (Yim et al., 2023a; Luo et al., 2022;
Li et al., 2024a; Wu et al., 2024b). The input to the network consists of the backbone coordinates
and residue types of the peptide-target complex. Several shallow MLPs (multi-layer perceptrons)
are applied to transform these inputs into initial feature representations. The network outputs node
embeddings and node-pair embeddings.
For the node (residue) embeddings, we use a combination of the following features:
• Residue type: A learnable embedding is applied.
• Atom coordinates: This includes both backbone and side-chain atoms.
These features are processed by separate MLPs. The concatenated features are then transformed by
another MLP to produce the final node embedding.
For the edge (residue-pair) embeddings, we use a combination of the following features:
• Residue-type pair: A learnable embedding matrix of size 20 × 20 is applied.
• Relative sequential positions: A learnable embedding of the relative position between the
two residues is applied.
• Distance between two residues.
• Relative orientation between two residues: The inter-residue backbone dihedral angles are
calculated to represent the relative orientation, and sinusoidal embeddings are applied.
Similarly, these features are processed by separate MLPs, concatenated, and then transformed by
another MLP to produce the final edge embedding.
Starting from the node and edge embeddings, the IPA encoder further encodes these representations.
The density model then applies a classification layer to classify residue types, while the prediction
model uses a regression head to predict parameters of the angle distributions. Since we use two
separate models instead of a single model, we set the embedding size to 128 for node embeddings
and 16 for edge embeddings, ensuring comparable parameters to our baseline models (Li et al.,
2024a). The IPA encoder consists of 4 layers, each with 8 query heads and a hidden dimension of
32.
E.2 TRAINING DETAILS
Both the density and prediction models are trained on 8 NVIDIA A100 GPUs. As we found these
two models prone to overfitting, we employed an early stopping strategy, training the density model
for 1400 iterations and the prediction model for 2400 iterations. This results in a training time that
is significantly shorter than the baseline models. We set the batch size to 64, using Adam as the
optimizer with a learning rate of 3 × 10−4. To prevent overfitting, we also applied a dropout rate of
0.5 at each layer in the IPA.
E.3 SAMPLING DETAILS
For the sampling process, we use 10 iterations with an update rate of 0.01 in the founding stage
to sample anchors by default, followed by 100 fine-tuning steps with an update rate of 0.1 in the
correction stage. In the scaffold design task, we replace the sampled hotspot residues with predefined
ground truth residues.
28

28/33

$Preprint, Under review F EXPERIMENTAL DETAILS In each task, for every method, we generate 64 peptides for each target protein, and the evaluation metrics are averaged across all generated peptides. F.1 METRIC DETAILS Valid This metric checks whether the distance between adjacent residues is consistent with peptide bond formation. Specifically, the distance between the Cα atoms of adjacent residues must be within 3.8A˚ for the peptide to be considered valid (Chelvanayagam et al., 1998; Zhang et al., 2012). RMSD The Root-Mean-Square Deviation (RMSD) is a widely used metric to assess structural similarity between two protein conformations. In our evaluation, the generated peptide is aligned with the native peptide using the Kabsch algorithm Kabsch (1976), focusing only on the peptide portion of the complex for superposition. After alignment, we compute the RMSD by calculating the normalized distance between corresponding Cα atoms in the generated and native peptides. A lower RMSD value indicates a closer alignment to the native structure, with 0 implying perfect alignment. SSR The Secondary Structure Ratio (SSR) measures the similarity between the secondary structures of the generated peptide and the reference peptide. It is computed by comparing the secondary structure labels of the two peptides. These labels are assigned using the DSSP software Kabsch & Sander (1983). The ratio of matching secondary structure assignments between the generated and native peptides is then calculated. A higher SSR value indicates that the generated peptide preserves the secondary structure of the native peptide more closely. BSR Binding Site Rate (BSR) quantifies how well the binding interactions of the generated peptide with the target protein resemble those of the native peptide. We define a residue as part of the binding site if its Cβ atom is within a 6Aradius of any atom in the peptide. The BSR is calculated as the ˚ overlap ratio of binding site residues between the generated peptide and the native peptide. A higher BSR indicates that the generated peptide interacts with the protein target in a manner similar to the native peptide, which could suggest similar functional properties. Stability Stability refers to the proportion of designed peptides that achieve a lower energy score than the native peptide-protein complex. A lower energy score generally implies greater structural stability. We use the FastRelax protocol in PyRosetta Chaudhury et al. (2010) to relax each complex, followed by evaluation using the REF2015 scoring function. It is then computed as the fraction of complexes where the designed peptide leads to a lower total energy compared to the native complex, indicating improved stability. Affinity Binding Affinity evaluates the percentage of designed peptides that exhibit stronger binding interactions with the target protein compared to the native peptide, as determined by their binding energy. Higher binding affinity usually suggests enhanced peptide functionality. Using PyRosetta’s InterfaceAnalyzerMover Chaudhury et al. (2010), we calculate the binding energy after relaxing the complex and defining the interaction interface. Affinity is the percentage of peptides that show lower binding energy (and thus higher affinity) relative to the native peptide. Novelty Novelty measures the proportion of generated peptides that are structurally and sequentially distinct from the native peptide. A peptide is considered novel if it satisfies two conditions: (a) the TM-score (a structural similarity score) between the generated and native peptides is less than or equal to 0.5 (Zhang & Skolnick, 2005), and (b) the sequence identity between the two peptides is less than or equal to 0.5. A higher novelty score indicates that more generated peptides are different from the native peptide. Diversity Diversity assesses the variability among the generated peptides in terms of both structure and sequence. It is calculated as the product of pairwise (1 - TM-score) and (1 - sequence identity) 29$

29/33

Preprint, Under review
across all generated peptides for a given target. A higher diversity score indicates greater structural and sequence diversity among the generated peptides, suggesting a broader range of potential
functional variations.
Success Success rate evaluates the quality of predicted complex structures using the confidence
score from AlphaFold2. Specifically, we employ AlphaFold2 Multimer (Evans et al., 2021) to predict the structures of peptides and their full-length receptors. The success rate is calculated as the
proportion of complexes with an overall ipTM score greater than 0.6.
F.2 BASELINE DETAILS
In the scaffold geneartion task, we set all baseline hot spot residues as K = 3, as the highest hot
spot numbers in our experiment.
RFDiffusion For RFDiffusion (Watson et al., 2022), we use the official implementation along
with the pretrained weights for the complex version. Specifically, we apply 200 discrete timesteps
during the diffusion process, generating 64 peptides for each target protein. In the peptide design
task, we use the official binder design scripts for sampling. For the scaffold design task, we provide
the ground-truth hotspot residues as additional information, instructing the model to inpaint the
remaining peptide structure. After sampling the peptide backbone, we use ProteinMPNN (Dauparas
et al., 2022) to generate the corresponding peptide sequences.
ProteinGenerator ProteinGenerator jointly samples protein backbone structures and sequences
(Lisanza et al., 2023). We utilize the official inference scripts, employing 200 diffusion timesteps to
design 64 peptides for each target protein.
PepFlow We employ the released Model No.1 from PepFlow (Li et al., 2024a), which is a fullatom peptide design model based on multi-modal flow matching. In the peptide design task, we
utilize the official implementation for sampling. For the scaffold generation task, we modify the
sampling process to ensure that the hotspot residues are constrained to their ground-truth values
during each update step. For each generation, we perform 200 discrete time steps for sampling 64
peptides of each target.
PepGLAD We utilize the released inference script and co-design model from PepGLAD (Kong
et al., 2024), which leverages latent diffusion models to generate full-atom peptide structures. For
the scaffold generation task, we modify the inference script so that the input peptide includes certain
known masked blocks instead of all blocks being masked. We maintain the same hyperparameter
settings as described in the original paper.
G ADDITIONAL RESULT
G.1 TM-SCORE AND AAR
Table 4: TM-Score and Sequence Recovery Rate in peptide design and scaffold generation tasks.
Table 5: Peptide Design Task
Method TM AAR
RFDiffusion 0.44 40.14
ProteinGenerator 0.43 45.82
PepFlow 0.38 51.25
PepGLAD 0.29 20.59
PepHAR (K=1) 0.33 32.32
PepHAR (K=2) 0.32 39.91
PepHAR (K=3) 0.34 34.36
Table 6: Scaffold Generation Task
Method TM AAR
RFDiffusion (K=3) 0.46 31.14
ProteinGenerator (K=3) 0.48 32.05
PepFlow (K=3) 0.37 51.90
PepGLAD 0.30 21.48
PepHAR (K=1) 0.33 32.90
PepHAR (K=2) 0.35 35.06
PepHAR (K=3) 0.38 35.34
30

30/33

Preprint, Under review
G.2 CUMULATIVE ERROS
Figure 6: Scatter Plot of peptide length and RMSD value.
We analyzed the relationship between the length of the generated peptides and their corresponding
RMSD values compared to native structures. As shown in the figure, peptide length demonstrates a
strong positive correlation with RMSD, indicating that the autoregressive generation process accumulates errors as peptide length increases. Specifically, during the extension stage, dihedral angle
predictions are used to iteratively add residues to the peptide structure. However, these predictions
are not always precise. For example, when predicting the dihedral angles for a new residue, small
deviations from the ground truth can occur. These deviations lead to slight inaccuracies in the reconstructed residue’s backbone conformation. Once a residue is added with structural bias, the error
propagates to subsequent extension steps. For instance, if the incorrect backbone conformation positions the residue slightly off from its ideal location, the next prediction step will use this biased
structure as input. This can result in further deviation in the predicted dihedral angles for the next
residue, introducing additional distortions to the growing peptide. Over multiple iterations, these
small inaccuracies accumulate, leading to increasing structural deviations from the native conformation. This accumulation of structural errors explains the observed increase in RMSD with longer
peptide sequences. The inherent dependency of the autoregressive process on previously generated
fragments makes it particularly susceptible to error propagation during residue-by-residue extension. Addressing this limitation is crucial to improving the accuracy of peptide structure generation
in future work.
Future work could address this accumulation issue in our autoregressive extension model. First, we
can improve the precision of each prediction step by employing more accurate dihedral prediction
models. Given the limited peptide datasets, leveraging pretrained models trained on larger datasets
or incorporating traditional energy relaxation methods at each extension step to correct backbone
bias could be beneficial. Second, advancements in autoregressive language modeling could be applied, such as predicting multiple dihedral angles simultaneously (Qi et al., 2020; Gloeckle et al.,
2024), injecting noise during training (Pasini et al.), or using diffusion models to refine the predicted
posterior distribution (Li et al., 2024c). Third, non-autoregressive approaches, such as generative
modeling (Kenton & Toutanova, 2019; Chang et al., 2022) or diffusion models (Wu et al., 2024a),
could be explored to generate all dihedral angles simultaneously and refine predictions iteratively.
G.3 DIFFERENT K VALUES OF BASELINES
As shown in Table 7, unlike PepHAR, which benefits from increasing K in terms of geometry
and energy metrics, the baseline models show only marginal improvements or even performance
degradation. We believe this is due to their training schemes, which do not explicitly condition on
known hotspots. Optimizing their training and inference processes for the scaffold setting could
potentially improve their performance.
31

31/33

Preprint, Under review
Table 7: Evaluation of methods in the scaffold generation task. K = 1, 2, 3 is the number of hot
spots.
Valid % ↑ RMSD A˚ ↓ SSR % ↑ BSR % ↑ Stability % ↑ Affinity % ↑ Novelty % ↑ Diversity % ↑ Success % ↑
RFDiffusion (K = 1) 66.80 3.51 63.19 23.56 17.73 20.58 66.17 22.46 19.19
RFDiffusion (K = 2) 68.68 2.85 65.68 31.14 18.69 22.34 45.53 24.14 21.59
RFDiffusion (K = 3) 69.88 4.09 63.66 26.83 20.07 21.26 55.03 26.67 23.15
ProteinGenerator (K = 1) 68.00 3.79 64.53 25.52 18.59 20.55 60.39 21.28 20.04
ProteinGenerator (K = 2) 69.20 3.92 66.79 30.61 19.08 21.54 55.24 23.29 20.42
ProteinGenerator (K = 3) 68.52 3.95 65.86 24.17 20.40 22.80 50.73 20.82 24.90
PepFlow (K = 1) 40.35 2.51 79.58 86.40 10.55 18.13 50.46 16.29 26.46
PepFlow (K = 2) 49.29 2.82 79.23 85.05 10.19 18.20 54.74 19.52 24.03
PepFlow (K = 3) 42.68 2.45 81.00 82.76 11.17 13.64 50.93 16.97 24.54
PepGLAD (K = 1) 53.45 3.87 76.59 20.15 14.54 12.03 50.46 30.84 13.56
PepGLAD (K = 2) 52.96 3.93 75.06 20.02 10.29 15.72 54.74 30.36 14.03
PepGLAD (K = 3) 53.51 3.84 76.26 19.61 12.22 18.27 50.93 30.99 14.85
PepHAR (K = 1) 56.01 3.72 80.61 78.18 17.89 19.94 80.61 29.79 20.43
PepHAR (K = 2) 55.36 2.85 82.79 85.80 19.18 19.17 74.76 25.32 22.09
PepHAR (K = 3) 55.41 2.15 83.02 88.02 20.50 20.65 72.56 19.68 21.45
Figure 7: Upper Left: Human Endothelin type B receptor in complex with the ET1 peptide binder.
Upper Right: Receptor in complex with PepHAR-generated peptide binder. Lower Left: HIS16
in the ET1 peptide is identified as a hotspot residue. Lower Right: PepHAR recovers this hotspot
residue in the peptide design task.
32

32/33

Preprint, Under review
G.4 CASE STUDY ON GPCR-PEPTIDE INTERACTION
Here, we present a case study of generating a peptide binder for the human Endothelin type B
receptor (PDB: 5GLH) (Shihoya et al., 2018). As shown in Fig. 7, the designed peptide exhibits
secondary structures (helix) similar to the native peptide binder (ET1 peptide), interacting with
the receptor residues at the top and within the receptor. Through per-residue energy calculations
and manual inspection, the HIS16 residue in the ET1 peptide is identified as a hotspot residue,
playing a crucial role in contacting the receptor’s extracellular loop regions. Remarkably, PepHAR
recovers this HIS16 residue in the generated peptide. The orientation of the functional group in the
generated HIS16 closely resembles that of the native hotspot residue, demonstrating that PepHAR
can effectively recover hotspot interactions during the design process.
G.5 SECONDARY STRUCTURE ANALYSIS
Table 8: Secondary structure composition evaluation.
Method Coil % Helix % Strand %
Native 75.20 11.47 13.15
PepHAR (K = 3) 89.42 10.36 0.21
Hotspots 90.13 9.48 0.22
We analyzed the secondary structure proportions of native peptides, generated peptides, and hotspot
residues in the test set, as shown in Table 8.
Compared to native peptides, PepHAR-generated peptides and hotspots tend to exhibit a higher
proportion of coil regions (bonded turns, bends, or loops) while maintaining similar proportions of
helix regions. However, the proportion of strand regions is significantly lower compared to native
peptides, indicating that strand regions in the native structure are often replaced by coil regions in
the generated peptides. This could be due to subtle differences in structural parameters required for
forming strands. We believe that further energy relaxation using tools like Rosetta, OpenMM, or
FoldX could help refine the peptide structures to form more accurate secondary structures.
Regarding hotspot occurrences, we observe that hotspots are predominantly located in coil and helix
regions, with almost no presence in strand regions. This aligns with interaction principles, where
strand regions may represent structural transitions, while the irregular coil regions or the relatively
stable helix regions are more likely to serve as functional interaction sites.
H CURRENT WORKS ON PEPTIDE DESIGN METHOD
Recent advancements in computational peptide binder design have significantly benefited from deep
learning. Bryant & Elofsson (2022) introduced EvoBind, an in silico directed evolution platform that
utilizes AlphaFold to design peptide binders targeting specific protein interfaces using only sequence
information. Building on this, Li et al. (2024b) developed EvoBind2, which extends EvoBind’s
capabilities by enabling the design of both linear and cyclic peptide binders of varying lengths
solely from a protein target sequence, without requiring the specification of binding sites or binder
sizes. Using MCTS simulations and reinforcement learning,Wang et al. (2023) engineers peptide
binders and achieves comparable result to EvoBind. Chen et al. (2023) proposed PepMLM, a targetsequence-conditioned generator for linear peptide binders based on masked language modeling. By
employing a novel masking strategy, PepMLM effectively reconstructs binder regions, achieving
low perplexities and demonstrating efficacy in cellular models. Leveraging geometric convolutional
neural networks to decipher interaction fingerprints from protein interaction surfaces, Gainza et al.
(2020) developed MaSIF. Subsequently, BindCraft (Pacesa et al., 2024) combined AF2-Multimer
sampling with ProteinMPNN sequence design to optimize protein-protein interaction surfaces. Similar to our approach of defining key hotspot residues, AlphaProteo (Zambaldi et al., 2024) also aims
to generate high-affinity protein binders that specifically interact with designated residues on the
target protein. PepGLAD (Kong et al., 2024) encodes full-atom peptide structures and sequences
using an equivariant graph neural network, and applies latent diffusion on the peptide embedding
space to explore peptide generation.
33

33/33