Hotspot driven peptide design
Hotspot driven peptide design
Hotspot driven peptide design
@scienceguy1 month ago
HOTSPOT-DRIVEN PEPTIDE DESIGN VIA MULTIFRAGMENT AUTOREGRESSIVE EXTENSION
Jiahan Li 1 â , Tong Chen 2 â , Shitong Luo 3 , Chaoran Cheng 4 , Jiaqi Guan 5 , Ruihan Guo 6 , Sheng Wang 2 , Ge Liu 4 , Jian Peng 6 , Jianzhu Ma 1
- 1 Tsinghua University, 2 University of Washington, 3 Massachusetts Institute of Technology,
4
University of Illinois Urbana-Champaign,
5
ByteDance Inc.,
6
Helixon Research
ced3ljhypc@gmail.com
majianzhu@tsinghua.edu.cn
ABSTRACT
Peptides, short chains of amino acids, interact with target proteins, making them a unique class of protein-based therapeutics for treating human diseases. Recently, deep generative models have shown great promise in peptide generation. However, several challenges remain in designing effective peptide binders. First, not all residues contribute equally to peptide-target interactions. Second, the generated peptides must adopt valid geometries due to the constraints of peptide bonds. Third, realistic tasks for peptide drug development are still lacking. To address these challenges, we introduce PepHAR , a hot-spot-driven autoregressive generative model for designing peptides targeting specific proteins. Building on the observation that certain hot spot residues have higher interaction potentials, we first use an energy-based density model to fit and sample these key residues. Next, to ensure proper peptide geometry, we autoregressively extend peptide fragments by estimating dihedral angles between residue frames. Finally, we apply an optimization process to iteratively refine fragment assembly, ensuring correct peptide structures. By combining hot spot sampling with fragment-based extension, our approach enables de novo peptide design tailored to a target protein and allows the incorporation of key hot spot residues into peptide scaffolds. Extensive experiments, including peptide design and peptide scaffold generation, demonstrate the strong potential of PepHAR in computational peptide binder design.
1 INTRODUCTION
Peptides, typically composed of 3 to 20 amino acid residues, are short single-chain proteins that interact with target proteins. (Bodanszky, 1988). Peptides play essential roles in various biological processes, including cellular signaling and immune responses (Petsalaki & Russell, 2008), and are emerging as a promising class of therapeutic drugs for complex diseases such as diabetes, obesity, hepatitis, and cancer (Kaspar & Reichert, 2013). Currently, there are approximately 80 peptide drugs on the global market, 150 in clinical development, and 400 -600 undergoing preclinical evaluation (Craik et al., 2013; Fosgerau & Hoffmann, 2015; Muttenthaler et al., 2021; Wang et al., 2022). Traditional peptide discovery methods rely on labor-intensive techniques like phage/yeast display for screening mutagenesis libraries (Boder & Wittrup, 1997; Wu et al., 2016), or energybased computational tools to score candidate peptides (Raveh et al., 2011; Lee et al., 2018; Cao et al., 2022), both of which face limitations due to the immense combinatorial design space.
Recently, deep generative models, particularly diffusion and flow-based methods, have shown substantial promise in de novo protein design (Huang et al., 2016; Luo et al., 2022; Watson et al., 2022; Yim et al., 2023a; Bose et al., 2023). Given the compact relationship between the structure and sequence of peptides and their target proteins (Grathwohl & Wuthrich, 1976; Vanhee et al., 2011), a few methods have successfully designed peptides conditioned on target information (Xie et al., 2023; Li et al., 2024a; Lin et al., 2024). These approaches typically represent peptide residues as rigid frames in the SE(3) manifold, angles in a torus manifold, and types in a statistical manifold (Cheng et al., 2024; Davis et al., 2024). Encoder-decoder architectures, particularly flow-matching methods Lipman et al. (2022), are then used to generate all residues simultaneously.
- · We introduce PepHAR , an autoregressive generative model based on hot spot residues for peptide binder design;
- · We address current challenges in peptide design by using an energy-based model for hot spot identification, autoregressive fragment extension for maintaining peptide geometry, and an optimization step for fragment assembly;
- · We propose a new experimental setting, scaffold generation , to mimic practical scenarios, and demonstrate the competitive performance of our method in both peptide binder design and scaffold generation tasks.
- 1 Founding Stage
- 2 for j â 1 to k do
- 3 Sample a hot-spot residue R i j ⼠P θ ( c, x , O | T ) based on Eq. 6, 7, and 8;
- 4 Initialize fragment F ( j,i j ,l j =1) â [ R i j ] ;
- 5 Extension Stage
- 6 while l 1 + ... + l k < N do
- 7 Randomly choose a fragment index i â 1 , ..., k and direction d â { L , R } ;
- 8 Set the starting residue as either the N-terminal R i j or the C-terminal R i j + l j -1 ;
- 9 Sample a new residue on the left R i j -1 or on the right R i j + l j based on Eq. 15 and 16;
- 10 Add the new residue to fragment F j ;
- 11 Merge fragments into the peptide D â F 1 + ... + F k
- 12 Correction Stage
- 13 for t â 1 , ... do
- 14 Calculate the objective J based on the current peptide using Eq.22;
- 15 Update the peptide using gradients from Eq.23 and 24;
- 16 return D = [ R 1 , ..., R N ]
- J Atchison and Sheng M Shen. Logistic-normal distributions: Some properties and uses. Biometrika , 67(2):261-272, 1980.
Although these methods have achieved initial success in generating peptide binders with native-like structures and high affinities, several challenges remain. First, not all peptide residues contribute equally to binding. As shown in Figure 1, some residues establish key functional interactions with the target, possessing high stability and affinity. These are referred to as hot spot residues , and are critical in drug discovery (Bogan & Thorn, 1998; Keskin et al., 2005; Moreira et al., 2007). Other residues, known as scaffolds, help position the hot spots in the binding region and stabilize the peptide (Matson & Stupp, 2012; Hosseinzadeh et al., 2021). Considering the different roles of these residues, generating all of them in one step may be inefficient. Second, the generated peptides must respect the constraints imposed by peptide bonds, which are non-rotatable and enforce fixed bond lengths and planar structures (Fisher, 2001). As illustrated in Figure 1, adjacent residues must maintain specific relative positions to form proper peptide bonds. A model that represents peptide backbone structures independently as local frames (Jumper et al., 2021) may neglect these geometric constraints. Third, in practical drug discovery, peptides are not always designed from scratch. Often, initial peptide candidates are optimized, or key hot spot residues are linked via scaffold residues (Zhang et al., 2009; Yu et al., 2023). Thus, more realistic in-silico benchmarks are needed to simulate these scenarios.
To tackle these challenges, we propose PepHAR , a hot-spot-driven autoregressive generative model. By distinguishing between hot spot and scaffold residues, we break down the generation process into three stages. First, we use an energy-based density model to capture the residue distribution around the target, and apply use Langevin dynamics to sample statistically favorable and feasible hot spots. Next, instead of generating all residues simultaneously, we autoregressively extend fragments step by step, modeling dihedral angles parameterized by a von Mises distribution to maintain peptide bond geometry. Finally, since the generated fragments may not align perfectly, we apply a hybrid optimization function to assemble the fragments into a complete peptide. To simulate practical peptide drug discovery scenarios, we evaluate our method not only in de novo peptide design but also in scaffold generation, where the model scaffolds known hot spot residues into a functional peptide, akin to peptide design based on prior knowledge.
In summary, our key contributions are:
2 RELATED WORK
Generative Models for Protein Design Generative models have shown significant promise in designing functional proteins (Yeh et al., 2023; Dauparas et al., 2023; Zhang et al., 2023c; Wang et al., 2021; Trippe et al., 2022; Yim et al., 2024). Some approaches focus on generating protein sequences using protein language models (Madani et al., 2020; Verkuil et al., 2022; Nijkamp et al., 2023) or through methods like directed evolution (Jain et al., 2022; Ren et al., 2022; Khan et al., 2022; Stanton et al., 2022). Others aim to design sequences based on backbone structures (Ingraham et al., 2019; Jing et al., 2020; Hsu et al., 2022; Li et al., 2022; Gao et al., 2022; Dauparas et al., 2022). For protein structures, which are crucial for determining protein function, diffusion-based (Luo et al., 2022; Watson et al., 2022; Yim et al., 2023b) and flow-based models (Yim et al., 2023a; Bose et al., 2023; Li et al., 2024a; Cheng et al., 2024) have been successfully applied to both unconditional (Campbell et al., 2024) and conditional protein design (Yim et al., 2024). However, these generative models typically treat all residues as equal, generating them simultaneously and overlooking the distinct roles of residues, such as those involved in catalytic sites (Giessel et al., 2022) or binding regions (Li et al., 2024a).
Computational Peptide Design The earliest methods for peptide design rely on protein or peptide templates for design (Bhardwaj et al., 2016; Hosseinzadeh et al., 2021; Swanson et al., 2022). These approaches use heuristic rules to search for similar sequences or structures in the PDB database as seeds for peptide design. A more prevalent class of models focuses on optimizing hand-crafted or statistical energy functions for peptide design (Cao et al., 2022; Bryant & Elofsson, 2023). While effective, these methods are computationally expensive and tend to get stuck in local minima (Raveh et al., 2011; Alford et al., 2017). Recently, deep generative models, such as GANs (Xie et al., 2023), diffusion models (Xie12 et al.; Wang et al., 2024), and flow models (Li et al., 2024a; Lin et al., 2024), have been applied to design peptide structures and sequences, conditioned on target protein information, offering more flexibility and efficiency in the design process.
3 PRELIMINARY
Protein Composition A protein or peptide is composed of multiple amino acid residues, each characterized by its type and backbone structure, which includes both position and orientation (Jumper et al., 2021). For the i -th residue, denoted as R i = ( c i , x i , O i ) , its type c i â { 1 ... 20 } refers to the class of its side-chain R group. The backbone position x i â R 3 represents the coordinates of the central C α atom, while the backbone orientation O i â SO(3) is defined by the spatial configuration of the heavy backbone atoms (N-C α -C). In this way, a protein can be represented as a sequence of N residues: [ R 1 , . . . , R N ] .
Problem Formation The goal of this work is to generate a peptide D = [ R 1 , . . . , R N ] , consisting of N residues, based on a target protein T = [ R 1 , . . . , R M ] of length M . We also define fragments, where the k -th fragment is denoted as F ( k,i k ,l k ) = [ R i k , . . . , R i k + l k -1 ] , a contiguous subset of residues. Fragments are sequentially connected within the protein, where i k indicates the N-terminal residue's index of the fragment in the original peptide, and l k represents the fragment's length. Multiple fragments can be assembled into a complete protein based on their residue indices.
Directional Relations The sequential ordering from the N-terminal to the C-terminal residue, along with the covalent bonds between adjacent residues, is fundamental in our approach. As illustrated in Fig 1, residues are linked via covalent peptide bonds (CO-NH), with each residue R i connecting to its neighboring residues R i -1 and R i +1 . These peptide bonds are partially double bonds, limiting their rotational freedom and resulting in a planar configuration for atoms between adjacent residues (C α , C β , O, and H atoms). The backbone structure of a protein can thus be described using dihedral angles, which define the spatial relations between these planes in 3D space. Each residue has three associated dihedrals: Ï i , Ï i , and Ï i . The first two angles determine the geometric relationship between adjacent residues, while the third controls the position of the O atom. Given a protein's backbone structure, we can calculate the dihedral angles for each residue. Conversely, the backbone structure of neighboring residues can also be derived from the dihedral angles, which serve as the building blocks in our model. Specifically, given the backbone position x i and orientation O i , we can approximate the backbone structures of neighboring residues R i -1 and R i +1 using coordinate
transformations. Details are included in the Appendix B.
( x i -1 , O i -1 ) = Left ( x i , O i , Ï i -1 , Ï i ) , (1)
( x i +1 , O i +1 ) = Right ( x i , O i , Ï i , Ï i +1 ) . (2)
4 METHODS
To tackle the challenges in peptide design, we propose a three-stage approach to generate peptides D based on their target protein T . Our method involves generating hot-spot residues, extending fragments, and correcting peptide structures. As shown in Figure 2 and Algorithm 1, the first stage, the Founding Stage , independently generates a small number k of hot-spot residues R i 1 , ..., R i k from the learned residue distribution P ( R | T ) . In the second stage, the Extension Stage , these hot-spot residues are used as starting points to progressively extend k fragments F 1 , ..., F k by adding new residues to the Left or Right in an autoregressive manner, until the total peptide reaches the desired length L . Finally, since each fragment is extended independently, the third stage, the Correction Stage , adjusts the sequences and structures of the fragments, refining them based on the gradients of the objective function to ensure valid geometries and meaningful peptide sequences.
4.1 FOUNDING STAGE
The founding stage first generates k hot-spot residues based on the target protein T . By introducing the residue distribution P ( R | T ) , hot-spot residues represent those that have higher probabilities (i.e., lower energies) of appearing near the binding pocket compared to other backbone structures and residue types. The generation of hot-spot residues focuses on finding regions with high probabilities, where residues are more likely to interact directly with the target. Conversely, regions with low probabilities (i.e., high energies) will have fewer peptide residues. For example, peptide residues should neither occur too far from nor too close to the pocket (Cao et al., 2022; Li et al., 2024a), as such regions may have near-zero probabilities (high energies) of residue occurrence. We parameterize P ( R | T ) using an energy-based model, which is a conditional joint distribution of backbone position x , orientation O , and residue type c :
P θ ( c, x , O | T ) = 1 Z exp( g θ,c ( x , O | T )) . (3)
here g θ is a scoring function parameterized by an equivariant network, which quantifies the score of residue type c occuring at a given backbone structure with position x and orientation O In other words, g θ,c is the unnormalized probability of type c . And Z is the normalizing constant associated with sequence and structure information, which we do not explicitly estimate.
Algorithm 1: Peptide Sampling Outline
Data: Target protein T , peptide length N , hot-spot residue count k , and indices [ i 1 , ..., i k ]
Network Implementation The density model g θ is parameterized by the Invariant Point Attention backbone network (Jumper et al., 2021; Luo et al., 2022; Yim et al., 2023b), which is SE(3) invariant. It takes positive residues (peptide residues) and negative residues (perturbed residues) along with the target protein as input, encoding them into hidden representations. A shallow Multi-Layer Perceptron (MLP) is then used to classify residue types for likelihood evaluation.
Training We use the Noise Contrastive Estimation (NCE) to train this parameterized energy-based model (Gutmann & Hyvarinen, 2010). NCE distinguish between samples from the true data distribution (positive points) and samples from a noise distribution (negative points). The positive distribution corresponds to the ground truth residue distribution of the peptide over the target ( c, x , O ) â¼ p ( R | T ) , while the negative samples are drawn from the disturbed distribution ( c neg , x -, O -) â¼ p ( Ë R | T ) by adding large spatial noises to the ground truth positions and orientations, labeled as type c neg . As positive and negative data are sampled equally, the NCE objective for a single positive data point is:
l ( c, x , O , | T ) = log exp g θ,c ( x , O | T ) â c ' exp g θ,c ' ( x , O | T ) + p ( c neg , x , O | T ) . (4)
As a common practice (Gutmann & Hyvarinen, 2012), we fix the negative probability p ( c neg , x , O | T ) as a constant, simplifying the evaluation of log likelihoods for negative samples. The final loss function is given by:
L NCE = -E + [ l ( c, x , O | T )] -E -[ l ( c neg , x -, O -| T ) ] . (5)
Sampling In the founding stage, we sample k hot-spot residues from the learned energy-based distribution, where k is kept small relative to the peptide length (e.g., k = 1 â¼ 3 in our experiments). Since hot-spots are assumed to be sparsely distributed along the peptide (Bogan & Thorn, 1998), we approximately sample them independently. For each hot-spot residue, we employ the Langevin MCMCSampling algorithm (Welling & Teh, 2011), starting from an initial guessed position x 0 and orientation O 0 , and iteratively update them using the following gradients:
x t +1 â x t + ϵ 2 2 â c ' â x g θ,c ' ( x t , O t | T ) + ϵ z t x , z t x â¼ N (0 , I 3 ) , (6)
O t +1 â exp O t ( ϵ 2 2 â c ' â O g θ,c ' ( x t , O t | T ) + ϵ Z t O ) , Z t O â¼ T N O t (0 , I 3 ) , (7)
c t +1 ⼠softmax g θ ( x t , O t | T ) . (8)
Since orientation lies in the SO(3) space, we employ the exponential map and a Riemannian random walk on the tangent space for updates (De Bortoli et al., 2022). The summation over all possible
residue types ensures that we transition from regions of low occurrence probability to regions of higher probability. Finally, after each iteration, the residue type c is sampled conditioned on the updated position and orientation.
4.2 EXTENSION STAGE
The extension stage expands fragments into longer sequences, starting from the sampled hot-spot residues. At each extension step, we adds a new residue to either the left or right of fragment F . Based on the relationship between adjacent residues (Eq.1,2), the backbone structure of the new residue is inferred from its dihedral angles and the structure of the adjacent residue, which is either the N-terminal or C-terminal residue of the fragment. Specifically, when connecting a new residue to residue R i j in the j th fragment F j , we model the dihedral angle distribution P ( Ï, Ï | d, R i j , E ) , where d â { L , R } indicates the extension direction, and E represents the surrounding residues, including target T and other residues in the currently generated fragments.
P ( Ï, Ï | d, R i j , E ) = { P ( Ï i j -1 , Ï i j ) , d = L , P ( Ï i j , Ï i j +1 ) , d = R . (9)
Since multiple angles are involved, the dihedral angle distribution is modeled as a product of parameterized von Mises distributions (Lennox et al., 2009), which use cosine distance instead of L2 distance to measure the difference between angles, behaving like circular normal distributions. For example, when d = L, we have:
P ( Ï i j -1 , Ï i j ) = f VM ( Ï ; µ Ï i j -1 , κ Ï i j -1 ) f VM ( Ï i j ; µ Ï i j , κ Ï i j ) , (10)
f VM ( Ï i j -1 ; µ Ï i j -1 , κ Ï i j -1 ) = 1 2 ÏI 0 ( κ Ï i j -1 ) exp ( κ Ï i j -1 · cos ( µ Ï i j -1 -Ï i j -1 ) ) , (11)
f VM ( Ï i j ; µ Ï i j , κ Ï i j ) = 1 2 ÏI 0 ( κ Ï i j ) exp ( κ Ï i j · cos ( µ Ï i j -Ï i j ) ) . (12)
Here, I 0 ( · ) denotes the modified Bessel function of the first kind of order 0. The four distribution parameters are predicted by a neural network h θ , referred to as the prediction network. Similarly, for d = R, the network predicts another set of four parameters:
h θ ( d, R i j , E ) = { ( µ Ï i j -1 , κ Ï i j -1 , µ Ï i j , κ Ï i j ) , d = L , ( µ Ï i j , κ Ï i j , µ Ï i j +1 , κ Ï i j +1 ) , d = R . (13)
Network Implementation The prediction network h θ uses the same IPA backbone to extract features. However, to avoid data leakage during training, since the neighboring backbone structures are known and dihedral angles can be derived analytically, we employ directional masks in the Attention module. For instance, if the direction is Left , residues can only attend to their neighbors on the right during attention updates, and vice versa for Right .
Training We optimize the network parameters using Maximum Likelihood Estimation (MLE) over directions d â¼ { L , R } and peptides in the peptide-target complex dataset. The MLE objective is given by:
L MLE = -E [ log P ( Ï, Ï | d, R i j , E ) ] . (14)
Sampling During the extension stage, we generate k fragments corresponding to k hot-spot residues from the founding stage. The extension process is iterative, where fragments are autoregressively extended until the total peptide length (the sum of fragment lengths) reaches a predefined value (e.g., the length of the native peptide). Consider a one-step extension of fragment F in direction d . The starting residue R i j depends on the direction: d = L implies adding a residue to the left of the fragment, making R i j the N-terminal residue (first residue); d = R implies adding to the right, making R i j the C-terminal residue (last residue). The other residues in the fragment and target form the environment E . We then sample the dihedral angles for the new residue in the chosen direction from the predicted distribution, using h θ . For example, when d = L :
Ï i j -1 â¼ f VM ( Ï i j -1 ; h θ ( d = L , R i j , E )) , (15)
Ï i j â¼ f VM ( Ï i j ; h θ ( d = R , R i j , E )) . (16)
Next, the backbone structure of the newly added residue R i j -1 is reconstructed using the transformations in Eq 1. The residue type is then estimated by the density model g θ used during the founding stage:
( x i j -1 , O i j -1 ) = Left ( x i j , O i j , Ï i j -1 , Ï i j ) , (17)
c i j -1 ⼠softmax g θ ( x i j -1 , O i j -1 | E ) . (18)
Finally, the process is repeated for another randomly selected fragment and direction.
4.3 CORRECTION STAGE
Although we autoregressively extended each fragment, the resulting fragments may not form a valid peptide with accurate geometry. For example, some fragments may not maintain proper distances between each other, leading to broken peptide bonds, while others may have incorrect dihedrals or residue types in relation to the whole peptide and the target protein. Some fragments may also exhibit atom clashes within the target protein. Inspired by traditional methods using hand-crafted energy functions (Alford et al., 2017), we introduce a correction stage as a post-processing step to refine the generated peptides. Rather than relying on empirical functions, we use the learned, network-parameterized distributions from the first two stages to regularize the peptides.
For a generated peptide D = [( c 1 , x 1 , O 1 ) , ..., ( c N , x N , O N )] , the dihedrals of each residue are derived based on the backbone structures of adjacent residues. To ensure self-consistency between dihedrals and backbone structures, we use the dihedrals to estimate new backbone structures and compare them with the original ones. The distance between these backbone structures reflects the validity of the generated peptide with respect to peptide bond properties and planarity. We define the distance between two residues' backbone structures separately for position and orientation, and derive the backbone objective considering both directions:
d (( x i , O i ) , ( x j , O j )) = ⥠x i -x j ⥠2 + ⥠log( O i ) -log( O j ) ⥠2 , (19)
J bb = -N â i =2 d ( Left ( x i , O i , Ï i -1 , Ï i ) -( x i -1 , O i -1 )) -N -1 â i =1 d ( Right ( x i , O i , Ï i , Ï i +1 ) -( x i +1 , O i +1 )) . (20)
Additionally, the dihedral angles must conform to the learned distribution P ( Ï, Ï ) to ensure correct geometric relationships between neighboring residues. This leads to the dihedral objective, which is similar to Eq. 14. However, in Eq. 14, we optimize the network parameters to fit the angle distribution, whereas here, we keep the learned networks fixed and update the dihedrals instead:
J ang = -N â i =2 log P ( Ï i -1 , Ï i ) -N -1 â i =1 log P ( Ï i , Ï i +1 ) . (21)
The final optimization objective is a weighted sum of the backbone and dihedral objectives. We iteratively update the peptide's backbone structures by taking gradients, similar to the founding stage, but here we optimize the entire peptide at each timestep. The residue types are predicted by the density model g θ at the end of each update step. Unlike the founding stage, where we started from random structures, the correction stage begins with the complete peptide.
J corr = λ bb J bb + λ ang J ang , (22)
( x t +1 i , O t +1 i ) â update( x t i , O t i , â x i J , â O i J , ) , (23)
c t +1 â¼ softmax( x t , O t | E ) . (24)
5 EXPERIMENTS
Overview Weevaluate PepHAR and several baseline methods on two main tasks: (1) Peptide Binder Design and (2) Peptide Scaffold Generation. In Peptide Design, we co-generate both the structure and sequence of peptides based on their binding pockets within the target protein. However, in realworld drug discovery, prior knowledge-such as key interaction residues at the binding interface or
initial peptides for optimization-is often available. Therefore, we introduce Peptide Scaffold Generation to assess how well models can scaffold and link these key residues into complete peptides, reflecting more practical applications. Details are included in the Appendix E.
Dataset Following Li et al. (2024a), we construct our training and test datasets. This moderatelength benchmark is derived from PepBDB (Wen et al., 2019) and Q-BioLip (Wei et al., 2024), with duplicates and low-quality entries removed. The binding pocket is defined as the residues in the target protein which has heavy atoms lying in the 10 Ë Aradius of any heavy atom in the peptide. The dataset consists of 158 complexes across 10 clusters from mmseqs2 (Steinegger & Soding, 2017), with an additional 8 , 207 non-homologous examples used for training and validation.
5.1 PEPTIDE BINDER DESIGN
In Peptide Binder Design, we co-generate peptide sequences and structures conditioned on the binding pockets of their target proteins. The models are provided with both the sequence and structure of the target protein pockets and tasked with generating bound-state peptides.
Metrics A robust generative model should produce diverse, valid peptides with favorable stability and affinity. The following metrics are employed: (1) Valid measures whether the distance between adjacent residues is consistent with peptide bond formation, considering C α atoms within 3 . 8 Ë A as valid (Chelvanayagam et al., 1998; Zhang et al., 2012). (2) RMSD (Root-Mean-Square Deviation) compares the generated peptide structures to the native ones based on C α distances after alignment. (3) SSR (Secondary Structure Ratio) evaluates the proportion of shared secondary structures between the generated and native peptides labeled by DSSP (Kabsch & Sander, 1983). (4) BSR (Binding Site Rate) assesses the similarity of peptide-target interactions by measuring the overlap of binding sites. (5) Stability calculates the percentage of generated complexes that are more stable (lower total energy) than their native counterparts, based on rosetta energy functions (Chaudhury et al., 2010; Alford et al., 2017). (6) Affinity measures the percentage of peptides with higher binding affinities (lower binding energies) than the native peptide. Beyond geometric and energetic factors, the model should also exhibit strong generalizability in discovering novel peptides. (7) Novelty is the ratio of novel peptides, defined by two criteria: (a) TM-score ⤠0 . 5 (Zhang & Skolnick, 2005) and (b) sequence identity ⤠0 . 5 . (8) Diversity quantifies structural and sequence
variability, calculated as the product of pairwise (1-TM-score) and (1-sequence identity) across all generated peptides for a given target. (9) Success rate evaluates the proportion of AF2-predicted complex structures with a whole ipTM value higher than 0 . 6 .
Baselines We compare PepHAR against three state-of-the-art peptide design models. RFDiffusion (Watson et al., 2022) uses pre-trained weights from RoseTTAFold (Baek et al., 2021) and generates protein backbone structures through a denoising diffusion process. Peptide sequences are then recovered using ProteinMPNN (Dauparas et al., 2022). ProteinGenerator augument RFDiffusion with sequence-structure jointly generation (Lisanza et al., 2023). PepFlow (Li et al., 2024a) models fullatom peptide and samples peptides using a flow-matching framework on a Riemannian manifold. PepGLAD (Kong et al., 2024) employs equivariant latent diffusion networks to generate full-atom peptide structures.
Results As shown in Table 1, PepHAR effectively generates peptides that exhibit valid geometries, native-like structures, and improved energies. While RFDiffusion produces valid peptides due to its pretrained protein folding weights, PepFlow, which is trained solely on peptide datasets, struggles with generating valid peptides, making it challenging for practical applications. In contrast, PepHAR's autoregressive generation based on dihedral angles not only ensures the production of valid peptides but also allows for precise placement at the binding site with accurate secondary structures. Similar to previous work (Li et al., 2024a), RFDiffusion excels at generating stable peptide-target structures, while PepHAR demonstrates competitive performance compared to PepFlow. Additionally, PepHAR shows impressive results in terms of novelty and diversity, highlighting its potential for exploring peptide distributions and designing a wide range of peptides for real-world applications. Figure 3 illustrates two examples of peptides generated by PepHAR, which closely resemble the structures and binding sites of native peptides while exhibiting low binding energies, indicating high affinities for the target.
5.2 PEPTIDE SCAFFOLD GENERATION
Compared to designing peptides from scratch, a more practical approach involves leveraging prior knowledge, such as key interaction residues. We introduce this as the task of scaffold generation, where certain hot spot residues in the peptide are fixed, and the model must generate a complete peptide by connecting these residues. In this context, the generated peptide should incorporate the hot spot residues in the correct positions, effectively scaffolding them. Hot spot residues are selected based on their higher potential for interacting with the target protein. To identify these, we first calculate the energy of each residue using an energy function (Alford et al., 2017), then manually select residues that are both energetically favorable and sparsely distributed along the peptide sequence. These selected residues are fixed as the condition for scaffold generation.
Baselines and Metrics Weuse the same baselines and metrics as in the Peptide Design task. Specifically, for RFDiffusion and Protein Generator, the known hot spot residues are provided as an additional condition, along with the target. For PepFlow, we modify the ODE sampling process by initializing it with the ground truth hot spot residues and ensure the model to only modify the remaining residues. In our method, we replace the sampled hot spot residues with the known ground truth residue.
Results As shown in Table 2, PepHAR demonstrates excellent performance in scaffolding hot spot residues into complete peptides. Given that the hot spot residues have functional binding capabilities, while the scaffold residues contribute primarily to structural integrity, the generated peptides are expected to possess valid, native structures with high stability. PepHAR successfully generates valid and native-like structures, ensuring that scaffold residues do not disrupt interactions between hot spot residues, achieving the best scores in SSR and BSR. Moreover, PepHAR achieves competitive stability results compared to RFDiffusion, which is trained on a larger PDB dataset. Additionally,
PepHAR produces novel and diverse scaffolds. Figure 5 presents two examples of scaffolded peptides alongside native peptides and given residues. The generated scaffolds exhibit similar structures to the native ones, while displaying variations in geometry and orientation at the midpoints and ends of the peptides, indicating flexibility in the scaffolding regions. Furthermore, the generated scaffolds often have lower total energy than the native peptides, suggesting enhanced stability of the complex and improved interaction potential.
5.3 ANALYSIS
Effect of Hot Spots Comparing Tables 1 and 2, we observe that introducing hot spots as prior knowledge significantly boosts PepHAR's performance, while providing little benefit to RFDiffusion and PepFlow. This highlights PepHAR's versatility across different design tasks. We also investigate the effect of varying the number of hot spots, denoted as K = 1 , 2 , 3 . As shown in Tables and Figure 3, increasing the number of hot spots improves geometries and energies, regardless of whether the hotspots are estimated by density models or provided as ground truth; however, it negatively impacts novelty and diversity. This illustrates a trade-off between designing low-diversity but high-quality peptides (in comparison to the native) and high-diversity but varied peptides (Luo et al., 2022; Li et al., 2024a).
Ablation Study Table 3 presents our ablation study, which assesses the effectiveness of different components in PepHAR. 'PepHAR w/o Hot Spot' refers to the model where hot spots sampled from the density model are replaced with randomly positioned and typed residues. 'PepHAR w/o Von Mises' indicates the use of direct angle predictions instead of modeling angle distributions. We also remove the correction stage in 'PepHAR w/o Correction.' Our findings reveal that generated hot spots are crucial for Valid, RMSD, SSR, and BSR metrics, underscoring their importance for achieving valid geometries and interactions. Modeling angle distributions also contributes positively by accounting for the flexibility of dihedral angles. Lastly, the final correction stage plays a vital role in enhancing fragment assembly, leading to peptides with higher affinity and stability, which are essential for effective protein binding.
6 CONCLUSION
In this work, we presented PepHAR , a hot-spot based autoregressive generative model designed for efficient and precise peptide design targeting specific proteins. By addressing key challenges in peptide design-such as the unequal contribution of residues, the geometric constraints imposed by peptide bonds, and the need for practical benchmarking scenarios-PepHAR provides a comprehensive approach for generating peptides from scratch or assembling peptides around key hot spot residues. Our method leverages energy-based hot spot sampling, autoregressive fragment extension through dihedral angles, and an optimization process to ensure valid peptide assembly. Through extensive experiments on both peptide generation and scaffold-based design, we demonstrated the effectiveness of PepHAR in computational peptide design, highlighting its potential for advancing drug discovery and therapeutic development.
REFERENCES
Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571 , 2022.
Rebecca F Alford, Andrew Leaver-Fay, Jeliazko R Jeliazkov, Matthew J O'Meara, Frank P DiMaio, Hahnbeom Park, Maxim V Shapovalov, P Douglas Renfrew, Vikram K Mulligan, Kalli Kappel, et al. The rosetta all-atom energy function for macromolecular modeling and design. Journal of chemical theory and computation , 13(6):3031-3048, 2017.
Namrata Anand and Tudor Achim. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019 , 2022.
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems , 34:17981-17993, 2021.
Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science , 373(6557):871876, 2021.
Heli Ben-Hamu, Samuel Cohen, Joey Bose, Brandon Amos, Maximillian Nickel, Aditya Grover, Ricky TQ Chen, and Yaron Lipman. Matching normalizing flows and probability paths on manifolds. In International Conference on Machine Learning , pp. 1749-1763. PMLR, 2022.
Nathaniel R Bennett, Brian Coventry, Inna Goreshnik, Buwei Huang, Aza Allen, Dionne Vafeados, Ying Po Peng, Justas Dauparas, Minkyung Baek, Lance Stewart, et al. Improving de novo protein binder design with deep learning. Nature Communications , 14(1):2625, 2023.
Gaurav Bhardwaj, Vikram Khipple Mulligan, Christopher D Bahl, Jason M Gilmore, Peta J Harvey, Olivier Cheneval, Garry W Buchko, Surya VSRK Pulavarti, Quentin Kaas, Alexander Eletsky, et al. Accurate de novo design of hyperstable constrained peptides. Nature , 538(7625):329-335, 2016.
Suhaas Bhat, Kalyan Palepu, Vivian Yudistyra, Lauren Hong, Venkata Srikar Kavirayuni, Tianlai Chen, Lin Zhao, Tian Wang, Sophia Vincoff, and Pranam Chatterjee. De novo generation and prioritization of target-binding peptide motifs from sequence alone. bioRxiv , pp. 2023-06, 2023.
Jos'e Luis Blanco-Claraco. A tutorial on se(3) transformation parameterizations and on-manifold optimization. arXiv preprint arXiv:2103.15980 , 2021.
Miklos Bodanszky. Peptide chemistry. A Practical Textbook, , 1988.
Eric T Boder and K Dane Wittrup. Yeast surface display for screening combinatorial polypeptide libraries. Nature biotechnology , 15(6):553-557, 1997.