Introduction

As of May 2024, the PDB archive stores > 80,000 macromolecular structures that contain at least one protein chemical modification (PCM) or post-translational modification (PTM). These modifications are highly diverse, ranging from small compounds (e.g. methylation) to very large polymeric compounds (e.g. glycosylation). This makes it challenging to represent them in a format that is consistent and accessible to the community.

In consultation with the PTM community, all entries containing PCMs/PTMs in the PDB archive are remediated to standardize and enrich the PCM/PTM data that they contain. This process involves:

Atom naming and additional annotation of protein backbone and terminal atoms within peptide residues are standardized
New PDBx/mmCIF data categories that classify the PCMs/PTMs with types and linking partners in the CCD definitions
New PDBx/mmCIF data categories that provide enriched PCMs/PTMs annotation in the coordinate files
Proper labeling of entries containing PCMs/PTM in the coordinate files

The standardization and enrichment of PCM information across all PDB entries allows PCM/PTM data to be more Findable, Accessible, Interoperable and Reusable (FAIR). Together, this supports the wider usage and analysis of protein modification data in the PDB archive.

Examples and Dictionary extensions

Examples of PDB entries and CCD files that contain the new PCM data category can be accessed on the wwPDB Github.

Dictionary extensions for both the backbone remediation and PCM/PTM remediation are also available in text or markdown on the wwPDB Github.

Lists of all remediated entries

The list of remediated and to be remediated PDB entries is provided here.

The list of remediated CCD IDs is provided here.

New PDBx/mmCIF Data Categories/Items

CCD Definitions

An extended data item, _chem_comp.pdbx_pcm, is introduced to state whether the CCD can be used to describe a PCM.

For example, the modification of lysine with pyridoxal phosphate can be handled in two ways:

In this example, the value of _chem_comp.pdbx_pcm is set to 'Y' for LLP and 'N' for PLP. This ensures that PLP is not used to describe this PCM.

A new category, _pdbx_chem_comp_pcm, is added to provide additional information about the modification including type and category classification, modified residue, linking atom pairs, linking position, and UniProt references, if available.

For example, PALMITIC ACID with CCD ID PLM is palmitoylation, a type of lipidation. Due to the diversity of PCMs, not all modifications can be handled using the same approach. Therefore, PCMs have been classified into broad categories. Within each category, the protein modifications are handled consistently.

There are 4 approaches to handling PCMs:

Approach	Categories	Examples
Part of peptide residue	Chromophore/chromophore-like Named protein modification Non-standard residue	Phosphoserine (SEP) N-Trimethyl-lysine (M3L)
Linked to peptide residue	ADP-Ribose Biotin Carbohydrate Covalent chemical modification Crosslinker Flavin Heme/heme-like Lipid/lipid-like Nucleotide monophosphate	N-acetyl-D-glucosamine (NAG) Palmitic acid (PLM)
Direct covalent bond	Disulfide bridge Isopeptide bond Non-standard linkage	Cysteine-Cysteine covalent bond Lysine-Asparagine covalent bond
Part of peptide sequence	Terminal acetylation Terminal amidation	Acetyl group (ACE) Amino group (NH2)

Modifications that are part of the peptide residue

Most small and well-known PCMs are in the category 'Named protein modification'. This covers all PTMs that are not explicitly covered by another category, including phosphorylation and methylations.

Small modifications are handled as part of the peptide residue to ensure that they are grouped into chemical units that are meaningful and findable by the community. For example, it is easier to find N-Trimethyl-lysine as a single chemical unit rather than a standard lysine linked to three separate methyl groups.

Chromophores and other non-standard residues are also handled using this approach. This is because they cannot be unambiguously split into a standard amino acid and the modification group.

Modifications that are linked to a peptide residue

Not all modifications can be defined as part of the peptide residue and instead need to have a separate CCD definition to describe the modification group.

In these cases, the CCD that describes the modification group contains the details of all the PCMs that it can form. For example, the CCD for palmitic acid (PLM) describes all the palmitoylations observed in the PDB archive.

Modifications that are direct covalent bonds

Some PCMs have no modification group and are instead direct covalent bonds between two peptide residues. The most common example of this are disulfide bridges, however, there are other such modifications, including isopeptide bonds, Selenocysteine-Selenocysteine bonds and peptide cyclisation bonding.

There are no CCDs that describe these modification groups because the modification contains no atoms. Therefore, these modifications are annotated in the mmCIF entry file that describes the macromolecular structure, but not in any CCD definition.

Modifications that are part of the peptide sequence

These are modifications that are both linked to the peptide residue they modify, but also part of the polypeptide sequence. Terminal acetylation and amidation are the only modifications that are handled in this way.

Coordinate Files

A new data item, _pdbx_entry_details.has_protein_modification, is created to indicate whether a PDB entry contains any PCMs.

A new category, _pdbx_modification_feature, is created to provide a list of all the PCMs that occur within a PDB entry and detailed information that enables the modification to be located within the structure.

Most of the information in _pdbx_modification_feature is inherited from the category _struct_conn that contains the details of all the covalent bonds related to PCMs.

As part of the PCM remediation project, all entries containing protein modifications are re-released with these new data items/categories.

The entry 4ZPZ provides an example of how these two PCMs are handled in _pdbx_modification_feature because it contains 2 phosphoserines and 1 disulfide bridge:

The entry 2K4H provides an example of how linked PCMs are handled in _pdbx_modification_feature because it contains 1 glycine myristoylation.

Standardizing Protein Modifications

The standardization of PCM handling ensures that there is a single correct approach to handling each PCM that occurs within the PDB archive. As a result of this remediation, some modified residues are split into standard residues and ligands and some standard residues are merged with ligands into modified residues.

Splitting CCDs

In some instances, a single peptide residue CCD is inconsistently used to describe a linked modification. For example, the CCD MYK describes N6-Myristoyl-Lysine, however, myristoylation is a lipid modification and must be handled as a linked modification. In all entries that contain MYK the following splitting process must be performed:

Existing CCD	>	New Peptide CCD	New Linked CCD	Example PDB Entry ID
MYK	>	LYS	MYR	3U31

Merging CCDs

In some instances, a linked modification group is inconsistently used to describe a PCM that should be handled as a peptide residue within the polypeptide sequence. For example, phosphoserine can be inconsistently described using two CCDs, serine (SER) and phosphate (PO4) rather than the phosphoserine CCD (SEP). This is because this is a 'Named protein modification' and so should be handled as a single CCD describing both the peptide residue and modification group. The following merging process must be performed in all of these cases:

Existing Peptide CCD	Existing Linked CCD	>	New CCD	Example PDB Entry ID
SER	PO4	>	SEP	6FW5

Replacing CCDs

In some instances, a CCD ID is incorrectly used to describe a PCM where another existing CCD already describes the modification. One example of this is the use of CCDs to describe Heme (HEM) and Heme C (HEC). HEM should be used exclusively to describe non-covalently bound heme, whereas HEC should be used to describe covalently bound heme C. However, there are many entries in which these are inconsistently used. The following replacement process must be performed in all of these cases:

Existing CCD	>	New CCD	Example PDB Entry ID
HEM	>	HEC	19HC

Adding CCDs to Polypeptide Sequences

In some instances, a CCD is linked to the polypeptide rather than being part of the polypeptide sequence. This most often occurs to the terminal residues of the sequence and most commonly to the terminal caps ACE and NH2. The following addition process must be performed in all these cases:

CCD	Current Handling	>	New Handling	Example PDB Entry ID
ACE	Linked to peptide N-terminus	>	Part of peptide sequence	1BHQ
NH2	Linked to peptide C-terminus	>	Part of peptide sequence	4TKY

The opposite process can also occur, where a CCD should be removed from a polypeptide sequence. For example, terminal myristoylation (MYR) is often inconsistently annotated as being part of the polypeptide sequence. The following removal process must be performed in all these cases:

CCD	Current Handling	>	New Handling	Example PDB Entry ID
MYR	Part of peptide sequence	>	Linked to peptide N-terminus	2NA0

Renaming CCDs

As part of the PCM remediation process, many CCDs that describe protein modifications are being renamed to make them more findable and remove ambiguities in their naming. For example, the CCD MLZ describes the methylation of the side chain nitrogen of lysine, currently it is named 'N-METHYL-LYSINE' however this name is misleading, implying that the methylation occurs on the backbone nitrogen rather than the side chain nitrogen. It will be renamed in the following process:

CCD	Current Name	>	New Name	Example PDB Entry ID
MLZ	N-METHYL-LYSINE	>	N6-METHYL-LYSINE	1IV8

This naming must be reflected in all entries that contain the CCD MLZ.

Changing CCD Parents

As part of the PCM remediation process, many CCDs that describe protein modifications have their parent residues updated. For example, the CCD SOQ is the CCD that describes N-METHYL-ASPARTIC ACID, however it is currently annotated as not having a parent residue. It is a natural modification to Aspartic acid and so the parent residue will be changed to ASP. The parent will be updated in the following process:

CCD	Current Parent	>	New Parent	Example PDB Entry ID
SOQ	None	>	ASP	7AZ6

This change in parent must be reflected in all entries that contain the CCD SOQ.

Acknowledgements

The protein chemical modifications (PCMs) and post translational modifications (PTMs) remediation project is a wwPDB collaborative project carried out principally by PDBe at EMBL-EBI, funded by the BBSRC grant number BB/V018779/1.