Introduction
As of May 2024, the PDB archive stores > 80,000 macromolecular structures that contain at least one protein chemical modification (PCM)
or post-translational modification (PTM). These modifications are highly diverse, ranging from small compounds (e.g. methylation) to
very large polymeric compounds (e.g. glycosylation). This makes it challenging to represent them in a format that is consistent
and accessible to the community.
In consultation with the PTM community, all entries containing PCMs/PTMs in the PDB archive are remediated to standardize and enrich
the PCM/PTM data that they contain. This process involves:
- Atom naming and additional annotation of protein backbone and terminal atoms within peptide residues are standardized
- New PDBx/mmCIF data categories that classify the PCMs/PTMs with types and linking partners in the CCD definitions
- New PDBx/mmCIF data categories that provide enriched PCMs/PTMs annotation in the coordinate files
- Proper labeling of entries containing PCMs/PTM in the coordinate files
The standardization and enrichment of PCM information across all PDB entries allows PCM/PTM data to be more Findable, Accessible,
Interoperable and Reusable (FAIR). Together, this supports the wider usage and analysis of protein modification data
in the PDB archive.
Examples and Dictionary extensions
Examples of PDB entries and CCD files that contain the new PCM data category can be accessed on the wwPDB
Github.
Dictionary extensions for both the backbone remediation and PCM/PTM remediation are also available in
text
or markdown
on the wwPDB Github.
Lists of all remediated entries
The list of remediated and to be remediated PDB entries is provided
here.
The list of remediated CCD IDs is provided
here.
New PDBx/mmCIF Data Categories/Items
CCD Definitions
An extended data item,
_chem_comp.pdbx_pcm,
is introduced to state whether the CCD can be used to describe a PCM.
For example, the modification of lysine with pyridoxal phosphate can be handled in two ways:
In this example, the value of
_chem_comp.pdbx_pcm
is set to 'Y' for LLP and 'N' for PLP. This ensures that PLP is not used to describe this PCM.
A new category, _pdbx_chem_comp_pcm,
is added to provide additional information about the modification including
type and
category classification,
modified residue, linking atom pairs, linking position, and UniProt references, if available.
For example, PALMITIC ACID with CCD ID PLM is palmitoylation, a type of lipidation. Due to the diversity of PCMs,
not all modifications can be handled using the same approach. Therefore, PCMs have been classified into broad categories.
Within each category, the protein modifications are handled consistently.
There are 4 approaches to handling PCMs:
Approach |
Categories |
Examples |
Part of peptide residue |
Chromophore/chromophore-like Named protein modification Non-standard residue |
Phosphoserine (SEP) N-Trimethyl-lysine (M3L) |
Linked to peptide residue |
ADP-Ribose Biotin Carbohydrate Covalent chemical modification Crosslinker Flavin Heme/heme-like Lipid/lipid-like Nucleotide monophosphate |
N-acetyl-D-glucosamine (NAG) Palmitic acid (PLM) |
Direct covalent bond |
Disulfide bridge Isopeptide bond Non-standard linkage |
Cysteine-Cysteine covalent bond Lysine-Asparagine covalent bond |
Part of peptide sequence |
Terminal acetylation Terminal amidation |
Acetyl group (ACE) Amino group (NH2) |
Modifications that are part of the peptide residue
Most small and well-known PCMs are in the category 'Named protein modification'. This covers all PTMs that are not explicitly covered
by another category, including phosphorylation and methylations.
Small modifications are handled as part of the peptide residue to ensure that they are grouped into chemical units that are
meaningful and findable by the community. For example, it is easier to find N-Trimethyl-lysine as a single chemical unit rather
than a standard lysine linked to three separate methyl groups.
Chromophores and other non-standard residues are also handled using this approach. This is because they cannot be unambiguously
split into a standard amino acid and the modification group.
Modifications that are linked to a peptide residue
Not all modifications can be defined as part of the peptide residue and instead need to have a separate CCD definition
to describe the modification group.
In these cases, the CCD that describes the modification group contains the details of all the PCMs that it can form.
For example, the CCD for palmitic acid (PLM) describes all the palmitoylations observed in the PDB archive.
Modifications that are direct covalent bonds
Some PCMs have no modification group and are instead direct covalent bonds between two peptide residues.
The most common example of this are disulfide bridges, however, there are other such modifications, including isopeptide bonds,
Selenocysteine-Selenocysteine bonds and peptide cyclisation bonding.
There are no CCDs that describe these modification groups because the modification contains no atoms. Therefore,
these modifications are annotated in the mmCIF entry file that describes the macromolecular structure,
but not in any CCD definition.
Modifications that are part of the peptide sequence
These are modifications that are both linked to the peptide residue they modify, but also part of the polypeptide sequence.
Terminal acetylation and amidation are the only modifications that are handled in this way.
Coordinate Files
A new data item,
_pdbx_entry_details.has_protein_modification,
is created to indicate whether a PDB entry contains any PCMs.
A new category,
_pdbx_modification_feature,
is created to provide a list of all the PCMs that occur within a PDB entry and detailed information that enables
the modification to be located within the structure.
Most of the information in
_pdbx_modification_feature
is inherited from the category
_struct_conn
that contains the details of all the covalent bonds related to PCMs.
As part of the PCM remediation project, all entries containing protein modifications are re-released with these new data items/categories.
The entry 4ZPZ provides an example of how these two PCMs are handled in
_pdbx_modification_feature
because it contains 2 phosphoserines and 1 disulfide bridge:
The entry 2K4H provides an example of how linked PCMs are handled in
_pdbx_modification_feature
because it contains 1 glycine myristoylation.
Standardizing Protein Modifications
The standardization of PCM handling ensures that there is a single correct approach to handling each PCM that occurs
within the PDB archive. As a result of this remediation, some modified residues are split into standard residues and
ligands and some standard residues are merged with ligands into modified residues.
Splitting CCDs
In some instances, a single peptide residue CCD is inconsistently used to describe a linked modification. For example,
the CCD MYK describes N6-Myristoyl-Lysine, however, myristoylation is a lipid modification and must be handled as a linked
modification. In all entries that contain MYK the following splitting process must be performed:
Existing CCD |
> |
New Peptide CCD |
New Linked CCD |
Example PDB Entry ID |
MYK |
> |
LYS |
MYR |
3U31 |
Merging CCDs
In some instances, a linked modification group is inconsistently used to describe a PCM that should be handled as a peptide
residue within the polypeptide sequence. For example, phosphoserine can be inconsistently described using two CCDs,
serine (SER) and phosphate (PO4) rather than the phosphoserine CCD (SEP). This is because this is a 'Named protein modification'
and so should be handled as a single CCD describing both the peptide residue and modification group. The following merging process
must be performed in all of these cases:
Existing Peptide CCD |
Existing Linked CCD |
> |
New CCD |
Example PDB Entry ID |
SER |
PO4 |
> |
SEP |
6FW5 |
Replacing CCDs
In some instances, a CCD ID is incorrectly used to describe a PCM where another existing CCD already describes the modification.
One example of this is the use of CCDs to describe Heme (HEM) and Heme C (HEC). HEM should be used exclusively to describe
non-covalently bound heme, whereas HEC should be used to describe covalently bound heme C. However, there are many entries
in which these are inconsistently used. The following replacement process must be performed in all of these cases:
Existing CCD |
> |
New CCD |
Example PDB Entry ID |
HEM |
> |
HEC |
19HC |
Adding CCDs to Polypeptide Sequences
In some instances, a CCD is linked to the polypeptide rather than being part of the polypeptide sequence. This most often occurs
to the terminal residues of the sequence and most commonly to the terminal caps ACE and NH2. The following addition process must
be performed in all these cases:
CCD |
Current Handling |
> |
New Handling |
Example PDB Entry ID |
ACE |
Linked to peptide N-terminus |
> |
Part of peptide sequence |
1BHQ |
NH2 |
Linked to peptide C-terminus |
> |
Part of peptide sequence |
4TKY |
The opposite process can also occur, where a CCD should be removed from a polypeptide sequence. For example, terminal
myristoylation (MYR) is often inconsistently annotated as being part of the polypeptide sequence. The following removal process
must be performed in all these cases:
CCD |
Current Handling |
> |
New Handling |
Example PDB Entry ID |
MYR |
Part of peptide sequence |
> |
Linked to peptide N-terminus |
2NA0 |
Renaming CCDs
As part of the PCM remediation process, many CCDs that describe protein modifications are being renamed to make them
more findable and remove ambiguities in their naming. For example, the CCD MLZ describes the methylation of the side chain
nitrogen of lysine, currently it is named 'N-METHYL-LYSINE' however this name is misleading, implying that the methylation
occurs on the backbone nitrogen rather than the side chain nitrogen. It will be renamed in the following process:
CCD |
Current Name |
> |
New Name |
Example PDB Entry ID |
MLZ |
N-METHYL-LYSINE |
> |
N6-METHYL-LYSINE |
1IV8 |
This naming must be reflected in all entries that contain the CCD MLZ.
Changing CCD Parents
As part of the PCM remediation process, many CCDs that describe protein modifications have their parent residues updated.
For example, the CCD SOQ is the CCD that describes N-METHYL-ASPARTIC ACID, however it is currently annotated as not having a
parent residue. It is a natural modification to Aspartic acid and so the parent residue will be changed to ASP. The parent
will be updated in the following process:
CCD |
Current Parent |
> |
New Parent |
Example PDB Entry ID |
SOQ |
None |
> |
ASP |
7AZ6 |
This change in parent must be reflected in all entries that contain the CCD SOQ.
Acknowledgements
The protein chemical modifications (PCMs) and post translational modifications (PTMs) remediation project
is a wwPDB collaborative project carried out principally by PDBe at
EMBL-EBI, funded by the BBSRC grant number BB/V018779/1.