💻PDB-CAT: A User-Friendly Tool to Classify and Analyze PDB Protein-Ligand Complexes

PDB-CAT is a program that classifies a group of protein-ligand structures into three categories: apo, covalent bonded, and non-covalent bonded. Besides the classification, the program can verify if there are any mutations in the protein sequence by comparing it to a FASTA sequence.

Open in Colab

STEP 0. Execute the Environment setup by JUST clicking on the ▶️ button.

It takes a few seconds

STEP 1. Import libraries by JUST clicking on the ▶️ button

STEP 2: Verify Required Folders

The script checks if the required directories exist. If they do not, it creates them automatically.

Here is where you should upload your cif files inside the cif-test folder!

Hover your mouse over the 'cif-test' folder and click the three-dot menu (or the options button) that appears next to it. From the dropdown menu, select the option to upload your files. Then, choose the files you want to upload from your computer and confirm the upload.

Step 3: Define Parameters

"""
=========
INITIAL INFORMATION. CHANGE THE CONTENT OF THESE VARIABLES IF NECESSARY
=========
"""

# Name of the folder with the cif files to process
folder_name = "cif-test" 
# Chose a threshold for the number of amino acids, to discriminate between peptides and the subunits of the protein                                               
res_threshold = 20  
# Analyze mutations. True or False        
mutation = False      
# PDB code of the protein to analyze. If mutation is False, this variable is not used.                           
pdb = " "  
```

Make sure all cells have been executed by playing the ▶️ button!

Download results in a zip

And that's it! If your computer downloads the zip file containing the results, the program has been correctly executed.

🚀 Learn how to use it with this example!

SARS-CoV-2 Main Protease

If you are familiar with GitHub you can clone the following repository:

  git clone https://github.com/URV-cheminformatics/PDB-CAT.git

"""
=========
INITIAL INFORMATION. CHANGE THE CONTENT OF THESE VARIABLES IF NECESSARY
=========
"""

# Name of the folder with the cif files to process
folder_name = "Mpro" 
# Chose a threshold for the number of amino acids, to discriminate between peptides and the subunits of the protein                                               
res_threshold = 20  
# Analyze mutations. True or False        
mutation = True      
# PDB code of the protein to analyze. If mutation is False, this variable is not used.                           
pdb = "SARS-CoV-2_FASTA"

🏁Output

📄 CSV

In the first CSV, a line is written for each PDB ID code, providing a comprehensive set of information. This section includes details related to the protein, such as the PDB ID, title of the PDB file, protein description, number of subunits, subunits ID - referred as chain -, and the number of residues for each subunit. Subsequently, it indicates whether it is a complex. Following this, information about discarded ligands - elements in the blacklist bonded to the protein - and branched molecules their names, types, functions, and the presence of a covalent bond is provided. Next, ligand information is presented, including the name, type, functions, and the presence of a covalent bond. The final segment covers mutation information, specifying the number of mutations, their location, identity percentage, and gaps.

In the second CSV, a line is written for each entity bonded to a protein. It is straightforward, containing the ID of the protein, the bonded molecule, its name, type, function, and whether it is covalently bonded and, if so, with which residue. Additionally, if it is a glycosylation, that information is also included.

496KB

df-Example-1(Main-protease).csv

205KB

df-ligand-Example-1(Main-protease).csv

Last updated 3 months ago