Working with Structure Data (PDBx/mmCIF)

In this hands-on module, we will learn how to work with the mmCIF data format. mmCIF is the modern, machine-friendly format used by the RCSB Protein Data Bank (PDB) to store 3D macromolecular structures and their associated metadata.

mmCIF files are used to represent experimentally determined and computationally predicted structures, including proteins, nucleic acids, and molecular complexes.

After completing this module, students should be able to:

Identify and understand valid mmCIF files
Explain what types of information mmCIF stores (atoms, residues, chains, metadata)
Read mmCIF files into Python objects using Biopython
Navigate the structure hierarchy (Structure → Model → Chain → Residue → Atom)
Extract common data (coordinates, residue IDs, chain IDs, B-factors, occupancy)
Write or export structures back to mmCIF format

Macromolecular structure context

mmCIF files describe macromolecular structures, meaning the three-dimensional positions of atoms in a biological system. While many entries in the Protein Data Bank are proteins, mmCIF is not protein-specific. It can represent a wide range of structures, including:

Proteins and protein complexes
DNA and RNA
Protein–nucleic acid complexes
Small-molecule ligands, ions, cofactors, and waters

Biological function depends strongly on 3D shape. Binding sites, active sites, molecular recognition, and molecular interactions are determined by how atoms and residues are arranged in space. Structural data come from experiments such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy, or from computational prediction methods such as AlphaFold.

The Protein Data Bank archives these structures so they can be analyzed, visualized, and compared.

../_images/RCSB_1MBN.png — Structure summary of PDB Entry 1MBN.

*Fun fact: this was the first protein to have its 3D structure revealed by X-ray crystallography. John Kendrew was awarded the Nobel Prize in Chemistry in 1962 for this achievement.*

What is mmCIF?

PDBx/mmCIF, or simply mmCIF, stands for “Macromolecular CIF”. This name is derived from the “Crystallographic Information File (CIF)” format, which was created to store small molecule crystallographic experiments. mmCIF uses the same syntax and file extension as CIF (.cif), but uses a different, much larger, and extensible dictionary.

Note

Note: The PDB now exclusively uses mmCIF (PDBx/mmCIF) for archiving data, making it the essential format for modern structural biology, whereas standard CIF is mostly used in chemical crystallography.

A typical mmCIF contains:

Atom-level coordinates (x, y, z) for each atom
Atom identity (element, atom name), residue identity (amino acid), chain identity
Optional values like occupancy and B-factor (temperature factor)
Experimental metadata (method, resolution, authors, citations)
Chemical component definitions (ligands, modified residues)

mmCIF Format Basics

While mmCIF files can be intimidating at first, they are built from just a few core concepts. At a high level, an mmCIF file consists of:

Data Blocks: contain information for one structure (some mmCIF files contain multiple structures)
Data items: single name-value pairs for metadata fields
Loops: contains tables with rows corresponding to column names (more on this below)

Data Blocks

Every mmCIF file begins with a data block, which is introduced by a line starting with data_:

data_1MBN

The text after data_ is the data block identifier, usually the PDB ID (e.g. “1MBN”). A data block defines a logical boundary for one structure entry.

Important rules:

A file may contain one or more data blocks
Each data block describes one structure
A data block ends when another data_ line is encountered or the file ends

Data Items (Name-Value Pairs)

The simplest information in a mmCIF file is stored as data items, which are single name-value pairs:

_struct.title   'The stereochemistry of the protein myoglobin'

This line has three parts:

_struct.title: The data item name
'The stereochemistry of the protein myoglobin': The value
Whitespace separates name from values

All mmCIF data item names begin with a leading underscore and have the form _category.keyword. In the example above:

_struct = the category
title = the keyword (which must be unique within the category)

Note

Data Items Syntax:

Data item names are not case-sensitive
Each data item must have exactly one value
Values may be numbers, short strings (quoted or unquoted), or special placeholders:
- ? = value is missing
- . = value is not applicable or intentionally omitted

Text Values and Multi-line Strings

Short text values may be enclosed in single or double quotes, or can be unquoted:

_entity_src_gen.gene_src_common_name               'sperm whale'
_entity_src_gen.gene_src_genus                     Physeter
_entity_src_gen.pdbx_gene_src_scientific_name      'Physeter catodon'

Long text values that span multiple lines are enclosed by semicolon delimiters placed at the start of a line:

_entity_poly.pdbx_seq_one_letter_code
;VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKG
HHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
;

Everything between the semicolons is treated as a single text value.

Loops (Tables)

Many types of data occur multiple times (atoms, residues, authors, citations, etc.). These are stored using the loop_ directive, which defines a table. A loop has two parts:

A list of column names (data item names)
Rows of values corresponding to those columns

For example, let’s take a look at the simplified _atom_site loop. This is often the most crucial information contained in a mmCIF file:

loop_
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
N  VAL  -2.900  17.600  15.500
C  VAL  -3.600  16.400  15.300
C  VAL  -3.000  15.300  16.200
O  VAL  -3.700  14.700  17.000

The first line (loop_) starts the table. Then, the lines starting with _atom_site. define the columns of the table:

label_atom_id: The atom name within the residue (e.g., N, C, O)

label_comp_id: The residue name (here, VAL = valine)

Cartn_x, Cartn_y, Cartn_z: The atom’s x, y, and z coordinates in 3D space (in Ångströms)

Each line below the headers provides values for one atom, in the same order as the column header. For example, the first row:

N  VAL  -2.900  17.600  15.500

means: Atom N (the backbone nitrogen) in residue VAL (valine) is located at coordinates (−2.900, 17.600, 15.500) Å

Working with mmCIF files

Let’s get some practice working with mmCIF files. We’ll use VSCode for these exercises. Open a VSCode RemoteSSH session and create a new terminal.

Within the terminal inside VSCode on your class VM, navigate to your mbs-337/working-with-bio-data project directory.

We’re going to use a new command called wget to download a mmCIF file directly from RCSB PDB. This command allows us to retrieve files directly from the internet by providing a URL to the data we want to download.

Within your VS Code terminal, use the below command to download the mmCIF file for PDB Entry 1MBN.

[mbs-337]$ wget https://files.rcsb.org/download/1MBN.cif.gz

--2026-02-02 21:44:40--  https://files.rcsb.org/download/1MBN.cif.gz
Resolving files.rcsb.org (files.rcsb.org)... 13.33.82.83, 13.33.82.18, 13.33.82.74, ...
Connecting to files.rcsb.org (files.rcsb.org)|13.33.82.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/gzip]
Saving to: ‘1MBN.cif.gz’

1MBN.cif.gz                               [ <=>     ]  36.54K  --.-KB/s    in 0.02s

2026-02-02 21:44:40 (2.15 MB/s) - ‘1MBN.cif.gz’ saved [37416]

You now have a file called 1MBN.cif.gz in your working-with-bio-data directory! If you try to open this file, you’ll see that it does not display in the text editor. That’s because we downloaded a compressed version of the mmCIF — the .gz file ending means that this file has been compressed using the GNU Zip (Gzip) algorithm to reduce the file size for storage and faster transfer.

The first thing we need to do is decompress this file to its original format:

[mbs-337]$ gunzip 1MBN.cif.gz

The .gz file ending should be gone now, and you should be able to view your mmCIF file in the text editor.

Helpful Linux Commands

You can inspect an mmCIF file from the command line:

# Print the first 20 lines
[mbs-337]$ head -n 20 1MBN.cif

# Find the title of the structure
[mbs-337]$ grep "_struct.title" 1MBN.cif

# Find lines that start a data block
[mbs-337]$ grep "^data_" 1MBN.cif

Read mmCIF from File

Biopython provides MMCIFParser in Bio.PDB for reading .cif files. The parser converts the text file into a Structure object that you can query and iterate over. Activate your Python virtual environment and create a file called mmcif_ex.py:

from Bio.PDB.MMCIFParser import MMCIFParser

# Create a parser using the MMCIFParser Class (blueprint)
parser = MMCIFParser()

with open('1MBN.cif', 'r') as f:
    # Use .get_structure method on our parser to create a structure object
    structure = parser.get_structure('myoglobin', f)
    print(structure)

The get_structure() method takes two inputs:

An ID — a short name you choose (e.g. 'myoglobin')
A file handle — the open mmCIF file

Tip

mmCIF files can be large. The parser loads the structure into memory so you can query it easily. For most single-protein structures this is fine; very large complexes may require more memory or streaming approaches.

Understanding the structure hierarchy

Biopython represents a structure using a hierarchy:

Structure: top-level, one per file
└──Model: Possible 3D arrangement of the atoms (usually just 1)
    └──Chain: Continuous polymers (e.g., hemoglobin is composed of four globin chains)
        └──Residue: Monomers (e.g., GLY, ALA, VAL, etc.)
            └──Atom: Individual atoms (e.g., N, C, O, etc.) with 3D coordinates

For example, imagine this structure:

Structure
└── Model 0
    └── Chain A
        └── Residue 1: LEU
            └── Atom: N
            └── Atom: CA
            └── Atom: C
            └── Atom: O
        └── Residue 2: VAL
            └── Atom: N
            └── Atom: CA
            └── Atom: C
            └── Atom: O
            └── Atom: CB
            └── Atom: CG1
            └── Atom: CG2

English translation: One structure file, with one model, containing one protein chain (A) made of amino acids, each built from atoms.

You typically iterate with nested loops over structure → model → chain → residue → atom. For example, to count chains, residues, and atoms:

num_models = 0
num_chains = 0
num_residues = 0
num_atoms = 0

for model in structure:
    num_models += 1
    for chain in model:
        num_chains += 1
        for residue in chain:
            num_residues += 1
            for atom in residue:
                num_atoms += 1

print(f"Models: {num_models}")
print(f"Chains: {num_chains}")
print(f"Residues: {num_residues}")
print(f"Atoms: {num_atoms}")

Models: 1
Chains: 1
Residues: 153
Atoms: 1210

Each object in our structure has an id (e.g., model.get_id(), chain.get_id(), etc.)

for model in structure:
    print(f"Model: {model.get_id()}")
    for chain in model:
        print(f"Chain: {chain.get_id()}")

Model: 0
Chain: A

Most structures, like this one, have only one model. Some NMR or computationally-predicted structures may have many models (an ensemble). If we had an ensemble of models, their IDs would be printed in numerical order, starting at 0.

We also see that our structure only has one chain (A). This is because myoglobin is made up of a single polypeptide chain. Some structures are made up of multiple polymers. For example, hemoglobin is a tetramer, consisting of four polypeptide chains. In this case, we would have Chains A, B, C, and D.

../_images/myoglobin_vs_hemoglobin.png — Structures of myoglobin and hemoglobin. Notice how hemoglobin is composed of four polypeptide chains: two alpha chains and two beta chains. Source: Eaton 2021

Let’s see what happens when we use residue.get_id(). We’ll also use residue.get_resname() to print the name of each residue:

for model in structure:
    for chain in model:
        for residue in chain:
            print(residue.get_resname(), residue.get_id())

VAL (' ', 1, ' ')
LEU (' ', 2, ' ')
SER (' ', 3, ' ')
...
GLY (' ', 153, ' ')
OH ('H_OH', 154, ' ')
HEM ('H_HEM', 155, ' ')

Now this is interesting! When we use residue.get_resname(), we see the actual names of each residue within Chain A (e.g., VAL, LEU, etc.). We know from earlier that our chain is made up of 155 residues, but not all of these are amino acids! The last two lines in our output show that residue 154 is OH (a hydroxide ion), and residue 155 is HEM (a heme group).

The output of residue.get_id() is much different from what we’ve seen thus far. Each residue ID is a tuple with three elements:

(hetfield, resseq, icode)

hetfield = The hetero-field identifies whether the residue is a standard amino/nucleic acid or something else:
- ' ' = standard amino acids and nucleic acids
- W = water molecule
- H_name = Other hetero-residues (e.g., H_OH for hydroxide ion)
resseq = The sequence identifier is an integer describing the position of the residue in the chain
icode = The insertion code is a string that is sometimes used to preserve a certain desirable numbering scheme.
- For example, a Ser 80 insertion mutant (inserted between a Thr 80 and an Asn 81 residue) could look like this:
THR (' ', 80, 'A') SER (' ', 80, 'B') ASN (' ', 81, ' ')

Finally, we can use atom.get_id() with atom.get_coord to print each atom and its 3D coordinates (x, y, z) for the first residue:

for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                print(residue.get_resname(), atom.get_id(), atom.get_coord())
            break

VAL N [-2.9 17.6 15.5]
VAL CA [-3.6 16.4 15.3]
VAL C [-3.  15.3 16.2]
VAL O [-3.7 14.7 17. ]
VAL CB [-3.5 16.  13.8]
VAL CG1 [-2.1 15.7 13.3]
VAL CG2 [-4.6 14.9 13.4]

This code returned the residue name of the first residue in our chain (VAL), along with each atom (N, CA, C, etc.) and its 3D coordinates.

Summary of Structure Methods

The table below summarizes the most commonly used methods at each level and what they return.

Hierarchy Level	Object Type	Common Methods	What They Return / Do
Structure	`Structure`	`get_id()`	Structure identifier string (e.g. `"myoglobin"`)
Model	`Model`	`get_id()`	Model number (e.g. `0`)
Chain	`Chain`	`get_id()`	Chain identifier (e.g. `A`)
Residue	`Residue`	`get_resname()`	Residue name (e.g. `VAL`)
		`get_id()`	Tuple `(hetfield, resseq, icode)`
Atom	`Atom`	`get_id()`	Atom name (e.g. `'CA'`, `'N'`)
		`get_coord()`	List of 3D coordinates `[x, y, z]`
		`get_element()`	Chemical element symbol (e.g. `'C'`, `'N'`)

For more methods and cool things you can do with structural biology data in Biopython, see this documentation.

EXERCISE

Exercise 1: Print Chain ID: Num Residues

Using the same structure file (e.g. 1MBN.cif), write a short script that:

Parses the mmCIF file with MMCIFParser.
Loops over models and chains.
For each chain, prints the chain ID and the number of residues in that chain.

Example output: Chain A: 155 residues.

from Bio.PDB.MMCIFParser import MMCIFParser

parser = MMCIFParser()
with open('1MBN.cif', 'r') as f:
    structure = parser.get_structure('myoglobin', f)

for model in structure:
    for chain in model:
        chain_id = chain.get_id()
        num_residues = 0
        for residue in chain:
            num_residues += 1
        print(f"Chain {chain_id}: {num_residues} residues")

Exercise 2: List All Hetero-residues in a Chain

For this exercise, we want to find all the hetero-residues in Chain A and print their residue name and ID. Use the fact that residue.get_id() returns a tuple (hetfield, resseq, icode) that we can unpack (a, b, c = (10, 20, 30)).

Hint: Standard amino acids and nucleic acids have hetfield == ' '.

Example output:

Chain A: hetero residues
OH id=('H_OH', 154, ' ')
HEM id=('H_HEM', 155, ' ')

from Bio.PDB.MMCIFParser import MMCIFParser

parser = MMCIFParser()
with open('1MBN.cif', 'r') as f:
    structure = parser.get_structure('myoglobin', f)

for model in structure:
    for chain in model:
        chain_id = chain.get_id()
        print(f"Chain {chain_id}: hetero residues")
        for residue in chain:
            hetfield, resseq, icode = residue.get_id()
            if hetfield != ' ':
                print(f"{residue.get_resname()} id={residue.get_id()}")