Data Insights

You must have completed the Remote Computing Certification before working on this.

In the previous certification, you built a virtual microscope. You successfully simulated 100 Argon atoms and generated a trajectory file (simulation.zarr). However, a trajectory file is just millions of coordinates that are meaningless to a human observer. To do science, we must reduce this high-dimensional data into interpretable physical observables.

This certification challenges you to bridge the gap between simulation and reality. You will process your raw simulation data to compute the radial distribution function (RDF), a fundamental metric in structural biology and statistical mechanics. You will then validate your simulation by comparing your results directly against experimental X-ray diffraction data.

The Radial Distribution Function

How “structured” is a fluid? In a solid crystal, atoms sit in perfect, predictable lattices. In an ideal gas, they are randomly distributed with no correlation. Real fluids lie somewhere in between.

The Radial Distribution Function, denoted as $g(r)$ , quantifies this structure. It answers a simple probabilistic question: Given an atom at the origin, what is the probability of finding another atom at a distance $r$ ?

Mathematically, it is defined as the observed local density at distance $r$ normalized by the average bulk density of the system:

g(r) = \frac{\rho(r)}{\rho_{bulk}}

$g(r) = 0$ : The atoms are too close. Strong repulsive forces (the Pauli exclusion principle) prevent overlap.
$g(r) > 1$ : The atoms are clustering. This usually happens at the “van der Waals radius,” forming a coordination shell.
$g(r) = 1$ : No correlation. The atoms effectively do not “feel” each other (long range).

The Assignment

You are required to write a single Python script named analyze-simulation.py. This script must perform the following three tasks.

Run the Production Simulation

Your previous simulation was a mere proof of concept. To extract meaningful physical properties, you need a system large enough to mimic bulk matter and a duration long enough to capture equilibrium statistics. You are moving from testing code to generating production data.

Update your simulation script parameters to reflect a physical liquid. Set the number of particles to 216 and the box size to 21.66 Angstroms. These values reproduce the density of liquid Argon at your target temperature of 85.0 Kelvin. Most importantly, increase the simulation length to 100,000 steps. This duration ensures the system relaxes into equilibrium and explores enough unique configurations to produce smooth, reliable data.

Keep all other force field parameters identical to your Remote Certification work. Do not execute this script locally or on the login node. This calculation requires significant computational power and will stall the shared environment for other users. Submit the job to the cluster. The resulting trajectory file is the raw material for your entire analysis; ensure it is generated correctly before proceeding.

Compute the RDF

You must now transform raw coordinate data into a statistical measure of probability. Mathematically, this process compares the observed distribution of particles in your simulation against the expected distribution of a structureless ideal gas.

Your first task is to compute the Euclidean distance, $r_{ij}$ , between every unique pair of particles $i$ and $j$ in the system. Because your simulation exists within a periodic box of length $L$ , you cannot simply subtract coordinates. You must apply the Minimum Image Convention to find the shortest path between particles, wrapping around boundaries where necessary. For a displacement vector $\mathbf{r}_{ij}$ , the periodic distance is corrected as:

\mathbf{r}_{ij}' = \mathbf{r}_{ij} - L \cdot \text{round}\left(\frac{\mathbf{r}_{ij}}{L}\right)

This ensures that no distance exceeds $L/2$ . Once you have these corrected distance vectors, you compute their magnitudes to populate a histogram $H(r)$ . This histogram counts how many particle pairs fall into discrete bins of width $\Delta r$ , effectively integrating the local density over small spherical shells.

However, a raw count of neighbors is physically meaningless on its own. In a completely random gas, the number of neighbors found at a distance $r$ increases naturally because the volume of the spherical shell grows larger the further you move from the center. To isolate the actual physical structure, you must normalize your histogram by this geometric growth.

You calculate the expected number of particles in an ideal gas, $N_{ideal}$ , for a given shell. This requires the exact volume of the spherical shell defined by the bin edges $r$ and $r + \Delta r$ . While simple calculus approximates this as $4\pi r^2 dr$ , the discrete nature of computational binning requires the precise difference between two spheres:

V_{shell}(r) = \frac{4}{3}\pi \left[ (r + \Delta r)^3 - r^3 \right]

Multiplying this volume by the system’s bulk number density, $\rho = N/L^3$ , gives you the normalization factor for that specific bin. The Radial Distribution Function $g(r)$ is the ratio of your observed counts to this ideal expectation, averaged over the total number of particles $N$ and the total number of frames $M$ :

g(r) = \frac{H(r)}{M \cdot N \cdot \rho \cdot V_{shell}(r)}

Comparison with Experiment

You will compare your simulation against historical X-ray diffraction data for Argon at 85 K. Download argon-85k-g(r).csv, which contains $r$ in Angstroms and experimental $g(r)$ values adapted from:

Yarnell, J. L., Katz, M. J., Wenzel, R. G., & Koenig, S. H. (1973) “Structure Factor and Radial Distribution Function for Liquid Argon at 85 °K.” Physical Review A, 7(6), 2130. DOI: 10.1103/PhysRevA.7.2130.

Below are the first few lines of the CSV file.

r,g(r)
2.7921,0.0
2.8602,0.0
2.9283,0.0
2.9964,0.0112
3.0645,0.0279
3.1326,0.0276

Your script must generate a publication-quality plot (rdf_plot.png) using matplotlib that overlays your computed $g(r)$ (as a dashed line) with the experimental data (as a solid line).

Tip

Below is a reference figure for what it should look like. Your figure should look comparable, but obviously do not submit this image for your assignment. Reference image

The Skeleton Code

Here is the skeleton code to get you started.

analyze-simulation.py

import os

try:
    import numpy as np
    import numpy.typing as npt
except ImportError as e:
    raise ImportError("Make sure you have numpy installed") from e

try:
    import polars as pl
except ImportError as e:
    raise ImportError("Make sure you have polars installed") from e

try:
    import zarr

    if int(zarr.__version__[0]) < 3:
        raise EnvironmentError(
            "zarr needs to be at least v3.0.0. You have {zarr.__version__}."
        )
except ImportError as e:
    raise ImportError("Make sure you have zarr installed") from e

try:
    import matplotlib.pyplot as plt
except ImportError as e:
    raise ImportError("Make sure you have matplotlib installed") from e

os.chdir(path=os.path.dirname(os.path.abspath(__file__)))

# -----------------------------------------------------------------------------
# Configuration Constants
# -----------------------------------------------------------------------------


N_PARTICLES: int = 216
BOX_SIZE: float = 21.66  # Angstroms
ZARR_PATH: str = "../remote/simulation.zarr"
EXPERIMENTAL_FILENAME: str = "./argon-85k-g(r).csv"
PLOT_FILENAME: str = "./rdf-plot.png"
DR: float = 0.1  # Bin width in Angstroms

# -----------------------------------------------------------------------------
# Core Functions
# -----------------------------------------------------------------------------


def compute_rdf(
    zarr_file: str,
    box_size: float,
    dr: float,
    equilibration_fraction: float = 0.2,
    batch_size: int = 50,
) -> tuple[npt.NDArray[np.float64], npt.NDArray[np.float64]]:
    r"""
    Computes the Radial Distribution Function (RDF), g(r), from molecular
    dynamics trajectory data.

    This function calculates the probability of finding a particle at a
    distance r from a reference particle, normalized by the density of an
    ideal gas. It utilizes vectorized NumPy operations and batch processing
    to handle large datasets efficiently.

    Args:
        zarr_file: Path to the Zarr file containing the simulation trajectory.
            Expected structure is a dataset named "positions" with shape
            (n_frames, n_particles, 3).
        box_size: The length of the cubic simulation box side in Angstroms ($L$).
        dr: The width of the bins for the histogram in Angstroms. Smaller widths give
            higher resolution but require more data to reduce noise.
        equilibration_fraction: The fraction of trajectory frames (0.0 to 1.0)
            to discard from the beginning of the simulation as "burn-in".
        batch_size: The number of frames to process simultaneously in vector
            operations. Larger batches increase speed but consume more RAM.

    Returns:
        Array of shape (n_bins,) containing the radial distance $r$ for each bin.

        Array of shape (n_bins,) containing the computed RDF values.

    Raises:
        FileNotFoundError: If the specified `zarr_file` does not exist.
        KeyError: If the Zarr file does not contain a "positions" dataset.
    """
    if not os.path.exists(zarr_file):
        raise FileNotFoundError(f"The trajectory file '{zarr_file}' was not found.")

    store = zarr.open(zarr_file, mode="r")

    if "positions" not in store:
        raise KeyError(
            f"The file '{zarr_file}' does not contain a 'positions' dataset."
        )

    positions = store["positions"][:]

    n_steps, n_particles, _ = positions.shape
    start_frame = int(n_steps * equilibration_fraction)
    positions = positions[start_frame:]
    n_frames = ...

    max_r = ...
    bins = ...
    hist_counts = np.zeros(len(bins) - 1, dtype=np.float64)

    for i in range(0, n_frames, batch_size):
        batch_pos = ...
        delta = ...
        dist_sq = ...

        identity_indices = ...
        dist_sq[:, identity_indices, identity_indices] = np.inf

        dist = ...

        batch_counts, _ = ...
        hist_counts += batch_counts

    bin_centers = ...
    g_r = ...

    return bin_centers, g_r


def save_rdf_to_zarr(zarr_path: str, r: npt.NDArray, g_r: npt.NDArray) -> None:
    """
    Saves the computed RDF analysis results back to the Zarr store.

    This functions creates (or updates) a group named 'analysis' within the Zarr
    hierarchy and saves the radial distances and g(r) values. We expect
    `"rdf_r"` and `"rdf_g"` as the keys.

    Args:
        zarr_path: The path to the Zarr store.
        r: The array of radial distances (bin centers).
        g_r: The array of computed g(r) values.
    """


def plot_rdf_from_zarr(zarr_path: str, exp_file: str, output_file: str) -> None:
    """
    Generates a comparison plot between the simulated RDF and experimental data.

    This function reads the previously computed RDF from the Zarr file and attempts
    to load experimental data from a CSV file. It produces a formatted plot
    saved to disk.

    Args:
        zarr_path: Path to the Zarr store containing the 'analysis' group.
        exp_file: Path to the CSV file containing experimental data.
            Expected columns: "r" and "g(r)".
        output_file: The filename to save the resulting plot.
    """


# -----------------------------------------------------------------------------
# Main Execution
# -----------------------------------------------------------------------------


def main() -> None:
    r, g_r = compute_rdf(ZARR_PATH, BOX_SIZE, DR)
    save_rdf_to_zarr(ZARR_PATH, r, g_r)
    plot_rdf_from_zarr(ZARR_PATH, EXPERIMENTAL_FILENAME, PLOT_FILENAME)


if __name__ == "__main__":
    main()

Deliverables

Submit the following to Gradescope.

analyze-simulation.py Your source code. It must be self-contained.
rdf-plot.png The visualization generated by your script.

Last updated on February 7, 2026