T02 - Science Stack

Due: Tuesday, Jan 20 by 11:59 pm

In biological research, “raw data” is rarely a clean spreadsheet. Usually, it is a massive, messy text file generated by a simulation engine or a sequencer. You cannot analyze this data by hand; you must build a pipeline to extract it.

Your objective is to complete a Python script that automates the workflow: downloading simulation results, programmatically cleaning them, and generating a publication-quality visualization.

The Motivation

Imagine you are running a Molecular Dynamics (MD) simulation of a protein. The software (in this case, AMBER) outputted a file containing thousands of lines of text. Hidden inside those lines is the Total Energy (Etot) of the system. We need to know if this energy is stable over time, but the value we want is buried in a line that looks like this:

Etot = -120027.8285 EKtot = 22663.8926 EPtot = -142691.7210

If you have 50,000 such lines, you cannot copy-paste. You must write a script that “finds” the Etot, “captures” the number, and “plots” the trend.

Your Task

You must complete the provided starter script to perform the following. You should make sure you have the following Python packages installed: NumPy, Matplotlib, and gdown.

import os
import re

try:
    import numpy as np
except ImportError as e:
    raise ImportError(
        "Make sure you install numpy by running `pip install numpy`"
    ) from e

try:
    import gdown
except ImportError as e:
    raise ImportError(
        "Make sure you install gdown by running `pip install gdown`"
    ) from e

try:
    import matplotlib.pyplot as plt
except ImportError as e:
    raise ImportError(
        "Make sure you have matplotlib installed with `pip install matplotlib`"
    ) from e


os.chdir(path=os.path.dirname(os.path.abspath(__file__)))

URL_GDRIVE = (
    "https://drive.google.com/file/d/1MGUOezjVK4SNWZS0IQi8E_HPzTwD5jYm/view?usp=sharing"
)
PATH_OUTPUT = "amber-example.out"


def download_from_gdrive(url: str, path_output: str) -> None:
    """
    Downloads a file from a Google Drive sharing URL if it does not exist.

    Args:
        url: The full Google Drive file URL
            (e.g., https://drive.google.com/file/d/ID/view).
        path_output: Path to store the file located at `url`.
    """

    if os.path.exists(path=path_output):
        return

    file_id_match: re.Match[str] | None = re.search(r"/d/([a-zA-Z0-9_-]+)", url)

    if not file_id_match:
        raise ValueError("Could not extract File ID from the provided URL.")

    file_id: str | None = file_id_match.group(1)

    gdown.download(id=file_id, output=path_output, quiet=True)


download_from_gdrive(url=URL_GDRIVE, path_output=PATH_OUTPUT)

if not os.path.exists(path=PATH_OUTPUT):
    raise RuntimeError("Output file does not exist! Contact Alex.")

# Load and the file at PATH_OUTPUT
# There are several lines in the output file that look like this.
# Etot   =   -120027.8285  EKtot   =     22663.8926  EPtot      =   -142691.7210

# TODO: Parse the file using a for loop and create a NumPy array
# containing all of the Etot values.
# Hint: there are some lines with Etot that should not be included.
# The NumPy datatype of the array should be float32.

# TODO: Save the array as an uncompressed `npz` file titled `etot.npz`.

# TODO: Create a plot using matplotlib.pyplot where the x-axis is labeled "Number of steps"
# and the y-axis is labeled "Total Energy".

# TODO: Save your figure as a PNG file with the name `etot.png`.

Submission

You will submit your work on Gradescope. Your submission must include:

  1. The completed t02.py script.
  2. The generated etot.npz data file.
  3. The final etot.png visualization.

The instructor will run your script on a fresh machine. If it does not produce the correct figure and data file automatically, it will not pass.

Tip

Below is a reference figure for what it should look like. Your figure should look comparable, but obviously do not submit this image for your assignment. Reference image

Last updated on