Science Stack

Ticket: T02 due Jan 20 by 11:59 pm

Biology has left the test tube. Today, the most groundbreaking discoveries, from folding proteins with AI to mapping the evolutionary history of viruses, do not happen on a laboratory bench. They happen in the cloud, on supercomputers, and in the terminal. To operate in this new world, you need more than a hypothesis. You need a Science Stack.

In a traditional “wet lab,” you rely on a standardized set of physical tools: pipettes, centrifuges, and reagents. If you do not know how to calibrate a pipette, your experiment fails. The “dry lab” is no different. The Science Stack is our digital toolkit. It is the collection of programming languages, file formats, libraries, and visualization tools that we use to perform scientific computing.

This brief introduces the industry-standard tools that power modern computational biology. Our goal is to move you away from the “scripting” mindset and toward the Research Software Engineering mindset. We do not just run code; we build robust, reproducible pipelines that generate truth.

In software engineering, a “stack” refers to the layers of technologies used to build an application. In science, our stack has four distinct layers:

  1. The Data Flow: The lifecycle of information as it moves from raw simulation to polished insight.
  2. The Glue: The programming languages (like Python) that connect our tools.
  3. The Formats: The specific ways we store numbers and text on a disk (like Parquet or Zarr).
  4. The Lens: The visualization tools we use to see our results (like Matplotlib).

Mastering this stack distinguishes a novice from a professional. A novice struggles to open a large file in Excel; a professional streams it through a binary pipeline. A novice creates a graph by clicking buttons; a professional generates it with code that can be run a thousand times without a single error.

Let us begin by understanding how data moves through this system.

The Scientific Data Lifecycle

Imagine you have just discovered a massive, subterranean cavern filled with gold ore. You are rich, but you are not yet wealthy. You cannot walk into a store and buy groceries with a jagged, dirty rock. To unlock the value, you must mine the ore, refine it into pure metal, and finally craft it into a recognizable currency or jewelry.

Computational science operates on this exact principle. We do not find answers lying on the ground; we find data. Often, this data is raw, messy, and voluminous. Your success as a scientist hinges on your ability to build a reliable factory—a pipeline—that turns this raw resource into scientific insight. We call this the Data Flow. It is the backbone of the “Science Stack.”

Understanding this flow solves the most paralyzing problem new researchers face: the “blank page” syndrome. When you stare at a terminal window knowing you need to test a hypothesis, the task feels overwhelming. By breaking the process into three distinct stages, Generation, Processing, and Analysis, you transform an impossible mountain into a series of climbable hills.

Phase 1: Data Generation

The first step is acquisition. In the physical sciences, this might mean running an experiment on a bench. In computational biology, we create our own digital experiments. We use software to simulate the physical world.

Data Generation is the act of producing raw information using specialized computational tools. This stage is notoriously chaotic. We often call it the “Wild West” of computing because there is rarely a single standard. One researcher might write a simulation engine in C++ that outputs binary files; another might write a tool in Fortran that spits out massive text logs.

Your primary responsibility here is selection and configuration. You must consult the literature to find the right tool for your specific biological question. You must learn its quirks, its inputs, and its parameters. You run the software, and it creates a file.

However, do not expect this file to be friendly. Developers of scientific software often prioritize speed and complex physics over user experience. The output might be scattered across thousands of small files, or buried in the one number you need inside millions of lines of debugging text. This is normal. The goal of this phase is quantity and accuracy of the raw signal, not neatness.

Key Idea: The Garbage-In, Garbage-Out Principle The quality of your raw data strictly limits the quality of your final insight. No amount of clever analysis can fix a simulation run with incorrect parameters. Validate your generation tools before moving forward.

Phase 2: Data Processing

You now have a hard drive full of messy simulation logs. If you tried to graph this data immediately, you would fail. The formats are inconsistent, the files are too large to open in Excel, and the data you need is trapped between lines of gibberish.

We must refine the ore. This stage is called Data Processing, often referred to informally as data wrangling. This is the bridge between having data and understanding it. Your goal is to extract the relevant information from the raw output and reorganize it into a structured, standardized format.

This is often the most time-consuming part of the science stack, yet it is the most critical. You must write scripts (usually in languages like Python or Rust) that open the raw files, find the specific numbers you care about, and save them into a clean table or array. This process allows you to spot failures early. Did the simulation crash halfway through? Is the temperature value reading -9999? Processing is your quality control checkpoint.

Example

Let’s look at a concrete example to build intuition. Imagine you ran a simulation to calculate the energy of a protein, and the software gave you a text file called sim_output.log. The file looks like this:

STEP 001: System initializes...
STEP 002: Temp = 300K, Energy = -50.2 kcal/mol
STEP 003: WARNING: bond length deviation
STEP 004: Temp = 301K, Energy = -52.1 kcal/mol
...

You cannot graph text. You need a list of energy values: [-50.2, -52.1] and want to write a Python script for this task.

  1. Read: Your script opens the file and looks at it line by line.
  2. Filter: The script ignores lines that don’t contain the word “Energy.” (It skips Step 001 and Step 003).
  3. Extract: When it finds “Energy,” it splits the sentence and grabs the number immediately following the equals sign.
  4. Store: It saves these numbers in a clean CSV (Comma-Separated Values) file that looks like a simple spreadsheet.

By the end of this phase, you should have a clean dataset. You have stripped away the noise, leaving you with the signal.

Key Idea: Separate Processing from Analysis Never try to “clean” your data inside your analysis scripts. Processing should happen once, creating a permanent, clean file. This ensures that when you tweak your graphs later, you don’t have to re-parse massive log files every time.

Phase 3: Data Analysis

Now you have the refined gold. It is time to craft the ring. Data Analysis is the stage where you interrogate your clean data to answer your scientific question.

Because you invested time in the Processing phase, this part becomes a joy rather than a chore. You load your clean, structured data and apply statistical methods. You might calculate averages, determine standard deviations, or train a machine learning model.

This is also where Visualization occurs. You create plots and figures that communicate your findings to the world. A common mistake novices make is assuming the graph is the analysis. The graph is merely the representation of the analysis. The core work is the mathematical testing of your hypothesis.

If you find an anomaly here (e.g., a trend that defies physics), you can easily trace your steps back. Was it a processing error? Go back to Phase 2. Was the simulation set up wrong? Go back to Phase 1. Because you built a modular pipeline, you can fix a single stage without destroying the entire structure.

The Glue Language

In computational science, we face a fundamental trade-off. We are constantly torn between two opposing needs: human efficiency and machine efficiency.

We want to write code quickly. We want the syntax to be readable, forgiving, and close to plain English. But we also need our programs to run fast. When simulating millions of molecules or training a massive neural network, a delay of milliseconds per calculation adds up to weeks of lost time.

Historically, no single programming language could give us both. This dilemma is known as the Three-Language Problem. It segments our world into three distinct tiers:

  1. The Interface (Glue): Languages like Python. They are easy to learn and write, but they are computationally slow.
  2. The Systems: Languages like C++ and Rust. They are incredibly fast and memory-efficient, but they are difficult to learn and cumbersome to write.
  3. The Accelerators: Specialized languages like CUDA or OpenCL are used to talk to hardware accelerators (GPUs and TPUs). These offer raw, blazing speed but require deep expertise to manage.

Alex’s Soap Box

Mojo is a new programming language designed to solve the Three-Language Problem. It unifies the stack. You can write high-level scripts and low-level system code in the same file. Because of this potential, I am currently transitioning my entire science stack to Mojo. While it will take years for the broader scientific community to catch up, adopting this technology now provides a massive strategic advantage. In both research and the startup world, being able to build faster, more efficient tools than your competitors can help get that sweet, sweet venture capital.

If you walk into any computational lab today, you will find that Python is the undisputed winner. It is the standard “glue” language of modern science.

We call it a Glue Language because its primary job is not to do the heavy lifting itself, but to stick different powerful tools together. Python excels at this because it prioritizes you, the human. It handles the boring details of memory management and system calls so you can focus on the biology.

You might ask: If Python is slow, why do we use it for high-performance supercomputing?

The answer lies in how we use it. When you run a heavy calculation in Python, Python isn’t actually doing the math. It is merely the steering wheel. Under the hood, Python passes that command down to a highly optimized library written in a systems language like C, C++, or Rust.

These libraries are the engines. They are written by systems programmers who enjoy obsessing over memory layout and processor instructions so that you don’t have to. You write a simple line of Python, and it triggers a blazing-fast C++ routine deep inside the machine.

Key Idea: The Steering Wheel vs. The Engine Think of Python as the steering wheel and the underlying libraries (like NumPy or PyTorch) as the engine. The steering wheel doesn’t make the car move; it directs the power. You can drive a Formula 1 car with a comfortable leather steering wheel. Python allows you to “drive” complex, high-performance C++ code with a simple, comfortable syntax.

Note

You will inevitably encounter R. It is a language built specifically for statistics and data visualization. In certain niches of biology—particularly bioinformatics and ecology—R is deeply entrenched. Many researchers use it because they were trained on specific R-based tools (like ggplot2 or Bioconductor) that were standard in their labs.

There is nothing wrong with using R for these specific tasks. However, it is essential to recognize its limits. R is a specialized tool for a specialized trade. Outside of specific statistical niches, Python dominates the entire landscape. From running web servers and automating cloud infrastructure to training state-of-the-art AI models, Python is the universal language.

Data Types and Formats

In science, data is rarely a shapeless blob. It almost always creates a structure. How you interact with that structure dictates which tools you must use. If you try to hammer a screw, you will ruin the wall. Similarly, if you try to process a massive grid of numbers using a tool designed for text, your computer will freeze, and your research will stall.

We generally categorize scientific data into two primary shapes: Numerical and Tabular.

Numerical Data

The physical world is often represented as a grid. A digital image is a 2D grid of pixels. A simulation of the atmosphere is a 3D grid of temperature, pressure, and wind velocity. In mathematics and computer science, we call these grids Arrays or Tensors.

NumPy

When working with these grids in Python, NumPy (Numerical Python) is the undisputed foundation. It is the bedrock upon which almost all other scientific Python tools are built.

NumPy provides the ndarray (N-dimensional array) object. While a standard Python list is flexible but slow, a NumPy array is rigid but incredibly fast. It forces the computer to store data in a contiguous block of memory, allowing the processor to crunch numbers in bulk.

Example

Imagine you want to multiply two lists of numbers: [1, 2, 3] and [4, 5, 6].

  1. Python List: You must write a loop. Python checks the first number, checks if it is an integer, multiplies it, then moves to the next. It is a slow, manual process.
  2. NumPy Array: You simply type a * b. NumPy sends a single instruction to the CPU to multiply the entire block of memory at once. It is instant.

You will encounter other libraries, such as PyTorch and JAX. These are specialized cousins of NumPy designed to run on Accelerators like GPUs (Graphics Processing Units). While powerful, they are primarily used for machine learning applications that require massive parallel processing. For general scientific data analysis, NumPy remains the standard.

The Storage Problem

Calculating data is easy; saving it is hard. You cannot simply write a massive 3D array of floating-point numbers to a text file. It would be enormous and imprecise. We need binary formats.

  1. NumPy Arrays (.npy): The simplest method. It saves the array exactly as it sits in memory. It is fast but not designed for massive scale or sharing between different programming languages.
  2. HDF5 (.h5): For years, this was the gold standard. It acts like a file system inside a single file, allowing you to organize data hierarchically. However, HDF5 is fragile. If your script crashes while writing, the entire file can become corrupt and unreadable. It also struggles with the cloud.
  3. NetCDF: A long-time standard in climate science and oceanography. It is reliable but feels dated.
  4. Zarr: The modern contender. Zarr fixes the fragility of HDF5. It breaks data into small, separate “chunks.” It is also designed for the cloud, making it the superior choice for modern workflows.

Key Idea: Volatility vs. Permanence RAM is fast but volatile (it vanishes when power cuts). Disk is slow but permanent. Your choice of file format (Zarr vs. HDF5) determines how efficiently you can move data from the slow disk to the fast RAM without corruption.

Tabular Data

If Numerical Data is a grid, Tabular Data is a table. Think of a spreadsheet with rows and columns. In biology, this is everywhere: a list of patients with their age, gene expression levels, and diagnosis.

The CSV trap

Novices often rely on CSV (Comma-Separated Values) files. They are human-readable and easy to open in Excel. However, for datasets with millions of rows, CSVs are a disaster.

A CSV is just text. The computer doesn’t know that “75.5” is a number. It has to read the text, figure out it’s a number, and convert it. Doing this for a million rows is incredibly slow. Furthermore, storing the number 123456789 in binary takes 4 bytes. Storing it as text takes 9 bytes.

Parquet and Arrow

To handle large tables efficiently, we use a two-part system:

  1. Apache Parquet (Disk): This is how we save the data. Parquet is a binary format that stores data by column rather than by row. This allows for massive compression. If a column contains “Human” for 1,000 rows, Parquet compresses that redundancy to almost nothing.
  2. Apache Arrow (Memory): This is how we load the data. Arrow is a standardized memory format. It allows different programs to share data without copying it.

The Tooling: Pandas vs. Polars

For over a decade, pandas has been the mainstay tool for analyzing tabular data in Python. It is powerful and ubiquitous. You will see it in almost every tutorial online. However, pandas was built for a different era. It is memory-hungry and often slow because it cannot easily use all the cores of your CPU.

We strongly recommend Polars.

Polars is a modern DataFrame library written in Rust. It is designed to be a high-performance replacement for pandas. Polars is multi-threaded. If you have an 8-core CPU, Polars uses all of them. Pandas typically uses only one. Polars uses “Lazy Evaluation.” When you ask it to filter data, it examines your entire query, optimizes the plan, and executes it only then. This saves memory and time.

Key Idea: The Text Bottleneck Avoid text files (CSV, TSV, TXT) for data storage whenever possible. Text is for humans; binary (Parquet, Zarr) is for machines. Using binary formats removes the expensive step of “parsing” every time you load your data.

Visualization: The Lens

You generated data, cleaned it into a table, and analyzed the statistics. But you still haven’t seen the answer.

The final layer of the Science Stack is Visualization. This is where we translate abstract arrays of numbers into patterns that the human brain can understand. In computational biology, a graph is not just a pretty picture for a slide deck; it is a diagnostic tool. A histogram can reveal if your simulation crashed; a scatter plot can show if your model is performing well.

Matplotlib

Just as NumPy is the foundation for calculation, Matplotlib is the foundation for visualization in Python.

It is likely the oldest and most battle-hardened plotting library in the scientific ecosystem. It is not famous for being “easy” or “beautiful” by default. It is famous for being controllable.

Modern libraries (like Seaborn or Altair) are like digital cameras: they have an “Auto” mode that makes things look good instantly, but you can’t change much. Matplotlib is like a manual film camera. You have to manually set the focus, the aperture, and the shutter speed. It takes more code to draw a simple line, but you have atomic control over every single pixel.

For this course, you must use Matplotlib. We force you to learn the “hard way” first so that you understand exactly how a figure is constructed: layer by layer, axis by axis.

Key Idea: Figures are Objects In Matplotlib, a plot isn’t a picture; it’s a container of code objects. A “Figure” contains “Axes” (the plot area), which contains “Lines” and “Labels.” You modify the graph by modifying these objects programmatically, not by clicking and dragging.

Last updated on