Data Types & Formats
In science, data is rarely a shapeless blob. It almost always creates a structure. How you interact with that structure dictates which tools you must use. If you try to hammer a screw, you will ruin the wall. Similarly, if you try to process a massive grid of numbers using a tool designed for text, your computer will freeze, and your research will stall.
We generally categorize scientific data into two primary shapes: Numerical and Tabular.
Numerical Data
The physical world is often represented as a grid. A digital image is a 2D grid of pixels. A simulation of the atmosphere is a 3D grid of temperature, pressure, and wind velocity. In mathematics and computer science, we call these grids Arrays or Tensors.
NumPy
When working with these grids in Python, NumPy (Numerical Python) is the undisputed foundation. It is the bedrock upon which almost all other scientific Python tools are built.
NumPy provides the ndarray (N-dimensional array) object.
While a standard Python list is flexible but slow, a NumPy array is rigid but incredibly fast.
It forces the computer to store data in a contiguous block of memory, allowing the processor to crunch numbers in bulk.
Example
Imagine you want to multiply two lists of numbers: [1, 2, 3] and [4, 5, 6].
- Python List: You must write a loop. Python checks the first number, checks if it is an integer, multiplies it, then moves to the next. It is a slow, manual process.
- NumPy Array: You simply type
a * b. NumPy sends a single instruction to the CPU to multiply the entire block of memory at once. It is instant.
You will encounter other libraries, such as PyTorch and JAX. These are specialized cousins of NumPy designed to run on Accelerators like GPUs (Graphics Processing Units). While powerful, they are primarily used for machine learning applications that require massive parallel processing. For general scientific data analysis, NumPy remains the standard.
The Storage Problem
Calculating data is easy; saving it is hard. You cannot simply write a massive 3D array of floating-point numbers to a text file. It would be enormous and imprecise. We need binary formats.
- NumPy Arrays (
.npy): The simplest method. It saves the array exactly as it sits in memory. It is fast but not designed for massive scale or sharing between different programming languages. - HDF5 (
.h5): For years, this was the gold standard. It acts like a file system inside a single file, allowing you to organize data hierarchically. However, HDF5 is fragile. If your script crashes while writing, the entire file can become corrupt and unreadable. It also struggles with the cloud. - NetCDF: A long-time standard in climate science and oceanography. It is reliable but feels dated.
- Zarr: The modern contender. Zarr fixes the fragility of HDF5. It breaks data into small, separate “chunks.” It is also designed for the cloud, making it the superior choice for modern workflows.
Key Idea: Volatility vs. Permanence RAM is fast but volatile (it vanishes when power cuts). Disk is slow but permanent. Your choice of file format (Zarr vs. HDF5) determines how efficiently you can move data from the slow disk to the fast RAM without corruption.
Tabular Data
If Numerical Data is a grid, Tabular Data is a table. Think of a spreadsheet with rows and columns. In biology, this is everywhere: a list of patients with their age, gene expression levels, and diagnosis.
The CSV trap
Novices often rely on CSV (Comma-Separated Values) files. They are human-readable and easy to open in Excel. However, for datasets with millions of rows, CSVs are a disaster.
A CSV is just text. The computer doesn’t know that “75.5” is a number.
It has to read the text, figure out it’s a number, and convert it.
Doing this for a million rows is incredibly slow.
Furthermore, storing the number 123456789 in binary takes 4 bytes.
Storing it as text takes 9 bytes.
Parquet and Arrow
To handle large tables efficiently, we use a two-part system:
- Apache Parquet (Disk): This is how we save the data. Parquet is a binary format that stores data by column rather than by row. This allows for massive compression. If a column contains “Human” for 1,000 rows, Parquet compresses that redundancy to almost nothing.
- Apache Arrow (Memory): This is how we load the data. Arrow is a standardized memory format. It allows different programs to share data without copying it.
The Tooling: Pandas vs. Polars
For over a decade, pandas has been the mainstay tool for analyzing tabular data in Python. It is powerful and ubiquitous. You will see it in almost every tutorial online. However, pandas was built for a different era. It is memory-hungry and often slow because it cannot easily use all the cores of your CPU.
We strongly recommend Polars.
Polars is a modern DataFrame library written in Rust. It is designed to be a high-performance replacement for pandas. Polars is multi-threaded. If you have an 8-core CPU, Polars uses all of them. Pandas typically uses only one. Polars uses “Lazy Evaluation.” When you ask it to filter data, it examines your entire query, optimizes the plan, and executes it only then. This saves memory and time.
Key Idea: The Text Bottleneck Avoid text files (CSV, TSV, TXT) for data storage whenever possible. Text is for humans; binary (Parquet, Zarr) is for machines. Using binary formats removes the expensive step of “parsing” every time you load your data.