Version Control

Ticket: T04 due Feb 10 by 11:59 pm

You have likely experienced the panic of saving a file as final_paper.doc, then final_paper_v2.doc, and finally final_paper_fixed_typo_REAL.doc. This method of managing changes is fragile. It clutters your file system, obscures the history of what actually changed, and makes it nearly impossible to return to a specific point in time without manually opening a dozen files. In computational biology, this chaos is unacceptable. When you are writing code that analyzes genomic data or models protein structures, you need to know exactly which version of the script produced a specific result. If your code works today but breaks tomorrow, you need to know exactly what lines changed between those two moments. We solve this problem with version control.

The Time Machine

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. It is the lab notebook of the digital world. Just as you would never erase an entry in a wet-lab notebook, you would cross it out and write the new observation below it, version control allows you to keep a permanent, unalterable history of your project.

The industry-standard tool for this is Git. Unlike Google Drive or Dropbox, which simply overwrite the previous version of a file with the new one, Git saves “snapshots” of your entire project. This allows you to travel back in time to any previous state. If you delete a critical function by accident, you can recover it. If your code stops working after an update, you can compare the current version to the last working one and instantly isolate the error.

Git does not just save files; it saves changes. It records who made the change, when they made it, and exactly what lines of text were modified.

The Mechanics

To understand how Git works, it helps to imagine you are a photographer. Your project folder is the scene you are trying to capture. It is messy and constantly changing as you edit files. Git does not automatically save every single keystroke, because that would create too much noise. Instead, you must deliberately choose when to take a “picture” of your code.

This process happens in three distinct stages. First, you have the working directory, where you are actively editing your files. When you are satisfied with a specific change (e.g., you have finished writing a function to parse a PDB file), you move that change to the staging area (also called the Index). This is like framing your shot in the camera’s viewfinder. You are selecting exactly which changes you want to include in the next snapshot. Finally, you perform a commit. This is the act of pressing the shutter button. It takes the files currently in the staging area and permanently stores them in the repository as a snapshot.

The Cloud

It is crucial to understand that Git and GitHub are not the same thing. Git is the software that runs locally on your computer (the camera). GitHub is a website where you upload your snapshots (the photo album).

While Git allows you to track history on your own machine, hosting services like GitHub (or alternatives like GitLab and Codeberg) allow you to share that history with others. This is essential for the team-based Sprints you will perform later in this course. By pushing your local commits to a remote server, you create a backup of your work that exists independently of your laptop. If your computer crashes, your code is safe. Furthermore, these platforms allow multiple people to work on the same codebase simultaneously, merging their unique “snapshots” into a single, cohesive history.

The “Free” Tier and Data Privacy

In the software industry, there is a common adage: “If the product is free, you are the product.” When you push code to a third-party host, you must agree to their Terms of Service (ToS). It is critical to read these terms, as they distinguish sharply between public and private repositories.

  • Public Repositories: By making code public on platforms like GitHub, you generally grant the host the right to “parse” and “analyze” that data on their servers. This data is often used to train Large Language Models (LLMs) like GitHub Copilot and OpenAI’s Codex. The Terms of Service also include an “Access Reciprocity” clause, which implies that if you scrape GitHub data to train your own AI, you must allow GitHub to do the same to you. This has sparked significant ethical debate regarding “copyright laundering,” where AI models generate code based on open-source work without adhering to the original license.
  • Private Repositories: For private code, GitHub’s Terms of Service state that the content is “Confidential Information”. However, this privacy is not absolute. The Terms explicitly allow GitHub personnel to access your private code for security scanning, to respond to support requests, or to comply with legal obligations. Furthermore, while GitHub states they do not use private repositories for AI training by default, users must trust that Microsoft (GitHub’s parent company) will maintain this policy indefinitely.

Platform Risk and Self-Hosting

Relying on a single commercial entity creates platform risk. Terms of service can change, accounts can be suspended, and features can be paywalled. To mitigate this, many researchers and organizations choose self-Hosting. Because Git is decentralized, you can run your own “GitHub” on a private server using open-source software like GitLab (Self-Managed) or Gitea.

  • The Upside: You own the data. No third party can train AI on your code, and no one can revoke your access.
  • The Downside: You are the systems administrator. If the server crashes or the hard drive fails, you are responsible for fixing it.

For those who want privacy without the maintenance burden, non-profit, community-owned platforms like Codeberg (hosted in the EU) offer a middle ground, promising no tracking and no AI training on your data.

Resources

To master these tools, you must read the documentation. We recommend the following resources for their clarity and rigor.

  • Pro Git Book. This is the definitive guide to Git. Read chapters one to three for a deep dive into the basics.
  • GitHub Skills. An interactive, browser-based way to learn the workflow without installing anything initially.
  • Learn Git Branching. A visual, interactive game that challenges you to solve puzzles using Git commands. This is excellent for building mental models of how “commits” and “branches” relate to one another.
Last updated on