Git Large File Storage: How To
Support information how to use Git LFS on JLU GitLab, and if this is a good idea.
Please note: This is work in progress; any contributions are welcome!
Why bother with Git LFS?
The main purpose of Git LFS is to treat data files as conveniently as if they were inside a Git repository, while actually keeping them outside of the repository. There are technical reasons in Git's design to make this extra step necessary:
- Large files will bloat the Git repository for everyone who has a clone, and degrade the performance of Git operations. Git is optimized for text-based content, not for binary files.
- Git is designed to distribute the entire snapshot history to every clone. If data are part of the Git repository (at any point in history!), it means it gets replicated on every clone, even if's not needed (any more).
- Git is designed to make it impossible to delete data from the repository's history. All you can do to "delete data" is force Git to you explicitly re-write the snapshot history. Even if you work alone, this is a bit of a hassle — it gets much worse if other people have clones, because history change requires everyone involved to confirm the deletion with their clone.
Optional, technical details:
With the Git LFS extension, you can version control (large) files "in association" with a Git repository.
Instead of storing a file within the Git repository as a blob, Git LFS only stores pointer files in the repository, but stores the actual file contents on a (separate) Git LFS server (and locally, in another folder).
A file tracked by Git LFS gets downloaded only if needed, e.g. when you check out a Git branch containing the tracked file (but it also gets cached locally, if you downloaded it before).
LFS uses Git filters and hooks to coordinate between the normal repository and the tracked files.
A smudge filter gets the file contents based on the pointer file.
A clean filter creates a new version of the pointer file if the file changes.
When you push
a commit that contains a new/changed file tracked by LFS, a pre-push hook takes care to separately upload the large file contents to the Git LFS server.
Is it a good idea to use Git LFS for your project?
You must consider several aspects before uploading any research data to JLU GitLab. Please read this information on research data management. If in doubt about your specific situation, please consult the department for research data, forschungsdaten@uni-giessen.de.
With this in mind, using Git LFS may be a natural choice, if:
- You're already using Git to organize your project,
- data are an integral part of the project,
- the data files are binary and/or larger than a few KiB,
- all project members who need to access these data files can work with Git LFS, and
- all machines that need to access these files have network access to the LFS server.
Advanced option: External Git LFS server
As an alternative to storing LFS data on JLU GitLab, you could store LFS data on an external server, while still using JLU GitLab to host your project.
For example, your workgroup could run their own Git LFS server; you can choose from e.g. this list.
The advantage of an external LFS server is independence from JLU GitLab; e.g. you could implement different policies, such as potentially more suitable security practices.
Practical tips how to use Git LFS
The following tips make the following assumptions:
- You have Git installed on your machine and you know the basics (if you don't, here is a good starting point).
- You have a project on JLU GitLab that includes a Git repository, and you have a local clone of it on your machine.
- You can type Git commands into a command line interface (terminal).
Below, example commands are indicated with a different font and with a leading dollar sign,
$ like this
; to reproduce them, drop the dollar sign$
. - You have the Git LFS extension installed on your machine (you can find instructions here); you can check e.g. with typing
$ git lfs version
.
Basic use
- Set up Git LFS; you have to do this once per machine and repository:
$ git lfs install
. - In your local repository, choose what types of files to track by LFS.
- For example, to track all CSV files, type:
$ git lfs track "*.csv"
. - This will create/change the Git configuration file
.gitattributes
. You should track this configuration change in the repository itself, with the usual Git commands$ git add .gitattributes
and$ git commit -m "start tracking CSV files with LFS"
. - Note: Because the file name
.gitattributes
leads with a dot, it may be hidden from view (to list it, use$ ls -a
on Linux/MacOS, or FIXME: What to do on Windows?).
- For example, to track all CSV files, type:
- Now you can interact with the LFS-tracked files in the usual way to control versions with Git.
For example, to make a new snapshot with a file
some_data.csv
, use the usual commandsadd
,commit
, andpush
like any other file in the repository:$ git add some_data.csv $ git commit -m "Add data to LFS" $ git push
Option: Prevent download of LFS files
You may want to work with a Git repository but prevent all LFS files from download. For example, you may want to clone a repository on a machine where you simply don't require the large files. Or maybe you want to work with a clone on a machine without network access, leading to LFS errors.
To temporally ignore the LFS content, you can set the environment variable called GIT_LFS_SKIP_SMUDGE
to the value 1
.
(And to stop ignoring LFS files, just set the variable to 0
.)
The syntax to set the variable depends on your command line interface:
- On Windows, type
$ set GIT_LFS_SKIP_SMUDGE=1
. - For Bash (e.g. Linux), type
$ export GIT_LFS_SKIP_SMUDGE=1
.
After that, you can e.g. clone the repository like usual, without an attempt to download the LFS files:
$ git clone <REMOTE-URL> <LOCAL-FOLDER>
TODO?: You can also ignore LFS files permanently, via Git configuration.
Option: Exclude particular files from being tracked by LFS
The easiest way to track files with LFS is to use a general file pattern, like all CSV files (*.csv
).
However, you may want to have a particular CSV file in the normal Git repository, and exclude it from the LFS pattern.
This can be done by editing the .gitattributes
file directly, with any text editor you want.
Example
Let's say you're tracking all CSV files in Git LFS, i.e. your .gitattributes
file contains a row like this:
*.csv filter=lfs diff=lfs merge=lfs -text
Let's assume you want to exclude some-directory/my-particular-file.csv
from LFS, and put it into the normal Git repo instead.
To do that, you simply add a text line some-directory/my-particular-file.csv !filter !diff !merge text
into the .gitattributes
file.
After your editing, .gitattributes
should contain these 2 lines:
*.csv filter=lfs diff=lfs merge=lfs -text
b.dat !filter !diff !merge text
Note: The file .gitattributes
is just a text file, so you can edit it with any text editor.
Because the file name leads with a dot, it may be hidden from view (to list it, use $ ls -a
on Linux/MacOS, or FIXME: What to do on Windows?).
Option: Lock files to avoid conflicts
FIXME: Clarify locking mechanism, mainly relevant for people working in teams, see https://github.com/git-lfs/git-lfs/wiki/File-Locking.
Example(s)
- Here you can find an of an analysis script that relies on data stored in Git LFS.
- TODO: Maybe add more examples with different data (audio, video, images)?
Useful links
- Main website about Git LFS: https://git-lfs.github.com/
- Information on LFS on GitLab: https://docs.gitlab.com/ce/topics/git/lfs/
- A list of LFS server implementations: https://github.com/git-lfs/git-lfs/wiki/Implementations