Git Large File Storage
How to use Git LFS on JLU GitLab (and if this is a good idea)
NOTE: This project is very much work in progress!
[TOC]
What Problem is solved by Git LFS?
The main purpose of Git LFS is to treat data files with the same convenience as files in your Git repository, while technically keeping them out of the Git repository. The main reasons behind Git LFS are technical:
- Large files will bloat the Git repository for everyone who has a copy, and degrade the performance of Git operations. Git is optimized for text-based content, not for binary files.
- Git is designed to distribute the full history to everyone who has a clone. If data are part of the Git repository (at any point in time!), it means they get replicated on every clone.
- Git is designed to make it impossible to delete data from the repository's history. All you can do to "delete data" is force Git to you explicitly re-write the snapshot history. Even if you work alone, this is a bit of a hassle. This gets worse if other people have a clone of the repository, because the deletion requires everyone involved to manually re-create the deletion with their clone.
Is it a good idea to use Git LFS for your project?
You should consider several aspects before uploading data to JLU GitLab. If in doubt about your specific situation, please ask the research data manager via email, at forschungsdaten@uni-giessen.de.
- You should never save data containing personally identifying information on JLU GitLab.
- In principle, Git LFS is suitable for data with "normal protection requirements" (in German, "normalem Schutzbedarf"), FIXME: EXPLAIN.
- Git LFS is most suitable if you want to integrate data and their analysis code in the same place. If you just want a place to keep your data on their own, you should also consider the option to use JLUbox as well as network drives, etc.
- Compared with JLUdata, you can store data privately among the members of your project. JLUdata is the preferred choice to publish data (for example, you get a DOI).
Practical steps how to use Git LFS (UNFINISHED)
Assumptions:
- You have Git installed on your machine and you know the basics how to use it (TODO, specify:
add
,commit
,pull
,push
). (If you don't, here is a good starting point). - You have a project on JLU GitLab that includes a Git repository, and you have a local clone of it on your machine.
- You can type Git commands into a command line interface (terminal).
Below, example terminal commands are indicated with a different font and with a leading dollar sign,
$ like this
. - You have the Git LFS extension installed on your machine (you can find instructions here).
- In your local repository clone, configure which types of files you want to track by LFS.
- For example, to let LFS keep track of
CSV
files, type:$ git lfs track "*.csv"
. - This will create/change the Git configuration file
.gitattributes
. You should track this configuration change in Git, e.g. by the usual Git commands$ git add .gitattributes
and$ git commit -m "start tracking HDF files with LFS"
.
Note that because the file name.gitattributes
starts with a dot, it may be hidden from view (on Linux and MacOS, use$ ls -a
to see it; FIXME: What to do on Windows?).
- For example, to let LFS keep track of
- FIXME: How to generally add files into LFS, and how to interact with them.
- FIXME: Clarify the locking mechanism.
Example(s)
- Here you can find an of an analysis script that relies on data stored in Git LFS.
- Maybe add more examples with different data (audio, video, images)?