From 0caf09ca88b4842440a7ee023cac69cc8cfdd34a Mon Sep 17 00:00:00 2001 From: Johannes Keyser <johannes.keyser@sport.uni-giessen.de> Date: Tue, 2 Feb 2021 18:42:34 +0100 Subject: [PATCH] Add more explanations. --- CONTRIBUTING.md | 8 +++++++ README.md | 57 ++++++++++++++++++++++++++++++----------------- example/README.md | 43 +++++++++++++++++++++++++++++++++++ 3 files changed, 87 insertions(+), 21 deletions(-) create mode 100644 CONTRIBUTING.md create mode 100644 example/README.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..d43bb34 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,8 @@ +# Contributing + +Anyone is welcome to contribute. + +Please note, that all data, code and text must be self-authored and [licensed as CC0](LICENSE.md) (or the material must be properly cited and licensed openly). + +Also note that this project is publicly available to anyone. + diff --git a/README.md b/README.md index df1fd59..9e5ac7e 100644 --- a/README.md +++ b/README.md @@ -1,42 +1,57 @@ -# Git Large File Storage How-To +# Git Large File Storage <img src="logo.svg" width="120px" /> *How to use Git LFS on JLU GitLab (and if this is a good idea)* -__NOTE: This project is work in progress!__ +__NOTE: This project is very much work in progress!__ -## Initial ideas for this project +[__TOC__] -- Explain/discuss/elaborate if LFS on JLU GitLab is the best choice for your data/project. - - Contrast especially with [JLUdata](https://jlupub.ub.uni-giessen.de/handle/jlupub/1). - - Explain why data don't belong in a plain Git repository: - - Because it degrades performance and bloats the repository. - - Because it is useless, since data are not expected to come in "versions" that need to be controlled. - - Because it is impossible (or at least a huge hassle) to actually delete data from a Git repository, especially if others have a clone. -- Provide simple, practical examples what LFS can offer. - - Tight integration of some data and analysis code. - - 1st idea: Python code to generate some random data, and plot them in a Jupyter notebook. - - Maybe add more examples in other languages, or different data (audio, video, images)? -- Anyone is welcome to contribute, with these guidelines: - - All data, code and text must be self-authored and [licensed as CC0](LICENSE.md), or the material must be properly cited and licensed openly. - - This project is publicly available to anyone. +## What Problem is solved by Git LFS? +The main purpose of Git LFS is to treat data files with the same convenience as files in your Git repository, while technically keeping them out of the Git repository. +The main reasons behind Git LFS are technical: + +- Large files will bloat the Git repository for everyone who has a copy, and degrade the performance of Git operations. + Git is optimized for text-based content, not for binary files. +- Git is designed to distribute the full history to everyone who has a clone. + If data are part of the Git repository (at any point in time!), it means they get replicated on every clone. +- Git is designed to make it impossible to delete data from the repository's history. + All you can do to "delete data" is force Git to you explicitly re-write the snapshot history. + Even if you work alone, this is a bit of a hassle. + This gets worse if other people have a clone of the repository, because the deletion requires everyone involved to manually re-create the deletion with their clone. + +## Is it a good idea to use Git LFS for your project? +You should consider several aspects before uploading data to JLU GitLab. +If in doubt about your specific situation, please ask the research data manager via email, at [forschungsdaten@uni-giessen.de](mailto:forschungsdaten@uni-giessen.de). + +- You should __never__ save data containing personally identifying information on JLU GitLab. +- In principle, Git LFS is suitable for data with "normal protection requirements" (in German, "normalem Schutzbedarf"), FIXME: EXPLAIN. +- Git LFS is most suitable if you want to integrate data and their analysis code in the same place. + If you just want a place to keep your data on their own, you should also consider the option to use [JLUbox](https://www.uni-giessen.de/fbz/svc/hrz/svc/daten/jlubox) as well as [network drives](https://www.uni-giessen.de/fbz/svc/hrz/svc/daten/san/index_html), etc. +- Compared with [JLUdata](https://jlupub.ub.uni-giessen.de/handle/jlupub/1), you can store data privately among the members of your project. + JLUdata is the preferred choice to *publish* data (for example, you get a DOI). ## Practical steps how to use Git LFS (UNFINISHED) Assumptions: -- You have Git installed on your machine and you know the basics how to use it. +- You have Git installed on your machine and you know the basics how to use it (TODO, specify: `add`, `commit`, `pull`, `push`). (If you don't, [here is a good starting point](https://git-scm.com/)). - You have a project on JLU GitLab that includes a Git repository, and you have a local clone of it on your machine. - You can type Git commands into a command line interface (terminal). Below, example terminal commands are indicated with a different font and with a leading dollar sign, `$ like this`. +- You have the Git LFS extension installed on your machine (you can find [instructions here](https://git-lfs.github.com/)). -1. On your machine, install the Git LFS extension ([here](https://git-lfs.github.com/) are some instructions). -2. In your local repository clone, configure which types of files you want to track by LFS. - - For example, to let LFS keep track of [HDF files](https://www.hdfgroup.org/), type: `$ git lfs track "*.hdf"`. +1. In your local repository clone, configure which types of files you want to track by LFS. + - For example, to let LFS keep track of `CSV` files, type: `$ git lfs track "*.csv"`. - This will create/change the Git configuration file [`.gitattributes`](.gitattributes). You should track this configuration change in Git, e.g. by the usual Git commands `$ git add .gitattributes` and `$ git commit -m "start tracking HDF files with LFS"`. *Note that because the file name `.gitattributes` starts with a dot, it may be hidden from view (on Linux and MacOS, use `$ ls -a` to see it; FIXME: What to do on Windows?).* -3. FIXME: How to add files into LFS, and how to interact with them. +2. FIXME: How to generally add files into LFS, and how to interact with them. +3. FIXME: Clarify the locking mechanism. + +## Example(s) +- [Here](example) you can find an of an analysis script that relies on data stored in Git LFS. +- Maybe add more examples with different data (audio, video, images)? ## Useful links - https://git-lfs.github.com/ diff --git a/example/README.md b/example/README.md new file mode 100644 index 0000000..ac9d6e7 --- /dev/null +++ b/example/README.md @@ -0,0 +1,43 @@ +# Example: Analysis that relies on data in Git LFS + +## Main illustration +This example illustrates how Git LFS enables tight integration of analysis code and its data: +Researchers who want to reproduce the results just need to clone this repository to get both. + +The example consists of +- an analysis script ([example_analyis_script.ipynb](example_analyis_script.ipynb)), +- which relies on binary data ([example_data.hdf](example_data.hdf)) stored by Git LFS, +- and stores figure data (as `.png` files) in the folder [plots](plots), stored by Git LFS. + +The analysis script is stored as a normal Git snapshot, without LFS. +Note the small badge "LFS" at the files stored in LFS in GitLab's file overview. +If you cloned the repository on your machine (and you have LFS installed), you type `$ git lfs ls-files` to get an overview of which files are stored in Git LFS: + +For example: +``` +$ git lfs ls-files + +3cf63089c8 * example/example_data.hdf +3d177cad03 * example/plots/plot_example_cumsum.png +f4aa57ea83 * example/plots/plot_example_histogram.png +82065f2929 * example/plots/plot_example_trace.png +``` + +To achieve tracking with LFS of these files types (PNG and HDF), irrespective of their path in the repository, you can use these two commands: + +``` sh +$ git lfs track "*.hdf" +$ git lfs track "*.png" +``` + +This should result in a Git configuration file `.gitattributes` that contains these lines: + +``` +*.hdf filter=lfs diff=lfs merge=lfs -text +*.png filter=lfs diff=lfs merge=lfs -text +``` + + +## Appendix information +- For the sake of readibility in the browser, the analysis script is a [Jupyter notebook](https://jupyter.org/). +- The data are stored in the open data file format [HDF-5](https://www.hdfgroup.org/). -- GitLab