From b835eb01ad7e0353c5c0f912e25048720d8d81d7 Mon Sep 17 00:00:00 2001 From: Johannes Keyser <johannes.keyser@sport.uni-giessen.de> Date: Tue, 2 Feb 2021 19:54:14 +0100 Subject: [PATCH] A few reformulations and fixes. --- README.md | 30 ++++++++++++++++------------ example/README.md | 8 ++++---- example/example_analyis_script.ipynb | 24 ++++++++-------------- example/example_data.hdf | 2 +- 4 files changed, 30 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index 9e5ac7e..2eb4913 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,26 @@ # Git Large File Storage -<img src="logo.svg" width="120px" /> +<img src="logo.svg" width="100px" /> *How to use Git LFS on JLU GitLab (and if this is a good idea)* -__NOTE: This project is very much work in progress!__ +__NOTE: This project is work in progress!__ -[__TOC__] +[[_TOC_]] -## What Problem is solved by Git LFS? -The main purpose of Git LFS is to treat data files with the same convenience as files in your Git repository, while technically keeping them out of the Git repository. -The main reasons behind Git LFS are technical: -- Large files will bloat the Git repository for everyone who has a copy, and degrade the performance of Git operations. +## What problem is solved by Git LFS? +The main purpose of Git LFS is to treat **data** files *as conveniently as if they were inside* a Git repository, while *actually keeping them outside* of the repository. +There are technical reasons in Git's design to make this extra step necessary: + +- Large files will bloat the Git repository for everyone who has a clone, and degrade the performance of Git operations. Git is optimized for text-based content, not for binary files. -- Git is designed to distribute the full history to everyone who has a clone. - If data are part of the Git repository (at any point in time!), it means they get replicated on every clone. +- Git is designed to distribute the entire snapshot history to every clone. + If data are part of the Git repository (at any point in history!), it means it gets replicated on every clone, even if's not needed (any more). - Git is designed to make it impossible to delete data from the repository's history. All you can do to "delete data" is force Git to you explicitly re-write the snapshot history. - Even if you work alone, this is a bit of a hassle. - This gets worse if other people have a clone of the repository, because the deletion requires everyone involved to manually re-create the deletion with their clone. + Even if you work alone, this is a bit of a hassle — it gets much worse if other people have clones, because history change requires everyone involved to confirm the deletion with their clone. + ## Is it a good idea to use Git LFS for your project? You should consider several aspects before uploading data to JLU GitLab. @@ -32,7 +33,8 @@ If in doubt about your specific situation, please ask the research data manager - Compared with [JLUdata](https://jlupub.ub.uni-giessen.de/handle/jlupub/1), you can store data privately among the members of your project. JLUdata is the preferred choice to *publish* data (for example, you get a DOI). -## Practical steps how to use Git LFS (UNFINISHED) + +## Practical steps how to use Git LFS Assumptions: - You have Git installed on your machine and you know the basics how to use it (TODO, specify: `add`, `commit`, `pull`, `push`). (If you don't, [here is a good starting point](https://git-scm.com/)). @@ -49,9 +51,11 @@ Assumptions: 2. FIXME: How to generally add files into LFS, and how to interact with them. 3. FIXME: Clarify the locking mechanism. + ## Example(s) - [Here](example) you can find an of an analysis script that relies on data stored in Git LFS. -- Maybe add more examples with different data (audio, video, images)? +- TODO: Maybe add more examples with different data (audio, video, images)? + ## Useful links - https://git-lfs.github.com/ diff --git a/example/README.md b/example/README.md index ac9d6e7..c9d42a2 100644 --- a/example/README.md +++ b/example/README.md @@ -9,11 +9,11 @@ The example consists of - which relies on binary data ([example_data.hdf](example_data.hdf)) stored by Git LFS, - and stores figure data (as `.png` files) in the folder [plots](plots), stored by Git LFS. -The analysis script is stored as a normal Git snapshot, without LFS. +The analysis script is stored as a normal Git snapshot (without LFS). Note the small badge "LFS" at the files stored in LFS in GitLab's file overview. -If you cloned the repository on your machine (and you have LFS installed), you type `$ git lfs ls-files` to get an overview of which files are stored in Git LFS: -For example: +If you cloned the repository on your machine (and you have LFS installed), you type `$ git lfs ls-files` to get an overview of which files are stored in Git LFS. +In this example, you would see something like this: ``` $ git lfs ls-files @@ -25,7 +25,7 @@ f4aa57ea83 * example/plots/plot_example_histogram.png To achieve tracking with LFS of these files types (PNG and HDF), irrespective of their path in the repository, you can use these two commands: -``` sh +```sh $ git lfs track "*.hdf" $ git lfs track "*.png" ``` diff --git a/example/example_analyis_script.ipynb b/example/example_analyis_script.ipynb index 9a63b2e..ca421c9 100644 --- a/example/example_analyis_script.ipynb +++ b/example/example_analyis_script.ipynb @@ -5,7 +5,7 @@ "id": "failing-rebecca", "metadata": {}, "source": [ - "# Example \"analysis\" script\n", + "# Example analysis script\n", "\n", "This example illustrates the integration of analysis code and data:\n", "\n", @@ -20,7 +20,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "id": "composite-secretariat", "metadata": {}, "outputs": [], @@ -79,8 +79,8 @@ } ], "source": [ - "# Main illustration: Create figure based on existing data stored with Git LFS,\n", - "# and save figures as SVG files, which are also tracked by Git LFS.\n", + "# Main illustration: Create figure based on data stored with Git LFS,\n", + "# and save figures as PNG files, which are also tracked by Git LFS.\n", "\n", "# The data file contains samples of the standard normal distribution.\n", "with h5py.File(DATA_FILE, 'r') as file_handle:\n", @@ -113,24 +113,16 @@ }, { "cell_type": "code", - "execution_count": 93, + "execution_count": 3, "id": "quiet-castle", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Wrote data into file \"example_data.hdf\".\n" - ] - } - ], + "outputs": [], "source": [ "# Appendix: Generate data set (for the sake of completeness).\n", "\n", "# Draw samples from the standard normal distribution.\n", - "np.random.seed(42)\n", - "Num_Data_Points = 150*1000\n", + "numpy.random.seed(42)\n", + "Num_Data_Points = 150000\n", "rand_normal_samples = numpy.random.randn(Num_Data_Points, 1)\n", "\n", "# Save the samples in a HDF-5 file.\n", diff --git a/example/example_data.hdf b/example/example_data.hdf index f0c9311..7b1ce99 100644 --- a/example/example_data.hdf +++ b/example/example_data.hdf @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:3cf63089c81a765c9f1e9436fdc7edacdae1b9af1015d2b1c5f9440f748db103 +oid sha256:5908419e1f7a2f9806fe0b4bff078c5edbb2d5406bed95fa8d956e95332933c3 size 1202048 -- GitLab