Bundling Files for Data Den Research Storage

Because of the design of Data Den, projects will often need to be bundled to form larger single file archives. The most common tool with which to do this is tar. Tar can also optionally compress the data but can take much longer. We also recommend using archivetar when using the ARC HPC clusters, also available as a container for Linux systems.

On the ARC HPC systems, archivetar will do the needed tar bundling and upload it to Data Den. The following command will sort data into tar files of 100GB each and upload them to the selected folder on Data Den using Globus. You can also use tar, zip, or other commands manually.

cd <folder to archive>

module load archivetar

archivetar --prefix my-archive --dryrun

archivetar --prefix my-archive --destination-dir /<dataden-volume>/<folder>

The following tar command will bundle all file files in directory, store it in the file bundle.tar.bz2, and compress it with bzip2. It will also create a small text file bundle.index.txt that can be stored to reference which files are in the bundle quickly.

tar -cvjf bundle.tar.bz2 directory | tee bundle.index.txt

To extract the bundle:

tar -xvjf bundle.tar.bz2

Optionally omit -j to save time compressing and omit -v to not print the bundle progress as it runs.
Compressing an archive can be accelerated on multi-core systems using pigz and lbzip2. The following will work on all ARC systems:

tar --use-compress-program=lbzip2 -cvf bundle.tar.bz2 brockp | tee bundle.index.txt

To extract the bundle:

tar --use-compress-program=lbzip2 -xvf bundle.tar.bz2

 

Last Updated: 
Friday, August 16, 2024