Environment modules allow you to control which software (and which version of that software) is available in your environment. For instance, at the time of writing, the cluster has 4 different version of standard R installed: 3.5.3, 3.6.3, 4.0.5, 4.1.0. When you first log in and try to run R the OS will respond with “command not found”. To activate R in your environment you would type:
module add R
That would then give you access to the most recent version of R available (4.1.0 in this case).
To use a specific version you would have typed something like:
module add R/3.6.3
To get a list of all available software you can type:
module avail
To get a full list of module commands:
module --help
There's a shorthand version of the module command: ml. To load a module you can use just:
ml fastp
to, for instance, load the fastp program.
You can also issue the other module commands using ml:
ml avail ml list ml purge ...
Just typing ml on its own is the same as ml list.
Whether using module or ml you can load multiple modules with a single command:
module add R/4.1.0 samtools bedtools
or
ml R/4.1.0 samtools bedtools
You have several options as to where to use “module load” commands.
When you issue a module load command the modules program checks whether you already have a different version of the same program loaded as a module. If you do, it reports an error and does not load the module a second time.
Suppose you have the latest version of R loaded in your .bashrc file, but also have a pipeline that has been installed and thoroughly tested with a previous version of R, and you have in the scripts for that pipeline something like:
module load R/3.6.3
This would generate an error because of the latest version of R already being loaded.
So, in your script you could unload R before loading the new version.
module unload R module load R/3.6.3
You might also consider unloading all modules, and loading only those you need, before starting up your pipeline:
module purge module load R/3.6.3
In general, the environment modules just edit your PATH variable (they usually prepend a directory to your PATH). The convention is that all programs loaded by “module load” can be found in a sub-directory of /opt. The naming convention is:
/opt/SOFTWARE/VERSION
Where SOFTWARE would be replaced by the name of the software, e.g. R, and VERSION would be replaced by a version number for that software, e.g. 3.6.3 (for R).
Usually (but not always) the executable programs will be in /opt/SOFTWARE/VERSION/bin.
You can find specifically what “module load” does for a piece of software by looking at the modulefile for that piece of software. It can be found at:
/opt/modules/modulefiles/SOFTWARE/VERSION
(This is a file - not a directory.)
If you prefer not to use modules, you can just update your PATH to include the directories of the pieces of software you want to use (possibly updating your PATH in .bashrc).
In most cases, modules is just a nice easy way of updating you PATH. So it seems preferable to use the module command rather than updating your PATH explicitly.
There are a couple of special purpose modules.
Since various OS level tools need python and perl there are versions of these languages installed system-wide. No module needed. These are python3 (version 3.8.5) and perl (version 5.30.0). You are welcome to use these, but there are also modules with slightly different versions:
The system-wide python is accessible only as “python3”. When a python module is loaded just “python” will start up the relevant version of python.
Python and perl packages that users request will be installed into the module versions of these programs. You can install python and perl packages locally as you wish (using any of these versions).
More information about environment modules can be found here: https://modules.readthedocs.io/en/latest/
If you try to install an R package (as an ordinary user) and get a “permission denied” message like this:
* installing to library '/opt/R/4.1.0/lib/R/library' Error: ERROR: no permission to install to directory '/opt/R/4.1.0/lib/R/library'
Then you might need to create the correct directory for R to use for package installation within your home directory. For R 4.1.0 this would be:
~/R/x86_64-pc-linux-gnu-library/4.1
Where “~” means “your home directory”. You should only specify the first two parts of the full version number (hence the 4.1 for version 4.1.0).
You can create the directory from the command line, like this:
cd mkdir -p ~/R/x86_64-pc-linux-gnu-library/4.1
Or you can do it from within R, and then you won't need to know any details like the specific version number - the R program that you have started will fill them in for you:
dir.create(Sys.getenv("R_LIBS_USER"), recursive = TRUE)
If you have used
#!/usr/bin/Rscript
as the first line of your R scripts so that you can run them just like programs on the old cluster, they will no longer work on the new cluster. This is because there is no interpreter at /usr/bin/Rscript on the new cluster.
On the new cluster you should load an R module (possibly from within your .bashrc file so that R is always available when you log in), and then use:
#!/usr/bin/env Rscript
at the top of your R scripts.