Babel is a high-performance computing (HPC) cluster designed to provide advanced computing capabilities for research and scientific computing tasks. This page provides information about the architecture and specifications of the Babel cluster.
Upgrade History
Birth: 2023-05-04 18:02:55.854769612 -0400
SSH Login
To access the Babel cluster, you need your ANDREW ID and password. Use the following command to log in:
ssh <username>@babel.lti.cs.cmu.edu
See Connecting to the Cluster.
Resource Scheduler
Slurm 20.11.9 is used for job scheduling.
Software Management
Environment Modules
The cluster uses modules
to allow for different versions of software packages. For example both cuda 12.1
and 12.2
are available on the cluster. You can check available packages using:
module avail
To switch to a particular version, simply run module load <software_version>
, for example:
module load cuda-12.1
will make cuda-12.1
available. Note if you load a module on the headnode it will be available on the compute node as well.
See Environment_Modules.
GCC Versions
To change the GCC version on our systems, use the command:
scl enable gcc-toolset-X bash
Replace "X" with the version number (9, 10, or 11) to activate the desired GCC version. For more detailed information and usage examples, please see our detailed guide on the GCC Toolset Activation page.
Current Community Models and Datasets
Models
/data/models/huggingface/
/data/models/meta-ai/
Meta-AI
Llama
/data/models/meta-ai/llama/weights/
/data/models/meta-ai/llama2/weights/
HuggingFace
alpaca
/data/models/huggingface/chavinlo/models--chavinlo--alpaca-native
huggyllama
/data/models/huggingface/huggyllama/models--huggyllama--llama-30b
/data/models/huggingface/huggyllama/models--huggyllama--llama-65b
/data/models/huggingface/huggyllama/models--huggyllama--llama-7b
Vicuna
/data/models/huggingface/lmsys/longchat-7b-v1.5-32k
/data/models/huggingface/lmsys/vicuna-13b-v1.5
/data/models/huggingface/lmsys/vicuna-13b-v1.5-16k
/data/models/huggingface/lmsys/vicuna-7b-v1.5
/data/models/huggingface/lmsys/vicuna-7b-v1.5-16k
meta-llama2
/data/models/huggingface/meta-llama/CodeLlama-13b-Instruct-hf
/data/models/huggingface/meta-llama/CodeLlama-13b-Python-hf
/data/models/huggingface/meta-llama/CodeLlama-34b-Instruct-hf
/data/models/huggingface/meta-llama/CodeLlama-34b-Python-hf
/data/models/huggingface/meta-llama/CodeLlama-7b-Instruct-hf
/data/models/huggingface/meta-llama/CodeLlama-7b-Python-hf
/data/models/huggingface/meta-llama/Llama-2-13b-chat-hf
/data/models/huggingface/meta-llama/Llama-2-70b-chat-hf
/data/models/huggingface/meta-llama/Llama-2-70b-hf
/data/models/huggingface/meta-llama/Llama-2-7b-chat-hf
/data/models/huggingface/meta-llama/Llama-2-7b-hf
meta-llama-3.1
/data/models/huggingface/meta-llama/Llama-3.1-8B-Instruct
/data/models/huggingface/meta-llama/Llama-3.1-70B
/data/models/huggingface/meta-llama/Llama-3.1-70B-Instruct
tiiue
/data/models/huggingface/tiiuae/falcon-180B
/data/models/huggingface/tiiuae/falcon-40b
/data/models/huggingface/tiiuae/falcon-7b
Datasets
/data/datasets/huggingface/
/data/datasets/GQA
/data/datasets/the_pile
/data/datasets/clueweb22
<-- requires licensing. See ClueWeb/data/datasets/shared/hypersim
/data/datasets/shared/scannet
/data/datasets/shared/tonicair (note: this is the tartanair dataset but I just named it incorrectly)
Cluster Architecture
The Babel cluster comprises various nodes optimized for different computational requirements. The cluster architecture includes the following key components:
Operating System
: Springdale 8 [Red Hat Enterprise Linux compliant]Kernel
: x86_64 Linux 4.18.0-372.32.1.el8_6.x86_64Login Node
: The login node is used for logging into the cluster, launching jobs, and connecting to compute nodes.Head Node
: The control node serves as the primary management and coordination point for the cluster, and is responsible for several key functions:Ansible Control Node
: The control node is the primary Ansible node, responsible for managing and automating tasks across the entire system.SLURM Controller
: The control node manages the SLURM installation and configuration, and is responsible for scheduling and managing jobs on the compute nodes.SLURM Database
: The control node may also serve as the primary database node, storing and managing data related to the cluster’s configuration, job scheduling, and system performance.
Compute Nodes
: The compute nodes provide CPU and GPU resources, local scratch space, and network-mounted storage for running compute-intensive tasks.NAS
: The NAS provides network-attached storage for the cluster, allowing users to store and access data from anywhere on the network.
See HPC Terminology for more info
Filesystem Layout
- Each user is provisioned with:
/home/<username>
: 100GB/data/user_data/<username>
: 500GB (available only on the compute nodes)/data/<sub_group>/user_data/<username>
: 500GB (for members of groups with dedicated storage appliances; also available only on the compute nodes)
- NFS mounted Storage via autofs (i.e. it is not local disk on each compute node).
/home/
: mounted on the login nodes and all compute nodes/data/datasets
: available only on the compute nodes/data/models
: available only on the compute nodes/compute/<node_name>
: available only on the compute nodes
Data Directories
Community datasets are placed at /data/datasets/<thing>
. If you have a dataset or model that would be useful to everyone let the administrator know by sending e-mail to to help@cs.cmu.edu with the details of the request or have your advisor contact the department's HPC administrator.
If you or your group requires additional space have your sponsor make the request by sending e-mail to to help@cs.cmu.edu with the details of the request.
AutoFS
AutoFS directories are not always mounted as AutoFS is an “on-demand” filesystem. You may need to stat the full path to the files you are looking for. For example, sometimes the output of ls /compute/
might seem empty. However, if you ls /compute/babel-0-23
the contents of the /scratch dirs from that node will be revealed.
For example:
[dvosler@babel-1-27 ~]# ls -la /compute/ total 4 drwxr-xr-x 2 root root 0 Jun 23 15:02 . dr-xr-xr-x. 21 root root 4096 Jun 23 15:02 ..
[root@babel-1-27 ~]# ls -la /compute/babel-0-19 total 28 drwxrwxrwt 4 root root 4096 May 29 10:25 . drwxr-xr-x 3 root root 0 Jun 28 16:17 .. drwx------ 2 root root 16384 May 5 15:02 lost+found drwxrwxr-x 3 dvosler dvosler 4096 May 11 18:41 things -rw-rw-r-- 1 dvosler dvosler 1 May 25 10:41 version.txt
This also applies to other paths. If you think your data is missing or not mounting try to ls the full path.
Local Scratch Partition
When are frequently accessing large files you should first move them to the /scratch
directory of the local machines your job is running on. Read/Write is much faster locally than it is over the network.
The /scratch
dir of each node is exported via NFS to other nodes on the cluster. The local disk for the node babel-X-X
can be accessed through at /compute/babel-X-X
on other nodes. This will allow for faster access and reduce pressure on the NAS. highly recommended for large or frequently accessed files.
- Compute nodes and scratch should only hold temporary files, in the sense that you should clean up after yourself.
Storage
The storage for this cluster is served from babel-nas-1.lti.cs.cmu.edu running Springdale 8.
ZFS Pools
The following ZFS pools are available:
- babel-zp00_home - Size: 38.4T - babel-zp01_data - Size: 160T
Backups
No tape backups are performed.
ZFS snapshots are handled by sanoid.
Snapshot configuration for babel-zp0?_???
:
- Hourly: 30 - Daily: 42 - Weekly: 18 - Monthly: 13 - Yearly: 0
Export Layer
The export layer for this cluster is nfsv4.
Network Fabrics
Ethernet
- 1Gbps Capable:
- eno1 - eno2 (unused)
Infiniband
- EDR (Rate: 100Gbit/s):
- ib0: 172.16.1.1/24 - ib1 (unused)
Location
The cluster is located in Wean Hall machine room [MRA].
FAQ on Babel Resource Allocation and Best Practices
Q: How are jobs prioritized by the scheduler?
Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedicated partitions for priority access to (a subset of) those nodes or may request a dedicated node reserved exclusively for their research group.
Q: How is disk space allocated?
By default, each research group has a 20 GB quota across all users. Some research groups may choose to purchase additions disk space. There are also some shared drives for shared data resources (see Current Community Models and Datasets).
Q: Do you have advice for long-running jobs?
- Make sure your code saves checkpoints frequently so that it can recover from being preempted.
- Post on the
#babel-babble
Slack channel first to alert other users. - Consider running on the `long` partition.
Q: What should I do if I notice another user's jobs/files are disrupting usage of the cluster for others?
Please message the babble-babel
channel, tagging the user with the problematic job as well as @help-babel
. Remember to communicate with respect; most errors are honest mistakes.
Q: I have other questions which aren't answered here.
Reach out on the babble-babel
Slack channel, tagging in @help-babel
. If you discover an answer which may be useful to others, please do contribute it to this FAQ.