Carnegie Mellon University

Babel is a high-performance computing (HPC) cluster designed to provide advanced computing capabilities for research and scientific computing tasks. This page provides information about the architecture and specifications of the Babel cluster.

Upgrade History

Birth: 2023-05-04 18:02:55.854769612 -0400

SSH Login

To access the Babel cluster, you need your ANDREW ID and password. Use the following command to log in:

 ssh <username>@babel.lti.cs.cmu.edu

See Connecting to the Cluster.

Resource Scheduler

Slurm 20.11.9 is used for job scheduling.

Environment Modules

The cluster uses modules to allow for different versions of software packages. For example both cuda 12.1 and 12.2 are available on the cluster. You can check available packages using:

module avail

To switch to a particular version, simply run module load <software_version>, for example:

module load cuda-12.1

will make cuda-12.1 available.  Note if you load a module on the headnode it will be available on the compute node as well.

See Environment_Modules.

GCC Versions

To change the GCC version on our systems, use the command:

scl enable gcc-toolset-X bash

Replace "X" with the version number (9, 10, or 11) to activate the desired GCC version. For more detailed information and usage examples, please see our detailed guide on the GCC Toolset Activation page.

Models

  • /data/models/huggingface/
  • /data/models/meta-ai/

Meta-AI

Llama

  • /data/models/meta-ai/llama/weights/
  • /data/models/meta-ai/llama2/weights/

HuggingFace

alpaca

  • /data/models/huggingface/chavinlo/models--chavinlo--alpaca-native

huggyllama

/data/models/huggingface/huggyllama/models--huggyllama--llama-30b

  • /data/models/huggingface/huggyllama/models--huggyllama--llama-65b
  • /data/models/huggingface/huggyllama/models--huggyllama--llama-7b

Vicuna

  • /data/models/huggingface/lmsys/longchat-7b-v1.5-32k
  • /data/models/huggingface/lmsys/vicuna-13b-v1.5
  • /data/models/huggingface/lmsys/vicuna-13b-v1.5-16k
  • /data/models/huggingface/lmsys/vicuna-7b-v1.5
  • /data/models/huggingface/lmsys/vicuna-7b-v1.5-16k

meta-llama2

  • /data/models/huggingface/meta-llama/CodeLlama-13b-Instruct-hf
  • /data/models/huggingface/meta-llama/CodeLlama-13b-Python-hf
  • /data/models/huggingface/meta-llama/CodeLlama-34b-Instruct-hf
  • /data/models/huggingface/meta-llama/CodeLlama-34b-Python-hf
  • /data/models/huggingface/meta-llama/CodeLlama-7b-Instruct-hf
  • /data/models/huggingface/meta-llama/CodeLlama-7b-Python-hf
  • /data/models/huggingface/meta-llama/Llama-2-13b-chat-hf
  • /data/models/huggingface/meta-llama/Llama-2-70b-chat-hf
  • /data/models/huggingface/meta-llama/Llama-2-70b-hf
  • /data/models/huggingface/meta-llama/Llama-2-7b-chat-hf
  • /data/models/huggingface/meta-llama/Llama-2-7b-hf

meta-llama-3.1

  • /data/models/huggingface/meta-llama/Llama-3.1-8B-Instruct
  • /data/models/huggingface/meta-llama/Llama-3.1-70B
  • /data/models/huggingface/meta-llama/Llama-3.1-70B-Instruct

tiiue

  • /data/models/huggingface/tiiuae/falcon-180B
  • /data/models/huggingface/tiiuae/falcon-40b
  • /data/models/huggingface/tiiuae/falcon-7b

Datasets

  • /data/datasets/huggingface/
  • /data/datasets/GQA
  • /data/datasets/the_pile
  • /data/datasets/clueweb22 <-- requires licensing. See ClueWeb
  • /data/datasets/shared/hypersim
  • /data/datasets/shared/scannet
  • /data/datasets/shared/tonicair (note: this is the tartanair dataset but I just named it incorrectly)

The Babel cluster comprises various nodes optimized for different computational requirements. The cluster architecture includes the following key components:

  • Operating System: Springdale 8 [Red Hat Enterprise Linux compliant]
  • Kernel: x86_64 Linux 4.18.0-372.32.1.el8_6.x86_64
  • Login Node: The login node is used for logging into the cluster, launching jobs, and connecting to compute nodes.
  • Head Node: The control node serves as the primary management and coordination point for the cluster, and is responsible for several key functions:
    • Ansible Control Node: The control node is the primary Ansible node, responsible for managing and automating tasks across the entire system.
    • SLURM Controller: The control node manages the SLURM installation and configuration, and is responsible for scheduling and managing jobs on the compute nodes.
    • SLURM Database: The control node may also serve as the primary database node, storing and managing data related to the cluster’s configuration, job scheduling, and system performance.
  • Compute Nodes: The compute nodes provide CPU and GPU resources, local scratch space, and network-mounted storage for running compute-intensive tasks.
  • NAS: The NAS provides network-attached storage for the cluster, allowing users to store and access data from anywhere on the network.


See HPC Terminology for more info

  • Each user is provisioned with:
    • /home/<username>: 100GB
    • /data/user_data/<username>: 500GB (available only on the compute nodes)
    • /data/<sub_group>/user_data/<username>: 500GB (for members of groups with dedicated storage appliances; also available only on the compute nodes)
  • NFS mounted Storage via autofs (i.e. it is not local disk on each compute node).
    • /home/: mounted on the login nodes and all compute nodes
    • /data/datasets: available only on the compute nodes
    • /data/models: available only on the compute nodes
    • /compute/<node_name>: available only on the compute nodes

Data Directories

Community datasets are placed at /data/datasets/<thing>. If you have a dataset or model that would be useful to everyone let the administrator know by sending e-mail to to help@cs.cmu.edu with the details of the request or have your advisor contact the department's HPC administrator.

If you or your group requires additional space have your sponsor make the request by sending e-mail to to help@cs.cmu.edu with the details of the request.

AutoFS

AutoFS directories are not always mounted as AutoFS is an “on-demand” filesystem. You may need to stat the full path to the files you are looking for. For example, sometimes the output of ls /compute/ might seem empty. However, if you ls /compute/babel-0-23 the contents of the /scratch dirs from that node will be revealed.

For example:

   [dvosler@babel-1-27 ~]# ls -la /compute/
   total 4
   drwxr-xr-x   2 root root    0 Jun 23 15:02 .
   dr-xr-xr-x. 21 root root 4096 Jun 23 15:02 ..
   [root@babel-1-27 ~]# ls -la /compute/babel-0-19
   total 28
   drwxrwxrwt 4 root     root      4096 May 29 10:25 .
   drwxr-xr-x 3 root     root         0 Jun 28 16:17 ..
   drwx------ 2 root     root     16384 May  5 15:02 lost+found
   drwxrwxr-x 3 dvosler  dvosler   4096 May 11 18:41 things
   -rw-rw-r-- 1 dvosler  dvosler      1 May 25 10:41 version.txt

This also applies to other paths. If you think your data is missing or not mounting try to ls the full path.

Local Scratch Partition

When are frequently accessing large files you should first move them to the /scratch directory of the local machines your job is running on. Read/Write is much faster locally than it is over the network.

The /scratch dir of each node is exported via NFS to other nodes on the cluster. The local disk for the node babel-X-X can be accessed through at /compute/babel-X-X on other nodes. This will allow for faster access and reduce pressure on the NAS. highly recommended for large or frequently accessed files.

  • Compute nodes and scratch should only hold temporary files, in the sense that you should clean up after yourself.

The storage for this cluster is served from babel-nas-1.lti.cs.cmu.edu running Springdale 8.

ZFS Pools

The following ZFS pools are available:

 - babel-zp00_home
  - Size: 38.4T
 - babel-zp01_data
  - Size: 160T

Backups

No tape backups are performed.

ZFS snapshots are handled by sanoid.

Snapshot configuration for babel-zp0?_???:

 - Hourly: 30
 - Daily: 42
 - Weekly: 18
 - Monthly: 13
 - Yearly: 0

The export layer for this cluster is nfsv4.

Ethernet

- 1Gbps Capable:

 - eno1
 - eno2 (unused)

Infiniband

- EDR (Rate: 100Gbit/s):

 - ib0: 172.16.1.1/24
 - ib1 (unused)

The cluster is located in Wean Hall machine room [MRA].

Q: How are jobs prioritized by the scheduler?

Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedicated partitions for priority access to (a subset of) those nodes or may request a dedicated node reserved exclusively for their research group.

Q: How is disk space allocated?

By default, each research group has a 20 GB quota across all users. Some research groups may choose to purchase additions disk space. There are also some shared drives for shared data resources (see Current Community Models and Datasets).

Q: Do you have advice for long-running jobs?

  1. Make sure your code saves checkpoints frequently so that it can recover from being preempted.
  2. Post on the #babel-babble Slack channel first to alert other users.
  3. Consider running on the `long` partition.

Q: What should I do if I notice another user's jobs/files are disrupting usage of the cluster for others?

Please message the babble-babel channel, tagging the user with the problematic job as well as @help-babel. Remember to communicate with respect; most errors are honest mistakes.

Q: I have other questions which aren't answered here.

Reach out on the babble-babel Slack channel, tagging in @help-babel. If you discover an answer which may be useful to others, please do contribute it to this FAQ.