Compute Canada Notes
slurm general
A clarification between
srun
,
sbatch
,
salloc
The
srun
is for single command execute.
The
sbatch
will let slurm take care of the standard output, suitable for long-term tasks
The
salloc
allocate node (CPU, GPU etc.) for interactive operation.
Official Machine Learing Courses
Talks
There are series of talks and tutorials from SHARCNET.
A wiki page collects all past talks: Online Seminars – SHARCNETHelp
Deploy your virtual environment and submit a job
# check available packages module avail $Package_Name # create virtual environment virtualenv --no-download ~/YOUR_VENV_NAME # actiavte virtual environment source ~/VENV/bin/activate # load required environment module load python module load scipy-stack # update pip pip install --upgrade pip # check available wheels on the cluster to aviod repetitive downloading avail_wheels torch # use '--no-index' to use the server offered packages pip install --no-index torch # search on compute canada wheel warehouse # check installation pip list | grep torchNB: load does not mean package installed, in jupyter, you have to run
pip3 install --no-index $pacakge_name.
Job submission requires a bash script file.
A example script is shown:#!/bin/bash #SBATCH --account= #SBATCH --ntasks=2 #SBATCH --nodes=1 #SBATCH --gres=gpu:2 #SBATCH --cpus-per-task=3 #SBATCH --mem=10G #SBATCH --time=00:30:00 #SBATCH -o outlog-%j.out #SBATCH --job-name=my-named-job #SBATCH --mail-user=your.email@example.com // this is for email notification #SBATCH --mail-type=ALL module load python module load scipy-stack source ~/VENV/bin/activate cd ~/MNIST srun python ~/MNIST/mnist.pyThen submit the script by:
sbatch YOUR_SUBMIT_SCRIPT.shpython package and environment
For cedar, required python package may need to be installed by
python -m pip install $PACKAGEAnother thing is Graham can be the most responsive cluster since it is held by Univ of Waterloo.
Note the cluster distribution. Cedar is in BC, and Graham also support Jupyter now.
Real-time python output to file
For server running code. Recommend real-time output
with open('somefile.txt', 'a') as your_file: your_file.write('Hello World\n')file storage
Scratchhas 20TB but file older than 60 days will be purged.
Projecthas 1TB and don’t get purged. Best implementation is do intensive read-write on
scratchand back-up on
project. (For search index purpose, leave expired, expiry, expiring here.)
A email will be send to user before purge
To locate the files in purge warning:# -atime specify time stamp, +60 means created 60days ago or longer, 30 means exactly 30 days # > purge...txt means redirect output to the txt file for next step filter find /scratch/YOUR_USER_NAME -atime +59 -ls > purge_warn_ls.txt # frst awk print last field by specify $NF then pass to next stage by | # second awk split directory by '/' and output field 2,3,4,5. # uniq return the unique items from previoius stage # so we can know which fold our purge-warning locate. awk '{print $NF}' purge_warn_ls.txt |awk -F'/' '{print $2,$3,$4, $5}' | uniq # batch update the files # just touch all files in the purge list cat purge_warn_ls_59.txt| awk '{print $NF}' | xargs -n1 touch # to actively remove the file, can use # -v is inverse match ls | grep -v "THE_PATTERN_OF_REMAIN_FILES" | xargs rmfile transfer
- To dropbox: https://riptutorial.com/dropbox-api
- Nextcloud by Compute Canada: https://docs.alliancecan.ca/wiki/Nextcloud
Go to bottom there are 2 command lines for this.# file upload curl -k -u -T https://nextcloud.computecanada.ca/remote.php/webdav/ # file download curl -k -u https://nextcloud.computecanada.ca/remote.php/webdav/ -o
- Cloud Local: https://docs.alliancecan.ca/wiki/Transferring_data#From_the_World_Wide_Web
baiscally usingsftpmatlab
matlab on the Compute Canada cluster requires:
# load module module load matlab # test license availabiliy, ok if return a number serial matlab -nodisplay -nojvm -batch disp(license()) # run in command line mode matlab -nodisplay -nojvmSince the RAM-greedy nature of MatLab,
sallocis usually used ahead of running it.
check submitted slurm requests:
sacctIt seems jobs can use complete node (node mode) or partial node (task-mode).
{Ref-Official Doc}
Also can do job array for sequential jobs, or parallel jobs with MPI.
Here is a PDF introducing the job submission and scheduling regulations:
{Link}An example alloc request is below:
salloc --time=DD-HH:MM --mem-per-cpu=G --ntasks= --account=tips on Graham
Official doc
Includes many handy customize functions.
https://wiki.math.uwaterloo.ca/fluidswiki/index.php?title=Graham_Tips#Virtual_Desktoprequest GPU
Official doc {Link} gives example of:
- gpu one 1 node
- task-orientated multi-gpu
- MPI muliti-threading
An example of GPU request is shown below:
salloc --account= --mem=8G --time=3:00:00 -J --nodes=1 --gpus-per-no de=p100:1I am using single node requesting. But the task-orientated request and multi-threading are alluring. May do task-orientated soon.
screen
{Doc}
NB: window and region is the display region, screen or bash is the running bash
- start with
ctrl
+
a
to command mode
- vertical split region:
|
;
- horizontal split region:
h
- canel split region:
Q
- switch region:
Tab
- activate or switch bash
ctrl
+
a
- change bash title
A
- switch to certain bash
num
- command mode:
:
-
focus right
focus on right part
-
resize
change size, can also do
ctrl
+
-
and
ctrl
+
+_
for fast decrease and increase size
- save layout
layout dump .my_filename
- reload layout
source .my_filename
- set as default
echo source .my_filename >> ~/.screenrc
-
A example workflow:
-
ctrl+a+|
or
S
split regions
- create new screen by
ctrl a c
or activate by double
ctrl a
- change title by
ctrl a A
job scheduling
Ref: {official doc}
check remaining quota (user limits)
use
sshare -A def-<account>_<cpu|gpu> -l -U <user>
to check user limits. Replace
<>
with your user name, and
<cpu|gpu>
means choose either of them. This is tricky as there are actually two separate accounts for cpu and gpu jobs.
The
EffectvUsage
column tells the used proportion. A low
EffectvUsage
usually comes with a high
LevelFS
indicating high priority.
The
partitiion-status
command should return the load of each node, however, not working on Cedar.
to minimize wait time
The official doc {Link} suggests less than 3 hours allocation requests tend to get instant responses.
My experience on Graham is set –time=3:00:00 almost get queued immediately.
The full run time level are:
- 3 hours or less
- 12 hours or less
- 24 hours (1 day) or less
- 72 hours (3 days) or less
- 7 days or less
- 28 days or less
The official instructions:
- Specify the job runtime only slightly (~10-20%) larger than the estimated value.
- Only ask for the memory your code will actually need (with a bit of a cushion).
- Minimize the number of node constraints.
- Do not package what is essentially a bunch of serial jobs into a parallel (MPI/threaded) job – it is much faster to schedule many independent serial jobs than a
single parallel job using the same number of cpu cores.
Some handy command line combo
# batch rename with for for i in *; do echo mv -i $i ${i::-5}; done # sort by file size ll -Shl FILE_NAME_PATTERN | awk {NR>1'print $NF, $5'}match pattern and print the next few lines
awk '/(sensors =)/{x=NR+9}(NR<=x){print}/File name/{print}' info_log.txtremove echo to do actually rename
connect to allocated nodes
{Official Doc-Attach to a running job}
srun --jobid JOB_ID --pty tmuxThe
tmuxis a screen-like software for multi-screen usage.
The Cheat Sheet oftmux: {Ref}
check job status
And check the progress by:
squeue -u $USER # or short code sq -u $USER # cancel a job scancel -JOB_IDJobs can have 3 status:
CGjob completed
PDpending, followed by reason (Resources, Priority, ReqNodeNotAvail) {Ref}
Rrunning
Ref: SHARCNET official course series {Dashboard_Link} {ML_Intro}, {Scheduler}
tensorflow deployment
The main challenge is numpy&tensorflow compatibility.
The following scheme works for now (2023/03/28) link
Official doc {Link}tensorboard: interactive, visualized probing
Motivation: It requires a quick visualization for the increasing workload in model profiling, especially with usage of transfer learning.
Official wiki: {Link}
Recommend connect to a running node before using. You may found this operation in the above section.
Start tensorboard with command below. Default port is 6006, use a different port to avoid interference. –load_fast seems to be a compute canada specified option.tensorboard --logdir= --host 0.0.0.0 --load_fast false --port=6008Then bind your local port with the remote port to visit.
ssh -N -f -L localhost:localport:computenode:6006 userid@cluster.computecanada.caiPython
# auto-completion {Tab} # see doc of the function help(YOUR_FUNCTION) # load venv in jupyter conda install ipykernel / pip install ipykernel # user means only install for current user # ENV_NAME should be consistent with the venv created by virtualenv # a typical display name can be 'Python(ENV_NAME)' python -m ipykernel install --user --name ENV_NAME --display-name DISPLAY_NAMEzip
{Ref}
# zip a folder(s) into .zip file zip -r archivename.zip directory_name # zip serveral files into 1 zip file zip achivename.zip file1 file2 file... # assign zip level from 0 to 9; -0 is no compression, -9 is maximal compression zip -0 -r archivename.zip directory_name # encrypted zip zip -e -r archivename.zip directory_nameML 101
Reinforcement Learning
From wiki
It doesn\’t need paired labels. The practitioner only need to mark the results generated by the model.
scheduler run MNIST with GPU
Reading man of sbatch
--contraintcan specify the cpu gpu features.
- if use win notepad++ write bash scripts, replace all \’\r\n\’ to \’\r\’ before run on linux
- sbatch submit account only accept
def-bingqliso far
- need to install on your own environment ahead of time if certain package required. Such as torchvision, torchtext, torchaudio:
pip install --no-index torchvision- see this page for jupyterHub on clusters:
https://docs.computecanada.ca/wiki/JupyterHub- check available wheels here
https://docs.computecanada.ca/wiki/Available_Python_wheels- Check jobs in queue
squeue -u $USER # or sq -u $USERparallel computing
- Only parallize task longer than 1e-4s
- Check the flow by timer, not guessing
Windows Users
参考{知乎-保姆级入门教程}
包括:
- 载入python
- 构建虚拟环境
- 批量安装所需包
还有第二弹{Job Submission}:
主打多任务自动提交
目前已经成功接入graham
查看官网说明,似乎有推荐的数据结构
Storage&File Management
官方教程中有大多数操作文件操作的说明:
{FAQ}
Visual exploration of Data by SHARCNET: {Youtube}
Highlight:
- df.groupby.().plot(kind=\’hist\’)
- seaborn PCA plotting, etc.
- interactive matplotlib (hide data)
- create your own python web app with bohec
Come back when you need to visualize your data.
free WebDAV from NextCloud
NextCloud is a cloud disk service hold by compute canada as well. Each user/group has 100GB quota.
I personally use it for Zotero. To deploy the WebDAV, simply log into nextcloud then click the settings on bottom left and copy the WebDAV url paste it in the required url cell in other softwares. Username and password are the same as your nextcloud one.