Skip to content

vsoch/watchme-mnist

Repository files navigation

Watchme Mnist

This is a watchme repository that shows how easy it is to monitor a task at some frequency using the watchme monitor pid task provided by the psutils set of tasks. Specifically, we are going to:

  1. Start with this sklearn mnist example
  2. Build it into a container, the Dockerfile here served at vanessa/watchme-mnist
  3. Run the container on an HPC cluster with varying amounts of memory, for a training task that takes approximately 20 minutes.

And compare results!

Included

This is a fairly simple analysis in that I could install watchme and then write a few quick scripts, run, and be done!

  • run_job.sh will submit job.sh to the cluster, specifying input parameters and outputs
  • job.sh is submit to different nodes with varying memory, each 5 times
  • data is where output data is written to, including json results files and images from the training.

Usage

1. Setup

Specifically, to install watchme:

$ pip install watchme[all]

You can also clone and install from the master branch directly:

$ git clone https://www.github.com/vsoch/watchme
cd watchme
pip install .[all] --user

And then I created a watcher folder (this repo).

$ watchme create watchme-mnist

We aren't going to be using .git as a temporal database, but it's still handy to use watchme to create the repo for us :)

2. Mnist on the Sherlock Cluster

This was the script job.sh submit via run_job.sh and we first export some variables to the environment to be added to our data:

# Add variables for host, cpu, etc.
export WATCHMEENV_HOSTNAME=$(hostname)
export WATCHMEENV_NPROC=$(nproc)
export WATCHMEENV_MAXMEMORY=${mem}

and the command to use watchme looks like this. We are going to run the model and record every 20 seconds. The output will be piped into a json file, and the script is given the name of a png file (in the same directory) to save a plot to. This should take 20-30 mins.

watchme monitor --name $name-$iter --seconds 20 singularity run docker://vanessa/watchme-mnist ${output}.png > ${output}.json

The above command is submit in a simple loop in run_job.sh, notice how we define iter, and mem based on the loops:

for iter in 1 2 3 4 5; do
    for mem in 4 6 8 12 16 18 24 32 64 128; do
        output="${outdir}/${name}-iter${iter}-${mem}gb"
        echo "sbatch --mem=${mem}GB job.sh ${mem} ${iter} ${name} ${output}"            
        sbatch --mem=${mem}GB job.sh "${mem}" "${iter}" "${name}" ${output}
    done
done

The results were each written directly to files in data (not using git as a temporal database).

About

An example using the watchme terminal monitor to record resources used during mnist training

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages