Setting up a Python Virtual Environment in /vol/bitbucket


1. Introduction


Many students have in the past caused themselves difficulties by hitting their QUOTA LIMITS in their DoC homedir. In particular, your default quota has two separate limits (raised this academic year):

DISK SPACE: 12GB [was 8GB until 1st Oct]
NUMBER OF FILES: 60,000 [was 40,000 until 1st Oct]
Please see our file storage quota guide for more information about disk quotas.

Of the two types of quota limit, the one that seems to cause the most difficulty is the number of files quota, because many modern software packages routinely create tens of thousands of files when you first run them (even though only a few of them have been locally modified).

BTW, many ask: why do we HAVE a limit of the number of files? The answer is that many filesystem operations (such as traversing a filesystem reading the contents of every file - for backups for example) take time roughly proportional to the total number of files traversed, rather than the total amount of data read. (Of course total amount of data read does matter, but my point is that a specific amount (1GB say) of data in a single file can take several orders of magnitude less time to read/traverse compared to that same amount of data scattered across 10,000 files.)

We have already seen a handful of students hitting the 60,000 files limit, less than a fortnight after we raised the limit. Some are existing users who have presumably managed to create at least 20,000 files in a fortnight, and others are brand new users (firstyears or MScs) who have managed to create almost 60,000 files in a fortnight!

To check your disk quota, on any Linux machine in DoC (eg a shell server or a lab machine), use:

quota -v

To check where you are using the disk space and largest number of files we have provided two simple scripts for you:

/vol/linux/bin/usage
/vol/linux/bin/nfiles

Both scripts should be run when sitting in your DoC home directory in a terminal window (or in a subdirectory of your DoC home directory), and they both analyse the "size" of all "immediate children" (i.e. files and sub-directories present in the current directory), and produces a list of those immediate children sorted by their sizes. usage uses "disk space" as it's definition of "size", whereas nfiles uses "number of files and directories inside" as it's definition of "size".


2. What common packages tend to cause quota (number of files) problems in DoC?

The two most common packages that we have observed causing number of files quota problems in DoC are:

  • Microsoft's VSCode.
  • locally installed Python Packages.

3. What can you do to prevent quota problems?

Obviously, storing less material in your DoC home dir helps:-)

There is another shared filesystem in DoC that you can write data into: it's called /vol/bitbucket. It's much larger than your home directory, it has no quota limits for individuals, and it is not backed up.

So it's great for storing thousands of REPRODUCEABLE files, eg. datasets that you download from the Internet, and Python packages in virtual environments etc. All we ask is that users don't fill it up overall (you can check how full it is via df -h /vol/bitbucket  and we generate a daily report of the "biggest users" - look at the file /vol/bitbucket/USAGE.TXT )!

Please note that /vol/bitbucket is not backed up. So please make sure that you keep copies of anything original to you (eg source code you write, the results you get) in your home directory.

Ok, so does can /vol/bitbucket help us avoid quota problems in DoC?

We're not sure what to do about VSCode right now. Our advice is "don't use it", but noone seems to want to take that advice! We sometimes see people with ~/.vscode directories and ~/.vscode-server directories. These are created by two different versions of VSCode that you may have tried: so please pick one and delete the other. It may be possible after installing VSCode in your DoC home directory to move it to /vol/bitbucket/$USER and then make a symlink back into your DoC home directory, but we haven't tested that well enough to be able to recommend it to you right now.

When it comes to Python packages, we advise that INSTEAD of installing python packages directly into your DoC home directory, you instead create a Python virtual environment on /vol/bitbucket and install the packages into that. You can create several virtual environments if you like, potentially one for each Python-based project you work on that needs a different combination of Python packages.

4. Creating a Python Virtual Environment in /vol/bitbucket

Step 1.

Make a directory with your username on /vol/bitbucket, if it doesn't already exist: mkdir /vol/bitbucket/$USER

Why? We request that you only store material inside /vol/bitbucket in a subdirectory named after your College username: we reserve the right to SUMMARILY DELETE any random files and directories stored in the top level directory of /vol/bitbucket.

So here you're creating your own properly-named subdirectory in /vol/bitbucket, and from now on you should only create things inside that directory.

Step 2.

Create the virtual environment (or VE), setting a useful shell variable first. Here I've chosen to call the new new Python venv "myenv", obviously you can change this to another name if you like:

export PENV=/vol/bitbucket/${USER}/myenv
python3 -m virtualenv $PENV
ls -al $PENV

Step 3.
 

Activate your VE:
source $PENV/bin/activate
which pip
[should say: /vol/bitbucket/${USER}/myenv/bin/pip]
which python
[should say: /vol/bitbucket/${USER}/myenv/bin/python]
which python3
[should say: /vol/bitbucket/${USER}/myenv/bin/python3]

Step 4.


Having activated your VE, all Python packages that you install will be installed inside your VE. For example if you wanted to install Jupyter Lab:

pip install jupyterlab
which jupyterlab
[should say: /vol/bitbucket/${USER}/myenv/bin/jupyterlab]

Alternatively, you can prepare a package requirements file, traditionally called requirements.txt, which specifies a list of packages (optionally with specific version constraints). Having prepared that file, you can install all those required packages inside your VE via:

pip install -r requirements.txt


Step 5.


Note that each time you login to a specific lab machine (either locally onsite or remotely via 2-hop ssh (first to a shell server and then to that lab machine), you'll need to redo the activate stage:

source /vol/bitbucket/${USER}/myenv/bin/activate


If activating it each time is a pain, you can automate this by adding that command to the end of your ~/.bash_profile in your DoC home dir, so that it's automatically activated whenever you login/ssh in)

If you ever want to deactivate the environment for the current session, just type 'deactivate'. If you later decide that automatically activating the environment was a bad idea, edit your ~/.bashrc and comment out the above two lines, then logout and login again.