Managing GPUBox

Information about the command syntax or description of tables regarding particular subcommands used in this section can be found in Command Reference.

Difference between gpubox and agpubox commands

GPUBox Client consists of two commands: agpubox and gpubox. The main difference between them is that gpubox refers to using the GPUBox infrastructure while the main purpose of agpubox is administrating it.

A user with regular credentials (group 1) does not have access to the agpubox command and each time he tries to use it he will receive an appropriate message informing him that he is unauthorized:

$ agpubox list
Bad Request
User is not authorized!

agpubox command on GPUBox Client for Windows

Every command described in this section applies also to GPUBox Client for Windows. To use the agpubox command on Windows, the administrator can launch the agpubox.bat script from Client's installation directory or from the Start menu, and then issue the commands accordingly to the instructions from this section.

Get basic information

Verifying OServer's status

The simplest way to check OServer's availability is to use the ping subcommand of the gpubox or agpubox commands.

If OServer is available under the indicated HTTP/HTTPS address, the OK message informs the user about it.

$ gpubox ping https://203.0.113.1:8082
OK

If there is no OServer under the given HTTP/HTTPS address or if there are problems with connectivity, the user will be notified about it with an appropriate message:

$ gpubox ping https://203.0.113.2:8082
***ERROR: Connection to OServer failed
OServer unavailable

Information about current account (whoami)

Just as it is the case with a regular user, an administrator can check the information about a currently used account by using the w|whoami subcommand of the agpubox command. Compared to the gpubox w|whoami subcommand, agpubox w|whoami contains more detailed information about the account:
$ agpubox whoami
ID             	:3
UserID         	:bob
Username       	:Bob
Valid from     	:2013-05-30 14:19:33
Valid to       	:2999-12-13 00:59:59
Tries to login 	:0
Last success login :2013-05-30 16:03:57
Last failed login  :0001-01-01 01:01:01
Enabled        	:yes
Group ID       	:0
Max GPU        	:4
Connected to   	:http://203.0.113.1:8081
Comment        	:GPUBox administrator

OServer's statistics

An administrator can get the most basic statistics regarding OServer and Monitor1 by sending the agpubox st|stats subcommand. The output for this command contains:
  • OServer's version
  • OServer's local time of when it was started
  • Number of requests processed since OServer is running
  • UTC epoch time in seconds since OServer was started
  • OServer's local time of when the first Monitor1 data appeared
  • Number of Monitor1 records
Here is an example of what the output for the agpubox stats subcommand can look like:
$ agpubox stats
OServer        	 
--------------------
Version        	   :0.8
Start time     	   :2013-05-24 13:24:35
Epoch time (secs)  :1369920275
Requests processed :472210

Monitor1       	 
--------------------
Start time     	:2013-05-30 13:24:40
Records        	:30903

Managing GPU devices

List device types

The agpubox ld|listdevices subcommand generates a list of the device types currently working in a particular status. This listing provides an administrator with a brief overview of the number of devices of each type that are now available or used in the GPUBox infrastructure.
$ agpubox listdevices
+----+----------------+---------------+-------+----------------+
|DID |GPU name        |Memory (GB)    |Status |Number of GPUs  |
+----+----------------+---------------+-------+----------------+
|1   |GeForce GTX 690 |2              |FREE   |4               |
|1   |GeForce GTX 690 |2              |SHARED |8               |
|2   |GeForce GTX 680 |2              |SHARED |12              |
|3   |GeForce GTX 780 |3              |SHARED |9               |
|3   |GeForce GTX 780 |3              |STOPPED|1               |
+----+----------------+---------------+-------+----------------+
            

List free GPUs

The agpubox f|free subcommand provides brief information about all of the GPUs that are available for allocation which means that they are either:
  • in the FREE status
  • in the SHARED status but the number of users concurrently using it does not exceed the value defined by the oserver_max_user_per_gpu parameter in the OServer configuration
$ agpubox free
+----+----------------+---------------+----------------+
|DID |GPU name        |Memory (GB)    |Available GPUs  |
+----+----------------+---------------+----------------+
|1   |GeForce GTX 690 |2.00           |12              |
|2   |GeForce GTX 680 |2.00           |10              |
|3   |GeForce GTX 780 |3.00           |9               |
+----+----------------+---------------+----------------+
            

Please note, that the agpubox free and gpubox free commands cannot be used interchangeably. The first one displays all of the GPUs in the infrastructure that are open for new allocations while the second one provides information only about the GPUs that can be used by the administrator. The same GPUs will be listed for both commands only when the administrator does not have any GPUs allocated.

List all GPUs

An administrator can get a list of all of the GPUs available within the GPUBox infrastructure two ways:
  • agpubox g|gpus
  • agpubox lg|listgpu
Each of the subcommands mentioned above lists all of the GPUs connected to OServer but they provide different types of information. The agpubox g|gpus subcommand is focused rather on how the GPU is used in the GPUBox infrastructure (i.e. how many users are connected to it, or what is the GPU status). The GPUs are sorted by General ID GID.
$ agpubox gpus
+---+-------------------+-------------+---------+-------------+-----+--------------------+       
|GID|GPU name           |PCI          |GPUServer|Status       |Users|Since               |
+---+-------------------+-------------+---------+-------------+-----+--------------------+
|1  |GeForce GTX 690    |710B:01:00.0 |GPU11    |EXCLUSIVE-PEN|3    |2013-12-22 18:34:41 |
|2  |GeForce GTX 690    |710B:02:00.0 |GPU11    |SHARED       |5    |2013-12-22 18:34:41 |
|3  |GeForce GTX 690    |710B:03:00.0 |GPU11    |SHARED       |4    |2013-12-22 18:34:41 |
|4  |GeForce GTX 690    |710B:04:00.0 |GPU11    |SHARED       |4    |2013-12-22 18:34:41 |
|5  |GeForce GTX 690    |710C:01:00.0 |GPU12    |SHARED       |2    |2013-12-22 18:34:48 |
|6  |GeForce GTX 690    |710C:02:00.0 |GPU12    |SHARED       |2    |2013-12-22 18:34:48 |
|7  |GeForce GTX 690    |710C:03:00.0 |GPU12    |FREE         |0    |2013-12-22 18:34:48 |
|8  |GeForce GTX 690    |710C:04:00.0 |GPU12    |EXCLUSIVE    |1    |2013-12-22 18:34:54 |
|...|...                |...          |...      |...          |...  |...                 |
            
The agpubox lg|listgpu subcommand is more of a monitoring tool providing such information as the GPU temperature, fan speed or available memory. The GPUs are sorted by the GPUServer from which they are provided.
$ agpubox listgpu
+---+---------+------+---------------+---------+-------------+-----------+----------------+------------+
|GID|GPUServer|Device|GPU name       |Temp. (C)|Fan speed (%)|Memory (MB)|Free memory (MB)|Kernel limit|
+---+---------+------+---------------+---------+-------------+-----------+----------------+------------+
|17 |GPU10    |0     |GeForce GTX 690|47       |34           |2047       |1968            |no          |
|6  |GPU10    |1     |GeForce GTX 690|57       |41           |2047       |1826            |no          |
|7  |GPU10    |2     |GeForce GTX 690|62       |44           |2047       |1828            |yes         |
|8  |GPU10    |3     |GeForce GTX 690|55       |39           |2047       |1830            |no          |
|9  |GPU11    |0     |GeForce GTX 680|41       |30           |2047       |1974            |no          |
|10 |GPU11    |1     |GeForce GTX 680|42       |30           |2047       |1975            |no          |
|...|...      |...   |...            |...      |...          |...        |...             |...         |
            

List GPU allocations

The agpubox l|list subcommand provides detailed information about the GPU allocations currently active in the managed GPUBox infrastructure. The records in the generated table are sorted by General ID GID referring to the allocated GPU. The table enables an administrator to see:
  • which user performed the allocation
  • which GPU was allocated
  • what is the PCI address of the allocated GPU
  • which GPUServer is involved
  • what is the IP address of Client used to allocate the GPU
  • what is the status of allocated GPU
  • since when the allocation is active
$ agpubox list
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
|GID|LID|UserID |User name   |GPU name            |PCI          |IP            |GPUServer|Status   |Since               |
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
|6  |1  |bob    |Bob         |GeForce GTX 780     |710B:04:00.0 |203.0.113.150 |GPU11    |SHARED   |2013-12-22 11:44:52 |
|7  |2  |bob    |Bob         |GeForce GTX 680     |710B:07:00.0 |203.0.113.150 |GPU11    |SHARED   |2013-12-22 11:41:16 |
|8  |1  |mary   |Mary        |GeForce GTX 690     |7112:04:00.0 |203.0.113.152 |GPU18    |EXCLUSIVE|2013-12-22 11:40:56 |
|8  |1  |john   |John        |GeForce GTX 690     |7112:07:00.0 |203.0.113.155 |GPU18    |SHARED   |2013-12-22 11:41:05 |
|9  |2  |john   |John        |GeForce GTX 780     |7112:08:00.0 |203.0.113.155 |GPU18    |STOPPED  |2013-12-22 11:43:59 |
|...|...|...    |...         |...                 |...          |...           |...      |...      |...                 |
            
The list subcommand can be used with parameter indicating a particular user. In this case only his allocations will be listed:
$ agpubox list bob
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
|GID|LID|UserID |User name   |GPU name            |PCI          |IP            |GPUServer|Status   |Since               |
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
|6  |1  |bob    |Bob         |GeForce GTX 780     |710B:04:00.0 |203.0.113.150 |GPU11    |SHARED   |2013-12-22 11:44:52 |
|7  |2  |bob    |Bob         |GeForce GTX 680     |710B:07:00.0 |203.0.113.150 |GPU11    |SHARED   |2013-12-22 11:41:16 |
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
            

Drop GPU on behalf of user

Although in the GPUBox infrastructure an administrator does not have control over allocating the GPUs on behalf of users, he can remove their allocations anytime with the agpubox u|udrop subcommand. The udrop subcommand has to be used with two parameters:
agpubox u|udrop <userid> <GID|all>
<userid> indicates on behalf of which user the GPU will be dropped.

agpubox u|udrop is the only subcommand that directly affects the user's allocations. There is no possibility for an administrator to allocate the GPUs on behalf of another user.

The udrop subcommand allows the removal of either a chosen allocation or of all of a user's allocations. According to this, the second parameter <GID|all> can be either the General ID of the GPU that the administrator wants to drop or all if all of the user's allocations are going to be dropped. For example, let's say that user john has 2 GPU allocated:
$ agpubox list john
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
|GID|LID|UserID |User name   |GPU name            |PCI          |IP            |GPUServer|Status   |Since               |
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
|8  |1  |john   |John        |GeForce GTX 690     |7112:07:00.0 |203.0.113.155 |GPU18    |SHARED   |2013-12-22 11:41:05 |
|9  |2  |john   |John        |GeForce GTX 780     |7112:08:00.0 |203.0.113.155 |GPU18    |STOPPED  |2013-12-22 11:43:59 |
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
            
An administrator can drop the GPU with GID 9 by sending the command:
$ agpubox udrop john 9
GPUs dropped

$ agpubox list john
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
|GID|LID|UserID |User name   |GPU name            |PCI          |IP            |GPUServer|Status   |Since               |
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
|8  |1  |john   |John        |GeForce GTX 690     |7112:07:00.0 |203.0.113.155 |GPU18    |SHARED   |2013-12-22 11:41:05 |
+---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
            
Or, in order to remove all of John's allocations regardless of General IDs:
$ agpubox udrop john all
GPUs dropped

Remove GPU from infrastructure

Every GPU that, at the moment, has no active users can be removed from the infrastructure with the agpubox d|drop subcommand followed by the parameter indicating General IDs of GPUs that are be removed from the GPUBox infrastructure.

Let us consider a situation in which we have 6 GPUs in the infrastructure, out of which three are currently allocated:

$ agpubox gpus
+---+-------------------+-------------+---------+-------------+-----+--------------------+       
|GID|GPU name           |PCI          |GPUServer|Status       |Users|Since               |
+---+-------------------+-------------+---------+-------------+-----+--------------------+
|11 |GeForce GTX 690    |710B:01:00.0 |GPU11    |EXCLUSIVE-PEN|3    |2013-12-22 18:34:41 |
|12 |GeForce GTX 690    |710B:02:00.0 |GPU11    |SHARED       |5    |2013-12-22 18:34:41 |
|13 |GeForce GTX 690    |710B:03:00.0 |GPU11    |SHARED       |4    |2013-12-22 18:34:41 |
|14 |GeForce GTX 690    |710B:04:00.0 |GPU11    |FREE         |0    |2013-12-22 18:34:41 |
|15 |GeForce GTX 690    |710C:01:00.0 |GPU12    |FREE         |0    |2013-12-22 18:34:48 |
|16 |GeForce GTX 690    |710C:02:00.0 |GPU12    |OFF          |0    |2013-12-22 18:34:48 |
+---+-------------------+-------------+---------+-------------+-----+--------------------+

An administrator cannot drop the GPUs with GID 11,12,13 until they are allocated by users:

$ agpubox udrop 11-13
The following GPUs were not dropped: 11 - GPU is in EXCLUSIVE or SHARED status
12 - GPU is in EXCLUSIVE or SHARED status
13 - GPU is in EXCLUSIVE or SHARED status

The GPUs with GID 14,15,16, on the other hand, can be seamlessly removed from the infrastructure:

$ agpubox udrop 14-16
The following GPUs were successfully dropped: 14 , 15 , 16

If some of the indicated GPUs can be dropped and some cannot be - the agpubox udrop subcommand will inform the administrator about the result of the operation:

$ agpubox udrop 11,12,13,14,15,16
The following GPUs were successfully dropped: 14 , 15 , 16
The following GPUs were not dropped: 11 - GPU is in EXCLUSIVE or SHARED status
12 - GPU is in EXCLUSIVE or SHARED status
13 - GPU is in EXCLUSIVE or SHARED status
            

Note that the agpubox d|drop operation is irreversible! In order to retrieve the GPUs removed from the infrastructure, the GPUServer providing them has to be restarted.

Manage GPU statuses

In the GPUBox infrastructure, an administrator can see the GPUs in one of nine possible statuses that - for a better understanding - can be divided into three groups:
  • Basic statuses: depending on users' behaviors or an administrator's decision
  • Pending statuses: will be switched to one of the basic statuses as soon as necessary
  • Circumstantial statuses: resulting from changes in the GPUBox infrastructure (starting/stopping GPUServers or connectivity issues)
  • An administrator can switch a FREE GPU to the SHARED status so it cannot be allocated in the EXCLUSIVE mode.
  • Basis statuses:
    FREE The GPU is ready for allocations and is currently not allocated to any user. The FREE GPUs are visible to the users sending the gpubox free command.
    SHARED
  • The GPU can be used by multiple users simultaneously.
  • As long as the number of users does not exceed the value of the oserver_max_users_per_gpu parameter in the OServer configuration the GPU is open for further allocations.
  • The SHARED GPU cannot be removed from the infrastructure with the agpubox drop subcommand.
  • EXCLUSIVE
  • The GPU is used by a single user and cannot be allocated by the others.
  • An administrator can switch the GPU in the EXCLUSIVE status to SHARED anytime.
  • The GPU in the EXCLUSIVE status cannot be removed from the infrastructure with the agpubox drop subcommand.
  • OFF The GPU is not used and has been turned off by an administrator and is closed for new allocations until it is switched back to FREE or SHARED.

    Pending statuses:
    EXCLUSIVE-PENDING If the GPU has been used in the SHARED mode by more than one user and an administrator has requested to switch the GPU status to EXCLUSIVE, the GPU will be visible in the transitional EXCLUSIVE-PENDING status until there is only one user connected to that particular GPU. As soon as this condition is satisfied, the GPU switches to the EXCLUSIVE status.
    OFF-PENDING If the GPU is in use (in SHARED or EXCLUSIVE mode) and an administrator has requested to switch the GPU status to OFF, the GPU will be visible in the transitional OFF-PENDING status until the last user is disconnected. As soon as this condition is satisfied, the GPU switches to the OFF status.

    Circumstantial statuses:
    STOPPED GPUServer providing the GPU has been stopped. The GPU cannot be used for computation but as soon as the GPUServer reconnects to OServer the GPU will be restored with its previous status.
    BROKEN Connection with GPUServer providing the GPU has been lost and it cannot be used until the GPUServer reconnects to OServer.
    INIT The GPU is being initialized and is not ready to be used yet.
    The basic command allowing an administrator to view and manage the GPU statuses is agpubox s|status. Without using any parameters, the subcommand gives the same effect as the agpubox g|gpus:
    $ agpubox status
    +---+-------------------+-------------+---------+-------------+-----+--------------+       
    |GID|GPU name           |PCI          |GPUServer|Status       |Users|Since         |
    +---+-------------------+-------------+---------+-------------+-----+--------------+
    |1  |GeForce GTX 690    |710B:01:00.0 |GPU11    |EXCLUSIVE-PEN|3    |2013-12-22 ...|
    |2  |GeForce GTX 690    |710B:02:00.0 |GPU11    |SHARED       |5    |2013-12-22 ...|
    |3  |GeForce GTX 690    |710B:03:00.0 |GPU11    |SHARED       |4    |2013-12-22 ...|
    |4  |GeForce GTX 690    |710B:04:00.0 |GPU11    |SHARED       |4    |2013-12-22 ...|
    |5  |GeForce GTX 690    |710C:01:00.0 |GPU12    |SHARED       |2    |2013-12-22 ...|
    |6  |GeForce GTX 690    |710C:02:00.0 |GPU12    |SHARED       |2    |2013-12-22 ...|
    |7  |GeForce GTX 690    |710C:03:00.0 |GPU12    |FREE         |0    |2013-12-22 ...|
    |8  |GeForce GTX 690    |710C:04:00.0 |GPU12    |EXCLUSIVE    |1    |2013-12-22 ...|
    |...|...                |...          |...      |...          |...  |...           |
    The basic GPU indicator for the agpubox s|status command is General ID (GID). Using the GID of chosen GPU as a parameter for the s|status subcommand will result in displaying only the line containing the information about indicated GPU:
    $ agpubox status 7
    +---+-------------------+-------------+---------+-------------+-----+--------------+       
    |GID|GPU name           |PCI          |GPUServer|Status       |Users|Since         |
    +---+-------------------+-------------+---------+-------------+-----+--------------+
    |7  |GeForce GTX 690    |710C:03:00.0 |GPU12    |FREE         |0    |2013-12-22 ...|            
    +---+-------------------+-------------+---------+-------------+-----+--------------+
    An administrator can change the status of a chosen GPU to one of the four basic statuses by adding shared, exclusive, off or free after the GID:
    $ agpubox status 7 off
                
    $ agpubox status 7
    +---+-------------------+-------------+---------+-------------+-----+--------------+       
    |GID|GPU name           |PCI          |GPUServer|Status       |Users|Since         |
    +---+-------------------+-------------+---------+-------------+-----+--------------+
    |7  |GeForce GTX 690    |710C:03:00.0 |GPU12    |OFF          |0    |2013-12-22 ...|            
    +---+-------------------+-------------+---------+-------------+-----+--------------+

    List GPUServers

    An administrator can get an overview of the most important information regarding GPUServers in the GPUBox infrastructure by using the agpubox ls|listserver subcommand.
    $ agpubox listserver
    +---------+-------+------+------------+-----+--------------------+--------+
    |         |Number |      |            |     |                    |        |
    |GPUServer|of GPUs|Users |IP          |Port |Since               |Version |
    +---------+-------+------+------------+-----+--------------------+--------+
    |GPU11    |4      |4     |203.0.103.11|9393 |2013-12-23 07:06:11 |0.8     |
    |GPU12    |4      |4     |203.0.103.12|9393 |2013-12-23 07:06:08 |0.8     |
    |GPU13    |4      |4     |203.0.103.13|9393 |2013-12-23 07:06:10 |0.8     |
    +---------+-------+------+------------+-----+--------------------+--------+
    The ls|listserver subcommand can be followed by the hostname which will result in displaying information only about indicated GPUServer:
    $ agpubox listserver GPU12
    +---------+-------+------+------------+-----+--------------------+--------+
    |         |Number |      |            |     |                    |        |
    |GPUServer|of GPUs|Users |IP          |Port |Since               |Version |
    +---------+-------+------+------------+-----+--------------------+--------+
    |GPU12    |4      |4     |203.0.103.12|9393 |2013-12-23 07:06:08 |0.8     |
    +---------+-------+------+------------+-----+--------------------+--------+

    Change/display configuration parameters

    The agpubox oc|osconfig and agpubox gc|gsconfig enables an administrator to conveniently display the parameters consisting on the OServer and GPUServer configuration without the need of editing the configuration files and restarting OServer or GPUServer.

    Configure OServer with agpubox command

    Sending the agpubox os|osconfig subcommand will result in generating the listing of the OServer parameters and their values:
    $ agpubox osconfig
    Read-only parameters
    ------------------------------------------------------------
    oserver_accounting_config = ""
    oserver_accounting_plugin = "defaultAccounting"
    oserver_db_path = "/home/gpubox/gpubox-oserver/log/oserver-20130527-104720.db"
    oserver_installfiles_path = "/home/gpubox/gpubox-oserver/install"
    oserver_log_path = "/home/gpubox/gpubox-oserver/log/oserver-20130527-104720.log"
    oserver_rest_binds = "0.0.0.0:8081,8082s"
    oserver_security_config = "/home/gpubox/gpubox-oserver/log/security-20130527-104720.db"
    oserver_security_plugin = "security1"
    oserver_ssl_cert_path = "/home/gpubox/gpubox-oserver/oserver/etc/ssl_cert_example.pem"
    oserver_webfiles_path = "/home/gpubox/gpubox-oserver/webui"
    
    Read-write parameters
    ------------------------------------------------------------
    oserver_allocation_option = "free"
    oserver_gpu_count_info = "no"
    oserver_log_level = "TRACE"
    oserver_max_gpu_per_user = 200
    oserver_max_gpu_per_user_per_ip = 200
    oserver_max_users_per_gpu = 5
    As you can see, the parameters are sorted according to their writability property. The Read-only parameters can be changed only by editing the configuration file and restarting OServer. The Read-write parameters can be edited from the command line by indicating the parameter and passing its new value after the agpubox oc|osconfig subcommand or from GPUBox Web Console. Skipping the new value in the command and giving only the parameter will result in displaying its current value:
    $ agpubox osconfig oserver_log_level
    Read-write parameters
    ------------------------------------------------------------
    oserver_log_level = "TRACE"
    
    $ agpubox osconfig oserver_log_level NOTICE
    
    $ agpubox osconfig oserver_log_level
    Read-write parameters
    ------------------------------------------------------------
    oserver_log_level = "NOTICE"

    Configure GPUServer with agpubox command

    The agpubox gc|gsconfig subcommand followed by the GPUServer name works analogically to the oc|osconfig subcommand and generates the list of configuration parameters of indicated GPUServer sorted by the writability property:
    $ agpubox gsconfig GPU14
    Read-only parameters
    ------------------------------------------------------------
    gpuserver_bind_ip = "203.0.113.14"
    gpuserver_bind_port = 9393
    gpuserver_gpus = "all"
    gpuserver_log_path = "/usr/local/gpubox-gpuserver/log/gpuserver-20130527-104720.log"
    gpuserver_rest_bind = "203.0.113.14:8080"
    gpuserver_rest_oserver_address = "http://203.0.113.14:8081/"
    
    Read-write parameters
    ------------------------------------------------------------
    gpuserver_infiniband_enabled = "no"
    gpuserver_log_level = "TRACE"
    In order to change the Read-only parameters the GPUServer configuration file has to be manually edited and GPUServer has to be restarted. The Read-write parameters can be changed from the command line by indicating the parameter and passing its new value after the agpubox gc|gsconfig subcommand. Not giving the new value will result in displaying only the current value of indicated parameter:
    $ agpubox gsconfig GPU14 gpuserver_infiniband_enabled
    Read-write parameters
    ------------------------------------------------------------
    gpuserver_infiniband_enabled = "no"
    
    $ agpubox gsconfig GPU14 gpuserver_infiniband_enabled yes
    
    $ agpubox gsconfig GPU14 gpuserver_infiniband_enabled
    Read-write parameters
    ------------------------------------------------------------
    gpuserver_infiniband_enabled = "yes"

    Verifying versions of components

    In order to enable the GPUBox infrastructure to run properly, each of its components (OServer, GPUServers and Clients) should have exactly the same version.

    OServer version

    The OServer version can be verified in four ways:
    • by checking the OSVC-IC-000 message at the first messages in the OServer log after start:
      [2013-04-27 10:48:32.250672] <INFO> OSVC-IC-000 Version: 0.8
    • by using the -v parameter of the OServer binary (regardless if it is currently running):
      $ ~/gpubox-oserver/bin/oserver -v
      Version: 0.8
    • with the agpubox st|stats subcommand:
      $ agpubox stats
      OServer        	 
      --------------------
      Version        	   :0.8
      Start time     	   :2013-05-24 13:24:35
      Epoch time (secs)  :1369920275
      Requests processed :472210
      
      Monitor1       	 
      --------------------
      Start time     	:2013-05-30 13:24:40
      Records        	:30903
    • by checking the OServer information on the left panel in GPUBox Console

    GPUServer version

    The GPUServer version can be checked by:
    • sending the $ agpubox listserver command and checking the Version column:
      $ agpubox listserver
      +---------+-------+------+------------+-----+--------------------+--------+
      |         |Number |      |            |     |                    |        |
      |GPUServer|of GPUs|Users |IP          |Port |Since               |Version |
      +---------+-------+------+------------+-----+--------------------+--------+
      |GPU11    |4      |4     |203.0.103.11|9393 |2013-12-23 07:06:11 |0.8     |
      |GPU12    |4      |4     |203.0.103.12|9393 |2013-12-23 07:06:08 |0.8     |
      |GPU13    |4      |4     |203.0.103.13|9393 |2013-12-23 07:06:10 |0.8     |
      +---------+-------+------+------------+-----+--------------------+--------+
    • finding the GBSC-GI-600 message in the log of each GPUServer:
      [2013-06-27 10:47:53.865208]  GBSC-GI-600 [P:5751] Version: 0.8
    • using the -v parameter on the binary of each GPUServer (regardless if it is currently running):
      $ ~/gpubox-gpuserver/bin/gpuserver -v
      Version: 0.8

    Client version

    It is important that both the gpubox and agpubox commands are running under the same version. To check it, a user or an administrator can use the v|version subcommand:
    $ gpubox version
    Version: 0.8
    $ agpubox version
    Version: 0.8
    An administrator can also see what is the version of the connected Client by going to the GPUServers / Running processes section in GPUBox Console and opening the Process details (in the Action column). The version of the Client responsible for a particular process will be shown in Version: ....

    Browse logs

    If the appropriate parameters (oserver_log_path on OServer and gpuserver_log_path on GPUServer) are correctly defined and the logs are stored in a file instead of being redirected to STDOUT (standard output), an administrator has the possibility of getting the specified parts of them by using the ol|olog (OServer's log) and gl|glog (GPUServer's log) subcommands for the agpubox command.

    The agpubox gl|glog and agpubox ol|olog subcommands use the same parameters and works analogically. Except for the need to indicate the name of the GPUServer while using agpubox gl|glog, the use of both of the subcommands to retrieve desired parts of OServer's or GPUServer's logs is based on the same rules.

    In order to retrieve exactly the messages that you need, you can apply several parameters to the agpubox ol|olog or agpubox gl|glog subcommands:
    -s|--severity=<severity> Define the message severity. Unlike the logging level parameters in the OServer and GPUServer configurations, using the -s|--severity parameter will result in displaying only the messages with indicated severity. If not used, messages with any severity will be displayed.
    -l|--limit=<number_of_records> Define the number of records that will be displayed. Depending on whether you use a negative or positive number, the generated output will display either the first (positive) or last (negative) records. If not used, by default the last 100 (-100) messages will be displayed.
    -b|–-begindate=<”YYYY-MM-DD hh:mm:ss”> Define the beginning of the timespan from which the messages will be displayed. If not used, the command will start from the beginning of the log. Cannot be used with
    -e|-–enddate=<”YYYY-MM-DD hh:mm:ss”> Define the end of the timespan from which the messages will be displayed. If not used, the command will stop at the end of the log.
    The parameters can be combined except for three limitations:
    • -b|--begindate cannot be used with a negative value of -l|--limit
    • -e|--enddate cannot be used with a positive value of -l|--limit
    • -l|--limit cannot be defined when both -b|--begindate and -e|--enddate parameters are used

    Using the ol|olog and gl|glog subcommands an administrator can quickly retrieve the desired parts of logs.

    An administrator can request to display the log entries that appeared within a specified timespan - for example, let us display all of the messages (regardless of severity) that appeared between "2013-12-16 15:45:00" and "2013-12-16 15:50:00"

    $ agpubox olog -b="2013-12-16 15:45:00" -e="2013-12-16 15:50:00"
    [2013-12-16 15:45:00.530693] <DEBUG   > ROSC-CW-35X [0000011] HTTP Status Code 201
    [2013-12-16 15:45:00.751914] <INFO	 > ROSC-HH-500 [0000012] New session [139844997080832]: get:http://203.0.113.1:8081/gpu/allocation, IP: 203.0.113.151:59771
    [2013-12-16 15:45:00.754146] <INFO	 > ROSC-AX-51A [0000012] Access granted to superuser gpubox, GPUBox administrator
    [2013-12-16 15:45:01.140152] <DEBUG   > ROSC-CW-35X [0000012] HTTP Status Code 200
    [2013-12-16 15:45:01.250337] <INFO    > ROSC-HH-500 [0000013] New session [139844976101120]: get:http://203.0.113.1:8081/auth/user, IP: 203.0.113.151:37300
    ...
    [2013-12-16 15:50:00.700638] <DEBUG   > ROSC-CW-35X [0000025] HTTP Status Code 200
    [2013-12-16 15:50:00.952343] <INFO	 > ROSC-HH-500 [0000026] New session [139843902359296]: get:http://203.0.113.1:8081/auth/user, IP: 203.0.113.152:37304
    [2013-12-16 15:50:34.954404] <INFO	 > ROSC-AX-51A [0000026] Access granted to superuser gpubox, GPUBox administrator
    [2013-12-16 15:50:34.954573] <DEBUG   > ROSC-CW-35X [0000026] HTTP Status Code 200
    [2013-12-16 15:50:34.956234] <INFO	 > ROSC-HH-500 [0000027] New session [139843891869440]: get:http://203.0.113.1:8081/log/-2013-12-16+15:44:00/2013-12-16+15:45:00, IP: 203.0.113.152:37305
    [2013-12-16 15:50:34.958090] <INFO	 > ROSC-AX-51A [0000027] Access granted to superuser gpubox, GPUBox administrator
        

    Applying the -s|--severity parameter might be helpful in retreiving the right messages quickly. For example, one of the GPUServers - let us say GPU14 - cannot connect to OServer. An administrator can display all of the CRITICAL messages from the problematic GPUServer's log to determine the possible problem:

    $ agpubox glog GPU14 -s=critical
    [2013-12-16 13:39:01.340265] <CRITICAL> GBSC-SC-91Z [10021] Fatal response from OServer, GPUServer is not authorized.
    [2013-12-16 13:42:55.340265] <CRITICAL> GBSC-SC-91Z [10126] Fatal response from OServer, GPUServer is not authorized.
    [2013-12-16 16:21:12.340265] <CRITICAL> GBSC-SC-91Z [10415] Fatal response from OServer, GPUServer is not authorized.

    ol|olog and gl|glog can operate only on the log files indicated by the oserver_log_path and gpuserver_log_path parameters on currently running OServer and GPUServers. It also means that an administrator sets too low logging level (for example NOTICE) it will not be possible to retrieve messages with higher severities.

    Monitor and manage processes

    The agpubox lp|listprocess command is the main tool to monitor the processes engaging the GPUs in the GPUBox infrastructure. Sending it will result in generating a list of processes active on particular GPUServers.
    $ agpubox listprocess
    +---------+------+----- +----------+------+----------+-----------+----+----+-----+-------+--------+-----+-------+-------+---+
    |GPUServer|Type  |UserID|PID       |Active|Since     |Client's IP|Net |Rtn |Rtn/s|Rx     |Rx/s    |Tx   |Tx/s   |Mem(MB)|Dev|
    +---------+------+------+----------+------+----------+-----------+----+----+-----+-------+--------+-----+-------+-------+---+
    |GPU14    |main  |john  |10677     |yes   |2013-10...|203.0.10...|TCP |44  |0    |47.7MB |0B/s    |592B |0B/s   |487.9MB|0  |
    |GPU14    |thread|john  |10677     |yes   |2013-10...|203.0.10...|IB  |7.0K|721  |266.8KB|13.8KB/s|3.7MB|8.8KB/s|487.9MB|3  |
    |GPU11    |main  |mary  |30010     |yes   |2013-10...|203.0.10...|TCP |4.4K|617  |212.2KB|11.5KB/s|4.1MB|7.5KB/s|943.9MB|0  |
    |...      |...   |...   |...       |...   |...       |...        |... |... |...  |...    |...     |...  |...    |...    |...|
    As an option, an administrator can indicate by name a particular GPUServer from which the processes will be listed:
    $ agpubox listprocess GPU14
    +---------+------+----- +----------+------+----------+-----------+----+----+-----+-------+--------+-----+-------+-------+---+
    |GPUServer|Type  |UserID|PID       |Active|Since     |Client's IP|Net |Rtn |Rtn/s|Rx     |Rx/s    |Tx   |Tx/s   |Mem(MB)|Dev|
    +---------+------+------+----------+------+----------+-----------+----+----+-----+-------+--------+-----+-------+-------+---+
    |GPU14    |main  |john  |10677     |yes   |2013-10...|203.0.10...|TCP |44  |0    |47.7MB |0B/s    |592B |0B/s   |487.9MB|0  |
    |GPU14    |thread|john  |10677     |yes   |2013-10...|203.0.10...|IB  |7.0K|721  |266.8KB|13.8KB/s|3.7MB|8.8KB/s|487.9MB|3  |
    +---------+------+------+----------+------+----------+-----------+----+----+-----+-------+--------+-----+-------+-------+---+
    There are two types of processes: main and thread.
    • the main process represents the Client's process on GPUServer and is responsible for creating threads on a particular GPUServer. It does not participate in data transfer therefore it always stays in the TCP protocol
    • each thread process active on the GPUs create their a separate communication channel
    When the subserver responsible for a particular process is closed inappropriately, the process will be visible as inactive (no in the Active column). Inactive processes disappears from the list when the GPUServer is restarted, but an administrator can nevertheless remove an inactive process with the agpubox rp|rprocess subcommand by indicating the GPUServer name and PID:
    $ agpubox listprocess
    +---------+------+------+----------+------+----------+--------------+----+-----+-------+--------+-----+-------+-------+---+
    |GPUServer|Type  |UserID|PID       |Active|Since     |Connected from|Rtn |Rtn/s|Rx     |Rx/s    |Tx   |Tx/s   |Mem    |Dev|
    +---------+------+------+----------+------+----------+--------------+----+-----+-------+--------+-----+-------+-------+---+
    |GPU14    |main  |john  |10677     |no    |2013-10...|203.0.103.155 |44  |0    |47.7MB |0B/s    |592B |0B/s   |487.9MB|3  |
    |GPU14    |thread|john  |10677     |no    |2013-10...|203.0.103.155 |7.0K|721  |266.8KB|13.8KB/s|3.7MB|8.8KB/s|487.9MB|3  |
    |GPU11    |main  |mary  |30010     |yes   |2013-10...|203.0.103.152 |4.4K|617  |212.2KB|11.5KB/s|4.1MB|7.5KB/s|943.9MB|1  |
    |...      |...   |...   |...       |...   |...       |...           |... |...  |...    |...     |...  |...    |...    |...|
    
    $ agpubox rprocess GPU14 10677
    Accepted
    $ agpubox listprocess
    +---------+------+------+----------+------+----------+--------------+----+-----+-------+--------+-----+-------+-------+---+
    |GPUServer|Type  |UserID|PID       |Active|Since     |Connected from|Rtn |Rtn/s|Rx     |Rx/s    |Tx   |Tx/s   |Mem    |Dev|
    +---------+------+------+----------+------+----------+--------------+----+-----+-------+--------+-----+-------+-------+---+
    |GPU11    |main  |mary  |30010     |yes   |2013-10...|203.0.103.152 |4.4K|617  |212.2KB|11.5KB/s|4.1MB|7.5KB/s|943.9MB|1  |
    |...      |...   |...   |...       |...   |...       |...           |... |...  |...    |...     |...  |...    |...    |...|

    Depending on the architecture of the application, some processes might not actually be running at the moment, but will be visible in the agpubox lp|listprocess table with 0B/s in the Tx/s column.

    Maintenance of orphan allocations

    When the GPUServer providing the GPUs allocated by a user is inappropriately closed, the GPUs turn to the BROKEN status. The corrupted allocations are visible in table generated by the agpubox l|list subcommand:
    $ agpubox list
    +---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
    |GID|LID|UserID |User name   |GPU name            |PCI          |IP            |GPUServer|Status   |Since               |
    +---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
    |6  |1  |bob    |Bob         |GeForce GTX 780     |710B:04:00.0 |203.0.113.150 |GPU11    |BROKEN   |2013-12-22 11:44:52 |
    |7  |2  |bob    |Bob         |GeForce GTX 680     |710B:07:00.0 |203.0.113.150 |GPU11    |BROKEN   |2013-12-22 11:41:16 |
    |8  |1  |mary   |Mary        |GeForce GTX 690     |7112:04:00.0 |203.0.113.152 |GPU18    |EXCLUSIVE|2013-12-22 11:40:56 |
    +---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
    The "orphan" allocations can be fixed and used by the user again when the GPUServer is restored and reconnects to OServer. But, until then, the corrupted allocations are consuming the pool of user allocations limited by the Max GPU attribute. There are three ways of removing the orphan allocations from the GPUBox infrastructure:
    • the user can drop the BROKEN on his own with the regular gpubox d|drop subcommand
    • an administrator can drop the GPUs on behalf of the user with the agpubox udrop subcommand
    • an administrator can send the agpubox m|maint o|orphans command
    Using the agpubox m|maint o|orphans command is the quickest way to get rid of all of the orphan allocations that occured in the GPUBox infrastructure and an administrator can use it to make sure that there are no corrupted allocation records in the GPUBox infrastructure after - for example - crash of the system on which GPUServer is installed. For an exemplary listing above, the agpubox maint orphans command will remove the two corrupted allocations belonging to user bob:
    $ agpubox maint orphans
    *** Removed 2 orphans
    
    $ agpubox list
    +---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
    |GID|LID|UserID |User name   |GPU name            |PCI          |IP            |GPUServer|Status   |Since               |
    +---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
    |8  |1  |mary   |Mary        |GeForce GTX 690     |7112:04:00.0 |203.0.113.152 |GPU18    |EXCLUSIVE|2013-12-22 11:40:56 |
    +---+---+-------+------------+--------------------+-------------+--------------+---------+---------+--------------------+
    

    Removing the orphans generates the accounting records but, since the connection to the GPU is broken, the generated records does not contain information about the GPU.