OServer

OServer is the central processing server of the GPUBox infrastructure. It coordinates the operations between the other two components: GPUServer and Client. Users can dynamically allocate and drop virtual GPUs from the centrally managed and monitored pool of shared devices created from the set of GPUServers.

OServer key functions are:

  • managing the GPUBox infrastructure
  • coordinating the allocations of the GPUs across the GPUBox infrastructure
  • managing and coordinating security (users, access, credentials, etc.)
  • monitoring, accounting, and logging user's activities.
  • Starting OServer on Linux

    Depending on the type of installation used, you can start OServer in two ways:

    • As a system service

      # service oserver start

      If you run the installation with root privileges, the daemon script and designated user gpubox is by default installed automatically. See OServer Installation for more details.

    • As a user process

      If the OSERVER_CONF environment variable indicates the configuration file or OServer uses the default location <OSERVER_INSTALLATION_DIR>/etc/oserver.conf, to start OServer you can simply issue a command:
       $ oserver
      Otherwise you have to pass the configuration file with -c command parameter:
       $ oserver -c /usr/local/gpubox-oserver/etc/oserver.conf
      To daemonize OServer, use the -d option:
       $ oserver -d

    Parameters

    oserver parameters
    oserver -c path_to_config_file -d -u user -v -h
    -c path_to_config_file Relative or absolute path to the configuration file
    -d Daemonize OServer
    -u user Start OServer as an indicated system user (root privileges required)
    -v Show version
    -h Show help

    Return codes

    0OK
    1 Error related to:
  • being unable to find the configuration file
  • parsing configuration parameter
  • missing required parameters
  • OServer with the -d option being unable to switch to the daemon mode
  • More detailed messages are also displayed in the log.
    2Forced to quit, the SIGQUIT signal was received
    3OServer caught an exception
    4Exception generated by signals such as SIGSEGV or SIGFPE
    5OServer is already running on the same system

    Environment variables

    OSERVER_CONF Path to the configuration file. The configuration file given as a command line parameter has precedence over the environment variable.

    Configuration

    All of the parameter values must be enclosed in quotation marks except for numbers. The following parameters have values expressed by numbers:

  • oserver_max_gpu_per_user
  • oserver_max_users_per_gpu
  • oserver_max_gpu_per_user_per_ip
  • The file containing the OServer settings is located under the OSERVER_CONF environment variable that by default is set to <OSERVER_INSTALLATION_DIR>/etc/oserver.conf.

    The exemplary configuration file should be in the following format:

    oserver_security_plugin = "sqliteSecurity"   
    oserver_security_config = "./data/security.db"                   
    oserver_accounting_plugin = "defaultAccounting"
    oserver_accounting_config = "./data/accounting.db"               
    
    oserver_rest_bind = "127.0.0.1:8081"         
    oserver_ssl_cert_path = "./etc/ssl_cert_example.pem"
    auth_token = "d6e459844365274877ca5949b144f076"
    oserver_db_path = "./data/oserver.db"            
    oserver_log_path = "./data/oserver.log" #filename
    oserver_log_level = "TRACE"   
    oserver_max_gpu_per_user = 5                         
    oserver_max_users_per_gpu = 5                        
    oserver_max_gpu_per_user_per_ip = 10                 
    oserver_gpu_count_info = "yes"                           
    oserver_allocation_option = "free"           
    oserver_webfiles_path = "./webui"
    oserver_installfiles_path = "./install"
    oserver_datafiles_path = "./data"
    
    oserver_license_uid="<LICENSE_UID>"
    oserver_license_key="<LICENSE_KEY>"

    Specific parameters are described in the table below

    Parameter ( - required) Default value Description
    oserver_security_plugin security1 Name of the security plugin. If not given default value is used.
    oserver_security_config empty Path to the security database. OServer does not run properly unless this parameter is defined.
    oserver_accounting_plugin defaultAccounting Name of the accounting plugin. If not given default value will be used.
    oserver_accounting_config empty Path to the file for accounting information. If not given, messages will be displayed in the OServer log.
    oserver_rest_bind 127.0.0.1:8081 Bind RESTful server on given IP and port.
    Format:IP:PORT or IP:PORT,PORT,PORTs where:
    IP - any IP address or 0.0.0.0 to bind on all interfaces
    PORT - any TCP port
    PORTs - any port,(HTTPS interface, parameter oserver_ssl_cert_path is required)

    Examples:
    203.0.113.1:8080
    203.0.113.1:8081s
    203.0.113.1:8080,8081s
    0.0:8080
    oserver_ssl_cert_path none Path to the SSL certificate file containing the public and private keys.
    auth_token d6e45984436527
    4877ca5949b144f076
    The authorization token used by GPUServers to log into OServer.
    oserver_db_path none Path to the devices database. OServer can create a new database every time it is started (if the file does not exist) or recover any existing data (if the indicated file is recognized as a database). OServer does not not run properly without this parameter.
    oserver_log_path STDOUT Path to OServer log file. If the path to a file is not given, the parameter is be set to STDOUT; messages will be displayed in standard output and cannot be checked on GPUBox Web Console.
    oserver_log_level TRACE Parameter referring to logging level. Can be chosen from one of the following:TRACE,DEBUG,INFO,NOTICE,WARNING,ERROR, or CRITICAL, where TRACE is the most detailed and CRITICAL gives only the most important entries.
    oserver_max_gpu_per_user 5 The maximum number of GPUs that can be allocated to one user. 0 means no limits.
    oserver_max_users_per_gpu 5 Parameter referring to the maximum number of users that can share one GPU. 0 means “no limits.”
    oserver_max_gpu_per_user_per_ip 5 Maximum number of GPUs that can be allocated by user from the same IP address. 0 means no limits.
    oserver_gpu_count_info no If "yes", users are able to see the number of free GPUs after using the gpubox free subcommand.
    oserver_allocation_option free One of the two available algorithms according to which a user allocates the GPUs in shared mode:

    “free” - allocate the free GPU and when the pool of free GPUs is exhausted then allocate a shared GPU.
    “shared” - allocate the GPU already working in shared mode, and if number of users is equal to the value of the oserver_max_users_per_gpu parameter, allocate the free GPU.
    oserver_webfiles_path <OSERVER_INSTALLATION_DIR>/webui Path to the directory containing the GPUBox Web Console files.
    oserver_installfiles_path <OSERVER_INSTALLATION_DIR>/install Path to the directory with installation packages for GPUServer and Client.
    oserver_datafiles_path empty Path to the directory used for storing the accounting files. Must be specified in order to run OServer properly.
    oserver_license_uid none Your license UserID.
    oserver_license_key none Your license key.

    Use variables with configuration parameters

    Any non-array string value can be replaced by environment variable.

    variable"$environment-variable"

    As an example, OServer's bind address can be replaced by a variable.
    System is using the IP over InfiniBand protocol on the ib0 interface. Let us set the environment variable in the /etc/rc.local file

    # echo "export OSERVER_BIND=\"`ip a s dev ib0 | grep -oP 'inet\s+\K[^/]+'`:8081,8082s\"" >> /etc/rc.local

    The address will be available in OServer's configuration file:

    oserver_rest_bind="$OSERVER_BIND"

    OServer's behavior when configuration parameters are invalid

    The following table describe OServer's behavior when parameter's syntax is correct but semantics are invalid.

    Parameter Description
    oserver_security_plugin OServer cannot find and load the security plugin and will display the following messages.
    [2013-06-26 17:18:59.515416] <ERROR   > OSVC-PS-80A Cannot load security plugin: security1, libsecurity1.so: cannot open shared object file: No such file or directory
    [2013-06-26 17:18:59.515428] <CRITICAL> OSVC-IC-970 Cannot load security plugin:security1
    [2013-06-26 17:18:59.515486] <CRITICAL> OSRV-OS-976B [21980] OServer received other unexpected exception
    OServer is terminated.
    oserver_security_config Security1 cannot find the security database.
    [2013-06-26 17:21:37.070424] <ERROR   > QS1D-RL-80B [FFFFFFF] unable to open database file
    [2013-06-26 17:21:37.070438] <WARNING > OSVC-PS-710 The security plugin security1 received invalid configuration
    [2013-06-26 17:21:37.070458] <ERROR   > QS1D-VT-81B [FFFFFFF] unable to open database file
    [2013-06-26 17:21:37.070467] <CRITICAL> OSVC-IC-99A Infrastructure token is invalid
    OServer is terminated.
    oserver_accounting_plugin OServer cannot find and load the accounting plugin and will display the following messages.
    [2013-06-26 17:19:33.460079] <CRITICAL> OSVC-PA-80A Cannot load accounting plugin: defaultAccounting, libdefaultAccounting.so: cannot open shared object file: No such file or directory
    [2013-06-26 17:19:33.460091] <CRITICAL> OSVC-IC-98A Cannot load accounting plugin:defaultAccounting
    [2013-06-26 17:19:33.460158] <CRITICAL> OSRV-OS-976B [22076] OServer received other unexpected exception* oserver_accounting_config, 
    [2013-06-26 17:20:13.036721] <WARNING > DACC-WA-79A Cannot open output file for accounting: /home/bob/Develop/oserver_data5/accounting.log
    [2013-06-26 17:20:13.036729] <WARNING > OSVC-PA-710 Accounting plugin defaultAccounting received invalid configurationAccounting redirected to log
    OServer is terminated.
    oserver_rest_bind When port is busy, interface does not exists or entire string is incorrect, message OSVC-MN-84A will explain the details.
    [2013-07-21 13:30:23.919725] <ERROR   > OSVC-MN-84A  set_ports_option: cannot bind to 8081: Address already in use
    [2013-07-21 13:30:23.919824] <CRITICAL> OSVC-OR-960 Cannot start RESTful server
    OServer is terminated.
    oserver_ssl_cert_path
    [2013-07-21 13:30:23.919725] <CRITICAL> OSVC-OS-93I Cannot open SSL certificate
    OServer is terminated.
    oserver_db_path
    [2013-06-26 17:16:03.277141] <CRITICAL>  DBHC-CD-960 [140103551608608] Cannot connect to database,  OServer is being terminated: unable to open database file
    [2013-06-26 17:16:03.277189] <CRITICAL>  DBHC-CD-961 [140103551608608] Unexpected database exception occurred while connecting database, Cannot connect to database,  OServer is being terminated: unable to open database file
    [2013-06-26 17:16:03.277213] <CRITICAL>  OSVC-OD-970 Cannot connect to database: /usr/local/gpubox-oserver/data/oserver.db, OServer is being terminated
    [2013-06-26 17:16:03.277231] <CRITICAL>  OSVC-IC-960 Unexpected exception occurred, cannot start OServer
    [2013-06-26 17:16:03.277249] <CRITICAL>  OSRV-OS-976B [21330] OServer received other unexpected exception
    OServer is terminated.
    oserver_log_path

    In case of invalid path, the log is always redirected to standard output.

    [2013-06-26 17:23:58.698152] <WARNING > OSVC-SL-71A Cannot open output file /usr/local/gpubox-oserver/log/oserver.log, log redirect to STDOUT
    OServer continues processing.
    oserver_webfiles_path
    [2013-07-21 13:37:26.839668] <CRITICAL> OSVC-OS-96F Path defined in parameter 'oserver_webfiles_path' does not exist
    OServer is terminated.
    oserver_installfiles_path
    [2013-06-26 17:36:37.555977] <CRITICAL> OSVC-OS-96G Path defined in parameter 'oserver_installfiles_path' does not exist
    OServer is terminated.
    oserver_datafiles_path
    [2013-06-26 17:25:48.484115] <CRITICAL> OSVC-OS-96E Path defined in parameter 'oserver_datafiles_path' does not exist
    OServer is terminated.
    oserver_log_level

    OServer uses default value.

    [2013-02-05 07:08:06.616268] <NOTICE  > OSVC-SL-68A Invalid log level, 'TRACE' will be used instead3
    [2013-02-05 07:08:06.616276] <INFO    > OSVC-SL-50A oserver_log_path  = STDOUT
    [2013-02-05 07:08:06.616297] <INFO    > OSVC-SL-50B oserver_log_level = TRACE
    [2013-02-05 07:08:06.616304] <INFO    > OSVC-SL-50C Logging started
    

    OServer continues processing.
    oserver_gpu_count_info

    OServer uses default value.

    [2013-02-05 07:09:49.244041] <INFO    > OSVC-IC-50I oserver_gpu_count_info = no
    
    OServer continues processing.
    oserver_allocation_option

    OServer uses default value.

    [2013-02-05 07:16:33.344573] <INFO    > OSVC-IC-50M oserver_allocation_option = shared
    
    OServer continues processing.
    auth_token

    OServer uses the value passed in the configuration file.

    OServer continues processing.
    oserver_max_gpu_per_user

    OServer uses the value passed in the configuration file.

    OServer continues processing.
    oserver_max_users_per_gpu

    OServer uses the value passed in the configuration file.

    OServer continues processing.
    oserver_max_gpu_per_user_per_ip

    OServer uses the value passed in the configuration file.

    OServer continues processing.

    Start sequence

    While OServer is starting it:

  • reads the configuration file
  • initializes the device database within the first start
  • loads the security plugin
  • loads the accounting plugin
  • initializes the security database within the first start
  • starts Monitor1
  • waits for GPUServers to connect:
    • registers or recovers the GPU allocation when GPUServers connect
    • recovers GPUs' allocation when OServer has been restarted
  • Messages at start

    At start, OServer displays the values of all the configuration parameters except auth_token, which is signaled only by the OSVC-IC-50H message, and license details. Besides the parameters preview, there are a few messages worth taking a look at:

  • OSVC-SL-50B level of logging is TRACE, if the level of logging is too high, some important messages won't be displayed.
  • OSVC-GI-000 version of OServer is 0.8, GPUServer and OServer should have the same version
  • OSVC-IC-50C the RESTful server is bound on all available interfaces, the HTTP communication is on port 8081 and HTTPS on port 8082
  • [2013-05-03 12:41:21.390250] <INFO    > OSVC-SL-50A oserver_log_path  = STDOUT
    [2013-05-03 12:41:21.390369] <INFO    > OSVC-SL-50B oserver_log_level = TRACE
    [2013-05-03 12:41:21.390444] <INFO    > OSVC-SL-50C Logging started
    [2013-05-03 12:41:21.393367] <INFO    > OSVC-IC-000 Version: 0.8
    [2013-05-03 12:41:21.394010] <INFO    > OSVC-IC-50C oserver_rest_bind = 0.0.0.0:8081,8082s
    [2013-05-03 12:41:21.394087] <INFO    > OSVC-IC-50D oserver_security_plugin = security1
    [2013-05-03 12:41:21.394160] <INFO    > OSVC-IC-50K oserver_security_config = /usr/local/gpubox-oserver/data/security1.db
    [2013-05-03 12:41:21.394232] <INFO    > OSVC-IC-50L oserver_accounting_plugin = defaultAccounting
    [2013-05-03 12:41:21.394310] <INFO    > OSVC-IC-50Q oserver_accounting_config = /usr/local/gpubox-oserver/data/accounting.txt
    [2013-05-03 12:41:21.394401] <INFO    > OSVC-IC-50E oserver_max_gpu_per_user = 10
    [2013-05-03 12:41:21.394533] <INFO    > OSVC-IC-50F oserver_max_gpu_per_user_per_ip = 5
    [2013-05-03 12:41:21.394617] <INFO    > OSVC-IC-50G oserver_max_users_per_gpu = 5
    [2013-05-03 12:41:21.394698] <INFO    > OSVC-IC-50H Infrastructure token presented
    [2013-05-03 12:41:21.394773] <INFO    > OSVC-IC-50I oserver_gpu_count_info = yes
    [2013-05-03 12:41:21.394847] <INFO    > OSVC-IC-50K oserver_webfiles_path = /usr/local/gpubox-oserver/webui
    [2013-05-03 12:41:21.394920] <INFO    > OSVC-IC-50L oserver_installfiles_path = /usr/local/gpubox-oserver/install
    [2013-05-03 12:41:21.394993] <INFO    > OSVC-IC-50M oserver_allocation_option = shared
    [2013-05-03 12:41:21.395066] <INFO    > OSVC-IC-50R oserver_datafiles_path = /usr/local/gpubox-oserver/data/
    [2013-05-03 12:41:21.395158] <INFO    > GBSC-OS-52I oserver_ssl_cert_path = /usr/local/gpubox-oserver/etc/ssl_cert_example.pem
    [2013-05-03 12:41:21.397519] <INFO    > OSVC-OD-510 Connected to database /usr/local/gpubox-oserver/data/oserver.db
    [2013-05-03 12:41:21.797528] <TRACE   > DBHC-CK-030 Database checked successfully.
    [2013-05-03 12:41:21.797654] <INFO    > OSVC-OD-50Z Database configuration has been checked successfully.
        

    At the next stage of the OServer's initialization, the security plugin is loaded. In the exemplary listing below, the QS1C-RL-64A message indicates that the security1 plugin is loaded for the first time or that the security database was newly created. The security database is initialized with the gpubox user. For more information, refer to the Security Bootstrapping chapter.

    [2013-05-03 12:41:21.801909] <INFO    > OSVC-PS-500 Plugin 'security1' loaded.
    [2013-05-03 12:41:22.184736] <NOTICE  > QS1C-RL-64A [FFFFFFF] Security database initialized with user 'gpubox'
    [2013-05-03 12:41:23.185179] <INFO    > OSVC-PS-510 Security plugin received configuration.
    [2013-05-03 12:41:23.186316] <NOTICE  > OSVC-IC-69A Infrastructure token belongs to userid: gpubox, username: GPUBox administrator
        

    Afterwards, OServer loads the defaultAccounting plugin and starts Monitor1. Data from Monitor1 will be saved in directory /usr/local/gpubox-oserver/data/monitor1 The OSVC-OS-600 message signifies that OServer is ready to receive requests.

    [2013-05-03 12:41:23.190961] <INFO    > OSVC-PA-500 Plugin 'defaultAccounting' loaded.
    [2013-05-03 12:41:23.191060] <WARNING > DACC-WA-59A Accounting output file opened successfully: /usr/local/gpubox-oserver/data/accounting.txt
    [2013-05-03 12:41:27.990962] <NOTICE  > OSVC-OS-600 OServer is ready to accept connections.
    [2013-05-03 12:41:27.991686] <INFO    > OSVC-M1-50B Directory /usr/local/gpubox-oserver/data/monitor1 for monitor1 created
    [2013-05-03 12:41:27.991779] <INFO    > OSVC-M1-50A Monitor1 started
        

    In the last part of the listing, GPUServer is connecting to OServer.

  • ROSC-HH-500: GPUServer from the 203.0.113.11 IP address has sent the GPU devices configuration to the http://203.0.113.1:8081/gpu interface.
  • ROSC-HH-500: GPUServer was authorized as the gpubox user.
  • ROSC-HH-500: In the first phase of the GPU registration, all devices are set to the INIT status.
  • DBHC-IG-510: The new GPU device is registered, but it is still in the INIT status and is not ready to use.
  • DBHC-UG-50A: This device is registered for the first time, i.e. has a unique name and memory expressed in gigabytes.
  • DBHC-GS-05A: The device status is changed to FREE and is ready to be used.
  • ROSC-CW-35X: Registration of all the devices from GPUServer has been finished successfully.
  • Connection ID (here [0000001]) is the first connection; it is the same for every single request.
  • [2013-05-03 12:41:31.154624] <INFO    > ROSC-HH-500 [0000001] New session [140303050716928]: put:http://203.0.113.1:8081/gpu, IP: 203.0.113.11:53715
    [2013-05-03 12:41:31.156987] <INFO    > ROSC-AX-51A [0000001] Access granted to superuser gpubox, GPUBox administrator
    [2013-05-03 12:41:32.546690] <INFO    > DBHC-IG-510 [0000001] Registered GPU (name:GeForce GTX 690 ,device:0, gdomain:C804) from IP:203.0.113.11 ,hostname:GPU4 ,version:0.8
    [2013-05-03 12:41:32.547277] <INFO    > DBHC-UG-50A [0000001] New device GeForce GTX 690 with 2.0GB was registered.
    [2013-05-03 12:41:32.548876] <TRACE   > DBHC-GS-05A [0000001] GPU devices 0 from IP 203.0.113.11 changed status to FREE
    [2013-05-03 12:41:32.549507] <INFO    > DBHC-IG-510 [0000001] Registered GPU (name:GeForce GTX 690 ,device:1, gdomain:C804) from IP:203.0.113.11 ,hostname:GPU4 ,version:0.8
    [2013-05-03 12:41:32.550592] <TRACE   > DBHC-GS-05A [0000001] GPU devices 1 from IP 203.0.113.11 changed status to FREE
    [2013-05-03 12:41:32.551290] <INFO    > DBHC-IG-510 [0000001] Registered GPU (name:GeForce GTX 690 ,device:2, gdomain:C804) from IP:203.0.113.11 ,hostname:GPU4 ,version:0.8
    [2013-05-03 12:41:32.552375] <TRACE   > DBHC-GS-05A [0000001]GPU devices 2 from IP 203.0.113.11 changed status to FREE
    [2013-05-03 12:41:32.553042] <INFO    > DBHC-IG-510 [0000001] Registered GPU (name:GeForce GTX 690 ,device:3, gdomain:C804) from IP:203.0.113.11 ,hostname:GPU4 ,version:0.8
    [2013-05-03 12:41:32.554131] <TRACE   > DBHC-GS-05A [0000001] GPU devices 3 from IP 203.0.113.11 changed status to FREE
    [2013-05-03 12:41:32.554651] <DEBUG   > ROSC-CW-35X [0000001] HTTP Status Code 201

    When GPUServer stops, the related allocations are preserved. When OServer stops, every allocation is preserved. During OServer's restart, the allocations are recovered.

    [2013-05-04 12:53:36.075857] <INFO    > ROSC-HH-500 [0000001] New session [139878933186304]: put:https://203.0.113.1:8082/gpu, IP: 203.0.113.11:34582
    [2013-05-04 12:53:36.077092] <INFO    > ROSC-AX-51A [0000001] Access granted to superuser gpubox, GPUBox administrator
    [2013-05-04 12:53:36.458680] <INFO    > OSVC-HB-50A Sent heartbeat to 1 GPUServers
    [2013-05-04 12:53:37.350527] <INFO    > DBHC-RG-50A [0000001] GPU 0 recovered  from GPU1
    [2013-05-04 12:53:37.350836] <TRACE   > DBHC-GS-05A GPU devices 0 from IP 203.0.113.11 changed status to SHARED
    [2013-05-04 12:53:37.351713] <INFO    > DBHC-RG-50A [0000001] GPU 1 recovered  from GPU1
    [2013-05-04 12:53:37.352004] <TRACE   > DBHC-GS-05A GPU devices 1 from IP 203.0.113.11 changed status to SHARED
    [2013-05-04 12:53:37.352390] <DEBUG   > ROSC-CW-35X [0000001] HTTP Status Code 201

    Stopping OServer on Linux

    OServer can be stopped in three ways:

  • Normal

    OServer receives the SIGINT(2) or SIGTERM(15) signal. All running requests are allowed to finish before OServer is closed.

  • Force

    OServer receives the SIGQUIT(3) signal. All running requests are terminated immediately and the entire shutdown sequence is skipped.

    In some cases, such as a locked database, normal mode of stopping OServer might not be enough and force mode has to be used instead.

  • Kill

    The administrator sends the SIGKILL(9) signal. OServer does not have any chance to receive any other signals and the process is terminated by the operating system.

  • Commands to stop

    Depending on how OServer was started, there are a few methods of terminating:

  • Started from the command line as non-daemon (option -d was not applied): Simply press Ctrl+c to send the SIGINT(2) signal and terminate OServer.

  • Started from the command line as daemon (option -d was applied. OServer as a daemon is disconnected from the input/output devices, such as the terminal or keyboard. The only way is to send signal from the command line.

    • Normal stop
      $ pkill -2 oserver
      or
      $ pkill -15 oserver
    • Force stop
      $ pkill -3 oserver
    • Kill stop
      $ pkill -9 oserver
  • Started as a service.

    • Normal mode
      # service oserver stop
    • Force mode
      # service oserver force-stop

  • Stop sequence

    While OServer is stopping it:

  • finishes all the running requests
  • terminates the heartbeat signal, GPUServer notifies about that fact in the GBSC-HB-82Rmessage
  • all the GPU allocations are preserved; when OServer restarts, all allocations are recovered
  • saves the Monitor1 data
  • Messages at stop

  • OSRV-OH-76DOServer received the SIGINT(2) signal
  • OSRV-OH-76CBeginning of the OServer stopping sequence
  • OSVC-O1-51AData from Monitor1 is saved in the m1_20130503133231-20130504125019 file
  • OSVC-DE-98AThe heartbeat signals to all GPUServers are terminated
  • [2013-05-04 12:50:21.788418] <WARNING > OSRV-OH-76D [4953] Received Interrupt(2)
    [2013-05-04 12:50:21.788587] <WARNING > OSRV-OH-77X [17316] OServer has 1 active requests, shutdown pending, please wait...
    [2013-05-04 12:50:21.788605] <WARNING > OSRV-OH-76C [17316] OServer received stop signal (SIGINT or SIGTERM)
    [2013-05-04 12:50:21.788827] <INFO    > OSVC-DE-989 Oserver is shutting down, please wait.
    [2013-05-04 12:50:21.860948] <INFO    > OSVC-O1-51A Monitor1 data offloaded to file /usr/local/gpubox-oserver/data/monitor1/m1_20130503133231-20130504125019
    [2013-05-04 12:50:22.097584] <INFO    > OSVC-DE-98A HeartBeat stopped.
    [2013-05-04 12:50:24.111992] <NOTICE  > OSRV-OH-56C [4953] OServer stopped.

    Accounting and monitoring

    Accounting

    Accounting in the GPUBox infrastructure is served by a plugin.

    Information about the GPUs is saved when:

  • user adds/drops one or more of the GPUs
  • GPUServer is terminated and related GPUs switch into the STOPPED status
  • GPUServer becomes unavailable and related GPUs switch into the BROKEN status
  • Administrator switches the GPU to the OFF status
  • Administrator drops the GPU with the agpubox drop subcommand
  • defaultAccounting

    The defaultAccounting plugin provides the most basic form of accounting. Data is saved into the file specified by the oserver_accounting_config installation parameter. Each GPU's accounting information is saved in a separate line.

    Each line contains the list of the accounting labels and their values in the following format:
  • accounting_label like userid, username,etc. is followed by colon
  • accounting_value is followed by semicolon
  • Accounting label Example value Description
    userid gpubox user's account name
    username GPUBox administrator user's name
    user_ip 203.0.113.151 Client's IP address
    time_start 2013-05-05 08:55:54 timestamp indicating when the GPU was allocated
    time_end 2013-05-05 11:14:09 timestamp indicating when the GPU was dropped
    consumed_seconds 8295 difference in seconds between time_end and time_start
    gpu_name GTX 780 the name of the GPU allocated by user
    gpu_server_ip 203.0.113.11 GPUServer's IP address
    gpu_global_domain710B PCI global domain ID has a hexadecimal format 0xAABB, where AA is the third octet and BB is the last octet of GPUServer's IP address
    gpu_pci_bus_id 1 physical GPUServer's PCI bus ID
    gpu_pci_dev_id 0 physical GPUServer's PCI device ID

    Information about saving the data is signaled by the message:
    [2013-05-05 19:07:53.581599] <INFO    > DACC-WA-50A Accounting saved for user: gpubox, GPUBox administrator    
    When the output file cannot be opened, the data is redirected to the log:
    [2013-05-05 19:19:33.947931] <WARNING > DACC-WA-71A Cannot open output file for accounting: /usr/local/gpubox-oserver/data/accounting.log, save accounting in log
    [2013-05-05 19:19:33.947965] <INFO    > DACC-WA-50B userid:gpubox;username:GPUBox administrator;user_ip:203.0.113.151;time_start:2013-05-05 19:18:45;time_end:2013-05-05 19:19:33;consumed_seconds:48;gpu_name:GeForce GTX 680;gpu_server_ip:203.0.113.12;gpu_global_domain:710C;gpu_pci_bus_id:1;gpu_pci_dev_id:0
    [2013-05-05 19:19:33.948024] <WARNING > DACC-WA-71A Cannot open output file for accounting: /usr/local/gpubox-oserver/data/accounting.log, save accounting in log
    [2013-05-05 19:19:33.948052] <INFO    > DACC-WA-50B userid:gpubox;username:GPUBox administrator;user_ip:203.0.113.151;time_start:2013-05-05 19:18:45;time_end:2013-05-05 19:19:33;consumed_seconds:48;gpu_name:GeForce GTX 680;gpu_server_ip:203.0.113.12;gpu_global_domain:710C;gpu_pci_bus_id:2;gpu_pci_dev_id:0

    Monitor1

    Monitor1 is the OServer's internal monitoring tool to collect information about the status of the GPUs, users, and GPUServers in the GPUBox infrastructure. The data represents the number of

    • GPUServers
    • GPUs in the free status
    • GPUs in the free pending status
    • GPUs in the shared usage mode
    • GPUs in the exclusive usage mode
    • GPUs in the exclusive pending
    • GPUs in the broken status
    • GPUs in the stopped status
    • GPUs in the off status
    • GPUs in the off pending status
    • GPUs in any status, total number of the GPUs
    • GPUs in an unknown status
    • Users who allocated the GPUs in the exclusive usage mode
    • Users who allocated the GPUs in the shared usage mode
    • Users who allocated the GPUs in any usage mode

    Monitor1 starts in the one of the last stages of the OServer initialization.
    When OServer starts for the first time, Monitor1 creates the monitor1 directory inside of the directory set by the oserver_datafiles_path parameter.

    [2013-05-05 19:15:13.842260] <INFO    > OSVC-M1-50B Directory /usr/local/gpubox-oserver/data/monitor1 for monitor1 created    
    [2013-05-05 19:15:13.842359] <INFO    > OSVC-M1-50A Monitor1 started

    Monitor1 is stopped as one of the first OServer's components:

    [2013-05-06 09:01:08.522218] <INFO    > OSVC-O1-51A Monitor1 data offloaded to file /usr/local/gpubox-oserver/data/monitor1/m1_20130505191529-20130506090105

    Access the data

    The GPUBox infrastructure statistics from Monitor1 are collected into the OServer's memory in 10-second intervals and can be retrieved in a few ways.

  • Data offload

    Data from the Monitor1's storage is offloaded into the file with the following name pattern: <OSERVER_INSTALLATION_DIR>/data/monitor1/m1_start-time_end-time where start-time and end-time define the timespan of the offloaded data. Timestamps have the YYYYMMDDhhmmss format.

    • From the command line:
      $ agpubox maint m1o --begindate=[start-time] --enddate=[end-time]
      [start-time] and [end-time] have the following format: "YYYY-MM-DD hh:mm:dd" When [start-time] and [end-time] are not given, all of the data is offloaded and the file takes dates of the first record and the current timestamp.
    • On one of the charts from the Dashboard of the GPUBox Web Console, click More and then Offload. Data represented on the chart is then offloaded.
    • From the Maintenance submenu of the GPUBox Web Console
      , click OServer and then the Maintenance submenu. There are two ways to offload the data: Offloading All data or setting the time frame and offloading the data in the specified timespan.
  • Download CSV file

    On one of the charts on the Dashboard of the GPUBox Web Console, click More, and then Download CSV; the data presented in the chart will then be downloaded into the file m1_[start-time]-[end-time] file where start-time and end-time define the timespan of the downloaded data. Timestamps have the "YYYYMMDDhhmmss" format.

  • Delete the data

    Monitor1 is not a storage-demanding service; an entire month of data fits in about 25MB of memory, but due to the long runtime of OServer's process and high usage of memory, unnecessary data from Monitor1 can be removed. We highly recommend offloading the data and downloading the CSV file first.

    There are two ways of deleting Monitor1's data:

  • issue the $ agpubox maint mon1delete --dateto="YYYY-MM-DD hh:mm:ss" command
  • from the Maintenance submenu in the GPUBox Web Console.
  • Operations

    Archiving log

    The log can be archived while OServer is running with certain tools, such as logrotate. The only requirement is to send the SIGHUP signal afterwards. For example:

    # service oserver reload
    or
    $ pkill -HUP oserver

    As an example, the following logrotate configuration file oserver-logrotate will rotate 12 times and at every 10MB:

    oserver.log {
      rotate 12
      size 10M
      compress
      create 0644 gpubox gpubox
      postrotate
        /usr/bin/killall -HUP oserver
      endscript  
    }

    The following command can be run by cron periodically:

    $ logrotate -v -s ./oserver_state ./oserver-logrotate

    Recovering GPU

    In a case when the GPUs were being used and OServer has failed or was restarted:

    Processes Processes running on GPUServer are preserved, but users cannot start new processes.
    Allocations All allocations are saved in the database and they are to be recovered after OServer's restart.

    It should be noted that restarting OServer is unnoticeable to the users' operations as it is quick.