基于区块链的毕业设计GPU Monitoring and Failure Notification – GPU监视和故障通知

本文提供基于区块链的毕业设计国外最新区块链项目源码下载,包括solidity,eth,fabric等blockchain区块链,基于区块链的毕业设计GPU Monitoring and Failure Notification – GPU监视和故障通知 是一篇很好的国外资料

GPU Monitoring and Failure Notification

While in minging crytocurrency with GPU, although the GPU has built-in thermal control, but it might not be fitted in some environment and cause hash rate decreases or even GPU faults. This project provides a tool to:

  • According the detected temperature to change fan speed
  • Execute custom action scripts while current temperature of GPU(s) is over a defined threshold
  • Execute custom action scripts while current hash rate of GPU(s) is under a defined threshold

Currently, the example of action script are based on the miner’s remote managemen API, for different miner, it might be needed to custimize locally.

Environment and Installation

Intel CPU/16G RAM Ubuntu 18.4/Python3

  • Environment variables

    For configuring Nvidia’s GPU, it is needed to set the following environment.

    export DISPLAY=:0 export XAUTHORITY=/var/run/lightdm/root/:0 export NO_AT_BRIDGE=1
  • Clone the source

    https://github.com/Ed-Yang/gpuctl

    Setup Python vitual environment:

    cd gpuctl python3 -m venv venv source ./venv/bin/activate pip install -r requirements.txt

    Install gpuctl:

    pip install .

    After completed the above procedure, before you run the gpuctl, you need to to run only:

    source ./venv/bin/activate

Usage

Some parameters are applying to every GPU on ststem (like interval, curve, etc.), if it is necessary to provide specific setting for a GPU, it is able to run seperate gpuctl instance with expected parametets.

```shell usage: gpuctl [-h] [-l] [-s SLOTS] [-a] [-n] [--interval INTERVAL] [-f FAN]             [-d DELTA] [--temp TEMP] [--temp-cdown TEMP_CDOWN] [--tas TAS]             [--rms RMS] [--rate RATE] [--rate-cdown RATE_CDOWN] [--ras RAS]             [--curve CURVE] [-v]  optional arguments: -h, --help            show this help message and exit -l, --list            list all GPU cards -s SLOTS, --slots SLOTS                         use PCI slot name to locate GPU (ie.                         0000:01:00.0/0000:01:00.1) -a, --amd             only use AMD GPU -n, --nvidia          only use Nvidia GPU --interval INTERVAL   monitoring interval -f FAN, --fan FAN     if temperature is exceed than FAN once, activate fan                         control (default:70) -d DELTA, --delta DELTA                         set fan speed if temperature diff % is over DELTA                         (defaut:2) --temp TEMP           over temperature action threshold --temp-cdown TEMP_CDOWN                         over temperature count down --tas TAS             over temperature action script --rms RMS             rate monitoring script --rate RATE           under rate threshold (default: 1000 kh) --rate-cdown RATE_CDOWN                         under rate count down --ras RAS             under rate action script --curve CURVE         set temp/fan-speed curve (ie. 0:0/10:10/80:100) -v, --verbose         show debug message ``` 
  • Slot Name

The slot name of each GPU card could be found by using “lspci -D” command. In the following output, the slot name of AMD GPU card is “0000:01:00.0”.

```shell lspci -D ```  ```shell 0000:01:00.0 VGA compatible controller.... ``` 
  • Action scripts

A few examples of action script are provided for reference, besides it is feasible to write a script to send syslog, email or telegram message, etc.

  • scripts/rate.sh: get miner’s current hashrate
  • scripts/restart.sh: restart miner
  • scripts/reboot.sh: reboot rig

If a failure is detected (over heat or under rate), the gpuctl will invoke the given script with slot name as argument.

Take the ‘ethminer’ as example, if we want to implement while error is detected, the gpuctl will inform ‘ethminer’ program to restart itself, we should fill in the correct mapping for slot to TCP port number which the ethminer listened to.

```shell if [[ $# -ne 0 ]] ; then     case $1 in         # 0000:01:00.0)         #     PORT="3335"         #     ;;         *)             PORT="3333"             ;;     esac fi ``` 

Interaction with miners

To utilize the the under rate detection feature, the miner’s must provide a way to retreive its current hash rate. Some miner’s implement the network management function, it could be easily enabled to make it accessable by others.

The following exapmple setup a netowrk management port on TCP port 3333.

  • Ethminer/nsfminer, additional parameters:

    –api-bind 127.0.0.1:3333 or –api-port -3333

  • Phoenixminer, additional parameters (check section 3):

    -cdm 2 -cdmport 3333

    If miner’s network management function is enabled, it could be tested by:

  • Sample Scripts

Get hash rate:

```shell ./scripts/rate.sh ``` 

Restart miner:

```shell ./scripts/restart.sh ``` 

Examples

  • Example 1) List on board GPU cards

    gpuctl --list
    ID Slot Name    Vendor   PCI-ID -- ------------ -------- ----------- 1 0000:01:00.0 AMD      [1002:67DF]
  • Example 2) For all of the GPUs, if its temperature is over 30c, then activate the fan speed control.

    sudo gpuctl --fan 30
    ID Slot Name    Vendor   PCI-ID -- ------------ -------- ----------- 1 0000:01:00.0 AMD      [1002:67DF]  gpuctl: started  12:02:20 INFO     [0000:01:00.0/AMD] current temp. 57c set speed 52% 12:03:39 INFO     [0000:01:00.0/AMD] current temp. 60c set speed 61%
  • Example 3) For every GPU, if its temeprature is over 50c, then activate fan control and if its temeprature is 55c for 5s, call restart script

    sudo gpuctl --fan 50 --temp 55 --tas ./scripts/restart.sh --temp-cdown 5
    ID Slot Name    Vendor   PCI-ID -- ------------ -------- ----------- 1 0000:01:00.0 AMD      [1002:67DF]  gpuctl: started  03:50:36 INFO     [0000:01:00.0/AMD] current temp. 58c set speed 52% 03:50:36 WARNING  [0000:01:00.0/AMD] temp: 58c/55c CD: 5 03:50:37 WARNING  [0000:01:00.0/AMD] temp: 58c/55c CD: 4 03:50:38 WARNING  [0000:01:00.0/AMD] temp: 59c/55c CD: 3 03:50:39 WARNING  [0000:01:00.0/AMD] temp: 58c/55c CD: 2 03:50:40 WARNING  [0000:01:00.0/AMD] temp: 59c/55c CD: 1 03:50:41 INFO     [0000:01:00.0/AMD] over heat, exec script ./scripts/restart.sh 03:50:41 INFO     [0000:01:00.0/AMD] result: send restart command to slot=0000:01:00.0, port=3333 {"id":5,"jsonrpc":"2.0","result":true}  03:50:42 WARNING  [0000:01:00.0/AMD] temp: 56c/55c CD: 5 03:50:43 WARNING  [0000:01:00.0/AMD] temp: 57c/55c CD: 4
  • Example 4) For every GPU, if its temeprature is over 55c, or rate under 30000 Kh/s call restart script

Use ethminer as example:

```shell ethminer -G --api-port 3333 -P .... ```  ```shell sudo gpuctl --temp 55 --tas ./scripts/restart.sh --rms ./scripts/rate.sh --rate 30000 --ras ./scripts/restart.sh ```  ```shell 03:45:38 WARNING  [0000:01:00.0/AMD] rate: 28564/30000 CD: 2 03:45:39 WARNING  [0000:01:00.0/AMD] temp: 60c/55c CD: 1 03:45:39 INFO     [0000:01:00.0/AMD] over heat, exec script ./scripts/restart.sh 03:45:39 INFO     [0000:01:00.0/AMD] result: send restart command to slot=0000:01:00.0, port=3333 {"id":5,"jsonrpc":"2.0","result":true}  03:45:40 WARNING  [0000:01:00.0/AMD] temp: 57c/55c CD: 120 03:45:42 ERROR    0000:01:00.0/AMD] get hashrate, exec script ./scripts/rate.sh failed !! ``` 

If the miner is rebooting, it might not be able to retrieve the hash rate for a few seconds.

  • Run Test Cause

    python3 -m unittest discover tests

Diagnostics

  • Monitor AMD GPU card

    sudo watch -c -n 2 amd-info
  • Monitor Nvidia GPU card

    sudo watch -c -n 2 nvidia-info

    or

    sudo watch -c -n 2 nvidia-smi

Q/A

  • nvidia Unable to init server: Could not connect: Connection refused

    In ~/.profile, add:

    export DISPLAY=:0 export XAUTHORITY=/var/run/lightdm/root/:0
  • (nvidia-settings:15781): dbind-WARNING **: 04:46:56.622….

    In ~/.profile, add:

    export NO_AT_BRIDGE=1

Reference

  • Fan controller for amdgpus
  • GPUFan
  • PyOpenCL Samples
  • Associating OpenCL device ids with GPUs

GPU监控和故障通知

在将晶流与GPU混合时,虽然GPU具有内置的热控制功能,但在某些环境下可能不适合,导致哈希率降低甚至GPU故障。这个项目提供了一个工具:

  • 根据检测到的温度更改风扇转速
  • 执行自定义操作脚本,同时GPU的当前温度高于定义的阈值执行自定义操作脚本,而GPU的当前哈希速率低于定义的阈值。要配置Nvidia的GPU,需要设置以下环境变量。export DISPLAY=:0 export XAUTHORITY=/var/run/lightdm/root/:0 export NO u AT u BRIDGE=1
  • 克隆源https://github.com/Ed-Yang/gpuctl安装Python虚拟环境:cd gpuctl python3-m venv venv source./venv/bin/activate pip install-r要求.txt安装gpuctl:pip安装。完成上述过程后,在运行gpuctl之前,只需运行:source./venv/bin/activate

目前,动作脚本的例子是基于miner的远程管理API,对于不同的miner,可能需要在本地进行定制。要配置Nvidia的GPU,需要设置以下环境。

环境和安装

克隆源代码

  • Slot Name
  • 操作脚本/费率.sh:获取矿工的当前哈希速率/重新启动.sh:重新启动miner

使用

如果检测到故障(过热或低速率),gpuctl将以slot name作为参数调用给定脚本。

```shell usage: gpuctl [-h] [-l] [-s SLOTS] [-a] [-n] [--interval INTERVAL] [-f FAN]             [-d DELTA] [--temp TEMP] [--temp-cdown TEMP_CDOWN] [--tas TAS]             [--rms RMS] [--rate RATE] [--rate-cdown RATE_CDOWN] [--ras RAS]             [--curve CURVE] [-v]  optional arguments: -h, --help            show this help message and exit -l, --list            list all GPU cards -s SLOTS, --slots SLOTS                         use PCI slot name to locate GPU (ie.                         0000:01:00.0/0000:01:00.1) -a, --amd             only use AMD GPU -n, --nvidia          only use Nvidia GPU --interval INTERVAL   monitoring interval -f FAN, --fan FAN     if temperature is exceed than FAN once, activate fan                         control (default:70) -d DELTA, --delta DELTA                         set fan speed if temperature diff % is over DELTA                         (defaut:2) --temp TEMP           over temperature action threshold --temp-cdown TEMP_CDOWN                         over temperature count down --tas TAS             over temperature action script --rms RMS             rate monitoring script --rate RATE           under rate threshold (default: 1000 kh) --rate-cdown RATE_CDOWN                         under rate count down --ras RAS             under rate action script --curve CURVE         set temp/fan-speed curve (ie. 0:0/10:10/80:100) -v, --verbose         show debug message ``` 
  • 脚本/重新启动.sh:重新启动钻机

以“ethminer”为例,如果我们想在检测到错误时实现,gpuctl会通知“ethminer”程序重新启动,我们应该填写正确的slot到ethminer监听的TCP端口号的映射。

```shell lspci -D ```  ```shell 0000:01:00.0 VGA compatible controller.... ``` 
  • Ethminer/nsfminer,附加参数:–api bind 127.0.0.1:3333或–api port-3333

要利用欠速率检测功能,miner必须提供一种检索其当前哈希速率的方法。一些矿机实现了网络管理功能,可以很容易地使其他矿机访问。

  • Phoenixminer,附加参数(检查第3节):-cdm 2-cdmport 3333如果miner的网络管理功能已启用,则可以通过以下方式进行测试:
  • 示例脚本
  • 示例1)列出板载GPU卡gpuctl–列出ID插槽名称供应商PCI-ID——————1 0000:01:00.0 AMD[1002:67DF]

以下示例在TCP端口3333上设置网络管理端口。

Ethminer/nsfminer,附加参数:

```shell if [[ $# -ne 0 ]] ; then     case $1 in         # 0000:01:00.0)         #     PORT="3335"         #     ;;         *)             PORT="3333"             ;;     esac fi ``` 

与矿工交互

–api bind 127.0.0.1:3333或–api port-3333

phoenix miner,附加参数(检查第3节):

  • 示例2)对于所有GPU,如果其温度超过30℃,则激活风扇转速控制。sudo gpuctl–fan 30 ID Slot Name Vendor PCI-ID——————1 0000:01:00.0 AMD[1002:67DF]gpuctl:已启动12:02:20 INFO[0000:01:00.0/AMD]当前温度。57c设定速度52%12:03:39信息[0000:01:00.0/AMD]当前温度。60c设置速度61%
  • 例3)对于每个GPU,如果温度超过50c,则激活风扇控制,如果温度为55c持续5s,调用restart script sudo gpuctl–fan 50–temp 55–tas./scripts/重新启动.sh–temp cdown 5 ID Slot Name Vendor PCI-ID——————-1 0000:01:00.0 AMD[1002:67DF]gpuctl:已启动03:50:36 INFO[0000:01:00.0/AMD]当前温度。58c设定速度52%03:50:36警告[0000:01:00.0/AMD]温度:58c/55c CD:5 03:50:37警告[0000:01:00.0/AMD]温度:58c/55c CD:4 03:50:38警告[0000:01:00.0/AMD]温度:59c/55c CD:3 03:50:39警告[0000:01:00.0/AMD]温度:58c/55c CD:2 03:50:40警告[0000:01:00.0/AMD]温度:59c/55c CD:1 03:50:41信息[0000:01:00.0/AMD]结束加热,执行脚本./scripts/重新启动.sh03:50:41 INFO[0000:01:00.0/AMD]result:send restart command to slot=0000:01:00.0,port=3333{“id”:5,“jsonrpc”:“2.0”,“result”:true}03:50:42 WARNING[0000:01:00.0/AMD]temp:56c/55c CD:5 03:50:43 WARNING[0000:01:00.0/AMD]temp:57c/55c CD:4温度超过55摄氏度,或速率低于30000kh/s调用重启脚本
  • 运行测试导致python3-m unittest发现测试

示例3)对于每个GPU,如果其温度超过50c,则激活风扇控制,如果其温度为55c持续5s,则调用重新启动脚本

```shell ./scripts/rate.sh ``` 

示例4)对于每个GPU,如果其温度超过55c,或速率低于30000 Kh/s,则调用重新启动脚本

```shell ./scripts/restart.sh ``` 

示例

  • 监视AMD GPU卡sudo watch-c-n 2 AMD info
  • 监视Nvidia GPU卡sudo watch-c-n 2 Nvidia info或sudo watch-c-n 2 Nvidia smi
  • Nvidia无法初始化服务器:无法连接:连接在~/.profile中被拒绝,添加:export DISPLAY=:0 export XAUTHORITY=/var/run/lightdm/root/:0
  • (nvidia)-设置:15781):d绑定警告**:04:46:56.622。。。。在~/.profile中,添加:export NO_AT_BRIDGE=1

监视Nvidia GPU卡

```shell ethminer -G --api-port 3333 -P .... ```  ```shell sudo gpuctl --temp 55 --tas ./scripts/restart.sh --rms ./scripts/rate.sh --rate 30000 --ras ./scripts/restart.sh ```  ```shell 03:45:38 WARNING  [0000:01:00.0/AMD] rate: 28564/30000 CD: 2 03:45:39 WARNING  [0000:01:00.0/AMD] temp: 60c/55c CD: 1 03:45:39 INFO     [0000:01:00.0/AMD] over heat, exec script ./scripts/restart.sh 03:45:39 INFO     [0000:01:00.0/AMD] result: send restart command to slot=0000:01:00.0, port=3333 {"id":5,"jsonrpc":"2.0","result":true}  03:45:40 WARNING  [0000:01:00.0/AMD] temp: 57c/55c CD: 120 03:45:42 ERROR    0000:01:00.0/AMD] get hashrate, exec script ./scripts/rate.sh failed !! ``` 

  • amdgpus的风扇控制器

诊断

  • GPUFan
  • PyOpenCL Samples

Q/A

  • 将OpenCL设备id与gpu关联
  • (nvidia-settings:15781): dbind-WARNING **: 04:46:56.622….

    In ~/.profile, add:

    export NO_AT_BRIDGE=1

参考

  • Fan controller for amdgpus
  • GPUFan
  • PyOpenCL Samples
  • Associating OpenCL device ids with GPUs

部分转自网络,侵权联系删除区块链源码网

www.interchains.cc

https://www.interchains.cc/20919.html

区块链毕设网(www.interchains.cc)全网最靠谱的原创区块链毕设代做网站 部分资料来自网络,侵权联系删除! 最全最大的区块链源码站 ! QQ3039046426
区块链知识分享网, 以太坊dapp资源网, 区块链教程, fabric教程下载, 区块链书籍下载, 区块链资料下载, 区块链视频教程下载, 区块链基础教程, 区块链入门教程, 区块链资源 » 基于区块链的毕业设计GPU Monitoring and Failure Notification – GPU监视和故障通知

提供最优质的资源集合

立即查看 了解详情