Ray Cluster + Vllm 部署满血版DeepSeek R1 671B

博主： Kevin
发布时间：2025 年 02 月 14 日
166 次浏览
暂无评论
9136字数
分类： AI Model

系统环境

系统版本	IP地址	主机名	vllm版本	ray版本	显卡
Ubuntu 22.04.5 LTS	10.84.10.24	server1	Version: 0.7.2	Version: 2.42.1	8x(A100*80G)
Ubuntu 22.04.5 LTS	10.84.10.22	server3	Version: 0.7.2	Version: 2.42.1	8x(A100*80G)
Ubuntu 22.04.4 LTS	10.84.10.23	server4	Version: 0.7.2	Version: 2.42.1	8x(A100*80G)

前提条件

已经安装了显卡驱动
机器间是万兆带宽

系统调优（所有节点都要做）

提高系统可打开的最大文件描述符数量。

如果 ulimit -n 过小，可能会出现进程无法创建新的 socket，导致分布式训练/推理失败。

Too many open files

临时修改

ulimit -n 65536

永久修改

1、修改 /etc/security/limits.conf

echo "* soft nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 65536" | sudo tee -a /etc/security/limits.conf

2、修改 /etc/systemd/system.conf 和 /etc/systemd/user.conf

sudo sed -i '/^#DefaultLimitNOFILE/c\DefaultLimitNOFILE=65536' /etc/systemd/system.conf

sudo sed -i '/^#DefaultLimitNOFILE/c\DefaultLimitNOFILE=65536' /etc/systemd/user.conf

3、重启生效

sudo reboot

优化 NCCL 的通信性能（所有节点都要做）

head节点

export NCCL_SOCKET_IFNAME=ens110f1
export GLOO_SOCKET_IFNAME=ens110f1
export NCCL_CROSS_NIC=1
export NCCL_NET_GDR_LEVEL=SYS
export NCCL_DEBUG=TRACE
export NCCL_DEBUG_SUBSYS=^ALLOC
export VLLM_HOST_IP=10.84.10.24
export VLLM_LOGGING_LEVEL=DEBUG

Worker-1 节点

export NCCL_SOCKET_IFNAME=ens6f1
export GLOO_SOCKET_IFNAME=ens6f1
export NCCL_CROSS_NIC=1
export NCCL_NET_GDR_LEVEL=SYS
export NCCL_DEBUG=TRACE
export NCCL_DEBUG_SUBSYS=^ALLOC
export VLLM_HOST_IP=10.84.10.22
export VLLM_LOGGING_LEVEL=DEBUG

Worker-2 节点

export NCCL_SOCKET_IFNAME=ens6f0
export GLOO_SOCKET_IFNAME=ens6f0
export NCCL_CROSS_NIC=1
export NCCL_NET_GDR_LEVEL=SYS
export NCCL_DEBUG=TRACE
export NCCL_DEBUG_SUBSYS=^ALLOC
export VLLM_HOST_IP=10.84.10.23
export VLLM_LOGGING_LEVEL=DEBUG

变量解释

export GLOO_SOCKET_IFNAME：指定 GLOO (PyTorch 分布式后端) 绑定的网卡接口
export NCCL_SOCKET_IFNAME：指定 NCCL (NVIDIA Collective Communications Library) 绑定的网卡接口，用于 GPU 之间的高速通信。
export NCCL_CROSS_NIC：启用跨网卡（NIC）通信，1启用，0禁用
export VLLM_LOGGING_LEVEL：配置默认日志级别
export NCCL_NET_GDR_LEVEL：控制 GPU Direct RDMA (GDR) 的使用级别，NCCL_NET_GDR_LEVEL 的值可以是以下几种：

PIX: 仅在同一个 PCIe 设备内使用 GDR。
PHB: 在同一个 NUMA 节点内的 PCIe 设备之间使用 GDR。
SYS: 在整个系统范围内使用 GDR。
OFF: 禁用 GDR。
这里设置为 SYS，表示在整个系统范围内启用 GPU Direct RDMA，以最大化通信性能。

export NCCL_DEBUG：设置 NCCL 的调试日志级别，NCCL_DEBUG 控制 NCCL 输出的日志详细程度，可选值包括：

WARN: 仅输出警告信息。
INFO: 输出基本信息。
TRACE: 输出详细的调试信息，包括每一步的操作细节。

export NCCL_DEBUG_SUBSYS：控制 NCCL 调试日志的子系统过滤。

NCCL_DEBUG_SUBSYS 用于指定哪些子系统的日志需要被记录或忽略。
^ALLOC 表示排除与内存分配相关的日志（即不记录 ALLOC 子系统的日志）可以减少日志量，专注于其他更重要的调试信息，例如通信性能或同步问题。

VLLM_HOST_IP：分布式环境中当前节点的 IP 地址

部署ray cluster

安装ray cluster(所有节点)

pip install -U "ray[default]" -i https://repo.huaweicloud.com/repository/pypi/simple/

启动ray head

(ray) root@server1:~#ray start --head --dashboard-host 0.0.0.0

work节点加入ray cluster

(ray) root@server3:~# ray start --address='10.84.10.24:6379'
(ray) root@server4:~# ray start --address='10.84.10.24:6379'

查看集群资源信息并验证是否加入集群成功

(ray) root@server1:~# ray status
======== Autoscaler status: 2025-02-14 09:05:15.452837 ========
Node status
---------------------------------------------------------------
Active:
 1 node_b0569f12ecacc9109c72036aabdc3e51731af88084bf1f5169f20e62
 1 node_8db61c9a75d08d79958a6aa4f0a9685e4cde76923f9c2904a4907f49
 1 node_db4b9d1a078369059b6163131f297bb8b2849c3fed28fe12a4463a6f
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/24.0 GPU
 0B/1.57TiB memory
 0B/558.79GiB object_store_memory

Demands:
 (no resource demands)

注意：ray 的版本在集群中需要保持一致

安装Vllm(所有节点)

pip install vllm -i https://repo.huaweicloud.com/repository/pypi/simple/

启动模型

nohup vllm serve /models02/unsloth-DeepSeek-R1-BF16 \
  --trust-remote-code \
  --served-model-name DeepSeek-R1 \
  --max-model-len 16384 \
  --host 0.0.0.0 \
  --port 80 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 3 \
  --api-key CPa0McW7RVfjPBR4PXThNj5DFdEgL7^C \
  --gpu-memory-utilization 0.95 \
  > /var/log/vllm_serve.log 2>&1 &

参数解释

注意：在没做IB网络或RoCE网络时，尽量用万兆以太网，下面报错是用千兆以太网启动一段时间后ray 报的错误：

1、ray.exceptions.ActorUnavailableError: The actor c33caec0b10576a560220f8e02000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: recvmsg:Connection timed out; RPC Error details: . The task may or maynot have been executed on the actor.


2、(RayWorkerWrapper pid=22535) [rank1]:[E310 16:27:47.480626819 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorkerWrapper pid=22535) [rank1]:[E310 16:27:47.480650391 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.


3、[2025-03-12 15:48:57,381 E 76289 83792] gcs_rpc_client.h:193: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by ray stop or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. The program will terminate.

测试能否正常访问

curl http://127.0.0.1:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
         "model": "DeepSeek-R1",
         "messages": [
             {"role": "user", "content": "介绍一下 Ray 计算引擎"}
         ],
         "max_tokens": 1024
     }'

import openai
import os

PROXY_API_KEY = "CPa0McW7RVfjPBR4PXThNj5DFdEgL7^C"
PROXY_SERVER_URL = "http://10.84.10.24/v1"
PROXYLLM_BACKEND = "DeepSeek-R1"

os.environ["OPENAI_API_KEY"] = PROXY_API_KEY
client = openai.OpenAI(base_url=PROXY_SERVER_URL)


def generate_text(prompt, model_name=PROXYLLM_BACKEND, max_tokens=1024):
    payload = {
        'stream': True,
        'model': model_name,
        'temperature': 0.6,
        'max_tokens': max_tokens
    }
    messages = [
        {'role': 'system', 'content': '你是一个有用的 AI 助手。'},
        {'role': 'user', 'content': prompt}
    ]

    stream = client.chat.completions.create(
        messages=messages,
        **payload
    )

    text = ""
    for chunk in stream:
        if chunk.choices:
            delta = chunk.choices[0].delta
            if hasattr(delta, 'content') and delta.content:
                text += delta.content
                print(delta.content, end='', flush=True)
    return text



prompt = "3，10，15，26，下一个数字是多少？"
generated_text = generate_text(prompt)
print("\nGenerated Text:", generated_text)

压力测试

压力测试项目路径：https://github.com/vllm-project/vllm

克隆项目

git clone https://github.com/vllm-project/vllm

执行压力测试

cd vllm
python3.10 benchmarks/benchmark_serving.py --host 10.84.4.53 --port 8000 --max-concurrency 100 --dataset-name random --tokenizer /models/DeepSeek-R1-Distill-Llama-70B/ --trust-remote-code --model DeepSeek-R1 --random-input-len 256 --random-output-len 2048 --num-prompts 5

参数解释

参数	作用说明
`python3.10`	指定使用 Python 3.10 解释器运行脚本
`benchmarks/benchmark_serving.py`	要执行的基准测试脚本路径
`--host 10.84.4.53`	指定服务端 IP 地址，用于连接远程推理服务
`--port 8000`	指定服务端端口号，与 `host` 配合建立网络连接
`--max-concurrency 100`	设置最大并发请求数，用于测试服务端在高并发场景下的性能
`--dataset-name random`	指定使用随机生成的数据集（而非真实数据集）进行测试
`--tokenizer /models/DeepSeek-R1-Distill-Llama-70B/`	指定分词器模型路径，用于文本的编码/解码处理
`--trust-remote-code`	允许加载远程代码（如 HuggingFace 模型中的自定义代码），通常用于非标准模型实现
`--model DeepSeek-R1`	指定要测试的目标模型名称，确保服务端加载正确的模型
`--random-input-len 256`	设置随机生成输入提示(prompt)的长度为 256 tokens
`--random-output-len 2048`	设置随机生成输出响应(response)的长度为 2048 tokens
`--num-prompts 5`	指定生成 5 个不同的随机提示进行测试

千兆网测试结果

最后修改：2025 年 03 月 14 日

如果觉得我的文章对你有用，请随意赞赏

仅登录用户可评论，点击登录

zpishero
能补充一下 pytorch 的安装以及NCLL相关的安...
盛夏光年凉
请问这个硬件配置要求是怎么样的OωO
盛夏光年凉
请问这个硬件配置要求是怎么样的OωO

Ray Cluster + Vllm 部署满血版DeepSeek R1 671B

Kevin • 2025 年 02 月 14 日

<h1>系统环境</h1><table><thead><tr><th>系统版本</th><th>IP地址</th><th>主机名</th><th>vllm版本</th><th>ray版本</th><th>显卡</th></tr></thead><tbody><tr><td>Ubuntu 22.04.5 LTS</td><td>10.84.10.24</td><td>server1</td><td>Version: 0.7.2</td><td>Version: 2.42.1</td><td>8x(A100*80G)</td></tr><tr><td>Ubuntu 22.04.5 LTS</td><td>10.84.10.22</td><td>server3</td><td>Version: 0.7.2</td><td>Version: 2.42.1</td><td>8x(A100*80G)</td></tr><tr><td>Ubuntu 22.04.4 LTS</td><td>10.84.10.23</td><td>server4</td><td>Version: 0.7.2</td><td>Version: 2.42.1</td><td>8x(A100*80G)</td></tr></tbody></table><h1>前提条件</h1><ul><li><a href="https://shackles.cn/index.php/archives/306/">已经安装了显卡驱动</a></li><li>机器间是万兆带宽</li></ul><h1>系统调优（所有节点都要做）</h1><h2>提高系统可打开的最大文件描述符数量。</h2><p>如果 ulimit -n 过小，可能会出现进程无法创建新的 socket，导致分布式训练/推理失败。</p><pre><code class="lang-text">Too many open files</code></pre><h3>临时修改</h3><pre><code class="lang-text">ulimit -n 65536</code></pre><h3>永久修改</h3><p>1、修改 /etc/security/limits.conf</p><pre><code class="lang-shell">echo &quot;* soft nofile 65536&quot; | sudo tee -a /etc/security/limits.conf
echo &quot;* hard nofile 65536&quot; | sudo tee -a /etc/security/limits.conf</code></pre><p>2、修改 /etc/systemd/system.conf 和 /etc/systemd/user.conf</p><pre><code class="lang-shell">sudo sed -i '/^#DefaultLimitNOFILE/c\DefaultLimitNOFILE=65536' /etc/systemd/system.conf

sudo sed -i '/^#DefaultLimitNOFILE/c\DefaultLimitNOFILE=65536' /etc/systemd/user.conf</code></pre><p>3、重启生效</p><pre><code class="lang-text">sudo reboot</code></pre><h1>优化 NCCL 的通信性能（所有节点都要做）</h1><h2>head节点</h2><pre><code class="lang-shell">export NCCL_SOCKET_IFNAME=ens110f1
export GLOO_SOCKET_IFNAME=ens110f1
export NCCL_CROSS_NIC=1
export NCCL_NET_GDR_LEVEL=SYS
export NCCL_DEBUG=TRACE
export NCCL_DEBUG_SUBSYS=^ALLOC
export VLLM_HOST_IP=10.84.10.24
export VLLM_LOGGING_LEVEL=DEBUG</code></pre><h2>Worker-1 节点</h2><pre><code class="lang-shell">export NCCL_SOCKET_IFNAME=ens6f1
export GLOO_SOCKET_IFNAME=ens6f1
export NCCL_CROSS_NIC=1
export NCCL_NET_GDR_LEVEL=SYS
export NCCL_DEBUG=TRACE
export NCCL_DEBUG_SUBSYS=^ALLOC
export VLLM_HOST_IP=10.84.10.22
export VLLM_LOGGING_LEVEL=DEBUG</code></pre><h2>Worker-2 节点</h2><pre><code class="lang-shell">export NCCL_SOCKET_IFNAME=ens6f0
export GLOO_SOCKET_IFNAME=ens6f0
export NCCL_CROSS_NIC=1
export NCCL_NET_GDR_LEVEL=SYS
export NCCL_DEBUG=TRACE
export NCCL_DEBUG_SUBSYS=^ALLOC
export VLLM_HOST_IP=10.84.10.23
export VLLM_LOGGING_LEVEL=DEBUG</code></pre><h2>变量解释</h2><p>export GLOO_SOCKET_IFNAME： 指定 GLOO (PyTorch 分布式后端) 绑定的网卡接口<br>export NCCL_SOCKET_IFNAME： 指定 NCCL (NVIDIA Collective Communications Library) 绑定的网卡接口，用于 GPU 之间的高速通信。<br>export NCCL_CROSS_NIC：启用跨网卡（NIC）通信，1启用，0禁用<br>export VLLM_LOGGING_LEVEL：配置默认日志级别<br>export NCCL_NET_GDR_LEVEL：控制 GPU Direct RDMA (GDR) 的使用级别，NCCL_NET_GDR_LEVEL 的值可以是以下几种：</p><ul><li>PIX: 仅在同一个 PCIe 设备内使用 GDR。</li><li>PHB: 在同一个 NUMA 节点内的 PCIe 设备之间使用 GDR。</li><li>SYS: 在整个系统范围内使用 GDR。</li><li>OFF: 禁用 GDR。<br>这里设置为 SYS，表示在整个系统范围内启用 GPU Direct RDMA，以最大化通信性能。</li></ul><p>export NCCL_DEBUG：设置 NCCL 的调试日志级别，NCCL_DEBUG 控制 NCCL 输出的日志详细程度，可选值包括：</p><ul><li>WARN: 仅输出警告信息。</li><li>INFO: 输出基本信息。</li><li>TRACE: 输出详细的调试信息，包括每一步的操作细节。</li></ul><p>export NCCL_DEBUG_SUBSYS：控制 NCCL 调试日志的子系统过滤。</p><ul><li>NCCL_DEBUG_SUBSYS 用于指定哪些子系统的日志需要被记录或忽略。</li><li>^ALLOC 表示排除与内存分配相关的日志（即不记录 ALLOC 子系统的日志）可以减少日志量，专注于其他更重要的调试信息，例如通信性能或同步问题。</li></ul><ol><li>VLLM_HOST_IP：分布式环境中当前节点的 IP 地址</li></ol><h1>部署ray cluster</h1><h2>安装ray cluster(所有节点)</h2><pre><code class="lang-shell">pip install -U &quot;ray[default]&quot; -i https://repo.huaweicloud.com/repository/pypi/simple/</code></pre><h2>启动ray head</h2><pre><code class="lang-shell">(ray) root@server1:~#ray start --head --dashboard-host 0.0.0.0</code></pre><h2>work节点加入ray cluster</h2><pre><code class="lang-shell">(ray) root@server3:~# ray start --address='10.84.10.24:6379'
(ray) root@server4:~# ray start --address='10.84.10.24:6379'</code></pre><h2>查看集群资源信息并验证是否加入集群成功</h2><pre><code class="lang-shell">(ray) root@server1:~# ray status
======== Autoscaler status: 2025-02-14 09:05:15.452837 ========
Node status
---------------------------------------------------------------
Active:
 1 node_b0569f12ecacc9109c72036aabdc3e51731af88084bf1f5169f20e62
 1 node_8db61c9a75d08d79958a6aa4f0a9685e4cde76923f9c2904a4907f49
 1 node_db4b9d1a078369059b6163131f297bb8b2849c3fed28fe12a4463a6f
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/24.0 GPU
 0B/1.57TiB memory
 0B/558.79GiB object_store_memory

Demands:
 (no resource demands)</code></pre><p><strong>注意：ray 的版本在集群中需要保持一致</strong></p><h1>安装Vllm(所有节点)</h1><pre><code class="lang-shell">pip install vllm -i https://repo.huaweicloud.com/repository/pypi/simple/</code></pre><h1>启动模型</h1><pre><code class="lang-shell">nohup vllm serve /models02/unsloth-DeepSeek-R1-BF16 \
  --trust-remote-code \
  --served-model-name DeepSeek-R1 \
  --max-model-len 16384 \
  --host 0.0.0.0 \
  --port 80 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 3 \
  --api-key CPa0McW7RVfjPBR4PXThNj5DFdEgL7^C \
  --gpu-memory-utilization 0.95 \
  &gt; /var/log/vllm_serve.log 2&gt;&amp;1 &amp;</code></pre><h2><a href="https://shackles.cn/index.php/archives/325/">参数解释</a></h2><p>注意：在没做IB网络或RoCE网络时，尽量用万兆以太网，下面报错是用千兆以太网启动一段时间后ray 报的错误：</p><pre><code class="lang-log">1、ray.exceptions.ActorUnavailableError: The actor c33caec0b10576a560220f8e02000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: recvmsg:Connection timed out; RPC Error details: . The task may or maynot have been executed on the actor.

2、(RayWorkerWrapper pid=22535) [rank1]:[E310 16:27:47.480626819 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorkerWrapper pid=22535) [rank1]:[E310 16:27:47.480650391 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.

3、[2025-03-12 15:48:57,381 E 76289 83792] gcs_rpc_client.h:193: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by ray stop or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. The program will terminate.</code></pre><h1>测试能否正常访问</h1><pre><code class="lang-shell">curl http://127.0.0.1:8000/v1/chat/completions \
     -H &quot;Content-Type: application/json&quot; \
     -d '{
         &quot;model&quot;: &quot;DeepSeek-R1&quot;,
         &quot;messages&quot;: [
             {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;介绍一下 Ray 计算引擎&quot;}
         ],
         &quot;max_tokens&quot;: 1024
     }'</code></pre><pre><code class="lang-python">import openai
import os

PROXY_API_KEY = &quot;CPa0McW7RVfjPBR4PXThNj5DFdEgL7^C&quot;
PROXY_SERVER_URL = &quot;http://10.84.10.24/v1&quot;
PROXYLLM_BACKEND = &quot;DeepSeek-R1&quot;

os.environ[&quot;OPENAI_API_KEY&quot;] = PROXY_API_KEY
client = openai.OpenAI(base_url=PROXY_SERVER_URL)

def generate_text(prompt, model_name=PROXYLLM_BACKEND, max_tokens=1024):
    payload = {
        'stream': True,
        'model': model_name,
        'temperature': 0.6,
        'max_tokens': max_tokens
    }
    messages = [
        {'role': 'system', 'content': '你是一个有用的 AI 助手。'},
        {'role': 'user', 'content': prompt}
    ]

stream = client.chat.completions.create(
        messages=messages,
        **payload
    )

text = &quot;&quot;
    for chunk in stream:
        if chunk.choices:
            delta = chunk.choices[0].delta
            if hasattr(delta, 'content') and delta.content:
                text += delta.content
                print(delta.content, end='', flush=True)
    return text

prompt = &quot;3，10，15，26，下一个数字是多少？&quot;
generated_text = generate_text(prompt)
print(&quot;\nGenerated Text:&quot;, generated_text)
</code></pre><h1>压力测试</h1><p>压力测试项目路径：<span class="external-link"><a class="no-external-link" href="https://github.com/vllm-project/vllm" target="_blank"><i data-feather="external-link"></i>https://github.com/vllm-project/vllm</a></span></p><h2>克隆项目</h2><pre><code class="lang-shell">git clone https://github.com/vllm-project/vllm</code></pre><h2>执行压力测试</h2><pre><code class="lang-shell">cd vllm
python3.10 benchmarks/benchmark_serving.py --host 10.84.4.53 --port 8000 --max-concurrency 100 --dataset-name random --tokenizer /models/DeepSeek-R1-Distill-Llama-70B/ --trust-remote-code --model DeepSeek-R1 --random-input-len 256 --random-output-len 2048 --num-prompts 5</code></pre><h2>参数解释</h2><table><thead><tr><th>参数</th><th>作用说明</th></tr></thead><tbody><tr><td><code>python3.10</code></td><td>指定使用 Python 3.10 解释器运行脚本</td></tr><tr><td><code>benchmarks/benchmark_serving.py</code></td><td>要执行的基准测试脚本路径</td></tr><tr><td><code>--host 10.84.4.53</code></td><td>指定服务端 IP 地址，用于连接远程推理服务</td></tr><tr><td><code>--port 8000</code></td><td>指定服务端端口号，与 <code>host</code> 配合建立网络连接</td></tr><tr><td><code>--max-concurrency 100</code></td><td>设置最大并发请求数，用于测试服务端在高并发场景下的性能</td></tr><tr><td><code>--dataset-name random</code></td><td>指定使用随机生成的数据集（而非真实数据集）进行测试</td></tr><tr><td><code>--tokenizer /models/DeepSeek-R1-Distill-Llama-70B/</code></td><td>指定分词器模型路径，用于文本的编码/解码处理</td></tr><tr><td><code>--trust-remote-code</code></td><td>允许加载远程代码（如 HuggingFace 模型中的自定义代码），通常用于非标准模型实现</td></tr><tr><td><code>--model DeepSeek-R1</code></td><td>指定要测试的目标模型名称，确保服务端加载正确的模型</td></tr><tr><td><code>--random-input-len 256</code></td><td>设置随机生成输入提示(prompt)的长度为 256 tokens</td></tr><tr><td><code>--random-output-len 2048</code></td><td>设置随机生成输出响应(response)的长度为 2048 tokens</td></tr><tr><td><code>--num-prompts 5</code></td><td>指定生成 5 个不同的随机提示进行测试</td></tr></tbody></table><h2>千兆网测试结果</h2><p><img src="https://oss.shackles.cn/VLLM/Test-Results.png" alt="千兆网测试结果" title="千兆网测试结果" style=""></p>

Ray Cluster + Vllm 部署满血版DeepSeek R1 671B

系统环境

前提条件

系统调优（所有节点都要做）

提高系统可打开的最大文件描述符数量。

临时修改

永久修改

优化 NCCL 的通信性能（所有节点都要做）

head节点

Worker-1 节点

Worker-2 节点

变量解释

部署ray cluster

安装ray cluster(所有节点)

启动ray head

work节点加入ray cluster

查看集群资源信息并验证是否加入集群成功

安装Vllm(所有节点)

启动模型

参数解释

测试能否正常访问

压力测试

克隆项目

执行压力测试

参数解释

千兆网测试结果

Elasticsearch、Logstash、Kibana添加密码设置

K8S

基于canel的网络策略

K8S dashboard认证及分级授权

调度器、预选策略及优选函数

CentOS 7下使用rpm包安装MySQL8.0.31

MySQL 5.7 主从配置（docker-compose）

MySQL 5.7.43 二进制单主配置

MySQL 5.7 单主配置（docker-compose）

K8S dashboard认证及分级授权

Ray Cluster + Vllm 部署满血版DeepSeek R1 671B