Questions tagged [hpc]

High Performance Computing encompasses using "supercomputers" with high numbers of CPUs, large parallel storage systems and advanced networks to perform time-consuming calculations. Parallel algorithms and parallelization of storage are essential to this field, as well as issues with complex, fast networking fabrics such as Infiniband.

Filter by
Sorted by
Tagged with
0 votes
0 answers

Issues Connecting to HPC Head Node from Non-Domain-Joined Machines - Help Needed!

Hello fellow ServerFaultee's I'm encountering a challenge with my HPC cluster setup. My main hurdle is connecting to the HPC head node from computers that are not domain-joined, specifically when ...
Chris's user avatar
  • 75
1 vote
2 answers

Install software on Ubuntu using apt without root?

I’ve got accounts on many HPC clusters. The machines have a minimal install, and the admins won’t add much else. I need to install lots of typical software. Normally I’d do this with apt, but of ...
projectshave's user avatar
0 votes
0 answers

HPC node, Infiniband is DOWN

I have an HPC with 17 nodes running CentOS 7 and a dedicated Mellanox SX6036 Infiniband switch, each node has an Infiniband FDR interface. Recently one node started giving errors and a quick look ...
Chris Woelkers's user avatar
0 votes
1 answer

Why does the login node connect to external networks but allocated compute node fail in Slurm-GCP?

I've noticed that connecting to the internet from the allocated compute node via Slurm-GCP keeps failing. For example, using wget from the login node works successfully: [me@gcp-login0 ~]$ wget https:/...
Mathews24's user avatar
  • 101
0 votes
0 answers

Setting up Rhel cluster for high perfomance and load balancing

I am having 7 servers with RHEL, I want to setup a cluster for those. We need cluster for: 1: high Perfomance 2: load balancing I have NAS for shared storage. I want to setup 1 server as visulization ...
biplab 's user avatar
0 votes
1 answer

Linux missing lvm

hello so i have a ubuntu hpc cluster and i got a problem with storage whenever i try to access the storage from my compute nodes i cant i keep getting this error mount:mounting
Dhamer Nader's user avatar
0 votes
0 answers

Singularity vs. Podman for HPC workloads?

Singularity is said to work well for HPC workloads. RedHat is making an effort to make Podman more usable for HPC, e.g. it (and I presume this is with their ubi8 image) is said to work with MPI. That'...
Cavalcade's user avatar
-1 votes
1 answer

Speeding up SAS data rate from 12Gbps

I'm curious about SAS data transfer speed. Maximum is 12Gbps in the whole bus (not per drive) as far as I understand, but I have a scenario where I would like to have a faster data rate (hopefully ...
zRISC's user avatar
  • 13
0 votes
0 answers

Not able to ssh into 2 Compute Nodes on HPE Cluster

I recently added two new Compute Nodes on HPE CLuster , But surprisingly, I am Unable to ssh into the new Compute Nodes from the Head Node . [Unable to SSH to new Compute Nodes][1] (base) [root@hn001 ~...
Aditya Kaushal's user avatar
1 vote
1 answer

HPC master node no infiniband network influence on compute nodes - Slurm management

I'm writing because I'm facing an issue that I cannot solve trying to configure a cluster with a master node ( or Frontend node ) as a Virtual machine managing nodes with infiniband network. I use ...
SimoneM's user avatar
  • 121
0 votes
1 answer

Single SSH login for multilpe machines?

I have a number of physical (desktop) machines running at the office as part of a new network to handle processing & serving Open Source data; some of these machines also house VMs. At the moment, ...
Michael Hillman's user avatar
2 votes
0 answers

Infiniband fabric with 3 nodes - newbie

I am trying to connect 3 HP z840 workstations using: Mellanox ConnectX-3 VPI 40 / 56GbE Dual-Port QSFP Adapter MCX354A-FCBT Mellanox SX6005 12-port Non-blocking Unmanaged 56Gb/s Description of ...
theenemy's user avatar
  • 121
2 votes
1 answer

How can I set up interactive-job-only or batch-job-only partition on a SLURM cluster?

I'm managing a PBS/torque HPC cluster, and now I'm setting up another cluster with SLURM. On the PBS cluster, I can set a queue to accept only interactive jobs by qmgr -c "set queue interactive_q ...
wdg's user avatar
  • 153
0 votes
0 answers

Setting up slurm on a cluster

My IT admin has setup a cluster with 3 nodes, which is administered via Windows server. VMs are hosted via Hyper-V, including an Ubuntu VM to which a substantial portion of the cluster's resources ...
ml_white_belt's user avatar
1 vote
1 answer

Why can't the GPUs communicate in a multi-GPU server?

This is a Dell PowerEdge r750xa server with 4 Nvidia A40 GPUs, intended for AI applications. While the GPUs work well individually, multi-GPU training jobs or indeed any multi-GPU computational ...
isarandi's user avatar
  • 341
1 vote
1 answer

Infiniband OpenSM N-to-N port routing configuration

I have 10 servers with two CPUs each and one Mellanox 100G Infiniband NIC per CPU. Each NIC is connected to a single Mellanox 36 port 100G IB switch. My RDMA application runs as one process per NUMA ...
Hugo Maxwell's user avatar
2 votes
0 answers

Lustre glitch: latency of minutes

Using a HPC lustre filesystem, we occasionally experience glitchiness where even simply opening a terminal and typing "ls" can take minutes to return. That is, any process that involves the ...
benjimin's user avatar
  • 121
1 vote
0 answers

Why a non-root installation can work across the whole cluster?

I recently installed anaconda (which includes a new python3) locally in my account folder on a cluster with a dozen of nodes (each node with several cores). I use it to install some package P that is ...
xiaohuamao's user avatar
0 votes
1 answer

Changing the subnet on which my beegfs cluster operates

I've added some fiber channels between the machines that constitutes my BeegFS cluster in an effort to increase throughput. However, I have to leave the old coppy ethernet in place with its addressing ...
Jarmund's user avatar
  • 535
1 vote
1 answer

Latency of memory accesses via interconnectors

I'm trying to compare latencies of different node interconnects for a cluster. The goal is to minimize the memory access latency. I have obtained some benchmarks regarding one of the hardware ...
Piotr M's user avatar
  • 33
2 votes
0 answers

Current single system image solutions

I'm designing a cluster for a small research institute. Since our computations require a large amount of memory, I'm looking for a solution that will allow our applications access to the whole memory ...
Piotr M's user avatar
  • 33
1 vote
1 answer

ssh port forwarding (tunneling in HPC)

I have an application server that runs on a compute node. The server opens a port (9000) and I then run a command for tunneling between my local machine and the server: ssh -N -f -L 9000:compute-node:...
moth's user avatar
  • 111
1 vote
1 answer

HPC cluster master node as virtual machine

For a given small HPC cluster (~16 nodes) a master node is used as a front-end for users to login and interact with SLURM, and not as a computing node. The master node is currently a bare-metal server....
Alejandro Arcila's user avatar
0 votes
1 answer

OpenLDAP implementation allows only root user to set passwords of accounts

I'm working application that requires the use of AWS ParallelCluster assets for some high performance processing. After the initial setup, we need to be able to add/remove user accounts and I am ...
ProlucidDavid's user avatar
2 votes
1 answer

Considerations using consumer class (high-end) GPU in server?

Motivation: First of all, even if I have some knowledge of computer science, software development and server Linux administration, I never looked into a server hardware and I am a total "newbie&...
Adrian Maire's user avatar
2 votes
2 answers

Infiniband drivers : OFED or distro included?

I'm setting up a Linux cluster with infiniband network, and I'm quite a newby in infiniband wolrd, any advice is more than welcome ! We are currently using Mellanox OFED drivers, but our infiniband ...
nirnaeth's user avatar
2 votes
1 answer

SLURM with "partial" head node

I am trying to install SLURM with NFS on a small ubuntu 18.04 HPC cluster, in a typical fashion, e.g. configure controller (slurmctld) and clients (slurmd) and shared directory, etc. What I am curious ...
rage_man's user avatar
  • 123
3 votes
0 answers

Bad multicore performance on DL360 Gen 10 with 2xXeon 6154

I have some issue on the multi-core performance of some server. The server are HPE DL360 Gen10, mounting 2x Xeon Gold 6154 (18 cores). When i refer to the performances they are slower then some older ...
vimax87's user avatar
  • 31
3 votes
2 answers

What is the overhead of ZFS RAIDz1/2 in HPC SSD Environment?

Example hardware / host: Modern 64 Core CPU, 128GB Memory 8 x Micron Pro 15.36TB u.2 SSDs SSDs connected by dedicated Oculink per device (no backplane or PCIe sharing) Ubuntu 20.04 Use case: A ...
epea's user avatar
  • 406
1 vote
1 answer

Wrong LDAP user ID is mapped into Slurm account management service

I configured a Slurm head node as follows: sssd to contact openLDAP slurmctld/slurmdbd/slurmd/munged to act as the Slurm controller and compute node ...where ray.williams is an LDAP user. Its UID ...
Nicolas De Jay's user avatar
2 votes
1 answer

HTCondor high availability

I am currently trying to make the job queue and submission mechanism of a local, isolated HTCondor cluster highly available. The cluster consists of 2 master servers (previously 1) and several compute ...
Christian Hennen's user avatar
-1 votes
1 answer

Using HPC managers like Slurm on multiple servers in LAN [closed]

I have access to a group of servers connected with a 1Gb LAN, and each of them has 40+ cores and Ubuntu OS. They all have a common NAS. I installed SLURM on a few of them and configured it so that ...
Cindy Almighty's user avatar
0 votes
0 answers

What can be a reason for different clock speeds between sockets on 2 x Xeon Scalable 6148?

I have server with dual Xeon Scalable 6148 CPUs running HPC application. Base clock: 2.4GHz All core Turbo: 3.1 GHz Some processing threads are not scaling well and are sensitive to cpu clock. I ...
terion's user avatar
  • 1
1 vote
1 answer

Single-node SLURM server: restrict interactive CPU usage

I have SLURM setup on a single node, which is also a 'login node'. I would like to restrict interactive CPU usage, e.g. outside the scheduling system. I found the following article which suggests to ...
Compizfox's user avatar
  • 384
0 votes
1 answer

Ubuntu server vs Ubuntu desktop for Beowulf cluster

I want to create a beowulf cluster using Ubuntu 18. Looking at some guides they all seem to use ubuntu server for this an my question is why? Is it not possible to use ubuntu desktop for the client ...
Rickard Johansson's user avatar
-1 votes
1 answer

How is it that Summit at Oak Ridge National Lab has 2,414,592 cores? [closed]

Top500 says that Summit has 2,414,592 cores: But they have 4608 nodes, 9216 chips (each node has 2 chips), and 22 cores per chip. This is 202,752 cores. Where ...
user1271772's user avatar
0 votes
1 answer

What does "CPU Minutes" mean exactly?

I'm actually trying to report cluster utilization in Slurm but i don't understand the metric CPU Minutes. [root@XXXX]# sreport cluster Utilization Start=2018-12-01 End=2018-12-31 ---------------------...
m4hmud's user avatar
  • 3
0 votes
1 answer

Exascale Power Consumption

I have read a lot of articles about exascale and found out that it may consumes approximately 20MW power envelope. Is it a daily basis or a yearly basis or every second? Please enlighten me. Here are ...
alyssaeliyah's user avatar
0 votes
1 answer

Configure Singularity to do headless rendering / use OpenGL / glxgears / glxinfo

I want to do headless rendering on a server where I do not have root permissions. Therefore, I created a Singularity container like this: Bootstrap: docker From: nvidia/cuda:9.0-runtime-ubuntu16.04 %...
thigi's user avatar
  • 101
3 votes
1 answer

Intel Xeon 6134 + One DIMM per channel or two DIMMs per channel for maximum memory bandwidth?

I'm unable to find this critical piece of information in spec sheets. Appreciate any insight. We're purchasing servers for HPC work with intel Xeon Gold 6134 (Skylake) cpus I want maximum memory ...
Aravindh Sathish's user avatar
3 votes
1 answer

Micosoft HPC Pack 2012 R2 does not run with Network Direct after joining new domain

I am working with a 13 computer cluster, running on Windows Server 2012 R2, using MS HPC Pack 2012 R2. The headnode is working properly. The servers are connected to the corporate network via IPv4 on ...
swaglord mcmuffin''s user avatar
2 votes
1 answer

Are processors more efficient at lower temperatures

I couldn't find a stackexchange site that's more suited for this question, I apologize. This doesn't necessarily have a lot to do with servers and stability.... I do not have problems with stability ...
xyious's user avatar
  • 343
0 votes
0 answers

How to handle mpi head node failure?

There is app which starting with mpirun. If compute node fail then all processes crush, but if only head node fail(for example reboot) then processes will stuck on compute nodes. How to get rid of ...
Severgun's user avatar
  • 163
3 votes
3 answers

Randomize Slurm Node Allocation

Has anyone had luck randomizing Slurm node allocations? We have a small cluster of 12 nodes that could be used by anywhere from 1-8 people at a time with jobs of various size/length. When testing our ...
tnallen's user avatar
  • 31
3 votes
1 answer

MAAS for diskless computational hpc cluster

I'm considering to use MAAS to deploy OS for a computational cluster. All nodes are diskless. Only head node and (probably) MAAS rack controller will have hard drives. It seems MAAS have to finish a ...
rth's user avatar
  • 135
0 votes
0 answers

Ideal configuration for a head node? [duplicate]

Which hardware should I concentrate on, when assembling a head node for an HPC cluster? The main task for the head node is to relay instructions to the compute nodes which will be running artificial ...
Rushat Rai's user avatar
1 vote
1 answer

ifconfig apparently showing wrong RX/TX values for InfiniBand HCA

Recently, I executed a watch -n 1 ipconfig on one of our Linux cluster computing nodes while it was running a 48-process MPI run, disributed over several nodes. Oddly, while Ethernet packets seem to ...
andreee's user avatar
  • 133
1 vote
2 answers

Containers for HPC batch processing

We are facing the problem that a lot of people want to run different scientific software on our high performance computing cluster. Every user requires a different set of libraries and library ...
J. Doe's user avatar
  • 13
-3 votes
1 answer

Solutions for monetizing excess CPU cycles [closed]

My company has a big (relatively) computer farm, say, 100 physical servers (dual CPU hexacore e5 xeons with 160 Gb RAM) leased from some hardware provider (say Leaseweb or OVM) on monthly basis, means,...
rlib's user avatar
  • 195
2 votes
1 answer

How many infiniband adapters should be used in multi socket servers?

Should dual socket motherboards have an infinity band adapter for each CPU? That is, should there be two infiniband band adapters, one in each CPU's PCIe slot. Would this eliminate the signal going ...
Darthtrader's user avatar