51 Tflops at FP32 precision
17 Tflops at FP64 precision
454 GB RAM (max 256 GB RAM in single motherboard)
35000+ math coprocessors (FPUs of the CPU and GPU subsytems)
100% allocated to the research of QMatterPhotonics
LBTS-εpsilon is our own small supercomputer, which design combines "relatively affordable and small" (hence the name) with "massively parallel supercomputer". It is being deployed in 2021.
Smallness & affordability produce the key advantage of not having to share the computer with different research groups: its resources are allocated 100% to us. As a consequence, and in spite of being a modest machine when compared with todays' most powerful computers, LBTS-εpsilon gives us forefront numerical calculation capabilities.
The computer has a hybrid architecture composed by CPU processors (x86_64 instruction set, 112 cores distributed in four nodes) plus GPGPU cards configured as mathematical coprocessors (nVidia CUDA Tesla, 34944 math cores total in four nodes, one FPU per math core). The system is best optimized for math operations at FP32 precision, in which it delivers 51Tflops max computing speed. (It is able of 17Tflops at FP64 precision; for comparison, a typical x86_64 core is able of about 0.01 TflopsFP64, 0.02 TflopsFP32).
LBTS-εpsilon has a mixed CPU+GPU architecture, distributed in four nodes.
Concerning the CPU subsytem, the "node BIG" is built upon an HP Proliant DL585G7 server with 48 CPU cores AMD Opteron and 256 GB of main RAM. The "node MID" is built upon an HP Proliant DL585G7 server with 32 CPU cores AMD Opteron and 64 GB of main RAM. The "node small 1" and "node small 2" are also based on Proliant servers, each with 16 CPU AMD Opteron cores and 64 GB RAM.
Concerning the GPU subsystem, we performed custom modifications of the power source subsystem of the host computers to be able to accomodate a number of CUDA cards larger than originally specified by the manufacturer: Nodes BIG and MID accomodate six nVidia Tesla K20x cards each. Each small1 and small 2 nodes have three of them. Each of such K20x card packs 2496 math cores with 5GB of in-card RAM and is able of about 3.5TFlops at FP32 precision (1.2 TFlops at FP64).
The two small secondary nodes are used for the crucial purpose of developing and optimizing the software routines for each research. The optimization for CUDA coprocessors usually involves calling the corresponding specialized versions of most scientific libraries or compiling the code against CUDA-enabled math libraries. We usually employ the ArrayFire libraries.