Core
The Core contains the backbone for networking, cryptographic operations, and hardware acceleration techniques such as Bitslicing, GPU acceleration, and Vectorization.
The Core is divided into the following sub-components:
SUB-COMPONENT | DESCRIPTION |
---|---|
Arch | Contains the architecture-specific headers for Bitslicing and Vectorization |
Crypto | Contains implementations of cryptographic operations such as AES and SHA |
CUDA | Contains standalone CUDA implementations for matrix multiplications and convolutions |
Networking | Handles socket communication and SSL/TLS connections between parties |
Other | Contains microbenchmarks for cryptographic operations and basic operations, utilities for printing and debugging, and SSL certificates used by the nodes to establish secure connections |
Tip
We provide different configuration options to support various architectures. These options can be set in the Makefile
or config.h
file. It is worth experimenting with different configurations to find the best performance for a specific use case.
Arch
Bitslicing and Vectorization become more effective with larger register sizes. However, not all CPUs support wider registers and not all functions require high parallelism.
Arch contains headers for common X86 architectures such as SSE and AVX512.
The headers contain efficient conversions for Bitslicing and Vectorization and map architecture-specific operations to generic functions.
This way a user can write code that is architecture-agnostic, and compile the code with the appropriate flags to increase or decrease the level of parallelism.
The DATTYPE
config option defines the register size. For instance, setting DATTYPE
to 512 will use AVX512 registers for Bitslicing and Vectorization, while setting it to 32 will use uint32_t registers.
The following options are available.
Register Size | Requirements | Supported BITLENGTH | Config Option |
---|---|---|---|
512 | AVX512 | 16, 32, 64 | DATTYPE=512 |
256 | AVX2 | 16, 32, (64 with AVX512) | DATTYPE=256 |
128 | SSE | 16, 32, (64 with AVX512) | DATTYPE=128 |
64 | None | 64 | DATTYPE=64 |
32 | None | 32 | DATTYPE=32 |
16 | None | 16 | DATTYPE=16 |
8 | None | 8 (Does not support all arithmetic instructions) | DATTYPE=8 |
1 | None | 8,16,32,64 (Use only for boolean circuits) | DATTYPE=1 |
The vectorization factor for a DATTYPE
can be calculated as DATTYPE
/ BITLENGTH
. For instance, if DATTYPE
is 256 and BITLENGTH
is 32, each arithmetic instruction will be executed on 256/32=8 inputs in parallel, and each boolean operation will be executed 256 times in parallel. DATTYPE=32
and BITLENGTH=32
will not vectorize arithmetic operations and DATTYPE=1
and BITLENGTH=32
will not bitslice boolean operations.
Warning
Some BITLENGTHs may not be supported for all DATTYPEs. For instance, AVX2 does not support arithmetic optations on packed 64-bit integers. Check the Table above for supported BITLENGTHs for each DATTYPE. All nodes must use the same DATTYPE and BITLENGTH to ensure compatibility.
References
The architecture-specific headers for vectorization and Bitslicing are adapted from USUBA, MIT LICENSE.
Crypto
Crypto contains cryptographic implementations for AES and SHA. The AES implementation is used to generate shared random numbers between parties and the SHA implementation is used to compare the views of the parties to ensure consistent messages. We support three different AES implementations for different architectures and two different SHA implementations.
Implementation | Description | Config Option |
---|---|---|
AES (X86) | Uses the AES-NI or VAES instruction set for AES encryption and decryption. This option is usually the fastest if the CPU supports AES-NI or VAES. VAES uses wider registers to increase parallelism. | RANDOM_ALGORITHM=2 |
AES (Bitslicing) | A Bitsliced implementation of AES that does not require any special instruction set. | RANDOM_ALGORITHM=1 |
AES (OPENSSL) | Uses the OpenSSL library for AES encryption and decryption. | USE_SSL_AES=1 and RANDOM_ALGORITHM=2 |
SHA (X86) | Uses the SHA-NI instruction set for SHA hashing if available. This option is usually the fastest if the CPU supports SHA-NI. IF SHA-NI is not available this option falls back to a non-accelerated implementation of SHA. | USE_ARM=0 |
SHA (ARM) | Uses SHA instruction set for ARM CPUs. | USE_ARM=1 |
Tip
The VERIFY_BUFFER
sets how many messages get accumulated before a hash is computed. Setting VERIFY_BUFFER=0
will buffer all messages and compute a single hash at the end of the protocol.
Usually, a smaller buffer provides the best results. Note that to ensure correctness, the following equation must hold \(VERIFY\_BUFFER \cdot DATTYPE \ge 512\), since the SHA function requires at least 512 bits of inputs to compute a hash. Setting VERIFY_BUFFER
to 512/DATTYPE
usually achieves the best performance.
References
The AES-NI implementation is adapted from AES-Brute-Force, Apache 2.0 LICENSE
The bitsliced AES implementation is adapted from USUBA, MIT LICENSE.
The SHA-256 implementation is adapted from SHA-Intrinsics, No License.
CUDA
In the CUDA folder, we provide standalone implementations of matrix multiplications and convolutions. These need to be compiled separately and will then be linked automatically by setting the appropriate flags in the project's Makefile
or config.h
.
To compile, the node requires a CUDA-compatible GPU and the CUDA toolkit installed. The implementations also require the CUTLASS library that provides templated CUDA kernels.
The executables can be built as follows.
# Dependencies for GPU acceleration
git clone https://github.com/NVIDIA/cutlass.git
# Compile standalone executable for GPU acceleration
cd core/cuda
make -j arch=sm_89 CUDA_PATH=/usr/local/cuda CUTLASS_PATH=/home/user/cutlass # Replace with you architecture, nvcc path and CUTLASS path
Note that some target architectures may not support datatypes such as uint16_t on the GPU. Thus, by default, some of these are commented out in our source files. If you want to use a BITLENGTH of 16 with GPU acceleration you can uncomment the appropriate lines in the .cu
source files of the CUDA directory.
We provide four different options for accelerating matrix multiplications and convolutions on the GPU.
Approach | Description | Config Option |
---|---|---|
CPU Matrix Multiplication | The matrix multiplication is handled on the CPU but utilizes optimizations such as cache tiling and transposing. | USE_CUDA_GEMM=0 |
GPU Matrix Multiplication | Accelerates matrix multiplications on the GPU. Convolutions are split up in im2col layout changes handled by the CPU and only the matrix multiplication itself is outsourced. | USE_CUDA_GEMM=1 |
Optimized GPU Matrix Multiplication | Similiar to USE_CUDA_GEMM=1 but provides improved data transfer and scheduling on the GPU that should be faster on most architectures. |
USE_CUDA_GEMM=3 |
NCHW GPU Convolution | Handles the whole convolution operation on the GPU. This is usually faster than the previous approaches. Separate matrix multiplications are handled the same way as in USE_CUDA_GEMM=3 . |
USE_CUDA_GEMM=2 |
CHWN GPU Convolution | Similiar to USE_CUDA_GEMM=2 but uses a different layout for the input and output tensors. This layout may be faster on some architectures. |
USE_CUDA_GEMM=4 |
References
CUDA GEMM and Convolution implementations are adapted from Cutlass, LICENSE and Piranha, MIT LICENSE.
Networking
In the networking folder, we provide implementations for socket communication and TCP/TLS connections between parties. Each party has a sending and receiving thread for each party it communicates with. The threads have the following responsibilities.
- Sending Threads: The sending threads wait until the main thread signals a condition variable by invoking the
send()
method. This is either done manually or if theSEND_BUFFER
is full. - Receiving Threads: The receiving threads continuously buffer incoming messages and signal the main thread whenever the
RECV_BUFFER
is full. Whenever the main thread requires a new chunk of messages it calls thereceive
method which consumes the data if it is already buffered or blocks until the receiving thread buffered the required data.
This execution model ensures that all communication between parties is utilizing parallelism and the main thread is only blocked when it needs to wait for a specific message to arrive before proceeding.
The following configuration options are available for networking.
Option | Description | Config Option |
---|---|---|
Encrypted Communication | Encrypts messages between the parties using TLS implemented by OpenSSL | USE_SSL=1 |
Send Buffer Size | Buffers x elements before sending a message. Setting x=0 waits until all elements for a communication round are available. Each element is of size DATTYPE bits. |
SEND_BUFFER=x |
Receive Buffer Size | Buffers x elements before signaling the main thread. Setting x=0 only signals the main thread when all elements for a communication round are available. Each element is of size DATTYPE bits. |
RECV_BUFFER=y |
Tip
Experimenting with different buffer sizes can improve performance significantly. The default option of SEND_BUFFER=10000
and RECV_BUFFER=10000
showed good performance in various experiments.