NAME Task::MemManager::Device - Device-specific memory management extensions for Task::MemManager VERSION version 0.02 SYNOPSIS use Task::MemManager::Device; # Use default NVIDIA_GPU device my $buffer = Task::MemManager->new(1000, 4); # Map buffer to GPU $buffer->device_movement( action => 'enter', direction => 'to', device => 'NVIDIA_GPU', device_id => 0 ); # Perform GPU operations (using your C code) my_gpu_function($buffer->get_buffer, $buffer->get_buffer_size); # Update buffer from GPU back to CPU $buffer->device_movement( action => 'update', direction => 'from' ); # Exit and deallocate from GPU $buffer->device_movement( action => 'exit', direction => 'from' ); DESCRIPTION Task::MemManager::Device extends the "Task::MemManager" module by providing device-specific memory management capabilities, particularly for GPU computing using OpenMP target directives. It enables seamless data movement between CPU and GPU memory spaces, supporting various mapping strategies (to, from, tofrom, alloc) and update operations. The module dynamically generates device-specific modules using Inline::C and OpenMP pragmas, allowing for flexible device support. By default, it provides NVIDIA GPU support with appropriate compilation flags, but can be extended to support AMD GPUs and other devices. Device modules are automatically loaded and compiled on first use, with the generated code cached by Inline::C for subsequent runs. Each device module implements a set of standard functions for entering data regions, exiting data regions, and updating data between host and device. LOADING THE MODULE The module can be loaded with or without specifying device modules: # Load with default NVIDIA_GPU device use Task::MemManager::Device; # Load with specific devices use Task::MemManager::Device qw(NVIDIA_GPU AMD_GPU); # Load via Task::MemManager with device specification use Task::MemManager Device => ['NVIDIA_GPU']; # Combine with allocator and view specifications use Task::MemManager Allocator => 'CMalloc', View => 'PDL', Device => 'NVIDIA_GPU'; METHODS device_movement $buffer->device_movement(%options); Manages data movement between CPU and device (GPU) memory spaces using OpenMP target directives. This is the primary method for controlling data placement and updates. Parameters: * "action" - The type of operation to perform. Required. One of: * 'enter' - Begin a data mapping region (allocate on device, optionally copy) * 'exit' - End a data mapping region (optionally copy back, deallocate) * 'update' - Update data between host and device without changing mapping * "direction" - The data transfer direction. Required. One of: * 'to' - Copy data from host to device * 'from' - Copy data from device to host * 'tofrom' - Copy data both ways (enter: to device, exit: from device) * 'alloc' - Allocate device memory without copying (enter only) * 'release' - Deallocate device memory without copying (exit only) * 'delete' - Deallocate device memory, discard changes (exit only) * "device" - Device module name. Optional. Default: 'NVIDIA_GPU' * "device_id" - Device ID number for multi-device systems. Optional. Default: 0 * "start" - Starting byte offset in buffer. Optional. Default: 0 * "end" - Ending byte position in buffer. Optional. Default: buffer size Returns: Nothing (dies on error) Throws: * Dies if action/direction combination is invalid * Dies if attempting to manage same device_id with different device modules * Dies if attempting to enter-map the same buffer twice on same device Examples: # Map buffer to GPU, copying data $buffer->device_movement( action => 'enter', direction => 'to' ); # Allocate GPU memory without copying $buffer->device_movement( action => 'enter', direction => 'alloc' ); # Update partial buffer region from GPU $buffer->device_movement( action => 'update', direction => 'from', start => 0, end => 1000 ); # Exit mapping, copying data back and deallocating $buffer->device_movement( action => 'exit', direction => 'from' ); # Exit mapping with release (keep mapping but allow reuse) $buffer->device_movement( action => 'exit', direction => 'release' ); DEVICE FUNCTIONS Each device module provides the following functions (where is replaced with the device name, e.g., NVIDIA_GPU): * "_enter_to_gpu" - Map data to device (copy from host) * "_enter_tofrom_gpu" - Map data bidirectionally * "_enter_alloc_gpu" - Allocate on device without copying * "_exit_from_gpu" - Unmap data from device (copy to host) * "_exit_tofrom_gpu" - Unmap bidirectional data * "_exit_release_gpu" - Release mapping without copying * "_exit_delete_gpu" - Delete mapping and discard data * "_update_to_gpu" - Update data to device * "_update_from_gpu" - Update data from device These functions are automatically registered and called by the "device_movement" method. They should not typically be called directly. COMPILATION OPTIONS The module supports device-specific compilation options for optimal performance: NVIDIA_GPU (default) COMPILER_FLAGS: -fno-stack-protector -fcf-protection=none -fopenmp -std=c11 -fPIC -Wall -Wextra CCEXFLAGS: -foffload=nvptx-none LINKER_FLAGS: -fopenmp (with system lddlflags) OPTIMIZE: -O3 -march=native AMD_GPU COMPILER_FLAGS: (same as NVIDIA_GPU) CCEXFLAGS: (none - AMD offloading under development) LINKER_FLAGS: -fopenmp (with system lddlflags) OPTIMIZE: -O3 -march=native DEFAULT (for other devices) COMPILER_FLAGS: (same as NVIDIA_GPU) CCEXFLAGS: -fopenmp LINKER_FLAGS: -fopenmp (with system lddlflags) OPTIMIZE: -O3 -march=native EXAMPLES Example 1 is a complete working example demonstrating basic GPU memory mapping, computation, and retrieval of results. Example 2 shows how to allocate GPU memory without initial data copy. Example 3 illustrates combining device management with PDL views for seamless integration with Perl Data Language. =head2 Example 1: Basic GPU Memory Mapping This example demonstrates the fundamental pattern of mapping memory to GPU, performing computations, and retrieving results. use Task::MemManager::Device; use Inline ( C => Config => ccflags => "-fno-stack-protector -fcf-protection=none " . " -fopenmp -Iinclude -std=c11 -fPIC " . " -Wall -Wextra -Wno-unused-function -Wno-unused-variable" . " -Wno-unused-but-set-variable ", lddlflags => join( q{ }, $Config::Config{lddlflags}, q{-fopenmp} ), ccflagsex => " -fopenmp ", libs => q{ -lm -foffload=-lm }, optimize => "-O3 -march=native", ); # replace with your OpenMP's device flags use Inline C => 'DATA'; my $buffer_length = 250000; my $buffer = Task::MemManager->new($buffer_length, 4); # Map buffer to GPU $buffer->device_movement(action => 'enter', direction => 'to'); # Perform GPU computation assign_as_float($buffer->get_buffer, $buffer->get_buffer_size); # Update results back to CPU $buffer->device_movement(action => 'update', direction => 'from'); # Verify results by printing some values my @values = unpack("f*", $buffer->extract_buffer_region(0, $buffer->get_buffer_size - 1)); print "First 10 values: ", join(", ", @values[0..9]), "\n"; print "Last 10 values: ", join(", ", @values[-10..-1]), "\n"; # Exit GPU mapping $buffer->device_movement(action => 'exit', direction => 'from'); __DATA__ __C__ #include "omp.h" void assign_as_float(unsigned long arr, size_t n) { float *array_addr = (float *)arr; size_t len = n / sizeof(float); #pragma omp target for (int i = 0; i < len; i++) { array_addr[i] = (float)i * 2.0f; } } Example 2: GPU Memory Allocation Without Initial Copy When you want to allocate GPU memory but don't need to copy initial data (e.g., for output-only computations): # look at Example 1 for the use statements and Inline C setup my $buffer = Task::MemManager->new(1000000, 4); # Allocate GPU memory without copying $buffer->device_movement(action => 'enter', direction => 'alloc'); # Perform GPU computation that generates results alloc_as_float($buffer->get_buffer, $buffer->get_buffer_size); # Copy results back to CPU $buffer->device_movement(action => 'exit', direction => 'from'); __DATA__ __C__ #include "omp.h" void alloc_as_float(unsigned long arr, size_t n) { float *array_addr = (float *)arr; size_t len = n / sizeof(float); #pragma omp target for (int i = 0; i < len; i++) { array_addr[i] = (float)i * 3.0f; } } Example 3: Working with PDL Views Combining device management with PDL views for seamless integration with Perl Data Language: use Task::MemManager Allocator => 'CMalloc', View => 'PDL', Device => 'NVIDIA_GPU'; use Inline ( C => Config => ccflags => "-fno-stack-protector -fcf-protection=none " . " -fopenmp -Iinclude -std=c11 -fPIC " . " -Wall -Wextra -Wno-unused-function -Wno-unused-variable" . " -Wno-unused-but-set-variable ", lddlflags => join( q{ }, $Config::Config{lddlflags}, q{-fopenmp} ), ccflagsex => " -fopenmp ", libs => q{ -lm -foffload=-lm }, optimize => "-O3 -march=native", ); # replace with your OpenMP's device flags use Inline C => 'DATA'; my $buffer_length = 1000; my $buffer = Task::MemManager->new($buffer_length, 4, {allocator => 'CMalloc'}); # Create PDL view my $pdl_view = $buffer->create_view('PDL', {view_name => 'my_pdl_view', pdl_type => 'float'}); # Initialize with random values in PDL $pdl_view->inplace->random; # Clone the view for comparison my $cloned_view = $buffer->clone_view('my_pdl_view'); # Move to GPU and modify $buffer->device_movement(action => 'enter', direction => 'to'); mod_as_float($buffer->get_buffer, $buffer->get_buffer_size); $buffer->device_movement(action => 'exit', direction => 'from'); # PDL view automatically reflects changes my @values = list $pdl_view; my @original = list $cloned_view; # Verify: values should be doubled for my $i (0 .. $#values) { die "Mismatch!" unless $values[$i] == $original[$i] * 2.0; } __DATA__ __C__ #include "omp.h" void mod_as_float(unsigned long arr, size_t n) { float *array_addr = (float *)arr; size_t len = n / sizeof(float); #pragma omp target for (int i = 0; i < len; i++) { array_addr[i] *= 2.0f; } } Example 4: Multiple Device Management Managing multiple buffers across different devices (code snippet): # Create multiple buffers my $buf1 = Task::MemManager->new(1000, 4); my $buf2 = Task::MemManager->new(2000, 4); # Map to different devices (if available) $buf1->device_movement( action => 'enter', direction => 'to', device_id => 0 ); $buf2->device_movement( action => 'enter', direction => 'to', device_id => 1 # Different device ); # Perform operations on each device - fictional C level functions process_on_device($buf1->get_buffer, $buf1->get_buffer_size); process_on_device($buf2->get_buffer, $buf2->get_buffer_size); # Retrieve results $buf1->device_movement(action => 'exit', direction => 'from', device_id => 0); $buf2->device_movement(action => 'exit', direction => 'from', device_id => 1); Example 5: Partial Buffer Updates Update only a portion of the buffer between host and device: my $buffer = Task::MemManager->new(10000, 4); $buffer->device_movement(action => 'enter', direction => 'to'); # Update only first 1000 bytes from GPU $buffer->device_movement( action => 'update', direction => 'from', start => 0, end => 1000 ); # Later, update another region to GPU $buffer->device_movement( action => 'update', direction => 'to', start => 1000, end => 2000 ); $buffer->device_movement(action => 'exit', direction => 'release'); AUTOMATIC CLEANUP The module automatically handles cleanup of device mappings when buffer objects are destroyed. The DESTROY method ensures that: * All device mappings are properly released * Device memory is deallocated * No memory leaks occur on the device * Reference counts are properly maintained Cleanup uses the "exit_release_gpu" operation, which allows the runtime to manage the actual deallocation timing while ensuring proper cleanup. DIAGNOSTICS If you set the environment variable DEBUG to a non-zero value, the module will provide detailed information about when things go wrong DEPENDENCIES The module depends on: * "Task::MemManager" - Base memory management functionality * "Inline::C" - For C code integration and compilation * "Module::Find" - For automatic discovery of device modules * "Module::Runtime" - For dynamic module loading * OpenMP-capable compiler (e.g., GCC 9+, Clang 10+) for GPU offloading For NVIDIA GPU support, you need: * GCC with nvptx offload support, or * Clang with CUDA/NVPTX target support (not tested yet with the relevant version of perl) LIMITATIONS AND CAVEATS * Cannot map the same buffer to the same device_id multiple times * Cannot manage the same device_id with different device modules * Device module compilation happens at first use (may take time) * Requires OpenMP 4.5+ for target directives * GPU offloading support varies by compiler and installation * AMD GPU support is experimental and may require additional setup TODO * Ensure that clang and icx compilers work correctly * Ensure AMD GPU offloading works correctly * Add support for additional devices (e.g., Intel GPUs, FPGAs) * Add support for asynchronous data transfers * Implement device-to-device direct transfers * Add support for unified memory management * Provide device property queries (memory available, etc.) * Add support for interfacing to other parallel programming models (e.g., CUDA, HIP) using OpenMP's interoperability features * Implement automatic workload distribution across multiple devices * Device module loading and registration (when DEBUG = 1) * Function registration for each device (when DEBUG = 1) * Buffer mapping operations (enter/exit/update) (when DEBUG = 1) * Device ID management (when DEBUG = 1) * Buffer lifecycle events (when DEBUG = 1) SEE ALSO * Task::MemManager - Base memory management module * Task::MemManager::View - Memory view management * Inline::C - Inline C code in Perl * OpenMP Specification - OpenMP target directives * GCC Offloading - GCC offloading setup AUTHOR Christos Argyropoulos, "" Initial documentation was created by Claude Sonnet 4.5 after providing the human generated test files for the module and the documentation in the MemManager distribution as context. COPYRIGHT AND LICENSE This software is copyright (c) 2025 by Christos Argyropoulos. This is free software; you can redistribute it and/or modify it under the MIT license. The full text of the license can be found in the LICENSE file. See for more information.