Efficient Parallel Fuzzy Dilation for Visual Reasoning on Edge: Leveraging ARM SIMD Extensions and Embedded GPUs Accelerators

38th ARCS 2025: Kiel, Germany, Apr 2025
Kiel, Allemagne
Laurent CABARET, Céline HUDELOT, Régis Pierrard, Jean-Philippe POLI
Texte intégral

BibTeX

            @inproceedings{key,
  title = "Efficient Parallel Fuzzy Dilation for Visual Reasoning on Edge: Leveraging ARM SIMD Extensions and Embedded GPUs Accelerators",
  author = "Laurent CABARET, Céline HUDELOT, Régis Pierrard, Jean-Philippe POLI",
  booktitle = "38th ARCS 2025: Kiel, Germany",
  year = "2025",
  url = "https://books.google.fr/books?hl=fr&lr=&id=zUuREQAAQBAJ&oi=fnd&pg=PA18&dq=Efficient+Parallel+Fuzzy+Dilation+for+Visual+Reasoning+on+Edge:+Leveraging+ARM+SIMD+Extensions+and+Embedded+GPUs+Accelerators.&ots=BCucTgxtLI&sig=-U6u6GHtSA50CsYz6B20jj8dNrU&redir_esc=y#v=onepage&q=Efficient%20Parallel%20Fuzzy%20Dilation%20for%20Visual%20Reasoning%20on%20Edge%3A%20Leveraging%20ARM%20SIMD%20Extensions%20and%20Embedded%20GPUs%20Accelerators.&f=false"
}
          

Fuzzy spatial relations are increasingly utilized in visual reasoning tasks, such as semantic annotation and object recognition. However, these tasks often rely on computationally intensive fuzzy morphological operators, which lead to significant latency during relation evaluation. To address this challenge, optimized implementations tailored to modern architectures are required. Previous works introduced the Reverse (R) and Parallel Reverse (PR) algorithms for Intel processors, utilizing OpenMP and SIMD intrinsics. This current work extends these contributions to embedded systems by targeting ARM-based processors and NVIDIA embedded GPUs. Specifically, we propose three architecture-specific implementations: PR64N, which uses 64-bit NEON SIMD instructions; PR128N, which employs 128-bit NEON instructions; and PRGPU, a GPU-optimized version based on CUDA. Our evaluation is conducted on the NVIDIA Jetson Orin Nano Super platform, an advanced ARM-based system-on-chip designed for low-power edge AI applications. The results indicate that the CPU implementations achieve near-peak performance by fully exploiting the platform's memory bandwidth. Concurrently, the GPU implementation efficiently offloads fuzzy spatial relation evaluations, thereby allowing the CPU to manage additional workloads. These findings highlight the suitability of the proposed methods for enabling visual reasoning tasks on resource-constrained embedded systems and contribute to the broader discussion on addressing heterogeneous architectures with tailored algorithms.