MPI 집합통신을 위한 프로세싱 노드 상태 기반의 메시지 전달 엔진 설계

서지반출

기타언어초록

본 논문은 MPI 집합 통신 함수가 처리 레벨 (transaction level) 에서 변환된다는 가정 하에 MPI 집합 통신 중 방송 (Broadcast), 확산 (Scatter), 취합 (Gather) 함수를 최적화한 알고리즘을 제안하였다. 또한 제안하는 알고리즘이 구동되는 MPI 전용 하드웨어 엔진을 설계하였으며, 이를 OCC-MPE (Optimized Collective Communication - Message Passing Engine) 라 명명하였다. OCC-MPE는 표준 송신 모드 (standard send mode)로 점대점 통신 (point-to-point communication) 을 하며, 집합 통신 중 가장 빈번하게 사용되는 방송, 취합, 확산을 제안하는 알고리즘에 의해 전송 순서를 결정한 후 통신하여 전체 통신 완료 시간을 단축시켰다. 제안한 알고리즘들의 성능을 측정하기 위하여 OCC-MPE를 SystemC 기반의 BFM(Bus Functional Model)을 제작하였다. SystemC 기반의 시뮬레이터를 통한 성능 평가 후에 VerilogHDL을 사용하여 제안하는 OCC-MPE를 포함한 MPSoC (Multi-Processor System on a Chip)를 설계하였다. TSMC 0.18 공정으로 합성한 결과 프로세싱 노드가 4개일 때 각 OCC-MPE가 차지하는 면적은 약 1978.95 이었다. 이는 전체 시스템에서 약 4.15%를 차지하므로 비교적 작은 면적을 차지함을 확인하였다. 본 논문에서 제안하는 OCC-MPE를 MPSoC에 내장하면, 비교적 작은 하드웨어 자원의 추가로 높은 성능향상을 얻을 수 있다.

기타언어초록

In this paper, on the assumption that MPI collective communication function is converted into a group of point-to-point communication functions in the transaction level, an algorithm that optimizes broadcast, scatter and gather function among MPI collective communication is proposed. The MPI hardware engine that operates the proposed algorithm was designed, and it was named the OCC-MPE (Optimized Collective Communication Message Passing Engine). The OCC-MPE operates point-to-point communication by using the standard send mode. The transmission order is arranged according to the algorithm that proposes the most frequently used broadcast, scatter and gather functions among the collective communications, so the whole communication time is reduced. To measure the performance of the proposed algorithm, the OCC-MPE with the Bus Functional Model (BFM) based on SystemC was designed. After evaluating the performance through the BFM based on SystemC, the proposed OCC-MPE is designed by using VerilogHDL. As a result of synthesizing with the TSMC $0.18{mu}m$, the gate count of each OCC-MPE is approximately 1978.95 with four processing nodes. That occupies approximately 4.15% in the whole system, which means it takes up a relatively small amount. Improved performance is expected with relatively small amounts of area increase if the OCC-MPE operated by the proposed algorithm is added to the MPSoC (Multi-Processor System on a Chip).

키워드

집합 통신 방송 확산 취합 MPI collective communication broadcast scatter gather

다운URL