맵리듀스를 사용한 데이터 큐브의 효율적인 계산 기법

맵리듀스를 사용한 데이터 큐브의 효율적인 계산 기법

ㆍ 저자명: 이기용,박소정,박은주,박진경,최연정,Lee. Ki Yong,Park. Sojeong,Park. Eunju,Park. Jinkyung,Choi. Yeunjung
ㆍ 간행물명: 정보처리학회논문지. KIPS transactions on software and data engineering. 소프트웨어 및 데이터 공학
ㆍ 권/호정보: 2014년|3권 11호|pp.479-486 (8 pages)
ㆍ 발행정보: 한국정보처리학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

맵리듀스(MapReduce)는 대용량 데이터를 다수의 컴퓨터로 병렬 처리하는 데 사용되는 프로그래밍 모델이다. 데이터 큐브(Data Cube)는 대용량 데이터 분석에 널리 사용되는 연산자로서, 주어진 차원 애트리뷰트들의 모든 가능한 조합에 대한 group-by들을 계산한다. 차원 애트리뷰트의 개수가 n일 때, 데이터 큐브는 총 $2^n$개의 group-by를 계산한다. 본 논문은 맵리듀스를 사용하여 데이터 큐브를 효율적으로 계산하는 방법을 제안한다. 제안 방법은 $2^n$ 개의 group-by를 $_nC_{{lceil}n/2{ ceil}}$개의 그룹으로 분할하고, 이 그룹들을 ${lceil}n/2{ ceil}$개의 맵리듀스 잡(job)을 통해 단계적으로 계산한다. 제안 방법은 기존 방법에 비해 맵퍼(mapper)가 생성하는 중간결과의 크기를 크게 줄임으로써 중간결과의 전송 및 정렬에 드는 비용을 크게 줄인다. 그에 따라 데이터 큐브를 계산하는 총 수행시간이 크게 감소된다. 실험을 통해 제안 방법이 기존 방법에 비해 더 빠르게 데이터 큐브를 계산함을 보인다.

기타언어초록

MapReduce is a programing model used for parallelly processing a large amount of data. To analyze a large amount data, the data cube is widely used, which is an operator that computes group-bys for all possible combinations of given dimension attributes. When the number of dimension attributes is n, the data cube computes $2^n$ group-bys. In this paper, we propose an efficient method for computing data cubes using MapReduce. The proposed method partitions $2^n$ group-bys into $_nC_{{lceil}n/2{ ceil}}$ batches, and computes those batches in stages using ${lceil}n/2{ ceil}$ MapReduce jobs. Compared to the existing methods, the proposed method significantly reduces the amount of intermediate data generated by mappers, so that the cost of sorting and transferring those intermediate data is reduced significantly. Consequently, the total processing time for computing a data cube is reduced. Through experiments, we show the efficiency of the proposed method over the existing methods.

키워드

데이터 큐브 맵리듀스 빅데이터 질의 처리 Data Cube MapReduce Big Data Query Processing OLAP

다운URL