단백질의 세포내 위치 예측을 위한 강화된 접미사 배열 기반의 고속 서열탐색

서지반출

기타언어초록

단백질의 세포내 위치를 예측하는 많은 방법들은 질의 단백질과 서열 유사성이 높은 단백질의 정보를 이용한다. 본 논문은 이러한 서열 유사성이 큰 단백질들을 고속으로 찾는 방법을 제안한다. 이를 위해, 유전체 데이터베이스에서 질의 DNA 서열의 위치를 찾는데 이용되는 강화된 접미사 배열을 단백질 데이터베이스 탐색에 적합하게 수정한다. 강화된 접미사배열의 하향식 순회 탐색과 이전 탐색결과의 재사용을 이용하여 데이터베이스내의 단백질 중에서 질의 서열의 부분 서열들과 자주 일치하는 서열들을 데이터베이스 크기와 무관하게 질의서열 길이의 선형 시간 복잡도로 찾는다. 찾아진 서열들에 대해서 스미스-워터만 알고리즘을 사용하여 최종 유사 단백질을 찾는다. 제안 방법은 서열탐색에 가장 널리 쓰이는 BLAST에 비해서 약 300배의 빠른 탐색 속도를 보였고, 단백질의 세포내 위치예측에 적용할 경우 BLAST를 사용하는 방법에 비하여 정확성이 향상되었다.

기타언어초록

For predicting subcellular localization of proteins, many methods exploit information of proteins having high sequence similarity to a query sequence. This paper proposes a fast sequence search method to find these highly similar proteins in database. For protein database search, we adopt enhanced suffix arrays which are used for finding the position of query DNA sequences in genome database. We use top-down traversal and reuse of previously searched results of enhanced suffix arrays for fast search. The time complexity for searching candidate proteins having many exact matches to the sub-sequences of a query protein is proportional only to the length of the query sequence, not dependent on database size. Smith-Waterman algorithm is applied to find the most similar protein in these candidate proteins. Comparing with most widely used search method BLAST, the proposed method shows 300 times faster search speed and gives higher prediction accuracies in protein subcellular localization prediction.

키워드

단백질의 세포내 위치 예측 서열 탐색 강화된 접미사 배열 스미스-워터만 알고리즘 protein subcellular localization prediction sequence search enhanced suffix arrays smith-waterman algorithm BLAST

다운URL