[Algorithm Part 2] 8. Substring Search

[Algorithm Part 2] 8. Substring Search

2023. 6. 19. 23:39ㆍAlgorithm

이번 글에서는, substring search에 대해 살펴볼 것이다.

Substring search는 말뭉치(=corpus)에서 특정 문자열(=pattern)을 모두 찾는 문제다.

(N >> M, N = length of corpus, M = length of pattern)

0. Burte-force algorithm

corpus 맨 왼쪽 문자를 기준으로, pattern이 있는지 확인한다.

만약, pattern과 mismatch하면, 다음 문자(i=i+1)를 기준으로 pattern 여부를 확인한다. (by 재귀)
만약, pattern과 match하면, 문자 위치(i)를 반환한다.

위 알고리즘은 backup을 사용하기 때문에 매우 비효율적이다. (worst case $\sim MN$ char compare)

여기서, backup이란 "이전에 탐색했던 문자를 다시 탐색하는 것"을 의미한다.

때문에, 대부분 알고리즘 설계 시, backup을 피하고 싶어 한다.

지금부터, backup 문제를 해결하여 linear-time 성능을 보여주는 알고리즘에 대해 살펴볼 것이다.

1. Knuth-Morris-Pratt algorithm

이전 문자를 탐색해 얻은 지식을 활용하는 것

즉, "Deterministic finite state machine 설계 및 활용"이 알고리즘의 핵심이다.

위 state machine은 현재 corpus와 pattern의 matching 정도를 알려주는 모델이다.

예를 들어, 현재 corpus의 i번째 문자 위치를 탐색하고 있다고 가정히자.

이때, "state machine은 pattern[:j]와 corpus[i-j:i]가 matching되어 있음"을 알려주는 모델이다.

즉, 어디까지 match되었는지 알려주는 모델이다.

Simulation algorithm

corpus 맨 왼쪽 문자를 탐색하여 state machine에 전달해, 현재 어디까지 match되었는지 반환 받는다.

만약, pattern의 마지막 문자까지 match되었으면, 문자 위치(i)를 반환한다.
그렇지 않으면, 다음 문자(i=i+1)로 탐색해 pattern match 여부를 확인한다. (by 재귀)

State machine

state = 현재, pattern과 match되는 문자 개수 (e.g., 0, 1, 2, 3, ...)
state의 개수는 pattern의 문자 개수보다 한 개 더 많다.
event = 다음 문자(e.g., a, b, c, d ...)
action = matching index를 반환한다.

Difference from brute-force algorithm

Build state machine
No backup
Could use other input stream

Build state machine

If state == j and event: corpus[i] == pattern[j], then state = j + 1
dfa[pattern[j]][j] = j + 1
If state == j and event: c != pattern[j], then simluate pattern[1:j-1] and take transition pattern[j]
dfa[c][j] = dfa[c][X] (X = state of pattern[1:j-1])
(WHY? c != pattern[j]이면, pattern[0:j-1] + c까지의 state와 pattern[1:j-1] + c까지의 state가 동일하기 때문이다!!)
이때, 이것만 보면 각 step마다 j char compare가 소요될거 처럼 보인다.
하지만, pattern[1:j-1]의 state를 유지하면, 1 char compare가 소요된다!!
(by X = dfa[pattern[j]][X])

Performance

Simulate	Building state machine	KMP algorithm
N char compare	M char compare	N + M char compare

KMP 알고리즘보다 빠른 알고리즘에 대해 살펴보자.

2. Boyer-Moore algorithm

pattern의 맨 왼쪽 문자부터 match하는 것이 아닌, pattern의 맨 오른쪽 문자부터 match하는 것이 핵심이다.

이러한 방식은 다음과 같은 장점이 있다.

"만약 pattern과 mismatch시, 최대 M개 문자를 건너뛸 수 있다."

Algorithm

corpus 맨 왼쪽 문자를 기준으로, pattern이 있는지 확인한다. 이때, pattern의 맨 오른쪽 문자부터 match한다.

만약, pattern과 match하면, 문자 위치(i)를 반환한다.
만약, pattern과 mismatch하면,
- Case1: pattern에 없는 문자(=c)로 mismatch가 발생 시,
  corpus의 mismatch 위치의 다음 문자를 기준으로 pattern 여부를 확인한다. (by 재귀)
- Case 2: pattern에 있는 문자(=c)로 mismatch가 발생 시,
  corpus의 mismatch 위치와 pattern의 c 문자의 마지막 위치를 맞춰 기준을 정해, pattern 여부를 확인한다. (by 재귀)
  이때, 기준이 backup될 시, pattern의 차순위 c 문자의 위치와 맞춰 기준을 정한다.

사전에 pattern의 모든 문자의 마지막 위치(index)를 알고 있어야 한다.

3. Rabin-Karp algorithm

알고리즘의 핵심은 hash다.

pattern의 hash 값을 계산한다.
각 corpus index i마다, corpus[i:i+M]의 hash 값를 계산한다.
pattern hash 값과 일치하는 모든 i를 찾아 matching 여부를 확인한다.

Efficiently computing hash function (Modular hash function)

$x_i = t_i R^{M-1} + t_{i+1} R^M + \cdots + t_{i+M-1}R^0 \pmod Q \\ (t_i = \text{corpus[i]}, \quad M\text{-digit}, \quad \text{base-}R\text{ integer}, \quad \text{modulo }Q)$

Horner's mothod를 통해, linear-time에 hash 값을 계산할 수 있다..

즉, pattern가 아무리 길어도, 효율적으로 hash 값을 계산할 수 있다.

$x_{i+1} = (x_i - t_i R^{M-1}) R + t_{i+M} \\ (x_{i+1} = t_{i+1} R^{M-1} + t_{i+2} R^M + \cdots + t_{i+M}R^0, \quad x_i = t_i R^{M-1} + t_{i+1} R^M + \cdots + t_{i+M-1}R^0)$

위 수식을 통해, $x_i$ 가 주어지면 constant-time에 $x_{i+1}$ 를 계산할 수 있다.

즉, 모든 $x_i$ 를 linear-time에 계산할 수 있다.

Version

Monte Carlo version: hash 값이 일치하는 index를 반환해준다.
빠르지만(linear-time), 매우 낮은 확률로 오답을 제공할 수 있다.
Las Vegas version: 만약 hash 값이 일치하면, matching 여부를 확인 후, 반환해준다.
느리지만(worst: ~MN), 정답을 보장한다.

Analysis

Theory: 만약 Q가 매우 큰 소수( $MN^2$ )라고 가정할 시, 오답 확률은 $1/N$ 다.
Practice: 만약 Q에 매우 큰 소수로 할당할 시, 오답 확률은 $1/Q$ 다.

위 알고리즘은 다른 알고리즘보다 "유연함 및 확장성"이 좋다.

예를 들어, 여러 pattern을 찾는 문제로 바꿀 때 매우 편하다.

Performance

algorithm	KMP	Boyer-Moore	Rabin-Karp
time-complexity	N + M	N/M (worst: MN)	N

저작자표시 비영리 변경금지 (새창열림)

'Algorithm' 카테고리의 다른 글

[Algorithm Part 2] 10. Data Compression (0)	2023.06.27
[Algorithm Part 2] 9. Pattern Matching (0)	2023.06.22
[Algorithm Part 2] 7. Tries (0)	2023.06.17
[Algorithm Part 2] 6. String Sort (0)	2023.06.11
[Algorithm Part 2] 5. Mincut/Maxflow (0)	2023.05.23

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

강정노트

강정노트

최근글

0. Burte-force algorithm

1. Knuth-Morris-Pratt algorithm

Simulation algorithm

State machine

Difference from brute-force algorithm

Build state machine

Performance

2. Boyer-Moore algorithm

Algorithm

3. Rabin-Karp algorithm

Efficiently computing hash function (Modular hash function)

Version

Analysis

Performance

'Algorithm' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역