LeetCode 187: Repeated DNA Sequences

link

Sliding window

Each 10-letter sequence is a window. We keep the window counts in a map.

Say len(s) = n and there are m distinct 10-letter repeated sequences.

Time: \mathcal{O}(n), space: \mathcal{O}(m).

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        sequence_count = {}
        begin = 0
        for end, c in enumerate(s):
            if end >= 9:
                seq = s[begin : end+1]
                if seq not in sequence_count:
                    sequence_count[seq] = 1
                else:
                    sequence_count[seq] += 1
                begin += 1
        
        return [seq for seq, count in sequence_count.items() if count > 1]

Since there are four different letters, the window can be considered as a base-4, 10-digit number. Say first window is: \text{window}_0 = d_0 \cdot 4^9 + d_1 \cdot 4^8 + \cdots + d_8 \cdot 4^1 + d_9 \cdot 4^0. The next window should be: \text{window}_1 = d_1 \cdot 4^9 + d_2 \cdot 4^8 + \cdots + d_9 \cdot 4^1 + d_{10} \cdot 4^0. In other words, \text{window}_1 = \left( \text{window}_0 - d_0 \cdot 4^9 \right) \cdot 4 + d_{10}.

In this way:

  1. We do not need to do string slice/copy per iteration, saving time.
  2. In the maps, we save numbers instead of len-10 strings, saving space.
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        if len(s) < 10:
            return []

        digit_of = {
            "A": 0,
            "C": 1,
            "G": 2,
            "T": 3,
        }

        base = 4

        window = 0
        for i in range(10):
            window = base * window + digit_of[s[i]]

        window_seq = {window: (0, 9)}
        window_count = {window: 1}

        msb = pow(base, 9)
        lsb = 1

        begin = 0
        for end in range(10, len(s)):
            begin_digit = digit_of[s[begin]]
            end_digit = digit_of[s[end]]
            window -= msb * begin_digit
            window *= base
            window += lsb * end_digit
            begin += 1

            if window in window_count:
                window_count[window] += 1
            else:
                window_count[window] = 1
                window_seq[window] = (begin, end)

        repeated_seq = []
        for window, count in window_count.items():
            if count > 1:
                begin, end = window_seq[window]
                repeated_seq.append(s[begin : end + 1])

        return repeated_seq

Leave a comment