3.6 Greedy Algorithms - 3.6.1 Huffman Coding

2024. 2. 3. 20:17ใ†CS - Roadmap.sh/3. Common Algorithms

CS - 3. Common Algorithms - 3.6 Greedy Algorithms - 3.6.1 Huffman Coding 

Huffman Coding

๐Ÿ’กHuffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.

ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ์€ ๋ฌด์†์‹ค ๋ฐ์ดํ„ฐ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ ๋ฌธ์ž์— ๊ฐ€๋ณ€ ๊ธธ์ด์˜ ์ฝ”๋“œ๋ฅผ ํ• ๋‹นํ•˜๊ณ , ํ• ๋‹น๋œ ์ฝ”๋“œ์˜ ๊ธธ์ด๋Š” ํ•ด๋‹น ๋ฌธ์ž์˜ ๋นˆ๋„์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ ๋ฌธ์ž๋Š” ๊ฐ€์žฅ ์ž‘์€ ์ฝ”๋“œ๋ฅผ, ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋‚ฎ์€ ๋ฌธ์ž๋Š” ๊ฐ€์žฅ ํฐ ์ฝ”๋“œ๋ฅผ ํ• ๋‹น๋ฐ›์Šต๋‹ˆ๋‹ค.

 

ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ(Huffman Coding)์€ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์••์ถ•ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ ๋ฌธ์ž ๋˜๋Š” ์‹ฌ๋ณผ์ด ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š”์ง€์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ธธ์ด์˜ ๋น„ํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ฐ€์žฅ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋ฌธ์ž๋Š” ๊ฐ€์žฅ ์งง์€ ๋น„ํŠธ๋ฅผ, ๊ฐ€์žฅ ์ ๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋ฌธ์ž๋Š” ๊ฐ€์žฅ ๊ธด ๋น„ํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ „์ฒด ๋น„ํŠธ ์ˆ˜๋ฅผ ์ตœ์†Œํ™” ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

import heapq
import collections
import sys

def huffman_encoding(data):
    # ๋ฌธ์ž์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์„ธ์–ด์„œ ์‚ฌ์ „์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
    freq_dict = collections.defaultdict(int)
    for char in data:
        freq_dict[char] += 1

    # ๋นˆ๋„์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์†Œ ํž™์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. 
    # ํž™์˜ ๊ฐ ์š”์†Œ๋Š” [๋ฌธ์ž ๋นˆ๋„์ˆ˜, [๋ฌธ์ž, ๋ฌธ์ž์˜ ํ—ˆํ”„๋งŒ ์ฝ”๋“œ]] ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.
    heap = [[weight, [char, ""]] for char, weight in freq_dict.items()]
    heapq.heapify(heap)

    # ํž™์— ์š”์†Œ๊ฐ€ ํ•˜๋‚˜๋งŒ ๋‚จ์„ ๋•Œ๊นŒ์ง€ ์•„๋ž˜์˜ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
    while len(heap) > 1:
        # ํž™์—์„œ ๊ฐ€์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์€ ๋‘ ์š”์†Œ๋ฅผ ๊บผ๋ƒ…๋‹ˆ๋‹ค.
        lo = heapq.heappop(heap)
        hi = heapq.heappop(heap)

        # ๋นˆ๋„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ์ ์€ ์š”์†Œ์˜ ํ—ˆํ”„๋งŒ ์ฝ”๋“œ ์•ž์— '0'์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        # ๋‘ ๋ฒˆ์งธ๋กœ ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์€ ์š”์†Œ์˜ ํ—ˆํ”„๋งŒ ์ฝ”๋“œ ์•ž์— '1'์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]

        # ๋‘ ์š”์†Œ๋ฅผ ํ•ฉ์นœ ์ƒˆ ์š”์†Œ๋ฅผ ํž™์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
        # ์ด ์š”์†Œ์˜ ๋นˆ๋„์ˆ˜๋Š” ๋‘ ์š”์†Œ์˜ ๋นˆ๋„์ˆ˜ ํ•ฉ์ด๋ฉฐ, ํ—ˆํ”„๋งŒ ์ฝ”๋“œ๋Š” ๋‘ ์š”์†Œ์˜ ํ—ˆํ”„๋งŒ ์ฝ”๋“œ๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
        heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])

    # ํž™์— ๋‚จ์€ ๋งˆ์ง€๋ง‰ ์š”์†Œ๊ฐ€ ์ „์ฒด ๋ฌธ์ž์˜ ํ—ˆํ”„๋งŒ ํŠธ๋ฆฌ์ž…๋‹ˆ๋‹ค. 
    # ์ด๋ฅผ ํ—ˆํ”„๋งŒ ์‚ฌ์ „์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    huff_dict = sorted(heapq.heappop(heap)[1:], key=lambda p: (len(p[-1]), p))
    print("Huffman Coding: ")
    for char, huff_code in huff_dict:
        print(f"{char} : {huff_code}")

 

์œ„ ์ฝ”๋“œ๋Š” ์‚ฌ์šฉ์ž๋กœ๋ถ€ํ„ฐ ๋ฌธ์ž์—ด์„ ์ž…๋ ฅ๋ฐ›์•„ ๊ฐ ๋ฌธ์ž์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ ,

์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•ด ๊ฐ ๋ฌธ์ž์— ๋Œ€์‘ํ•˜๋Š” ํ—ˆํ”„๋งŒ ์ฝ”๋“œ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋Ÿฐ์‹์œผ๋กœ ๋ฌธ์ž์—ด์—์„œ ๊ฐ ๋ฌธ์ž์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์นด์šดํŠธํ•˜๊ณ , ๋นˆ๋„์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ—ˆํ”„๋งŒ ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ ๋ฌธ์ž์— ๋Œ€์‘ํ•˜๋Š” ํ—ˆํ”„๋งŒ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ ํ—ˆํ”„๋งŒ ์ฝ”๋“œ๋Š” ๊ทธ ๋ฌธ์ž์˜ ๋นˆ๋„์ˆ˜์— ๋ฐ˜๋น„๋ก€ํ•˜๋Š” ๊ธธ์ด๋ฅผ ๊ฐ€์ง€๋ฏ€๋กœ, ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๋ฌธ์ž๋Š” ์งง์€ ์ฝ”๋“œ๋ฅผ, ๋นˆ๋„์ˆ˜๊ฐ€ ๋‚ฎ์€ ๋ฌธ์ž๋Š” ๊ธด ์ฝ”๋“œ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

ํ™œ์šฉ

์ •์˜์— ๋”ฐ๋ผ์„œ '์••์ถ•'์— ๊ฐ•์ ์„ ์ง€๋‹ˆ๊ธฐ์— ์••์ถ•๊ณผ ์ „์†ก์— ์žˆ์–ด ํฐ ํšจ์œจ์„ฑ์„ ์ง€๋‹™๋‹ˆ๋‹ค.

 

1. ํŒŒ์ผ ์••์ถ•

ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ์€ ํŒŒ์ผ ์••์ถ•์— ๋งŽ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

ZIPํŒŒ์ผ ์••์ถ•์€ ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ์„ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์••์ถ•ํ•ฉ๋‹ˆ๋‹ค.

JPEG ์ด๋ฏธ์ง€ ์••์ถ•์—์„œ๋„ ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€ ์ƒ‰์ƒ ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์••์ถ•ํ•ฉ๋‹ˆ๋‹ค.

 

2. ํ†ต์‹ 

ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ „์†กํ•˜๋Š” ๋ฐฉ๋ฒ•์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์ „์†ก์— ํ•„์š”ํ•œ ๋น„ํŠธ์ˆ˜๋ฅผ ์ค„์—ฌ ์ตœ์†Œํ™”ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

3. ์›น ๋กœ๋”ฉ

ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋„ ํ—ˆํ”„๋งŒ ์ฝ”๋”ฉ์„ ํ†ตํ•ด ์••์ถ•์„ ํ• ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์›น ํŽ˜์ด์ง€ ๋กœ๋”ฉ ์†๋„๋ฅผ ๋†’์ด๋Š”๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

728x90