正規表現による単語リストの圧縮 (3)

できたー！
migemo の RegexCompiler#optimize を利用させてもらいました。結果はこんな感じ。

a(bs|pd_set_socket_session_trace|r(ray|sort)|s(cii2ebcdic|ort))
acosh?
add(cslashes|slashes)
aggregat(e(|_(info|methods(|_by_(list|regexp))|properties(|_by_(list|regexp))))|ion_info)

戦略

末尾一致や中間一致は考えないことにして、先頭一致部分に注目する。たとえば "oooo" で始まる単語が3個あったとする。

ooooxxx
ooooyyy
oooozzz

共通部分である "oooo" を括り出せば 4バイト * 2行 = 8バイト減らせる。

ooooxxx
    yyy
    zzz

ただし、これを正規表現で表すには1行目に2バイト、2行目以降に各1バイトのメタ文字が必要。

oooo(xxx|
     yyy|
     zzz)

結局、文字列 x を括り出すことによって減らせるデータ量 r(x) は

r(x) = (x のバイト数 - 1) * (x の出現度数 - 1) - 2バイト

となる。あとは単語リストに出現するすべての部分文字列 x に対して r(x) を計算し、 r(x) の大きいものから順に x を括り出していけば、必ずしも最適ではないけれども結構いい線まで行くと思われる。
ちなみに、 PHP 5.2.0 における r(x) の上位は

r(x)	x
1111	ncurses_
586	mysqli_
366	image
331	pdf_
282	cpdf_
278	fbsql_
274	dbplus_
258	imap_
238	ibase_
233	mysql_

でした。

実装

単語リストをファイルから読み込みつつ、部分文字列の出現頻度をカウントする。たとえば

abs
acos
acosh

というリストなら、部分文字列の出現頻度は

{"a"=>3, "ab"=>1, "abs"=>1, "ac"=>2, "aco"=>2, "acos"=>2, "acosh"=>1}

となる。この中で、減らせるデータ量 r(x) が最大になるのは x = "acos" である。 "acos" で始まる単語リスト ["acos", "acosh"] を RegexCompiler#optimize に食わせると "acos(h|)" が出てくる。最後に "(h|)" → "h?" みたいな局所最適化をして、ファイルに出力すればおしまい。

問題点

秀丸エディタの強調定義ファイルには次の制限がある。

1行の長さは（フラグ部分も含めて） 253 バイトまで
全体の容量は（内部形式に変換した状態で） 32KB まで

1行の長さ制限をクリアするために、正規表現が250バイトを超えそうになったところで分割する。そのため、たとえば同じ "printer_" で始まる2つの単語が

"printer_draw_elipse" → "printer_(d(raw_(elipse)))"
"printer_select_pen"  → "print(er_(s(e(lect_pen))))"

こんな風に泣き別れになることもある。ていうか、もはや元の単語がどこにあるのかわからない！

ソース

#!/usr/bin/ruby -I./migemo
# 
#       HilightCompressor
#       by IKKI  2007/03/21
# 

# 単語リストを正規表現で圧縮して強調定義ファイルを作る。
# migemo が必要。 → http://0xcc.net/migemo/
# [インストール]
#       hilight_compressor.rb
#       migemo/
#       ├migemo.rb     ← migemo.rb.in をリネーム
#       └migemo-regex.rb
# [使い方]
# 引数に単語リストファイルを与えて実行。
#       $ ./hilight_compressor.rb filename.dic
# 終わると filename.hilight ができる。

$KCODE="e"
require 'migemo'
require 'migemo-regex'
include MigemoRegex

hilight_prefix = '177,'     # 出力行の先頭
hilight_suffix = ''         # 出力行の末尾

if ARGV.length < 1
    puts "usage: #{$0} inputfile"
    exit
end
outputfile = ARGV[0].sub(/[^.]*$/, "hilight")

class RegexCompiler
    def my_optimize(level)
        @regex = optimize2(@regex) if level >= 2
        @regex = optimize3(@regex) if level >= 3
    end
end

class HilightCompressor
    def initialize(border = 1, limit = 250)
        @list = Array.new
        @freq = Hash.new(0)
        @border = border        # 共通部分の最小文字数
        @limit  = limit         # 強調定義の1行の長さの最大値
    end
    attr_accessor :border
    attr_accessor :limit
    
    def load_stdin
        while line = gets
            line.chomp!
            push(line[/^\w+/])
        end
    end
    
    def load_array
        arr.each do |item|
            push(item[/^\w+/])
        end
    end
    
    def push(str)
        @list.push(str)
        str.length.downto(@border) do |len|
            @freq[str[0, len]] += 1
        end
        self
    end
    
    def gather!(lead)   # 文字列 lead で始まる要素を削除して配列で返す
        arr = Array.new
        bytes = 0
        @list.delete_if do |str|
            if str[0, lead.length] == lead && bytes + str.length < @limit - arr.length
                str.length.downto(@border) do |len|
                    @freq[str[0, len]] -= 1
                    @freq.delete(str[0, len]) if @freq[str[0, len]] <= 0
                end
                arr.push(str)
                bytes += str.length
                true
            end
        end
        arr
    end
    
    def major_part  # 最大共通部分を文字列で返す
        # 文字列 x を括ることによって減るバイト数は、正規表現のメタ文字の長さも考えると
        # (x のバイト数 - 1バイト) * (x を含む行数 - 1行) - 2バイト
        return nil if @freq.empty?
        x = @freq.max do |a, b|
            (a[0].length - 1) * (a[1] - 1) <=> (b[0].length - 1) * (b[1] - 1)
        end
        reduce = (x[0].length - 1) * (x[1] - 1) - 2
        $stderr.puts "#{sprintf("%4i", reduce)} #{x[0]}"
        if reduce >= 0
            x[0]
        else
            nil
        end
    end
    
    def regex_for(words)    # 配列 words にヒットする正規表現を文字列で返す
        compiler = RegexCompiler.new
        words.each do |w| compiler.push(w) end
        compiler.uniq
        compiler.my_optimize(3)
        regex = compiler.regex
        renderer = RegexRendererFactory.new(regex, "egrep", "")
        string = renderer.render
        string.gsub(/\((.|\[.+?\])\|\)/, '\1?')     # "(a|)" → "a?"
    end
    
    def compressed_list()   # 最適化されたリストを配列で返す
        arr = Array.new
        while m = major_part        # 大きく減らせるやつから順に括る
            g = gather!(m)
            r = regex_for(g)
            raise "regex length = #{r.length} bytes" if r.length > @limit
            arr.push(r)
        end
        (?! .. ?~).each do |c|      # 余ったやつを頭文字ごとに括る
            g = gather!(c.chr)
            $stderr.puts "#{sprintf("%4i", g.length)} #{c.chr}"
            next if g.empty?
            r = regex_for(g)
            raise "regex length = #{r.length} bytes" if r.length > @limit
            arr.push(r)
        end
        arr.concat(@list)       # さらに余ったやつをそのままくっつける
        @list.clear
        @freq.clear
        arr.sort
    end
end

hc = HilightCompressor.new
hc.load_stdin
max = 0
hfile = open(outputfile, "w")
hc.compressed_list.each do |word|
    line = "#{hilight_prefix}#{word}#{hilight_suffix}\r\n"
    hfile.print(line)
    max = line.length if line.length > max
end
hfile.close
$stderr.puts "output file = #{outputfile}"
$stderr.puts "max length  = #{max} bytes"