博士論文口試 古宗翰 (2014/4/28)
清華大學資訊工程學系博士論文口試
學 生:古宗翰
指導教授:韓永楷 教授
口試委員:
校內:韓永楷 教授, 盧錦隆 教授, 唐傳義 教授
校外:王有禮 教授, 姚兆明 教授, 謝孫源 教授
時 間:103 年 4 月 28 日 (星期四) 13:00 - 15:00
地 點:台達館 613 會議室
題 目:Space-Efficient Indexes for some New Bioinformatics
Applications
論文摘要:
This thesis studies the dictionary matching problem and its related
variations {\it approximate} dictionary matching problem,
{\it circular} dictionary matching problem, and an application {\it text
indexing with wildcards problem}.
Given a set $\D$ of $d$ patterns of total length $n$, the dictionary
matching problem is to index $\D$ such that for any query text $T$, we can
locate the occurrences of any pattern within $T$ efficiently. This problem
can be solved in optimal $O(|T|+occ)$ time by the classical Aho-Corasick
automaton where $occ$ denotes the number of occurrences. The space
requirement is $O(n)$ words which is far from optimal (i.e. succinct space
or compressed space). When $\D$ contains a total of $n$ characters drawn
from an alphabet set $\Sigma$ of size $\sigma$, Hon et al. (2008) gave an
$nH_k(\D)+o(n\log\sigma)$-bit index which supports a query in
$O(|T|(\log^{\epsilon} n+ \log d)+ occ)$ time, where $\epsilon >0$ and
$nH_k(\D)$ denotes the $k$th-order entropy of $\D$. Recently,
Belazzougui (2010) has proposed an elegant scheme, which takes
$n\log\sigma +O(n)$ bits of index space and supports a query in optimal
time. In this thesis, we provide connections between Belazzougui's index
and XBW compression of Ferragina and Manzini (2005), and show that
Belazzougui's index can be slightly modified to be stored in
$nH_k(\D)+O(n)$ bits, while query time remains optimal; this improves the
compressed index by Hon et al. (2008) in both space and time.
For the {\it approximate} dictionary matching problem, we consider the
one error case instead of the $k$ errors case (i.e. general case), where
$k$ is an constant number larger than 0. In the one error case, we consider
a substring of $T[i..j]$ an occurrence of $P$ whenever the edit distance
between $T[i..j]$ and $P$ is at most one. For this problem, the best known
indexes are by Cole et al. (2004), which requires $O(n+ d\log{d})$ words
of space and reports all occurrences in $O(|T|\log{d}\log{\log{d}}+occ)$
time, and by Ferragina et al. (1999), which requires $O(n^{1+\epsilon})$
words of space and reports all occurrences in $O(|T|\log\log n + occ)$
time. Although there have been successes in compressing the dictionary
matching index while keeping the query time optimal (as described on the
above). However, a compressed index for approximate dictionary matching
problem is still open. In this thesis, we propose the first such index
which requires an optimal $nH_k+O(n)+o(n\log\sigma)$-bit index space. The
query time of our index is $O(\sigma |T|\log^3{n}\log{\log{n}}+occ)$.
Circular patterns are those patterns whose cyclic shifts are also valid
patterns. These patterns arise naturally in bioinformatics and
computational geometry. In this thesis, we consider succinct indexing
schemes for a set of $d$ circular patterns of total length $n$, with each
character drawn from an alphabet $\Sigma$ of size $\sigma$. Our succinct
index which needs $n\log\sigma(1+o(1))+O(n)=O(d\log n)$ bits is based
on the popular Burrows-Wheeler transform (BWT) on circular patterns, while
the dictionary matching problem or the pattern matching problem can be
solved efficiently.
Sometimes the text string $T$ could have wildcard characters inside.
Therefore, suppose $T=T_1\phi^{k_1}T_2\phi^{k_2}\cdots\phi^{k_d}T_{d+1}$
whose total length is $n$, where characters of each $T_i$ are chosen from
an alphabet $\Sigma$, and $\phi$ denotes a wildcard symbol. The text
indexing with wildcards problem is to index $T$ such that when we are
given a query pattern $P$, we can locate the occurrences of $P$ in $T$
efficiently. This problem has been applied in indexing genomic sequences
that contain single-nucleotide polymorphisms (SNP) because SNP can be
modeled as wildcards. Recently Tam et al. (2009) and Thachuk (2011) have
proposed succinct indexes for this problem. In this thesis, we will show
how to apply the index of dictionary matching problem to solve this problem,
and we present the first compressed index for this problem, which takes
only $nH_h +o(n\log \sigma)+O(d\log n)$ bits of space, where $H_h$ is the
$h$th-order empirical entropy ($h=o(\log_{\sigma} n)$) of $T$.
著作列表:
Journal Papers: (Authors are listed in alphabetical order)
1. Wing-Kai Hon, Tsung-Han Ku, Tak-Wah Lam, Rahul Shah, Siu-Lung Tam,
Sharma V. Thankachan, Jeffrey Scott Vitter: Compressing Dictionary Matching
Index via Sparsification Technique. Algorithmica, 2014.
2. Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan,
Jeffrey Scott Vitter: Faster compressed dictionary matching. Theoretical
Computer Science, 475: 113-119, 2013.
3. Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan,
Jeffrey Scott Vitter: Compressed text indexing with wildcards. Journal of
Discrete Algorithms, 19: 23-29, 2013.
Conference Papers: (Authors are listed in alphabetical order)
1. Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan:
Space-Efficient Construction Algorithm for the Circular Suffix Tree.
In Proceedings of Symposium on Combinatorial Pattern Matching, pages
142-152, 2013.
2. Travis Gagie, Wing-Kai Hon, Tsung-Han Ku: New Algorithms for Position
Heap. In Proceedings of Symposium on Combinatorial Pattern Matching,
pages 95-106, 2013.
3. Sudip Biswas, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan:
Position-Restricted Substring Searching over Small Alphabets. In
Proceedings of International Symposium on String Processing and
Information Retrieval, pages 29-36, 2013.
4. Wing-Kai Hon, Tsung-Han Ku, Chen-Hua Lu, Rahul Shah, Sharma V. Thankachan:
Efficient Algorithm for Circular Burrows-Wheeler Transform. In Proceedings
of Symposium on Combinatorial Pattern Matching, pages 257-268, 2012.
5. Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan,
Jeffrey Scott Vitter: Compressed Dictionary Matching with One Error.
In Proceedings of Data Compression Conference, pages 113-122, 2011.
--
[;32m※ Origin: 楓橋驛站<
bbs.cs.nthu.edu.tw>
[;32m◆ From: wiselyku @
111-253-141-163.dynamic.hinet.net [m