[RIKEN AIP Achievement Report Event] CompressedAI2025: Compressed Data Structures for Advanced Intelligent Data Analysis
📊 RIKEN AIP Achievement Report Event
CompressedAI2025: Compressed Data Structures for Advanced Intelligent Data Analysis
🧠 Workshop Overview
This workshop highlights recent advancements in compressed data structures and AI, with a focus on real-world applications in data analysis. Topics include:
- Advanced compressed data structures
- Efficient algorithms
- Cutting-edge ML models
- Graph-based methodologies
- Algorithmic techniques for real-world challenges
The event features talks by leading researchers with significant contributions in these fields.
🎯 Key Topics
- Advanced Algorithmic Techniques
- Compressed Data Structures
- Machine Learning
- Applications: Bioinformatics, Chemoinformatics, Trajectory Data Analyses
🏢 Organization
Succinct Information Processing Team, RIKEN-AIP
🗓️ Date
July 3rd and 4th
📍 Venue
- In-person: RIKEN-AIP Nihonbashi office (Access) (please register via the Google form)
- Virtual: Zoom
👥 Speakers
- Travis Gagie (Dalhousie University)
- Tomohiro I (Kyushu Institute of Technology)
- Ben Langmead (Johns Hopkins University)
- Takaaki Nishimoto (RIKEN-AIP)
- Giulio Ermanno Pibiri (Ca’ Foscari University)
- Simon J. Puglisi (University of Helsinki)
- Yasuo Tabei (RIKEN-AIP)
- Rossano Venturini (Pisa University)
- Yoshitaka Yamamoto (Shizuoka University)
📅 Program Schedule
Day 1 – July 3rd
Time | Speaker | Title |
---|---|---|
13:00–13:50 | Yasuo Tabei | Past Achievements and Future Directions in Succinct Information Processing |
13:50–14:40 | Takaaki Nishimoto | Optimal-Time Queries and Constructions of RLBWT in BWT-runs Bounded Space |
14:40–15:30 | Ben Langmead | The new move (infra)structure for pangenomics |
15:30–16:00 | — | ☕ Coffee Break |
16:00–16:50 | Yoshitaka Yamamoto | Application and Limitation of Itemset-Frequency Sketches |
16:50–17:40 | Rossano Venturini (Online) | Fast KNN Retrieval for Dense and Sparse Learned Embedding |
Day 2 – July 4th
Time | Speaker | Title |
---|---|---|
13:00–13:50 | Simon J. Puglisi | Fast k-mer lookups with the Spectral Burrows-Wheeler transform |
13:50–14:40 | Giulio Ermanno Pibiri | Repetition-aware Compression and Query of Colored de Bruijn Graphs |
14:40–15:10 | — | ☕ Coffee Break |
15:10–16:00 | Tomohiro I | Space-efficient SLP Encoding for O(log N)-time Random Access |
16:00–16:50 | Travis Gagie (Online) | Fun with MEMs |
📄 Abstracts
Fun with MEMs
A maximal exact match (MEM) of a pattern P with respect to a text T is a substring of P occurring in T but that can't be extended in either direction while still occurring in T. Finding MEMs is an important step in many bioinformatics tasks, such as DNA read alignment, and taxonomic and metagenomic classification. Although there are theoretically superior options, most practitioners find MEMs with Li's forward-backward algorithm, which is used in several popular tools. The number of steps forward-backward takes is roughly proportional to the total length of the MEMs, however, which tends to be dominated by short MEMs that are likely due to noise.
Last year we noticed that a strategy similar to Boyer-Moore pattern matching, which we call Boyer-Moore-Li (BML), can be used to speed up forward-backward when looking only for long MEMs. Li incorporated BML into his tool ropebwt3 (over a weekend!) and found it often gives a tenfold speedup in practice. Continuing the Boyer-Moore philosophy of ignoring as much of the input as possible, we then introduced a k-mer filtration step: if a k-mer P[i..i + k - 1] does not occur in T then no MEM of length L > k can contain P[i..i + k - 1]; therefore, given k and a lower bound L > k on the lengths of the MEMs we want, we can use a Bloom filter of the distinct k-mers in T to break P into maximal substrings containing only k-mers accepted by the filter, which we call pseudo-MEMs, and then search only the pseudo-MEMs of length L or more. In fact, our experiments show it is usually enough to search only a few of the longest pseudo-MEMs, and this is much faster than searching all of P even with BML.
We are now working on mixing BML with two-level indexing. We build an rsync-like parse of T and index the parse with a run-length compressed suffix array (RLCSA) and T itself with an FM-index. Given P, we parse it and try as much as we can to run BML on the parses of P with respect to the parse of T, dropping into P and T themselves as little as we can. Matching phrase by phrase is faster than matching character by character, especially since the parse of T is over a large alphabet (the dictionary of distinct phrases) and that makes the RLCSA cache-efficient.
This is joint work with many people, including Hideo Bannai, Christina Boucher, Nate Brown, Lore Depuydt, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Marinella Sciortino and Mohsen Zakeri.
The new move (infra)structure for pangenomics
Pangenome indexing is an important tool for avoiding the problem of reference bias in sequencing data analysis. Compressed indexes like r-index are practical because they use space proportional to the amount of non-redundant sequence in a pangenome. Since the r-index appeared, it became the basis of new methods for (a) computing maximal exact matches, (b) performing multi-class and taxonomic classification, and (c) read alignment.
Nishimoto & Tabei recently proposed the "move structure" as an alternative to the r-index. The move structure has advantageous time and space complexity, achieving the same important O(r) space bound as the r-index. It also radically rearranges how the compressed index is laid out. Instead of being fragmented over many wavelet trees, bitvectors and lookup tables, the move structure simply consists of one table. A move-structure match query involves a predictable and small number of movements through the table. This leads to excellent locality of reference, yielding the best query speed of any full-text compressed index.
I will discuss how the move structure has become the basis for a new suite of practical pangenome methods, starting with my group's work on Movi. Movi is a move-structure-based compressed index for pangenome indexing. I will discuss many practical improvements we have made in Movi, including (a) integrating "thresholds" into the structure so that it can perform MONI-like queries, (b) minimizing the number of bytes used to represent each row of the structure using compression, (c) different strategies and rationales for "splitting" rows of the structure, (d) specific ways in which the move structure (and its notion of "splitting") makes it applicable to classification. I will also discuss how the move structure can continue to develop into a core pangenome method alongside r-index. Much of this work is joint with Travis Gagie and Christina Boucher.