検索条件

キーワード
タグ
ツール
開催日
こだわり条件

タグ一覧

JavaScript
PHP
Java
Ruby
Python
Perl
Scala
Haskell
C言語
C言語系
Google言語
デスクトップアプリ
スマートフォンアプリ
プログラミング言語
U/UX
MySQL
RDB
NoSQL
全文検索エンジン
全文検索
Hadoop
Apache Spark
BigQuery
サーバ構成管理
開発サポートツール
テストツール
開発手法
BI
Deep Learning
自然言語処理
BaaS
PaaS
Iaas
Saas
クラウド
AI
Payment
クラウドソフトウェア
仮想化ソフトウェア
OS
サーバ監視
ネットワーク
WEBサーバ
開発ツール
テキストエディタ
CSS
HTML
WEB知識
CMS
WEBマーケティング
グラフィック
グラフィックツール
Drone
AR
マーケット知識
セキュリティ
Shell
IoT
テスト
Block chain
知識

[RIKEN AIP Achievement Report Event] CompressedAI2025: Compressed Data Structures for Advanced Intelligent Data Analysis

2025/07/03(木) 04:00  〜  2025/07/04(金) 08:00

主催:RIKEN AIP Public

📊 RIKEN AIP Achievement Report Event

CompressedAI2025: Compressed Data Structures for Advanced Intelligent Data Analysis


🧠 Workshop Overview

This workshop highlights recent advancements in compressed data structures and AI, with a focus on real-world applications in data analysis. Topics include:

  • Advanced compressed data structures
  • Efficient algorithms
  • Cutting-edge ML models
  • Graph-based methodologies
  • Algorithmic techniques for real-world challenges

The event features talks by leading researchers with significant contributions in these fields.


🎯 Key Topics

  • Advanced Algorithmic Techniques
  • Compressed Data Structures
  • Machine Learning
  • Applications: Bioinformatics, Chemoinformatics, Trajectory Data Analyses

🏢 Organization

Succinct Information Processing Team, RIKEN-AIP


🗓️ Date

July 3rd and 4th


📍 Venue

  • In-person: RIKEN-AIP Nihonbashi office (Access) (please register via the Google form)
  • Virtual: Zoom

👥 Speakers


📅 Program Schedule

Day 1 – July 3rd

Time Speaker Title
13:00–13:50 Yasuo Tabei Past Achievements and Future Directions in Succinct Information Processing
13:50–14:40 Takaaki Nishimoto Optimal-Time Queries and Constructions of RLBWT in BWT-runs Bounded Space
14:40–15:30 Ben Langmead The new move (infra)structure for pangenomics
15:30–16:00 Coffee Break
16:00–16:50 Yoshitaka Yamamoto Application and Limitation of Itemset-Frequency Sketches
16:50–17:40 Rossano Venturini (Online) Fast KNN Retrieval for Dense and Sparse Learned Embedding

Day 2 – July 4th

Time Speaker Title
13:00–13:50 Simon J. Puglisi Fast k-mer lookups with the Spectral Burrows-Wheeler transform
13:50–14:40 Giulio Ermanno Pibiri Repetition-aware Compression and Query of Colored de Bruijn Graphs
14:40–15:10 Coffee Break
15:10–16:00 Tomohiro I Space-efficient SLP Encoding for O(log N)-time Random Access
16:00–16:50 Travis Gagie (Online) Fun with MEMs

📄 Abstracts

Fun with MEMs

A maximal exact match (MEM) of a pattern P with respect to a text T is a substring of P occurring in T but that can't be extended in either direction while still occurring in T.  Finding MEMs is an important step in many bioinformatics tasks, such as DNA read alignment, and taxonomic and metagenomic classification.  Although there are theoretically superior options, most practitioners find MEMs with Li's forward-backward algorithm, which is used in several popular tools.  The number of steps forward-backward takes is roughly proportional to the total length of the MEMs, however, which tends to be dominated by short MEMs that are likely due to noise.

Last year we noticed that a strategy similar to Boyer-Moore pattern matching, which we call Boyer-Moore-Li (BML),  can be used to speed up forward-backward when looking only for long MEMs.  Li incorporated BML into his tool ropebwt3 (over a weekend!) and found it often gives a tenfold speedup in practice.  Continuing the Boyer-Moore philosophy of ignoring as much of the input as possible, we then introduced a k-mer filtration step: if a k-mer P[i..i + k - 1] does not occur in T then no MEM of length L > k can contain P[i..i + k - 1]; therefore, given k and a lower bound L > k on the lengths of the MEMs we want, we can use a Bloom filter of the distinct k-mers in T to break P into maximal substrings containing only k-mers accepted by the filter, which we call pseudo-MEMs, and then search only the pseudo-MEMs of length L or more.  In fact, our experiments show it is usually enough to search only a few of the longest pseudo-MEMs, and this is much faster than searching all of P even with BML.

We are now working on mixing BML with two-level indexing.  We build an rsync-like parse of T and index the parse with a run-length compressed suffix array (RLCSA) and T itself with an FM-index.  Given P, we parse it and try as much as we can to run BML on the parses of P with respect to the parse of T, dropping into P and T themselves as little as we can.  Matching phrase by phrase is faster than matching character by character, especially since the parse of T is over a large alphabet (the dictionary of distinct phrases) and that makes the RLCSA cache-efficient.

This is joint work with many people, including Hideo Bannai, Christina Boucher, Nate Brown, Lore Depuydt, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Marinella Sciortino and Mohsen Zakeri.

The new move (infra)structure for pangenomics

Pangenome indexing is an important tool for avoiding the problem of reference bias in sequencing data analysis. Compressed indexes like r-index are practical because they use space proportional to the amount of non-redundant sequence in a pangenome. Since the r-index appeared, it became the basis of new methods for (a) computing maximal exact matches, (b) performing multi-class and taxonomic classification, and (c) read alignment.

Nishimoto & Tabei recently proposed the "move structure" as an alternative to the r-index. The move structure has advantageous time and space complexity, achieving the same important O(r) space bound as the r-index. It also radically rearranges how the compressed index is laid out. Instead of being fragmented over many wavelet trees, bitvectors and lookup tables, the move structure simply consists of one table. A move-structure match query involves a predictable and small number of movements through the table. This leads to excellent locality of reference, yielding the best query speed of any full-text compressed index.

I will discuss how the move structure has become the basis for a new suite of practical pangenome methods, starting with my group's work on Movi. Movi is a move-structure-based compressed index for pangenome indexing. I will discuss many practical improvements we have made in Movi, including (a) integrating "thresholds" into the structure so that it can perform MONI-like queries, (b) minimizing the number of bytes used to represent each row of the structure using compression, (c) different strategies and rationales for "splitting" rows of the structure, (d) specific ways in which the move structure (and its notion of "splitting") makes it applicable to classification. I will also discuss how the move structure can continue to develop into a core pangenome method alongside r-index. Much of this work is joint with Travis Gagie and Christina Boucher.

Workship