Storage and retrieval system for complex analytics on big sequence collections

Storage and retrieval system for complex analytics on big sequence collections

Data Series

Data series (a.k.a. sequences, or time series) are present in virtually every scientific and social domain: from health care, astronomy and biology, to finance and the internet-of-things.

Scaling to Big Data

In astronomy, there are applications with more than 70TB of spectroscopic sequence data, while by 2025 scientists are expected to collect around 2-40 ExaBytes of DNA sequence data.

Interactive Science

Our research aims to change a landscape, where database systems are used merely for storing and retrieving data, by enabling scientists to transparently use specialized query processing systems for accessing their sequential data.

Features
Summarization and Indexing

NESTOR uses specialized summarization techniques for both reducing the size of data series, but also for allowing blazing fast analytics. It additionally allows for the construction of domain specific indexes and decide when to use them by performing access path selection. Such indexes facilitate both analytical (such as similarity search) as well as aggregation queries.

Parallelization

Both data storage, indexing, as well as query processing can scale to large clusters of computing nodes, allowing both for multi-TB data processing but also for large analytical jobs to be performed in seconds.

Adaptive Reorganization

NESTOR's storage layer continuously and adaptively reorganizes the underlying data layout in order to match the current workload, without incurring any additional overhead.

Modern Hardware Optimizations

We aim to utilize all modern hardware optimization techniques such as SIMD, NUMA-aware multi-processing, GPUs and SSD optimizations.

Team

Prof. Themis Palpanas (Paris Descartes University)
Prof. Stratos Idreos (Harvard University)
Dr. Kostas Zoumpatianos (Harvard University & Paris Descartes University)
PhD students: Anna Gogolou, Karima Echihabi, Michele Linardi, Botao Peng
Research Engineers: Paul Boniol, Federico Roncalo, Xucheng Tang



Research

We have developed the current state-of-the-art data series indexes, iSAX2+ (bulk loading), DPiSAX (distributed), ADS+ (adaptive), Coconut (update friendly), and ULISSE (variable-length), the first data series query workload benchmark, as well as DSStat, a toolset for data series preprocessing and visualization.

We have applied our techniques on streaming and uncertain data series, and have worked with data from diverse domains, such as home networks, road tunnels, and manufacturing.

  • Management

    There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of sequences, or data series. Examples of such applications come from biology, astronomy, entomology, the web, and other domains.

  • Data Series Management: Fulfilling the Need for Big Sequence Analytics.

    Kostas Zoumpatianos, Themis Palpanas.

    ICDE 2018
  • The Parallel and Distributed Future of Data Series Mining. [Invited Paper]

    Themis Palpanas.

    HPCS 2017
  • Data Series Management: The Next Challenge

    Themis Palpanas.

    ICDE 2016
  • Big Sequence Management: A Glimpse on the Past, the Present, and
    the Future. [Invited Paper]

    Themis Palpanas.

    LNCS 2016
  • Data Series Management: The Road to Big Sequence Analytics.

    Themis Palpanas.

    SIGMOD Record 2015

  • Indexing

    For big data exploration, it is prohibitive to rely to full sequential scans for every single query, and therefore, indexing is required. The target of our indexing techniques is to make query processing efficient enough, such that the analysts can repeatedly fire several exploratory queries with quick response times and low initialization costs.

  • Coconut: A Scalable Bottom-Up Approach for Building Data Series Indexes

    Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas.

    PVLDB 2018
  • ULISSE: ULtra compact Index for Variable-Length Similarity SEarch in Data Series

    Michele Linardi, Themis Palpanas.

    ICDE 2018
  • DPiSAX: Massively Distributed Partitioned iSAX

    Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas.

    ICDM 2017
  • ADS: The Adaptive Data Series Index

    Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

    VLDBJ 2016
  • Query Workloads for Data-Series Indexes

    Kostas Zoumpatianos, Yin Lou, Themis Palpanas, Johannes Gehrke.

    KDD 2015
  • Beyond One Billion Time Series: Indexing and Mining Very Large Time Series
    Collections with iSAX2+

    Alessandro Camerra, Jin Shieh, Themis Palpanas, Thanawin Rakthanmanon, Eamonn Keogh.

    KAIS 2014
  • Indexing for Interactive Exploration of Big Data Series

    Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

    SIGMOD 2014
  • iSAX 2.0: Indexing and Mining One Billion Time Series.

    Alessandro Camerra, Themis Palpanas, Jin Shieh, Eamonn Keogh.

    ICDM 2010
  • Indexing Large Human-Motion Databases

    Eamonn Keogh, Themis Palpanas, Victor B. Zordan, Dimitrios Gunopulos, Marc Cardle.

    VLDB 2004

  • Analytics

    Examples of analysis operations are queries by content (range and similarity queries, nearest neighbors), clustering, classification, outlier patterns, frequent sub-sequences, and others.

  • Matrix Profile X: VALMOD - Scalable Discovery of Variable-Length
    Motifs in Data Series.

    Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn Keogh.

    SIGMOD 2018
  • VALMOD: A Suite for Easy and Exact Detection of Variable Length
    Motifs in Data Series.

    Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn Keogh.

    SIGMOD 2018
  • Data Series Similarity Using Correlation-Aware Measures.

    Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas.

    SSDBM 2017
  • Correlation-Aware Distance Measures for Data Series

    Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas.

    EDBT 2017
  • Time Series Analysis for Near-Infrared Spectroscopy Data.

    Novri Suhermi, Judit Gervain, Themis Palpanas.

    fNIRS 2016
  • Characterizing Home Device Usage From Wireless Traffic Time Series.

    Katsiaryna Mirylenka, Vassilis Christophides, Themis Palpanas, Ioannis Pefkianakis, Martin May.

    EDBT 2016
  • Envelope-Based Anomaly Detection for High-Speed Manufacturing Processes.

    Katsiaryna Mirylenka, Alice Marascu, Themis Palpanas, Matthias Fehr, Stefan Jank, Gunter Welde, Daniel Groeber.

    APC|M 2013
  • Finding Interesting Correlations with Conditional Heavy Hitters.

    Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, Divesh Srivastava.

    ICDE 2013
  • Scalable Similarity Matching in Streaming Time Series.

    Alice Marascu, Suleiman Ali Khan, Themis Palpanas.

    PAKDD 2012
  • Real-Time Data Analytics in Sensor Networks.

    Themis Palpanas.

    Springer 2012

  • Exploration

    Using our techniques, users can explore large datasets and find patterns of interest, using nearest neighbor search. They can draw queries (data series) using a mouse, or touch screen, or they can select from their own datasets.

  • RINSE: Interactive Data Series Exploration with ADS+

    Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

    VLDB 2015

  • Summarization

    In order to support time- and space-efficient management and analytics, data series need to be summarized. Different summarization techniques are applicable to different applications and problem settings.

  • Practical Data Prediction for Real-World Wireless Sensor Networks.

    Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco.

    TKDE 2015
  • What Does Model-Driven Data Acquisition Really Achieve in Wireless Sensor Networks?.

    Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco.

    Best Paper Award

    PerCom 2012
  • Real-Time Data Analytics in Sensor Networks.

    Themis Palpanas.

    Springer 2012
  • Streaming Time Series Summarization Using User-Defined Amnesic Functions.

    Themis Palpanas, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos.

    TKDE 2008
  • Online Amnesic Approximation of Streaming Time Series.

    Themis Palpanas, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos, Wagner Truppel.

    ICDE 2004

  • Uncertainty

    Modeling tuples with value and existential uncertainty has several advantages. From an engineering perspective, a programmer can feed uncertain data directly into the system, without explicitly preprocessing data and forcing data approximations. From an application requirements perspective, maintaining possible values allows the application to provide results with confidence intervals.

  • Sliding Windows over Uncertain Data Streams

    Michele Dallachiesa, Gabriela Jacques-Silva, Bugra Gedik, Kun-Lung Wu, Themis Palpanas.

    KAIS 2015
  • Top-k Nearest Neighbor Search In Uncertain Data Series

    Michele Dallachiesa, Themis Palpanas, Ihab F. Ilyas.

    VLDB 2015
  • Uncertain Time-Series Similarity: Return to the Basics

    Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas.

    VLDB 2012
  • Similarity Matching for Uncertain Time Series:
    Analytical and Experimental Comparison

    Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas.

    QUeST @ GIS 2011

Contact

France
LIPADE - Paris Descartes University
45 rue des Saints Pères
Paris 75006, France
Email: Prof. Themis Palpanas
 
USA
DASLab - Harvard University - Maxwell Dworkin 136
33 Oxford Street
Cambridge, MA 02138, USA
Email: Dr. Kostas Zoumpatianos