Storage and retrieval system for complex analytics on big sequence collections

Storage and retrieval system for complex analytics on big sequence collections

Data Series

Data series (a.k.a. sequences, or time series) are present in virtually every scientific and social domain: from health care, astronomy and biology, to finance and the internet-of-things.

Scaling to Big Data

In astronomy, there are applications with more than 70TB of spectroscopic sequence data, while by 2025 scientists are expected to collect around 2-40 ExaBytes of DNA sequence data.

Interactive Science

Our research aims to change a landscape, where database systems are used merely for storing and retrieving data, by enabling scientists to transparently use specialized query processing systems for accessing their sequential data.

Features
Summarization and Indexing

NESTOR uses specialized summarization techniques for both reducing the size of data series, but also for allowing blazing fast analytics. It additionally allows for the construction of domain specific indexes and decide when to use them by performing access path selection. Such indexes facilitate both analytical (such as similarity search) as well as aggregation queries.

Parallelization

Both data storage, indexing, as well as query processing can scale to large clusters of computing nodes, allowing both for multi-TB data processing but also for large analytical jobs to be performed in seconds.

Adaptive Reorganization

NESTOR's storage layer continuously and adaptively reorganizes the underlying data layout in order to match the current workload, without incurring any additional overhead.

Modern Hardware Optimizations

We aim to utilize all modern hardware optimization techniques such as SIMD, NUMA-aware multi-processing, GPUs and SSD optimizations.

Team

Prof. Themis Palpanas (University of Paris)
Prof. Stratos Idreos (Harvard University)
Dr. Kostas Zoumpatianos (Harvard University & University of Paris)
PhD students: Paul Boniol, Karima Echihabi, Botao Peng, Qitong Wang, Luka Jakovljevic
Research Engineers: -
Collaborators: Anastasia Bezerianos, Niv Dayan, Panagiota Fatourou, Haridimos Kondylakis, Theophanis Tsandilas
Alumni: Anna Gogolou, Michele Linardi, Federico Roncalo, Xucheng Tang



Research

We have developed the current state-of-the-art data series indexes, iSAX2+ (bulk loading), DPiSAX (distributed), ADS+ (adaptive), Coconut (update friendly), and ULISSE (variable-length), the first data series query workload benchmark, as well as DSStat, a toolset for data series preprocessing and visualization.

We have applied our techniques on streaming and uncertain data series, and have worked with data from diverse domains, such as home networks, road tunnels, and manufacturing.

  • Management

    There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of sequences (a.k.a. data series, or time series). Examples of such applications come from biology, astronomy, entomology, the web, and other domains.

  • Data Series Management (Dagstuhl Seminar 19282).

    Anthony Bagnall, Richard L. Cole, Themis Palpanas, Kostas Zoumpatianos.

    Dagstuhl Reports 2019.
  • Report on the First and Second Interdisciplinary Time Series [a.k.a. data series, or sequences] Analysis Workshop (ITISA).

    Themis Palpanas and Volker Beckmann.

    SIGMOD Record 2019
  • T-Store: Tunable Storage for Large Sequential Data [a.k.a. data series, or time series].

    Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

    NEDB 2019
  • Data Series Management: Fulfilling the Need for Big Sequence Analytics.

    Kostas Zoumpatianos, Themis Palpanas.

    ICDE 2018
  • The Parallel and Distributed Future of Data Series Mining. [Invited Paper]

    Themis Palpanas.

    HPCS 2017
  • Data Series Management: The Next Challenge

    Themis Palpanas.

    ICDE 2016
  • Big Sequence Management: A Glimpse on the Past, the Present, and
    the Future. [Invited Paper]

    Themis Palpanas.

    LNCS 2016
  • Data Series Management: The Road to Big Sequence Analytics.

    Themis Palpanas.

    SIGMOD Record 2015

  • Indexing

    For big data exploration, it is prohibitive to rely to full sequential scans for every single query, and therefore, indexing is required. The target of our indexing techniques is to make query processing efficient enough, such that the analysts can repeatedly fire several exploratory queries with quick response times and low initialization costs.

  • SING: Sequence Indexing Using GPUs.

    Botao Peng, Panagiota Fatourou, Themis Palpanas.

    ICDE 2021
  • Twin Subsequence Search in Time Series.

    Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Themis Palpanas, Spiros Athanasiou, Spiros Skiadopoulos.

    EDBT 2021
  • Fast Data Series Indexing for In-Memory Data.

    Botao Peng, Panagiota Fatourou, Themis Palpanas.

    VLDBJ 2021
  • BestNeighbor: Efficient Evaluation of kNN Queries on Large Time Series Databases.

    Oleksandra Levchenko, Boyan Kolev, Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseflia, Themis Palpanas, Dennis Shasha, Patrick Valduriez.

    KAIS 2020
  • Scalable Data Series Subsequence Matching with ULISSE.

    Michele Linardi, Themis Palpanas.

    VLDBJ 2020
  • Evolution of a Data Series Index - The iSAX Family of Data Series Indexes.

    Themis Palpanas.

    CCIS 2020
  • ParIS+: Data Series Indexing on Multi-Core Architectures.

    Botao Peng, Panagiota Fatourou, Themis Palpanas.

    TKDE 2020
  • Massively Distributed Time Series Indexing and Querying.

    Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas

    TKDE 2020
  • Data Series Indexing Gone Parallel.

    Botao Peng (supervised by Panagiota Fatourou, Themis Palpanas)

    ICDE (PhD Workshop) 2020
  • Coconut Palm: Static and Streaming Data Series Exploration Now in your Palm.

    Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas.

    SIGMOD 2019
  • Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search.

    Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, Houda Benbrahim.

    PVLDB 2020
  • MESSI: In-Memory Data Series Indexing.

    Botao Peng, Panagiota Fatourou, Themis Palpanas.

    ICDE 2020
  • Truly Scalable Data Series Similarity Search.

    Karima Echihabi (supervised by Themis Palpanas and Houda Benbrahim).

    VLDB PhD Workshop 2019
  • Effective and Efficient Variable-Length Data Series Analytics.

    Michele Linardi (supervised by Themis Palpanas).

    VLDB PhD Workshop 2019
  • Coconut: Sortable Summarizations for Scalable Indexes over Static and Streaming Data Series.

    Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas.

    VLDBJ 2019
  • Local Similarity Search on Geolocated Time Series Using Hybrid Indexing.

    Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Themis Palpanas, Spiros Athanasiou, Spiros Skiadopoulos.

    SIGSPATIAL 2019
  • Distributed Algorithms to Find Similar Time Series.

    Oleksandra Levchenko, Boyan Kolev, Djamel-Edine Yagoubi, Dennis Shasha, Themis Palpanas, Patrick Valduriez, Reza Akbarinia, Florent Masseglia.

    ECML/PKDD 2019
  • Local Pair and Bundle Discovery over Co-Evolving Time Series.

    Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Themis Palpanas, Spiros Athanasiou, Spiros Skiadopoulos.

    SSTD 2019
  • The Lernaean Hydra of Data Series Similarity Search: An Experimental Evaluation of the State of the Art.

    Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, Houda Benbrahim.

    PVLDB 2019
  • Scalable, Variable-Length Similarity Search in Data Series: The ULISSE Approach.

    Michele Linardi, Themis Palpanas.

    PVLDB 2019
  • Generating Data Series Query Workloads.

    Kostas Zoumpatianos, Yin Lou, Ioana Ileana, Themis Palpanas, Johannes Gehrke.

    VLDBJ 2018
  • Massively Distributed Time Series Indexing and Querying.

    Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas.

    TKDE 2018
  • ParIS: The Next Destination for Fast Data Series Indexing and Query Answering.

    Botao Peng, Themis Palpanas, Panagiota Fatourou.

    IEEE BigData 2018
  • Coconut: A Scalable Bottom-Up Approach for Building Data Series Indexes

    Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas.

    PVLDB 2018
  • ULISSE: ULtra compact Index for Variable-Length Similarity SEarch in Data Series

    Michele Linardi, Themis Palpanas.

    ICDE 2018
  • DPiSAX: Massively Distributed Partitioned iSAX

    Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas.

    ICDM 2017
  • ADS: The Adaptive Data Series Index

    Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

    VLDBJ 2016
  • Query Workloads for Data-Series Indexes

    Kostas Zoumpatianos, Yin Lou, Themis Palpanas, Johannes Gehrke.

    KDD 2015
  • Beyond One Billion Time Series: Indexing and Mining Very Large Time Series
    Collections with iSAX2+

    Alessandro Camerra, Jin Shieh, Themis Palpanas, Thanawin Rakthanmanon, Eamonn Keogh.

    KAIS 2014
  • Indexing for Interactive Exploration of Big Data Series

    Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

    SIGMOD 2014
  • iSAX 2.0: Indexing and Mining One Billion Time Series.

    Alessandro Camerra, Themis Palpanas, Jin Shieh, Eamonn Keogh.

    ICDM 2010
  • Indexing Large Human-Motion Databases

    Eamonn Keogh, Themis Palpanas, Victor B. Zordan, Dimitrios Gunopulos, Marc Cardle.

    VLDB 2004

  • Analytics

    Examples of analysis operations are queries by content (range and similarity queries, nearest neighbors), clustering, classification, outlier patterns, frequent sub-sequences, and others.

  • Electricity Demand Activation Extraction: From Known to Uknown Signatures, Using Similarity Search.

    Pauline Laviron, Zueqi Dai, Berenice Huquet, Themis Palpanas.

    e-Energy 2021
  • SAND: Streaming Subsequence Anomaly Detection.

    Paul Boniol, John Paparrizos, Themis Palpanas, Michael J. Franklin.

    PVLDB 2021
  • Unsupervised and Scalable Subsequence Anomaly Detection in Large Data Series.

    Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas, Mohammed Meftah. Emmanuel Remy.

    VLDBJ 2021
  • GraphAn: Graph-based Subsequence Anomaly Detection.

    Paul Boniol, Themis Palpanas, Mohammed Meftah, Emmanuel Remy.

    PVLDB 2020
  • Series2Graph: Graph-based Subsequence Anomaly Detection for Time Series.

    Paul Boniol, Themis Palpanas.

    PVLDB 2020
  • Unsupervised Subsequence Anomaly Detection in Large Sequences.

    Paul Boniol (supervised by Themis Palpanas, Mohammed Meftah, Emmanuel Remy).

    VLDB PhD Workshop 2020
  • Scalable Machine Learning on High-Dimensional Vectors: From Data Series to Deep Network Embeddings.

    Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas.

    WIMS 2020
  • Automated Anomaly Detection in Large Sequences.

    Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas.

    ICDE 2020
  • SAD: An Unsupervised System for Subsequence Anomaly Detection.

    Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas.

    ICDE 2020
  • Matrix Profile Goes MAD: Variable-Length Motif and Discord Discovery in Data Series.

    Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn Keogh.

    DAMI 2020
  • Matrix Profile X: VALMOD - Scalable Discovery of Variable-Length
    Motifs in Data Series.

    Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn Keogh.

    SIGMOD 2018
  • VALMOD: A Suite for Easy and Exact Detection of Variable Length
    Motifs in Data Series.

    Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn Keogh.

    SIGMOD 2018
  • Data Series Similarity Using Correlation-Aware Measures.

    Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas.

    SSDBM 2017
  • Correlation-Aware Distance Measures for Data Series

    Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas.

    EDBT 2017
  • Time Series Analysis for Near-Infrared Spectroscopy Data.

    Novri Suhermi, Judit Gervain, Themis Palpanas.

    fNIRS 2016
  • Characterizing Home Device Usage From Wireless Traffic Time Series.

    Katsiaryna Mirylenka, Vassilis Christophides, Themis Palpanas, Ioannis Pefkianakis, Martin May.

    EDBT 2016
  • Envelope-Based Anomaly Detection for High-Speed Manufacturing Processes.

    Katsiaryna Mirylenka, Alice Marascu, Themis Palpanas, Matthias Fehr, Stefan Jank, Gunter Welde, Daniel Groeber.

    APC|M 2013
  • Finding Interesting Correlations with Conditional Heavy Hitters.

    Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, Divesh Srivastava.

    ICDE 2013
  • Scalable Similarity Matching in Streaming Time Series.

    Alice Marascu, Suleiman Ali Khan, Themis Palpanas.

    PAKDD 2012
  • Real-Time Data Analytics in Sensor Networks.

    Themis Palpanas.

    Springer 2012

  • Exploration

    Using our techniques, users can explore large datasets and find patterns of interest, using nearest neighbor search. They can draw queries (data series) using a mouse, or touch screen, or they can select from their own datasets.

  • Data Series Progressive Similarity Search with Probabilistic Quality Guarantees.

    Anna Gogolou, Theophanis Tsandilas, Karima Echihabi, Anastasia Bezerianos, Themis Palpanas.

    SIGMOD 2020
  • Progressive Similarity Search on Time Series Data.

    Anna Gogolou, Theophanis Tsandilas, Themis Palpanas, Anastasia Bezerianos.

    BigVis@EDBT 2019
  • Comparing Similarity Perception in Time Series Visualizations.

    Anna Gogolou, Theophanis Tsandilas, Themis Palpanas, Anastasia Bezerianos.

    TVCG 2019
  • Comparing Similarity Perception in Time Series Visualizations

    Anna Gogolou, Theophanis Tsandilas, Themis Palpanas, Anastasia Bezerianos.

    IEEE VIS 2018
  • RINSE: Interactive Data Series Exploration with ADS+

    Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

    VLDB 2015

  • Summarization

    In order to support time- and space-efficient management and analytics, data series need to be summarized. Different summarization techniques are applicable to different applications and problem settings.

  • Practical Data Prediction for Real-World Wireless Sensor Networks.

    Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco.

    TKDE 2015
  • What Does Model-Driven Data Acquisition Really Achieve in Wireless Sensor Networks?.

    Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco.

    Best Paper Award

    PerCom 2012
  • Real-Time Data Analytics in Sensor Networks.

    Themis Palpanas.

    Springer 2012
  • Streaming Time Series Summarization Using User-Defined Amnesic Functions.

    Themis Palpanas, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos.

    TKDE 2008
  • Online Amnesic Approximation of Streaming Time Series.

    Themis Palpanas, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos, Wagner Truppel.

    ICDE 2004

  • Uncertainty

    Modeling tuples with value and existential uncertainty has several advantages. From an engineering perspective, a programmer can feed uncertain data directly into the system, without explicitly preprocessing data and forcing data approximations. From an application requirements perspective, maintaining possible values allows the application to provide results with confidence intervals.

  • Sliding Windows over Uncertain Data Streams

    Michele Dallachiesa, Gabriela Jacques-Silva, Bugra Gedik, Kun-Lung Wu, Themis Palpanas.

    KAIS 2015
  • Top-k Nearest Neighbor Search In Uncertain Data Series

    Michele Dallachiesa, Themis Palpanas, Ihab F. Ilyas.

    VLDB 2015
  • Uncertain Time-Series Similarity: Return to the Basics

    Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas.

    VLDB 2012
  • Similarity Matching for Uncertain Time Series:
    Analytical and Experimental Comparison

    Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas.

    QUeST @ GIS 2011

Contact

France
Email: Prof. Themis Palpanas
LIPADE - University of Paris
45 rue des Saints Pères
Paris 75006, France