Storage and retrieval system for complex analytics on big sequence collections

Storage and retrieval system for complex analytics on big sequence collections

Data Series

Data series (a.k.a. sequences, or time series) are present in virtually every scientific and social domain: from health care, astronomy and biology, to finance and the internet-of-things.

Scaling to Big Data

In astronomy, there are applications with more than 70TB of spectroscopic sequence data, while by 2025 scientists are expected to collect around 2-40 ExaBytes of DNA sequence data.

Interactive Science

Our research aims to change a landscape, where database systems are used merely for storing and retrieving data, by enabling scientists to transparently use specialized query processing systems for accessing their sequential data.

Features
Summarization and Indexing

NESTOR uses specialized summarization techniques for both reducing the size of data series, but also for allowing blazing fast analytics. It additionally allows for the construction of domain specific indexes and decide when to use them by performing access path selection. Such indexes facilitate both analytical (such as similarity search) as well as aggregation queries.

Parallelization

Both data storage, indexing, as well as query processing can scale to large clusters of computing nodes, allowing both for multi-TB data processing but also for large analytical jobs to be performed in seconds.

Adaptive Reorganization

NESTOR's storage layer continuously and adaptively reorganizes the underlying data layout in order to match the current workload, without incurring any additional overhead.

Modern Hardware Optimizations

We utilize all modern hardware optimization techniques such as SIMD, NUMA-aware multi-processing, GPUs and SSD optimizations.

Team

Prof. Themis Palpanas (University of Paris)
Prof. Stratos Idreos (Harvard University)
Dr. Kostas Zoumpatianos (Harvard University & University of Paris)
PhD students: Qitong Wang, Ilias Azizi
Research Engineers: -
Collaborators: Anastasia Bezerianos, Niv Dayan, Panagiota Fatourou, Haridimos Kondylakis, Theophanis Tsandilas
Alumni: Paul Boniol, Manos Chatzakis, Karima Echihabi, Botao Peng, Luka Jakovljevic, Anna Gogolou, Michele Linardi, Federico Roncalo, Xucheng Tang



Research

We have developed the current state-of-the-art data series indexes, iSAX2+ (bulk loading), ADS+ and Dumpy (adaptive), DPiSAX and Odyssey (distributed), ParIS+ and Hercules (multi-core), SING (GPU), MESSI and Elpis (in-memory), Coconut-LSM (streaming series), ULISSE (variable-length), and ProS (progressive query answering) the first data series query workload benchmark, as well as DSStat, a toolset for data series preprocessing and visualization.

We have applied our techniques on streaming and uncertain data series, and have worked with data from diverse domains, such as home networks, road tunnels, seismology, neuroscience, astrophysics, manufacturing, as well as from deep learning embeddings.

Extensive experimental evaluations demonstrate that our techniques are the state-of-the-art for exact search and approximate search with quality guarantees, and the only viable solution for disk-resident datasets for both data series and general high-dimensional vector datasets.

Moreover, we have developed unsupervised methods for subsequence anomaly detection: NormA and Series2Graph (offline), and SAND (online). These methods exhibit state-of-the-art performance across a variety of dataset characteristics and anomaly types, without the need to learn from domain knowledge, labeled data, or datasets clean from anomalies.

  • Tutorials

    In our tutorials we describe the most prevalent similarity search methods developed in both the data series and the high-dimensional communities, and comment on their merits and drawbacks. We present recent results from extensive experiemntal comparison studies, which demonstrate the superiority of the state-of-the-art data series methods. We also present and discuss the state-of-the-art methods in data series analytics, and subsequence anomaly detection in particular.

  • New Trends in Time Series Anomaly Detection.

    Paul Boniol, John Paparrizos, Themis Palpanas.

    EDBT 2023
  • Scalable Analytics on Large Sequence Collections.

    Karima Echihabi, Themis Palpanas.

    MDM 2022
  • New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed.

    Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas.

    VLDB 2021
  • High-Dimensional Similarity Search for Scalable Data Science.

    Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas.

    ICDE 2021
  • Big Sequence Management: Scaling Up and Out.

    Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas.

    EDBT 2021
  • Big Sequence Management: on Scalability.

    Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas.

    IEEE BigData 2020
  • Big Sequence Management.

    Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas.

    ISCC 2020

    • Management

      There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of sequences (a.k.a. data series, or time series). Examples of such applications come from biology, astronomy, entomology, the web, and other domains.

    • Data Series Management (Dagstuhl Seminar 19282).

      Anthony Bagnall, Richard L. Cole, Themis Palpanas, Kostas Zoumpatianos.

      Dagstuhl Reports 2019
    • Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA).

      Themis Palpanas and Volker Beckmann.

      SIGMOD Record 2019
    • T-Store: Tunable Storage for Large Sequential Data [a.k.a. data series, or time series].

      Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

      NEDB 2019
    • Data Series Management: Fulfilling the Need for Big Sequence Analytics.

      Kostas Zoumpatianos, Themis Palpanas.

      ICDE 2018
    • The Parallel and Distributed Future of Data Series Mining. [Invited Paper]

      Themis Palpanas.

      HPCS 2017
    • Data Series Management: The Next Challenge

      Themis Palpanas.

      ICDE 2016
    • Big Sequence Management: A Glimpse on the Past, the Present, and
      the Future. [Invited Paper]

      Themis Palpanas.

      LNCS 2016
    • Data Series Management: The Road to Big Sequence Analytics.

      Themis Palpanas.

      SIGMOD Record 2015

    • Indexing

      For big data exploration, it is prohibitive to rely to full sequential scans for every single query, and therefore, indexing is required. The target of our indexing techniques is to make query processing efficient enough, such that the analysts can repeatedly fire several exploratory queries with quick response times and low initialization costs.

    • FreSh: A Lock-Free Data Series Index.

      Panagiota Fatourou, Eleftherios Kosmas, Themis Palpanas, George Paterakis.

      SRDS 2023
    • Dumpy: A Compact and Adaptive Index for Large Data Series Collections.

      Zeyu Wang, Qitong Wang, Peng Wang, Themis Palpanas, Wei Wang.

      SIGMOD 2023
    • Elpis: Graph-Based Similarity Search for Scalable Data Science.

      Ilias Azizi, Karima Echihabi, Themis Palpanas.

      PVLDB 2023
    • Odyssey: A Journey in the Land of Distributed Data Series Similarity Search.

      Manos Chatzakis, Panagiota Fatourou, Eleftherios Kosmas, Themis Palpanas, Botao Peng.

      PVLDB 2023
    • Hercules Against Data Series Similarity Search.

      Karima Echihabi, Panagiota Fatourou, Kostas Zoumbatianos, Themis Palpanas, Houda Benbrahim.

      PVLDB 2022
    • Efficient Range and kNN Twin Subsequence Search in Time Series.

      Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Themis Palpanas, Spiros Athanasiou, Spiros Skiadopoulos.

      TKDE 2022
    • Data Series Similarity Search via Deep Learning.

      Qitong Wang (supervised by: Themis Palpanas).

      VLDB PhD Workshop 2022
    • SING: Sequence Indexing Using GPUs.

      Botao Peng, Panagiota Fatourou, Themis Palpanas.

      ICDE 2021
    • Twin Subsequence Search in Time Series.

      Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Themis Palpanas, Spiros Athanasiou, Spiros Skiadopoulos.

      EDBT 2021
    • Fast Data Series Indexing for In-Memory Data.

      Botao Peng, Panagiota Fatourou, Themis Palpanas.

      VLDBJ 2021
    • BestNeighbor: Efficient Evaluation of kNN Queries on Large Time Series Databases.

      Oleksandra Levchenko, Boyan Kolev, Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseflia, Themis Palpanas, Dennis Shasha, Patrick Valduriez.

      KAIS 2020
    • Scalable Data Series Subsequence Matching with ULISSE.

      Michele Linardi, Themis Palpanas.

      VLDBJ 2020
    • Evolution of a Data Series Index - The iSAX Family of Data Series Indexes.

      Themis Palpanas.

      CCIS 2020
    • ParIS+: Data Series Indexing on Multi-Core Architectures.

      Botao Peng, Panagiota Fatourou, Themis Palpanas.

      TKDE 2020
    • Massively Distributed Time Series Indexing and Querying.

      Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas

      TKDE 2020
    • Data Series Indexing Gone Parallel.

      Botao Peng (supervised by Panagiota Fatourou, Themis Palpanas)

      ICDE (PhD Workshop) 2020
    • Coconut Palm: Static and Streaming Data Series Exploration Now in your Palm.

      Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas.

      SIGMOD 2019
    • Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search.

      Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, Houda Benbrahim.

      PVLDB 2020
    • MESSI: In-Memory Data Series Indexing.

      Botao Peng, Panagiota Fatourou, Themis Palpanas.

      ICDE 2020
    • Truly Scalable Data Series Similarity Search.

      Karima Echihabi (supervised by Themis Palpanas and Houda Benbrahim).

      VLDB PhD Workshop 2019
    • Effective and Efficient Variable-Length Data Series Analytics.

      Michele Linardi (supervised by Themis Palpanas).

      VLDB PhD Workshop 2019
    • Coconut: Sortable Summarizations for Scalable Indexes over Static and Streaming Data Series.

      Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas.

      VLDBJ 2019
    • Local Similarity Search on Geolocated Time Series Using Hybrid Indexing.

      Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Themis Palpanas, Spiros Athanasiou, Spiros Skiadopoulos.

      SIGSPATIAL 2019
    • Distributed Algorithms to Find Similar Time Series.

      Oleksandra Levchenko, Boyan Kolev, Djamel-Edine Yagoubi, Dennis Shasha, Themis Palpanas, Patrick Valduriez, Reza Akbarinia, Florent Masseglia.

      ECML/PKDD 2019
    • Local Pair and Bundle Discovery over Co-Evolving Time Series.

      Georgios Chatzigeorgakidis, Dimitrios Skoutas, Kostas Patroumpas, Themis Palpanas, Spiros Athanasiou, Spiros Skiadopoulos.

      SSTD 2019
    • The Lernaean Hydra of Data Series Similarity Search: An Experimental Evaluation of the State of the Art.

      Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, Houda Benbrahim.

      PVLDB 2019
    • Scalable, Variable-Length Similarity Search in Data Series: The ULISSE Approach.

      Michele Linardi, Themis Palpanas.

      PVLDB 2019
    • Generating Data Series Query Workloads.

      Kostas Zoumpatianos, Yin Lou, Ioana Ileana, Themis Palpanas, Johannes Gehrke.

      VLDBJ 2018
    • Massively Distributed Time Series Indexing and Querying.

      Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas.

      TKDE 2018
    • ParIS: The Next Destination for Fast Data Series Indexing and Query Answering.

      Botao Peng, Themis Palpanas, Panagiota Fatourou.

      IEEE BigData 2018
    • Coconut: A Scalable Bottom-Up Approach for Building Data Series Indexes

      Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas.

      PVLDB 2018
    • ULISSE: ULtra compact Index for Variable-Length Similarity SEarch in Data Series

      Michele Linardi, Themis Palpanas.

      ICDE 2018
    • DPiSAX: Massively Distributed Partitioned iSAX

      Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas.

      ICDM 2017
    • ADS: The Adaptive Data Series Index

      Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

      VLDBJ 2016
    • Query Workloads for Data-Series Indexes

      Kostas Zoumpatianos, Yin Lou, Themis Palpanas, Johannes Gehrke.

      KDD 2015
    • Beyond One Billion Time Series: Indexing and Mining Very Large Time Series
      Collections with iSAX2+

      Alessandro Camerra, Jin Shieh, Themis Palpanas, Thanawin Rakthanmanon, Eamonn Keogh.

      KAIS 2014
    • Indexing for Interactive Exploration of Big Data Series

      Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

      SIGMOD 2014
    • iSAX 2.0: Indexing and Mining One Billion Time Series.

      Alessandro Camerra, Themis Palpanas, Jin Shieh, Eamonn Keogh.

      ICDM 2010
    • Indexing Large Human-Motion Databases

      Eamonn Keogh, Themis Palpanas, Victor B. Zordan, Dimitrios Gunopulos, Marc Cardle.

      VLDB 2004

    • Analytics

      Examples of analysis operations are queries by content (range and similarity queries, nearest neighbors), clustering, classification, outlier patterns, frequent sub-sequences, and others.

    • Choose Wisely: An Extensive Evaluation of Model Selection for Anomaly Detection in Time Series.

      Emmanouil Sylligardos, Paul Boniol, John Paparrizos, Panos Trahanias, Themis Palpanas.

      PVLDB 2023
    • Appliance Detection Using Very Low-Frequency Smart Meter Time Series.

      Adrien Petralia, Philippe Charpentier, Paul Boniol, Themis Palpanas.

      e-Energy 2023
    • dCAM: Dimension-wise Activation Map for Explaining Multivariate Data Series Classification.

      Paul Boniol, Mohammed Meftah, Emmanuel Remy, Themis Palpanas.

      SIGMOD 2022
    • iEDeaL: A Deep Learning Framework for Detecting Highly Imbalanced Interictal Epileptiform Discharges.

      Qitong Wang, Stephen Whitmarsh, Vincent Navarro, Themis Palpanas.

      PVLDB 2022
    • Predicting Dyslexia in Adolescents from Eye Movements During Free Painting View.

      Alae Eddine El Hmimdi, Lindsey M Ward, Themis Palpanas, Vivien Sainte Fare Garnot, Zoi Kapoula.

      BrainSci 2022
    • Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection.

      John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S. Tsay, Aaron Elmore, Michael J. Franklin.

      PVLDB 2022
    • Theseus: Navigating the Labyrinth of Subsequence Anomaly Detection.

      Paul Boniol, John Paparrizos, Yuhao Kang, Themis Palpanas, Ruey Tsay, Aaron J. Elmore, Michael J. Franklin.

      PVLDB 2022
    • TSB-UAD: An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection.

      John Paparrizos, Yuhao Kang, Ruey Tsay, Paul Boniol, Themis Palpanas, Michael J. Franklin.

      PVLDB 2022
    • Predicting Dyslexia and Reading Speed in Adolescents from Eye Movements in Reading and Non-Reading Tasks: a Machine Learning Approach.

      Alae Eddine El Hmimdi, Lindsey M Ward, Themis Palpanas, Zoi Kapoula.

      BrainSci 2021
    • SAND: Streaming Subsequence Anomaly Detection.

      Paul Boniol, John Paparrizos, Themis Palpanas, Michael J. Franklin.

      PVLDB 2021
    • SAND in Action: Subsequence Anomaly Detection for Streams.

      Paul Boniol, John Paparrizos, Themis Palpanas, Michael J. Franklin.

      PVLDB 2021
    • Electricity Demand Activation Extraction: From Known to Uknown Signatures, Using Similarity Search.

      Pauline Laviron, Zueqi Dai, Berenice Huquet, Themis Palpanas.

      e-Energy 2021
    • Unsupervised and Scalable Subsequence Anomaly Detection in Large Data Series.

      Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas, Mohammed Meftah. Emmanuel Remy.

      VLDBJ 2021
    • GraphAn: Graph-based Subsequence Anomaly Detection.

      Paul Boniol, Themis Palpanas, Mohammed Meftah, Emmanuel Remy.

      PVLDB 2020
    • Series2Graph: Graph-based Subsequence Anomaly Detection for Time Series.

      Paul Boniol, Themis Palpanas.

      PVLDB 2020
    • Unsupervised Subsequence Anomaly Detection in Large Sequences.

      Paul Boniol (supervised by Themis Palpanas, Mohammed Meftah, Emmanuel Remy).

      VLDB PhD Workshop 2020
    • Scalable Machine Learning on High-Dimensional Vectors: From Data Series to Deep Network Embeddings.

      Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas.

      WIMS 2020
    • Automated Anomaly Detection in Large Sequences.

      Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas.

      ICDE 2020
    • SAD: An Unsupervised System for Subsequence Anomaly Detection.

      Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas.

      ICDE 2020
    • Matrix Profile Goes MAD: Variable-Length Motif and Discord Discovery in Data Series.

      Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn Keogh.

      DAMI 2020
    • Matrix Profile X: VALMOD - Scalable Discovery of Variable-Length
      Motifs in Data Series.

      Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn Keogh.

      SIGMOD 2018
    • VALMOD: A Suite for Easy and Exact Detection of Variable Length
      Motifs in Data Series.

      Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn Keogh.

      SIGMOD 2018
    • Data Series Similarity Using Correlation-Aware Measures.

      Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas.

      SSDBM 2017
    • Correlation-Aware Distance Measures for Data Series

      Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas.

      EDBT 2017
    • Time Series Analysis for Near-Infrared Spectroscopy Data.

      Novri Suhermi, Judit Gervain, Themis Palpanas.

      fNIRS 2016
    • Characterizing Home Device Usage From Wireless Traffic Time Series.

      Katsiaryna Mirylenka, Vassilis Christophides, Themis Palpanas, Ioannis Pefkianakis, Martin May.

      EDBT 2016
    • Envelope-Based Anomaly Detection for High-Speed Manufacturing Processes.

      Katsiaryna Mirylenka, Alice Marascu, Themis Palpanas, Matthias Fehr, Stefan Jank, Gunter Welde, Daniel Groeber.

      APC|M 2013
    • Finding Interesting Correlations with Conditional Heavy Hitters.

      Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, Divesh Srivastava.

      ICDE 2013
    • Scalable Similarity Matching in Streaming Time Series.

      Alice Marascu, Suleiman Ali Khan, Themis Palpanas.

      PAKDD 2012
    • Real-Time Data Analytics in Sensor Networks.

      Themis Palpanas.

      Springer 2012

    • Exploration

      Using our techniques, users can explore large datasets and find patterns of interest, using nearest neighbor search. They can draw queries (data series) using a mouse, or touch screen, or they can select from their own datasets.

    • ProS: Data Series Progressive k-NN Similarity Search and Classification with Probabilistic Quality Guarantees.

      Karima Echihabi, Theophanis Tsandilas, Anna Gogolou, Anastasia Bezerianos, Themis Palpanas.

      VLDBJ 2022
    • Data Series Progressive Similarity Search with Probabilistic Quality Guarantees.

      Anna Gogolou, Theophanis Tsandilas, Karima Echihabi, Anastasia Bezerianos, Themis Palpanas.

      SIGMOD 2020
    • Progressive Similarity Search on Time Series Data.

      Anna Gogolou, Theophanis Tsandilas, Themis Palpanas, Anastasia Bezerianos.

      BigVis@EDBT 2019
    • Comparing Similarity Perception in Time Series Visualizations.

      Anna Gogolou, Theophanis Tsandilas, Themis Palpanas, Anastasia Bezerianos.

      TVCG 2019
    • Comparing Similarity Perception in Time Series Visualizations

      Anna Gogolou, Theophanis Tsandilas, Themis Palpanas, Anastasia Bezerianos.

      IEEE VIS 2018
    • RINSE: Interactive Data Series Exploration with ADS+

      Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

      VLDB 2015

    • Summarization

      In order to support time- and space-efficient management and analytics, data series need to be summarized. Different summarization techniques are applicable to different applications and problem settings.

    • SEAnet: A Deep Learning Architecture for Data Series Similarity Search.

      Qitong Wang, Themis Palpanas.

      TKDE 2023
    • Deep Learning Embeddings for Data Series Similarity Search.

      Qitong Wang, Themis Palpanas.

      KDD 2021
    • Practical Data Prediction for Real-World Wireless Sensor Networks.

      Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco.

      TKDE 2015
    • What Does Model-Driven Data Acquisition Really Achieve in Wireless Sensor Networks?.

      Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco.

      Best Paper Award

      PerCom 2012
    • Real-Time Data Analytics in Sensor Networks.

      Themis Palpanas.

      Springer 2012
    • Streaming Time Series Summarization Using User-Defined Amnesic Functions.

      Themis Palpanas, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos.

      TKDE 2008
    • Online Amnesic Approximation of Streaming Time Series.

      Themis Palpanas, Michail Vlachos, Eamonn Keogh, Dimitrios Gunopulos, Wagner Truppel.

      ICDE 2004

    • Uncertainty

      Modeling tuples with value and existential uncertainty has several advantages. From an engineering perspective, a programmer can feed uncertain data directly into the system, without explicitly preprocessing data and forcing data approximations. From an application requirements perspective, maintaining possible values allows the application to provide results with confidence intervals.

    • Sliding Windows over Uncertain Data Streams

      Michele Dallachiesa, Gabriela Jacques-Silva, Bugra Gedik, Kun-Lung Wu, Themis Palpanas.

      KAIS 2015
    • Top-k Nearest Neighbor Search In Uncertain Data Series

      Michele Dallachiesa, Themis Palpanas, Ihab F. Ilyas.

      VLDB 2015
    • Uncertain Time-Series Similarity: Return to the Basics

      Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas.

      VLDB 2012
    • Similarity Matching for Uncertain Time Series:
      Analytical and Experimental Comparison

      Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas.

      QUeST @ GIS 2011

Contact

France
Email: Prof. Themis Palpanas
LIPADE - University of Paris
45 rue des Saints Pères
Paris 75006, France