The conference program is available.
SESSION 1. Conceptual Modeling and Ontologies. Small conference hall. Chair: Nikolay Skvortsov
Vladimir Budzko and Viktor Medennikov. Формирование Информационного Ландшафта Цифровой Экономики на Примере Сельского Хозяйства.
Abstract
Рассматривается иконографическое и формализованное описание
интеграции информационных ресурсов и алгоритмов их использования,
отражающей существенную часть основных принципов цифровой
экономики, позволяющих разработать математическую модель
формирования цифровой платформы управления (ЦПУ) экономикой,
которая представляет комплементарное объединение трех подплатформ
(стандартов цифровой экономики): сбора с целью накопления для
дальнейшего активного использования первичной учетной информации в
общей для всех производственных отраслей России облачной базе
данных (БД); единых информационных БД, отражающих технологические
особенности конкретной отрасли; единых баз знаний, отражающих
принятие управленческих решений также конкретной отрасли. ЦПУ – это
цифровой инструмент эффективного перехода от фрагментарных методов
проектирования и разработки информационных систем к комплексному,
интегрированному подходу в цифровой экономике. ЦПУ включает единые
отраслевые и федеральные информационные модели, классификаторы,
справочники. На примере жизненного цикла оптимизированной структуры
севооборотов, определяющих все процессы в сельском хозяйстве,
рассмотрены информационные модели этой подсистемы. Показано, что
ЦПУ с полученными тремя цифровыми стандартами и в совокупности с
развитыми методами информационного моделирования представляет
эффективный инструмент для построения автоматизированной системы
управления (АСУ) аграрным предприятием, обеспечивающим
информационную совместимость АСУ большинства из них, как следствие
эффективное решение большинства стоящих перед производителями
задач. Кроме того, информационная и алгоритмическая совместимость
АСУ обеспечит прозрачность управления отраслью на региональном и
федеральном уровнях на всех этапах производства.
Anna Shiyan, Anton Larin, Ildar Baimuratov and Nataly Zhukova. Automatic Information Retrieval and Extraction Methodology for the Ontology of Plant Diseases.
Abstract
Detecting plant diseases is a challenge for farmers; however,
solutions are offered by computer vision and image processing. A
major issue is the limitation of information obtained only from the
image. Using only computer vision along does not take into account
weather conditions and visual similarity of symptoms between
diseases. These challenges can be addressed by developing an expert
system that uses an ontology containing knowledge about plant
diseases, pathogens and symptoms.We developed an ontology of plant
diseases by integrating existing ontologies and adding
disease-causing factors. Our Plants and their Diseases Ontology
(PatDO) can improve diagnostic accuracy by incorporating detailed
symptom descriptions and linking them with specific pathogens.
Wikidata served as the primary source for taxonomic data, with
SPARQL queries extracting relationships among plants, pathogens,
and diseases. Using data from the American Phytopathological
Society and the European and Mediterranean Plant Protection
Organization (EPPO), we identified relationships previously
undocumented in Wikidata. In addition, large language models helped
extract different symptoms of the pathogens that cause plants
diseases from EPPO. The final ontology consists of 5002 classes and
8 properties that connect various entities, including plants,
pathogens and symptoms.
Nikolai Kalinin and Nikolay Skvortsov. An Approach to Ontology-Based Domain Analysis for Cross-Domain Research Infrastructure Development.
Abstract
Effective development of Research Infrastructures (RIs) depends on
a thorough understanding of domain-specific needs. Despite its
importance, domain analysis remains underrepresented in RI
development methodologies. Building on previous work utilizing
literature analysis, this article introduces an approach based on
the analysis of ontological sources. Ontologies, as structured
representations of domain knowledge, offer both a foundation for
metadata schema creation and a means to capture domain requirements
systematically. Moreover, ontologies play a key role in
implementing the FAIR principles by enabling data to be Findable,
Accessible, Interoperable, and Reusable through shared semantics
and standardized vocabularies. We propose the construction of an
ontology for research infrastructure development, based on the
DOLCE UltraLite (DUL) foundational ontology, and describe an
approach for mapping it to existing domain-specific ontologies.
Through this mapping process, we create an ontology network that
supports comprehensive domain analysis and enables semantic
interoperability across disciplines. This integrated ontology
network provides a robust basis for domain analysis and for the
construction of cross-domain Research Infrastructures aligned with
FAIR-compliant practices.
SESSION 2. Information security I. Lecture room №1. Chair: Maxim Kalinin
Maria Poltavtseva and Dmitry Zegzhda. Security assessment of heterogeneous big data processing and storage systems.
Abstract
The paper considers the problem of assessing the security of big
data processing and storage systems. Heterogeneous big data
processing and storing systems are increasingly used today in large
enterprises and organisations. They are characterised by two
important aspects. Firstly, they use different data storage and
processing tools, SQL and NoSQL DBMSs, which are geographically and
organisationally distributed. Second, these tools process a
cohesive, unified data set with a complex lifecycle of each
information fragment. Each individual tool of such an ecosystem
built on different platforms can be attacked by an intruder and
cause data leakage. The high degree of trust and the volume of
processed data aggravate the security situation. The paper proposes
a new method for as-sessing the security of such systems. The
method uses input data on the processes of data processing and data
movement in the source system collected using block-chain
technology. The security assessment takes into account the
specificity of big data processing and storage systems and the need
to integrate the specific security assessment with the security
assessment of the information system as a whole. The calculation of
the specific assessment is based on the analysis of the access
control system, through the analysis of security policies, and the
analysis of trust in the nodes performing operations on data. As a
result, the paper pro-poses an integrated assessment that reflects
the specificity of big data processing and storage systems and can
be easily embedded in various more general assess-ment
methodologies. A security assessment framework based on a
previously presented target system modelling framework for security
policy analysis is also presented.
Maxim Kalinin, Artem Konoplev and Vasiliy Krundyshev. Storage Optimization of a Blockchain-like Oriented Acyclic Block Graph Used for Data Protection in Highly Loaded Systems.
Abstract
This paper examines existing methods for addressing the challenge
of unlimited blockchain growth and evaluates their applicability to
blockchain-like oriented acyclic block graphs, which are utilized
for data protection in highly loaded systems. The authors introduce
a novel approach to reduce the volume of such block graphs by
embedding the hash image of the system’s state directly into block
headers. This method improves scalability while maintaining data
integrity and security, making it particularly suitable for
resource-constrained environments. By fixing the system state’s
hash representation within the block structure, the proposed
solution minimizes storage requirements without compromising the
decentralized verification process. The technique is designed to
optimize the performance of oriented acyclic block graphs enabling
their efficient deployment in high-throughput applications.
Potential use cases include smart cities, vehicular ad-hoc networks
(VANETs), industrial and medical IoT systems, and social digital
services, where high transaction volumes and low latency are
critical. The paper highlights the method’s ability to balance
scalability, security, and decentralization, ensuring robust data
protection in dynamic large-scale networks. The findings suggest
that this method offers a viable path for implementing lightweight,
scalable, and secure distributed ledger technologies in highly
loaded systems.
Vladimir Budzko and Viktor Belenkov. Кибербезопасность систем, реализующих интенсивное использование данных. Дисциплина формирования точек синхронизации целостности на объекте ИИД-системы.
Abstract
The current stage of development of Russian society is
characterized by digital transformation of all its spheres,
including economy, science, healthcare, education, culture, etc.
One of the areas of such transformation is the widespread use of
systems implementing data intensive domain (DID-systems). The
increasingly widespread use of these systems, their functioning as
part of global networks entails new risks of ensuring information
security (IS), which can negatively affect individuals, groups,
organizations, sectors of the economy and society as a whole.
Ensuring IS of DID-systems requires the organization of their
operation, which performs the restoration of the DID system and its
databases (DB) after the implementation of cyber-attacks. The
article considers the conditions that the discipline of forming
integrity synchronization points used at the DID system facility
must meet, ensuring the minimization of the total time costs for
performing special actions that create the possibility of restoring
the system, and its actual restoration after the implementation of
cyber-attacks leading to its failures.
SESSION 3. Machine learning methods. Conference Hall «Kapitsa». Chair: Vladimir Parkhomenko
Nikita Tsaplin, Alexander Petrov and Dmitry Kovalev. Greedy Feature Selection for Network Traffic Shallow Packet Inspection.
Abstract
Monitoring network traffic is essential for securing cloud
infrastruc-ture, especially given the growing sophistication of
cyberattacks and infor-mation-based threats. Traditional deep
packet inspection (DPI) methods are often impractical due to high
computational costs, legal concerns, and incompatibility with
encrypted traffic. This paper presents RuStatExt, a
high-performance driver for Hyper-V that enables real-time network
monitoring through Shallow Packet Inspection (SPI) – analyzing only
L2–L4 headers. We propose novel algorithms for greedy feature
selection and dynamic parameter tuning via grid search,
sig-nificantly improving the performance of machine learning models
used for anom-aly detection. Experimental evaluation on real-world
network traffic from 21 pro-duction VMs shows that combining SPI
metadata with optimized model training increases the F1-score
compared to baseline approaches. The best results were achieved
using Isolation Forest with dynamic parameter scaling,
significantly im-proving average F1-score from 0.28 to 0.73. The
solution introduces negligible overhead even under 1 Gbps traffic
loads, making it suitable for large-scale de-ployment in cloud
environments.
Dmitry I. Ignatov and Ruslan Molokanov. Benchmarking of Boolean Matrix Factorization Models for Collaborative Filtering: Classic and Neural Approaches.
Abstract
Boolean matrix factorization (BMF) is a widely used technique for
dimensionality reduction and information extraction from
highdimensional binary data. It is commonly applied in areas where
the binary data format is naturally acquired. Examples include
database tiling, financial transaction analysis, pattern
recognition and recommendation systems based on processing an
implicit feedback. This study aims to implement and compare
different BMF methods for collaborative filtering – an approach for
generating personalized recommendations by leveraging user feedback
from individuals with similar preferences. Classical methods rooted
in Formal Concept Analysis are examined alongside a new neural
network approach inspired by the Neural Collaborative Filtering
architecture. Algorithms are be implemented and evaluated using
various datasets, including synthetic matrices and real-world data
such as user ratings of movies.
Wei Yuan, Dmitry Ignatov and Zamira Ignatova. Statistical learning of polyhedra volumes from metric features.
Abstract
We explore statistical learning methods for computing polyhedra
volumes, motivated by Sabitov’s polynomial approach, which
expresses volume as a root of an edge-length-dependent polynomial.
It is known that searching for Sabitov polynomials presents
significant computational challenges for complex polyhedra. To
directly estimate the volume of a polyhedron from its edges’
lengths, we propose using ElasticNet regression to approximate
volumes from edge lengths, demonstrating high accuracy (R2 > 0.999)
for tetrahedra and octahedra (R2 ≥ 0.98 and > 0.999, depending on
the generation parameters) yielding their volume formulas from
triples of the distances of their edges and diagonals. We further
speculate on the statistical learning capabilities to deal with
Steffen’s flexible polyhedron, where traditional methods struggle
to obtain the Sabitov polynomial. Our results bridge algebraic
geometry and machine learning, offering a scalable alternative for
volume computation.
SESSION 4. Data Models and Data Integration, Database Management. Small conference hall. Chair: Viktor Zakharov
Sergey Stupnikov. Верификация интеграции данных в базе данных двойных звезд.
Abstract
Рост числа источников данных в науке и промышленности, структуры
данных которых сильно различаются, определены с использова-нием
различных моделей данных и реализованы в различных СУБД, ведет к
необходимости разработки систем интеграции данных.
Специализиро-ванные системы интеграции данных создаются в различных
предметных областях, например, в астрономии, управлении
землепользованием, мате-риаловедении. Сложность программ интеграции
данных, реализуемых в си-стемах интеграции, ведет к необходимости
формальной верификации их корректности. В работе рассмотрен подход
к верификации корректности интеграции данных в базе данных двойных
звезд Института астрономии РАН. Интеграция данных из звездных
каталогов производится при помощи программ на императивном языке
программирования. В качестве целевой базы данных используется
реляционная СУБД. Подход к верификации ос-нован на определении
семантики структур данных и программ интеграции данных в формальном
языке спецификаций и последующем доказательстве корректности
интеграции данных с использованием автоматизированных средств
доказательства.
Vladislav Sukhomlinov and Oleg Sabinin. Application of Stratified Sampling in Statistical Data Analysis to Improve Query Plan Cardinality Estimation.
Abstract
This paper proposes a new approach for generating the sample needed
to collect statistics and calculate the cardinality of execution
plans in relational database management systems. The study examines
the well-known simple random sampling method and identifies its
inherent limitations when applied to modern database systems. To
address these challenges, we advocate adopting stratified sampling
in contexts where data segmentation based on table attributes is
feasible. We propose a modified stratified sampling algorithm that
reduces the required sample size for statistical data collection
without compromising the accuracy of the results. Preliminary
experiments using PostgreSQL 17.4 DBMS and a 2.5 GB database
confirm the effectiveness of the proposed approach on samples for
statistical analysis, with up to 35,000 rows, emphasizing its
potential for optimizing query performance.
[SHORT] Irina Karabulatova, Stepan Vinnikov, Anatoly Nardid and Yuriy Gapanyuk. The metagraph transformation algorithm based on incidence and nesting representation.
Abstract
The article is devoted to the work of the metagraph transformation
algorithm. Graph models, including complex graph models, are
currently being actively used to describe the structures of complex
systems. The relevance of the research lies in the fact that the
article considers an algorithm for converting structures of
large-sized complex systems represented in form of metagraphs. The
metagraph model and its multipartite and matrix representation is
discussed. The proposed approach is based on representing a
metagraph data model as a combination of incidence and nesting
matrices. Operations on a metagraph can be represented in the form
of a combination of operations on incidence and nest-ing matrices.
The metagraph transformation algorithm is presented. The example of
algorithm application is explained in details; the interpretations
of output ma-trices are given. The expected impact is that the
proposed algorithm is based on the matrix representation of the
metagraph on the one hand, and on event sourc-ing architectural
pattern on the other hand, which allows algorithm to be used for
data intensive domains.
SESSION 5. Information security II. Lecture room №1. Chair: Evgeny Pavlenko
Nikolai Kalinin and Nikolay Skvortsov. Towards Unified Ontology of Cloud Assets for Multi-Cloud Security.
Abstract
Multi-cloud architectures have become ubiquitous as organizations
leverage services from multiple cloud providers but this trend has
introduced new security challenges in consistency and oversight.
Cloud misconfigurations and sophisticated threats are rising in
tandem with accelerated cloud adoption, resulting in frequent
security incidents and data breaches. This paper presents an
OWL-based unified ontology of cloud assets, constructed by
analyzing Terraform provider schemas from AWS, Google Cloud, Azure,
Yandex Cloud, and Cloud.ru. The ontology provides a formal,
provider-agnostic framework to integrate and reason about cloud
infrastructure across heterogeneous environments. By unifying cloud
asset definitions, our approach enables the automated construction
of a comprehensive multi-cloud asset knowledge base and supports
universal Cloud Security Posture Management (CSPM) policies that
can detect and prevent misconfigurations consistently across
different clouds. Furthermore, security analysts can formulate and
analyze attack chains using the ontology’s relationships and
logical constraints, allowing reasoning engines to infer potential
threat paths and misconfiguration exploits. The proposed
ontology-driven framework aims to enhance cloud security monitoring
and incident analysis in hybrid and multi-cloud deployments,
ultimately helping organizations and Managed Security Service
Providers (MSSPs) improve their security posture through formal
knowledge representation and automated reasoning.
Pavel Yugai, Evgeny Zubkov, Dmitriy Moskvin and Denis Ivanov. Robustness of machine learning models in network threat defense systems in the context of adversarial attacks.
Abstract
Machine learning algorithms (ML) are utilized across various
domains of information technology. In network security, ML models
are employed for detecting information security incidents and
responding to diverse threats. Clas-sical ML algorithms or their
modifications are implemented in various network security products,
depending on their specific implementations and purposes. There
exist adversarial attacks targeting ML models, designed to
manipulate in-put data in such a way that the output of the ML
model becomes erroneous. Dif-ferent methods and metrics for
assessing the robustness of ML models against adversarial attacks
are applied in various domains. This paper examines adver-sarial
attacks that are employed against widely used ML models for
detecting network attacks. It discusses both classical and
specialized metrics for evaluating the resilience of ML models to
adversarial attacks. An analysis of the applicabil-ity of various
metrics for assessing the robustness of machine learning models
against the aforementioned adversarial attacks is conducted.
Nikolay N. Shenets, Elena B. Aleksandrova and Artem S. Konoplev. Secure Data Processing Approach for Ad-Hoc Networks Based on Blockchain and Secret Sharing.
Abstract
In this paper, we propose a new secure communication scheme for
Ad-Hoc networks such as MANET/FANET based on blockchain technology
and secret sharing. First, we review the usage of both secret
sharing and blockchain for different protection purposes. Next, we
describe our approach for securing Ad-Hoc networks and show its
advantages. Namely, we use secret sharing as a base for
authentication and key predistribution, and mutable permissioned
blockchain for the identification of nodes instead of Public Key
Infrastructure.
SESSION 6. Mathematical Models. Conference Hall «Kapitsa». Chair: Egor Khitrov
Dmitry Ignatov. Computation of the Dedekind-MacNeille completion of the Bruhat order for the Weyl group type B_n.
Abstract
This paper presents the computation of the Dedekind-MacNeille
completion for the Bruhat order of the Weyl group B6 using concept
lattices (aka Galois lattices).We extend the associated sequence
A378072 in the OEIS by providing the value for n = 6, which equals
142343254. Our approach leverages Formal Concept Analysis and the
NextClosure algorithm to efficiently compute the completion. We
also present additional results for other Weyl groups (infinite
families up to computationally reasonable n, An, Dn, and special
ones, G2, F4, E6) and analyze lattice properties including the
number of chains and (maximal) antichains, as well as lattice
height, width, and breadth.
Aleksey Buzmakov, Sophia Kulikova and Vladimir. Parkhomenko О расчете робастности и устойчивости формальных понятий на основе дельта-меры.
Abstract
В работе рассмотрена робастность формальных понятий, из которой
проистекает устойчивость и ее быстрое приближение в виде
дельта-меры. В виде утверждения представлен вычислительно
эффективный способ расчета робастности на основе подконтекстов.
Основным результатом работы выступаeт метод аналитической оценки
количества прообразов для понятий. На основе метода предложена
оптимизация алгоритма дельта-устойчивых понятий. Рассуждения
иллюстрируются примером.
Aleksandr Sidnev, Igor' Malyshev and Vladimir Tsygan. Queue network models for performance evaluation and optimization of Job shop systems.
Abstract
The paper presents a modern approach to using closed queue network
models to evaluate and optimize Job-shop systems. The approach
uses two-moment method for performance measures at individual
nodes. This method is embedded in an iterative calculation
procedure for an open network equivalent to the original closed
network. The method results in algorithms for calculating and
optimizing general multi-class closed queue networks. Numerical
studies comparing the performance of the approach with simulations
suggest that the approach yields fairly accurate estimates of
performance measures.
SESSION 7. Research support in data infrastructures. Small conference hall. Chair: Egor Khitrov
Anton Khritankov. Towards a Technology Platform for AI Applications with MLOps and Hybrid Computing.
Abstract
The rapid advancement of artificial intelligence (AI) technologies
has necessitated the development of robust frameworks that
facilitate the efficient creation and deployment of AI
applications. This paper proposes a specific variant of the MLOps
(Machine Learning Operations) process tailored for the full cycle
development of AI applications, addressing the unique challenges
that arise during implementation. We identify and discuss several
critical tasks associated with this process, including pipeline
implementation, machine learning application verification and
long-term modeling of continuous learning systems, which are
essential for ensuring the efficiency and effectiveness of MLOps
implementations. Additionally, we describe a hybrid cloud computing
platform designed to automate these MLOps processes, enhancing
scalability and flexibility in AI application development. This
platform integrates on-premises and cloud resources, facilitating
seamless collaboration and resource allocation. By providing a
structured approach to MLOps, this work contributes to the
advancement of methodologies in AI development and offers a
practical framework for organizations seeking to optimize their AI
initiatives and accelerate time-to-market for innovative solutions.
[SHORT] Victor Dudarev, Nadezhda Kiselyova and Alfred Ludwig. Информационная поддержка распределённых исследований в области тонкоплёночных материалов: от синтеза к управлению процессами.
Abstract
Разработана система управления исследовательскими данными (RDMS)
MatInf, ориентированная на поддержку исследовательских коллективов,
работающих с большими объемами данных, генерируемых в рамках
высокопроизводительных экспериментов в области неорганического
материаловедения. Архитектура MatInf обеспечивает полную поддержку
пользовательских типов данных, определяемых динамически после
развертывания системы. Такая гибкость достигается благодаря
механизму позднего связывания типов с внешними веб-сервисами,
реализующими функции валидации, извлечения и визуализации данных.
Представлены примеры использования RDMS для хранения и анализа
экспериментальных данных по тонкоплёночным материалам. Ключевое
отличие MatInf заключается в отсутствии свободно распространяемых
альтернатив, одновременно поддерживающих типизированное
представление материаловедческих данных, расширяемую систему
пользовательских типов, интеграцию с произвольными форматами
исследовательских документов без модификации ядра системы и
поддержку связности объектов с помощью ориентированного
мультиграфа.
[SHORT] Alexander Elizarov, Evgeny Lipachev and Olga Nevzorova. Towards a Research Infrastructure of Mathematical Knowledge.
Abstract
We propose an approach to creating a research infrastructure for
managing mathematical knowledge. The research infrastructure is
presented as the system of interconnected semantic mathematical
artefacts developed for different domains of mathematical
knowledge. The formation of mathematical artifacts is based on the
software tools of the OntoMath digital ecosystem that we have
already developed. When creating mathematical artefacts, we were
guided by the FAIR principles and recommended the practice of their
application. We highlight the main mathematical artifacts of the
research infrastructure such as an ontology of professional
mathematics, an ontology for mathematical theorems and statements
and an ontology of mathematical problems, an ontology of methods of
solving mathematical problems, an ontology of algorithms and
programs, a knowledge graph for mathematical formulas, a knowledge
graph for representing the organizational structure of mathematical
space, including descriptions of scientific groups, individuals,
research topics presented in mathematical journals, and an
ontological model for representing mathematical knowledge as a
system of interconnected specialized ontologies.
Vladimir Korenkov, Irina Filozova, Galina Shestakova, Andrey Kondratyev, Aleksey Bondyakov, Tatiana Zaikina, Irina Nekrasova and Yanina Popova. Automation of Scientific Publications Management in the JINR Digital Repository.
Abstract
In the context of growing volumes of scientific
publications and the increasing number of digital repositories,
effective management of research out-comes has become an
increasingly complex challenge. This paper presents a modular
system for automating the management of scientific publications,
inte-grated into the Joint Institute for Nuclear Research (JINR)
digital repository based on the DSpace software platform. The
system enables automated harvest-ing of publication metadata and
full texts from external sources, verification of authorship
records, duplicate elimination, and data normalization —
significantly enhancing the accuracy and completeness of repository
information. The reposi-tory’s functionality is extended with data
visualization: interactive histograms — among the most common and
intuitive visualization types for such systems — have been
implemented. This feature, developed using D3.js, enhances the
re-pository's analytical capabilities.
The proposed architecture is
characterized by flexibility, scalability, and ability to integrate
into existing infrastructures of research organizations, opening
pro-spects for its adoption in universities, research centers, and
national libraries.
The development is carried out at the
Laboratory of Information Technologies (LIT) of the Joint Institute
for Nuclear Research (JINR).
Nikolay Skvortsov. Обеспечение семантического поиска ресурсов в рамках жизненного цикла решения задач.
Abstract
Повторное использование неоднородных научных данных и методов
обычно требует значительных усилий по обеспечению их интеграции и
семантической интероперабельности. В статье предлагается подход к
обеспечению семантического поиска ресурсов с применением онтологий
предметных областей. Ресурсы, включающие источники данных и
реализации методов, регистрируются в исследовательских
инфраструктурах для обеспечения их повторного использования при
решении задач в различных предметных областях. Таким образом
создаются коллекции научных данных и наборы инструментов для
совместных исследований и обеспечения преемственности научных
результатов. Для связывания с данными и методами семантически
значимых метаданных используются формальные спецификации
предметных областей в онтологической модели. С использованием таких
метаданных и логического вывода над ними данные и методы
классифицируются в предметной области для обеспечения поиска
ресурсов, релевантных решаемым задачам. На разных этапах жизненного
цикла решения исследовательских задач обеспечивается нахождение
источников данных и методов, их семантическая интеграция и
корректное совместное функционирование.
SESSION 8. Information security III. Lecture room №1. Chair: Maria Poltavtseva
Nikita Gribkov and Maxim Kalinin. Evaluating the security of big data infrastructure using intelligent analysis of its code basis.
Abstract
The paper analyses methods of security assessment of typical
components of big data infrastructure. Based on the results of the
analysis, we propose an evaluation method based on a comparative
analysis of the code base of the investigated components with the
sets of known potentially dangerous code fragments. To increase the
universality of the method, the possibility of analyzing components
without source codes has been studied. The method is complex:
fragments are analyzed at several levels of abstraction at the
level of binary year, assembly code and its graph representations,
recovered code and its graph representations. The method allows
labelling potentially dangerous code fragments, including those
without syntactic samples, in components of big data infrastructure
and assessing its security level based on the collected statistics.
Evgeny Zubkov and Dmitry Zegzhda. Assessment of the sustainability of the cyber-physical system based on historical data.
Abstract
The study examines the sustainability issues of cyberphysical
systems from the perspective of information security
vulnerabilities in software and hardware components. An overview
is provided of methods for assessing and ensuring the stability of
such systems. A model for evaluating stability is proposed, based
on a continuous Markov process, using historical data on software
versioning and vulnerabilities.
[SHORT] Nikita Gololobov, Evgeniy Pavlenko and Lavrova Darya. Model of software functioning based on Bayesian networks.
Abstract
This article presents an innovative model of software functionality
based on Bayesian networks, developed to address critical
cybersecurity issues related to code reachability assessment and
vulnerability exploitation prediction. The proposed model overcomes
the limitations of traditional software analysis methods, which
generate an excessive number of false positives due to the lack of
context regarding the actual reachability of vulnerable components.
[SHORT] Evgenia Novikova and Igor Kotenko. Towards assessment of the trustworthiness of the AI models: application to cyber security tasks.
Abstract
Currently, there is an active discussion on the usage
of AIbased systems in many areas of the national economy such as
finance, industry, medicine, education, etc. The key issue of the
practical application of AI models consists in evaluation of their
trustworthiness level. The paper discusses the notion of
trustworthiness of the AI model and its main characteristics. The
authors demonstrate that currently there is no unified approach to
evaluating the robustness of the AI model as well as the set of
metrics used to assess it. To fill this gap, the authors propose a
formal description of the AI model evaluation process based on
ontological modelling. The proposed ontology describes key
components of the evaluation methodology, including requirements
defined for the given subject domain and analytical task, a set of
evaluation metrics, and their calculation algorithms. The
application of the developed ontology is demonstrated by evaluating
the trustworthiness of the AI models developed for cyber security
tasks.
SESSION 9. Machine Learning Applications. Conference Hall «Kapitsa». Chair: Dmitry Kovalev
Nadezhda Kiselyova, Victor Dudarev, Oleg Senko, Alexander Dokukin and Andrey Stolyarenko. Application of Machine Learning to Predict Crystal Lattice Parameters of ThCr2Si2 Type Crystal Structure Compounds.
Abstract
A comparison of the efficiency of using various machine learning
methods in predicting the qualitative and quantitative properties
of inorganic compounds was carried out. In predicting qualitative
properties, the most accurate programs, according to the results of
cross-validation, were those based on training a neural network
using the backpropagation method (average accuracy 91%), the
support vector machine method (91.6%), and the knearest neighbors
method (92.9%). The high average accuracy (97.6%) of the
examination assessment in the cross-validation mode indicates the
effectiveness of using ensembles of algorithms. Using the selected
programs, prediction of new compounds of the composition AD2X2 (A
and D are various elements here and below; X is B, Al, Si, P, Ga,
Ge, As, Sn or Sb) with crystal structures of the ThCr2Si2, FeMo2B2,
CaAl2Si2, CaBe2Ge2 and CoSc2Si2 types under ambient conditions was
carried out. For the predicted compounds with the structures
ThCr2Si2, FeMo2B2, CaAl2Si2 and CaBe2Ge2, the crystal lattice
parameters were estimated. When solving these problems, the most
accurate results according to the LOOCV (LeaveOne-Out
Cross-Validation) were obtained using programs from the
scikit-learn package: svm.NuSVR, svm.SVR, Random Forest, Gradient
Boosting Regressor, ARD Regression, Extra Trees Regressor,
Orthogonal Matching Pursuit and Bayesian Ridge, as well as a
program specially developed for predicting quantitative properties
DivenBoost. MAE (Mean Absolute Error) was in the range of
0.014-0.155 Å. The value of the multiple determination coefficient
R2 was in the range of 0.836-0.990. To predict compounds not yet
obtained and to estimate unknown values of lattice parameters, only
the values of the properties of elements A, B and X were used.
Andrew Soroka and Alex Meshcheryakov. JAMPR+/L2D: scalable neural heuristic for constrained vehicle routing problems in dynamic environment.
Abstract
The vehicle routing problems with real-world constraints (we
consider vehicles capacity limits, time windows constrains,
pickup-anddelivery multi-depo — CPDPTW) pose significant
computational challenges. While classical exact and heuristic
methods remain effective to solve problems of small/medium size (N
≲ 100), they often lack adaptability and scalability for larger
logistics tasks. In this work, we show how JAMPR+/L2D RL deep
learning model, proposed in to solve large CPDPTW problems can be
adopted in the case of substantial changes of graph distance
matrix.We test performance of JAMPR+/L2D model for medium-sized
CVRP and VRPTW problems on CVRPLIB benchmarks: JAMPR+/L2D
outperforms the state-of-the-art heuristic HGS in over 85% of
instances, achieving improvement in objective gap. We show that the
JAMPR+/L2D model trained on CPDPTW problem, generalizes well for
tasks with simpler constraints (CVRP, VRPTW), for different problem
sizes and for moderate changes in distance matrixes. For more
substantial changes in distance matrixes, we propose here to make
fast finetuning of JAMPR+: on ORTEC data (for CPDPTW) the proposed
strategy remarkably reduces the objective gap without full model
retraining, what will give both accuracy and rapid inference of the
model in the practical routing scenarios with distance matrix
changes.
Aleksandr Osipov and Ramon Antonio Rodriges Zalipynis. Accelerated Wildfire Simulations via Caching Techniques.
Abstract
Thousands of wildfires occur daily, posing serious threats to the
global environment. Therefore, accurate wildfire simulations are
crucial to combat, mit-igate, and prevent wildfires effectively.
However, simulating a large number of concurrent wildfires requires
significant computational resources. The novel idea presented in
this work is to accelerate cellular automata wildfire simulations
by sharing and/or reusing calculations, thereby reducing their
number. We develop this idea by proposing and implementing a set of
new techniques: precise and imprecise caching, as well as fuzzy
approximation. This work is pioneering in terms of designing and
exploring caching techniques for the aforementioned sce-narios. We
use Simpson, Jaccard, and Sneath metrics for accuracy evaluations.
We assess computational efficiency by thorough theoretical
algorithm analysis and profiling. Importantly, all the approaches
significantly speed up the simula-tions without even modifying the
wildfire simulation model, keeping it intact. Precise caching
improved the computation speed by 20% without any accuracy
degradation. Imprecise caching yielded similar performance gains,
but with re-duced accuracy (Jaccard: 0.86, Sneath: 0.76). Further,
fuzzy logic reduced the runtime by 44% but exhibited lower accuracy
(Jaccard: 0.86, Sneath: 0.65, de-pending on fire characteristics).
The presented approaches may enable simulating more wildfires in a
fraction of the time or require fewer computational resources.
[SHORT] Aleksei Samarin, Aleksei Toropov, Alexander Savelev, Egor Kotenko, Anastasia Mamaeva, Artem Nazarenko, Alexander Motyko, Elena Mikhailova and Valentin Malykh. AI-Driven Automatic Proctoring System for Secure Online Exams.
Abstract
This study presents a microservice-based AI-driven proctoring
system for secure and scalable online exam monitoring. The proposed
system integrates deep learning and computer vision techniques to
analyze multimodal data, including video streams, audio signals,
and metadata, to detect dishonest behaviors such as unauthorized
assistance, device usage, and unusual gaze patterns. The system
architecture ensures seamless integration with online learning
platforms, providing a modular and adaptive approach to remote exam
supervision. Key components include real-time facial recognition,
eye-tracking, head pose estimation, and audio anomaly detection.
Explainable AI techniques enhance transparency, allowing educators
to interpret decisions and minimize false positives. Experimental
evaluation on a controlled dataset demonstrated high detection
accuracy and efficiency, validating the system’s applicability for
automated proctoring. The microservice structure allows for
flexible deployment, making it suitable for large-scale educational
environments. This work is mainly devoted to the description of the
analytical core of the system. Future improvements will focus on
refining detection models, reducing bias, and addressing ethical
considerations concerning student privacy. This research
contributes to advancing AI-powered academic integrity solutions,
offering a practical and scalable alternative to traditional
proctoring methods.
SESSION 10. Experimental Data Analysis. Small conference hall. Chair: Nikita Voinov
Anna Provorova, Kristina Lykova, Sophia Kulikova, Daria Semenova and Julia Zaripova. “Dish I Wish”: an app for studying children's eating behavior.
Abstract
This study presents a specialized web application, Dish I Wish,
designed to collect data on dietary preferences among children
aged 4 to 14 through the use of gamification techniques. In
response to the limitations inherent in traditional paper-based
questionnaires, the application features an intuitive, interactive
interface that enables children to simulate meal planning by
selecting dishes and adjusting portion sizes. An experimental
design is employed to align individual children's preferences with
perceived family habits, thereby enhancing the depth and contextual
relevance of the data collected. Developed using the MERN
technology stack and leveraging MongoDB for flexible data storage,
the system emphasizes scalability and automation to reduce the
potential for human error in data collection.
Vladimir Parkhomenko, Anastasiya Ivanova, Ivan Eroshin, Pavel Drobintsev and Alexander Schukin. User Interface Evaluation Using Tracking Eyes and Facial Expressions.
Abstract
Оценка пользовательского интерфейса (UI) полезна для создания
программного обеспечения, удобного для людей. Мы описываем две
системы на основе WebGazer, представляющие из себя стенды для
проведения полевых исследований и наглядного представления
статистики по результатам их выполнения. Системы имеют открытый
исходный код на Github. Они использовались в полевых экспериментах
в мае 2025 года с участием более 20 человек. Первый эксперимент
посвящен оценке маркетплейсов с использованием тепловых карт,
разработаны показатели и методика. Ранжирование по разным
показателям показало относительную равноважность главной и товарной
страниц маркетплейсов, похожих на Ozon и Wildberries. Второй
эксперимент посвящен оценке индивидуального эффекта от воздействия
(ITE) с использованием отслеживания глаз и мимики, а также системы
разработанных методики и показателей, включая новый показатель
отвлеченности (тревожности). Под воздействием рассматривается смена
веб-браузеров Сhrome и Firefox. Несмотря на предпочтения
пользователей Chrome в опросе, статистически значимой разницы между
использованием браузеров в среднем нет на основе мониторинга
взгляда и эмоций. Проведены эксперименты на синтетических данных
для оценки чувствительности мета-алгоритмов измерения ITE, в
которых при сильной гетерогенности данных модель X-learner с Causal
Forest демонстрирует наиболее стабильные и значимые оценки ITE.
[SHORT] Georgiy Frolov and Vladimir Parkhomenko. Research and experimental analysis of Big Data tools: an application for Twitch streaming platform.
Abstract
Статья посвящена сравнительному анализу популярных инструментов Big
Data в пакетном (Spark, Hive on Tez, Map Reduce) и потоковом (Spark
Streaming, Apache Flink) случаях. Мы оцениваем временную
эффективность инструментов: время выполнения и временная задержка,
для этих средств соответственно. Apache Flink и Spark Streaming
демонстрируют достойные результаты, и оба могут считаться
релевантными инструментами для потоковой обработки. Однако Apache
Flink демонстрирует более стабильную и существенно меньшую
абсолютную задержку, чем Spark Streaming. Также делается вывод, что
Spark в пакетном случае превосходит другие сравниваемые инструменты
по временной эффективности. Hive on Tez показывает небольшое
отставание в эффективности задач пакетной обработки по сравнению со
Spark, и благодаря простоте использования синтаксиса HiveQL признан
эффективным и релевантным инструментом. Hadoop Map Reduce,
напротив, демонстрирует довольно плохие результаты и не
рекомендуется к использованию из-за наличия более быстрых и удобных
альтернатив, перечисленных выше. Все соображения проиллюстрированы
на примере приложений на платформе потокового вещания Twitch,
генерация данных для потоковой обработки автоматизирована, скрипты
и результаты бенчмарка размещены в открытом репозитории Github.
[SHORT] Dmitry Nikitenko and Artem Bulekov. An approach to intuitive visual analysis of ratings.
Abstract
This paper aims to illustrate how the nature of the object under
study can be used to select methods and ways of visual analysis.
The object under consideration is various ratings, i.e.
time-varying lists of some objects ranked by values of some fixed
parameters. The key idea is to outline the set of most important
parameters and visualize various metrics under study always
together with the mentioned set of parameters. As an example, we
consider the ratings of supercomputer systems the Top500 rating
of the world's most productive computing systems and the Top50
rating of Russian supercomputers, within the framework of which the
proposed analysis tool was implemented. For this, we choose rating
editions, system position and HPL performance as the key set of
parameters, which form the comprehensive vision of all the
entrances in the his-tory of rating, allowing to intuitively feel
the significance of the studied metric values.
[SHORT] Mikhail Lebedev, Vladimir Parkhomenko, Roman Zuev and Alexander Schukin. Container Route Aggregation Using Big Data.
Abstract
Одной из основных проблем на контейнерном терминале является
агрегация маршрутов, которая заключается в минимизации
необоснованных разрывов, т.е. потери данных о маршруте из-за
системных ошибок или человеческого фактора. Мы решаем эту задачу с
помощью инструментов Big Data, интеграции данных и других методик.
Корпоративная база данных на основе ClickHouse интегрируется с
базой данных железнодорожных накладных в Postgres. Затем мы
формируем маршруты на основе имеющихся данных с удалением ненужных
маршрутов, добавляем начальные и конечные операции (станции).
Результаты сохраняются в Yandex Cloud и файл Parquet. Пользователю
предоставляется агрегированная по выбранным параметрам таблица.
Разработанный программный модуль работает на контейнерном терминале
Санкт-Петербурга и затрагивающий перемещение контейнеров по России.
Он позволяет сократить количество необоснованных разрывов примерно
в 3 раза.
SESSION 11. Image Analysis I. Lecture room №1. Chair: Dmitry Ignatov
Aleksei Samarin, Alexander Savelev, Aleksei Toropov, Anastasia Mamaeva, Egor Kotenko, Aleksandra Dozortseva, Artem Nazarenko, Alexander Motyko, Elena Mikhailova and Valentin Malykh. Enhancing Microorganism Classification with Vision-Language Large Models Generated Synthetic Microscopy Images.
Abstract
The scarcity of annotated microscopy datasets remains a major
obstacle to training robust deep learning models for microorganism
classification. This study proposes a novel data augmentation
pipeline that leverages Vision-Language Large Models (VLLMs) to
generate synthetic microscopic images across six distinct bacterial
and non-bacterial classes. These synthetic samples were
progressively integrated into the training dataset in controlled
proportions to systematically assess their impact on model
performance. Quantitative evaluations reveal that incorporating
synthetic data up to 11% significantly enhances classification
accuracy, with the best-performing configuration achieving a
Precision of 0.91, a Recall of 0.89, and an F1-score of 0.90.
However, performance begins to decline beyond this saturation
threshold, suggesting that excessive synthetic augmentation may
introduce distributional noise or overfitting. Our findings
highlight the potential of VLLM-based synthetic data generation as
a scalable solution to address class imbalance and data scarcity in
microbial image analysis tasks.
Дарья Имайкина and Константин Туральчук. Разработка модели классификации МРТ-изображений по стадиям болезни Альцгеймера с применением интерпретационных методов машинного обучения.
Abstract
В данной работе предложен подход к автоматизированной
классификации МРТ-изображений головного мозга по стадиям болезни
Альцгеймера с применением методов интерпретируемого машинного
обучения. Исследование включает сравнительный анализ архитектур
нейронных сетей, в результате которого разработана эффективная
CNN-модель, достигающая точности 95% на тестовой выборке. Особое
внимание уделено интерпретации решений модели с помощью методов
Grad-CAM и LIME, что позволило выявить ключевые области мозга,
влияющие на классификацию. Показано, что при корректной
классификации модель демонстрирует четкую локализацию значимых
признаков, в то время как ошибки связаны с расплывчатыми
активациями. Тестирование устойчивости применения методов
интерпретации с моделью к искажениям изображений подтвердило ее
надежность, за исключением случаев горизонтального отражения из-за
анатомической асимметрии мозга. Практическая значимость работы
заключается в создании прототипа клиент-серверной системы,
предназначенной для автоматизации диагностики нейродегенеративных
заболеваний. Полученные результаты открывают перспективы для
внедрения разработанных методов в клиническую практику с целью
повышения точности и скорости постановки диагноза.
Aleksei Samarin, Alexander Savelev, Aleksei Toropov, Anastasia Mamaeva, Egor Kotenko, Artem Nazarenko, Alexander Motyko, Elena Mikhailova, Valentin Malykh and Svetlana Babina. Infrared Imaging-Enhanced Automated Wildlife Detection in Nature Reserves.
Abstract
This study presents a refined approach to automated wildlife
detection in natural environments by integrating a neural network
architecture with a dual-stream attention mechanism and a newly
introduced infrared-based pre-classification stage. The method
addresses a key challenge in ecological monitoring: the need for
scalable and accurate tools to assess wildlife populations and
support biodiversity conservation. A preliminary classification
module evaluates color and thermal patterns in the infrared
spectrum to enhance detection reliability before the core detection
process. This step filters background noise and highlights regions
likely to contain living organisms, enabling more focused and
accurate downstream analysis. The main detection system is based on
a dual-attention neural architecture that emphasizes semantically
important regions within camera trap images while minimizing
interference from visually complex natural scenes. This is
particularly important for field-acquired data, which often
contains partial occlusions, inconsistent lighting, and background
clutter. Performance evaluation was conducted using the Wildlife
Insights dataset, a large and diverse collection of camera trap
images from various ecological regions. Experimental results
demonstrate that the proposed approach offers higher accuracy and
robustness compared to traditional models, particularly in visually
challenging scenarios. By combining infrared color-based
classification with an attention-augmented detection pipeline, this
method significantly enhances the effectiveness of automated
wildlife monitoring. The findings highlight the system’s
applicability for ecological research, conservation planning, and
long-term wildlife population studies.
[SHORT] Pavel Arkhipov, Sergey Philippskih and Maxim Tsukanov. Разработка специализированного метода контекстной аугментации малых объектов.
Abstract
В статье проведен анализ существующих методов аугментации данных и
выявлены их недостатки, связанные с ограничениями при работе со
снимками, полученными при помощи БПЛА. Предложен новый
специализированный метод контекстной аугментации малых объектов,
поддерживающий пространственный реализм, и способный обеспечить
необходимую плотность объектов небольшого размера на этапе обучения
нейросетевых моделей. Учитывая сложность реализации предлагаемого
метода контекстной аугментации малых объектов, было принято решение
разделить процесс разработки на несколько этапов. Для эмпирической
проверки предложенного метода в качестве базового детектора была
выбрана модель SSD MobileNet V2 FPNLite 320x320, находящаяся в
репозитории TensorFlow 2 Detection Model Zoo. Данная нейросетевая
модель была обучена на наборе данных VisDrone с применением
стандартных методов аугментации данных (SSD MobileNet V2 FPNLite
320x320) и разработанного метода (SSD MobileNet V2 FPNLite 320x320
CSOA). Произведено сравнение полученных результатов детектирования
базовой модели SSD MobileNet V2 FPNLite 320x320 с моделью SSD
MobileNet V2 FPNLite 320x320 CSOA в соответствии с протоколом
оценки библиотеки COCO Evaluation metrics. Экспериментальные
исследования показали, что контекстно-зависимая аугментация
значительно улучшает способность нейросетевых моделей к
детектированию объектов малых размеров. Полученный результат имеет
большое практическое значение при работе с компактными
энергоэффективными системами с ограниченной производительностью и
памятью.
SESSION 12. Large Language Models and Applications I. Conference Hall «Kapitsa». Chair: Elena Bolshakova
Дарья Потапова и Наталья Лукашевич. Методы тестирования порождающих моделей, использующих информационный поиск.
Abstract
Применение порождающих моделей, использующих информационный
поиск, или подхода Retrieval Augmented Generation (RAG), когда
поисковый модуль извлекает релевантные документы по запросу
пользователя, а порождающая модель формирует ответ, комбинируя
найденный контекст со своими внутренними знаниями, позволяет
преодолеть такие ограничения больших языковых моделей, как
устаревание знаний и неспособность работать с динамично
изменяемыми данными. Эффективность таких систем напрямую зависит от
качества поиска, структуры контекста и формулировок инструкций,
однако всесторонняя оценка такого подхода на текущий момент
является сложной задачей не только из-за малого количества
подходящих наборов данных и различия в требованиях к системам, но и
из-за отсутствия стандартных метрик, которые позволяли бы измерять
соответствие ответа предоставленному контексту. В работе
проводилось сравнительное исследование шести языковых моделей
(DeepSeek, Llama, Mistral, Qwen, RuAdapt, YandexGPT) в вопросно-
ответной задаче с использованием подхода RAG. Эксперименты вы-
полнены на четырех русскоязычных датасетах — XQuAD, TyDi QA, RuBQ и
SberQuAD — приведённых к единому формату, подходящему под задачу.
Были рассмотрены различные стратегии добавления и упорядочивания
фрагментов в подаваемом модели контексте, а оценка моделей с
помощью метрик Context Relevance, Utilisation, Completeness,
Adherence и Exact Matching позволила выявить ограничения
генеративных моделей при извлечении информации из релевантного
контекста.
Grigory Kovalev and Mikhail Tikhomirov. Iterative Layer-wise Distillation for Efficient Compression of Large Language Models.
Abstract
This work investigates distillation methods for large language
models (LLMs) with the goal of developing compact models that
preserve high performance. Several existing approaches are
reviewed, with a discussion of their respective strengths and
limitations. An improved method based on the ShortGPT approach has
been developed, building upon the idea of incorporating iterative
evaluation of layer importance. At each step, importance is
assessed by measuring performance degradation when individual
layers are removed, using a set of representative datasets. This
process is combined with further training using a joint loss
function based on KL divergence and mean squared error. Experiments
on the Qwen2.5-3B model show that the number of layers can be
reduced from 36 to 28 (resulting in a 2.47 billion parameter model)
with only a 9.7% quality loss, and to 24 layers with an 18% loss.
The findings suggest that the middle transformer layers contribute
less to inference, underscoring the potential of the proposed
method for creating efficient models. The results demonstrate the
effectiveness of iterative distillation and fine-tuning, making the
approach suitable for deployment in resource-limited settings.
Dmitry Peniasov and Konstantin Turalchuk. Few-Shot Prompting Strategies for Dialogue Summarization with LLMs.
Abstract
Dialogue summarization is a challenging subfield of natural language processing that requires handling fragmented, multi-speaker discourse, topic shifts, and long-range dependencies inherent in conversational data. Despite progress in text summarization, most state-of-the-art models are trained on monological corpora such as news articles, which differ significantly from dialogues in both structure and content. As a result, these models often exhibit poor transferability to dialogue summarization tasks. The scarcity of high-quality, annotated dialogue datasets further complicates effective model adaptation. In this work, we explore an alternative data-centric approach by leveraging large language models (LLMs) as synthetic annotators within a few-shot learning framework. We propose a retrieval-augmented method for in-context example selection that prioritizes semantic similarity to the input dialogue. Through a series of controlled experiments, we evaluate the impact of demonstration quality and selection strategy on summarization performance. Our findings suggest that carefully curated few-shot prompts can substantially enhance the reliability of LLM-generated dialogue summaries and reduce reliance on costly manual annotation.
[SHORT] Andrei Borisov and Konstantin Turalchuk. Contextual Support for Deep Reading of Russian Classical Literature Based on the Retrieval-Augmented Generation Method.
Abstract
In response to the growing interest in classical literature in
Russia, a Retrieval-Augmented Generation (RAG)–based system for
contextual reading support is proposed. The system generates hints
and answers to user questions at any place in the text, considering
both the immediate context and a curated knowledge base that
includes philological studies, cultural-historical commentaries,
biographical notes, etc. The Rebind.ai project offers a similar
solution for Western literature but does not support the Russian
classics. Moreover, Rebind.ai is costly due to the operational
complexity of its chosen process for producing new books. To
simplify the addition of books, proposed system’s architecture is
designed for rapid expansion of the underlying knowledge base.
SESSION 13. Time Series Analysis. Small conference hall. Chair: Mikhail Zymbler
Александр Валиулин и Михаил Цымблер. GPU-Accelerated Matrix Profile Computing for Streaming Time Series.
Abstract
Currently, the mining of streaming time has become increasingly
important in a wide range of applications. At the moment, the
matrix profile (MP) is considered a simple yet powerful exploratory
time series mining tool, which provides the discovery of numerous
time series mining primitives without requiring prior knowledge.
The research community introduced a substantial number of MP
computing algorithms. However, to the best of our knowledge, no
developments are both GPU- and stream-oriented. In this article, we
introduce the StreaMP (Streaming Matrix Profile) algorithm for
GPUs, which overcomes the above limitation. Our algorithm processes
incoming data in a segment-wise manner, accumulating the time
series arriving from a sensor in RAM. StreaMP merges MPs of the
segment and the time series read so far through our proposed
multistep schema. To compute MPs, StreaMP is able to apply any
GPU-based algorithm, which encapsulates all details of parallel
processing. The MP built by StreaMP makes it possible to discover
the repeated and anomalous patterns in the streaming time series in
linear time. Experimental evaluation demonstrates that StreaMP
outperforms SCAMP, which is currently considered the fastest
GPU-based MP computation algorithm for batch mode.
Artem Fedorov and Archil Maysuradze. Scalable Bayesian Motif Detection for Multivariate Time Series.
Abstract
We develop a scalable, interpretable, and adaptive motifdiscovery
framework for multivariate time series. The approach combines a
kernel-based, invariance-preserving embedding with variational
Bayesian inference under a Dirichlet-process mixture, so the number
of motifs grows naturally with data while remaining compact and
explainable. The same engine operates equally well on large offline
archives and real-time streams. Tests on synthetic signals, daily
stock prices, and clinical EEG show higher accuracy than
state-of-the-art baselines at a fraction of their computational
cost: the model reliably rediscovers expertdefined patterns,
maintains clarity of the learned representations, and remains
stable as data volume scales. These properties make the detector a
practical tool for financial, industrial, and biomedical
time-series analysis.
Dmitrii Popov and Archil Maysuradze. Optimizing Embedding Linearity for Robust Speaker Diarization in Overlapping Speech Scenarios.
Abstract
Speaker diarization, the task of segmenting and identifying
speakers in audio recordings, is critical for applications like
automatic speech recognition, transcription, and analysis of
multi-speaker recordings such as meetings and podcasts. Overlapping
speech, prevalent in datasets like AMI and CALLHOME, poses a
significant challenge, as it complicates accurate speaker
segmentation. This work addresses this issue by investigating the
linearity of biometric embeddings, a property enabling the
representation of overlapping speech as a linear combination of
individual speaker embeddings, which is essential for robust
diarization, particularly in cascaded schemes with Target-Speaker
Voice Activity Detection (TSVAD). We propose a novel fine-tuning
method for the ECAPA-TDNN model to enhance embedding linearity,
utilizing a synthetic dataset derived from VoxCeleb and a modified
loss function combining AAM-Softmax with a linearity term.
Integrated into a cascaded TSVAD-based diarization framework, our
approach supports both full-context and streaming modes.
Experiments on standard benchmarks (AMI, DIHARD, VoxConverse)
demonstrate reduced Diarization Error Rate (DER) compared to
state-of-the-art methods, highlighting improved handling of
overlapping speech. The proposed method bridges a gap in optimizing
embedding linearity, offering practical benefits for real-world
multi-speaker scenarios.
Konstantin Izrailov, Igor Kotenko and Andrey Chechulin. Intelligent detection of voice fraud in authentication systems.
Abstract
In this paper, an actual problem of countering voice falsification
(spoofing attacks) in voice authentication systems is considered.
An intelligent method for detecting fake speech is proposed which
is based on combining classical acoustic features and modern
machine learning models. As features, Mel-frequency cepstral
coefficients (MFCC), perceptual linear prediction (PLP) and other
characteristics are used, which describe individual properties of
voice signal. For classification, the deep neural network
architectures ECAPA-TDNN (Emphasized Channel Attention,
Propagation, and Aggregation in Time Delay Neural Network) and
statistical GMM-UBM (Gaussian Mixture Model-Universal Background
Model) are applied, providing high resistance to different types of
attacks, including synthesized and imitated records. The developed
system was tested on open corpora VCTK (Voice Cloning Toolkit) and
TEDLIUM (Technology, Entertainment, Design - Laboratoire
d’Informatique de l’Université du Mans), where it showed high
results: classification accuracy achieves 97–99%, equal error rate
(EER) decreased to 2–5%, and false acceptance rate (FAR) and false
rejection rate (FRR) are minimized. The scientific novelty of this
work is the integration of diverse methods of speech processing and
analysis, which increases the reliability of fake detection. The
practical significance is the possibility of using the developed
solution in real biometric systems, including banking services and
remote service platforms. The article also discusses limitations
and perspectives for further development of the proposed approach.
[SHORT] Leonid Sidorov and Archil Maysuradze. Temporal and Statistical Analysis of EEG Signals for Enhanced P300 Pattern Recognition.
Abstract
This paper provides a comprehensive analysis of EEG signal
processing, emphasizing pattern detection and interpretation
through a blend of machine learning and statistical methods. We
evaluate the efficacy of neural network architectures, particularly
those employing learned convolutions, for capturing temporal
dependencies in EEG data from a pattern recognition task. Our
approach uses the Wilcoxon test and Holm-Bonferroni correction to
identify significant variations across EEG channels and intervals,
enhancing the robustness of our findings. While the temporal
analysis revealed significant differences in P300 and non-P300
signals around 200 milliseconds post-stimulus, channel-wise
statistical methods showed limited effectiveness, emphasizing the
superiority of deep learning techniques in this aspect. These
results highlight the intricate dynamics of cognitive processing
and the importance of later signal segments, while acknowledging
the contributions of early activity. This work lays the groundwork
for advancing EEG-based brain-computer interface systems, offering
the framework for decoding neural phenomena and enhancing cognitive
signal analysis.
SESSION 14. Image Analysis II. Lecture room №1. Chair: Aleksei Samarin
Dmitrii Vorobev, Artem Prosvetov and Karim Elhadji Daou. Real-time Localization of a Soccer Ball from a Single Camera.
Abstract
We propose a computationally efficient method for real-time
three-dimensional football trajectory reconstruction from a single
broadcast camera. In contrast to previous work, our approach
introduces a new set of discrete modes that are designed to
accelerate the optimization pocedure in a multi-mode state model
while preserving centimeter-level accuracy – even in cases of
severe occlusion, motion blur, and complex backgrounds. The system
operates on standard CPUs and achieves low latency suitable for
live broadcast settings. Extensive evaluation on a proprietary
dataset of 6K-resolution Russian Premier League matches
demonstrates performance comparable to multi-camera systems,
without the need for specialized or costly infrastructure. This
work provides a practical method for accessible and accurate 3D
ball tracking in professional football environments.
Aleksandr Borisov and Sergey Makhortov. Facial Expression Generation from Neutral Geometry Using a Local Shape Model.
Abstract
This paper presents a method for generating three-dimensional
facial expressions from a given neutral geometry, based on a local
shape model and a small set of training data. The facial geometry
is pre-segmented into anatomically meaningful regions, and
deformation parameters are estimated for each region with respect
to data fidelity and boundary consistency. After approximating the
neutral shape, expressions are synthesized and transferred back to
the original geometry while preserving anatomical structure. The
proposed ap-proach ensures a high degree of expressiveness and
consistency across expressions while maintaining subject-specific
facial features. Experimental results demonstrate its superiority
over existing methods under limited data conditions.
Daniil Timchenko and Dmitry Ignatov. TopoGAN Reproducibility Study: Enhancement and Analysis of an Emerging Paradigm.
Abstract
Recent advances in deep generative models, such as Generative
Adversarial Networks (GANs) and Diffusion Models, have led to
remarkable progress in high-fidelity image synthesis. However,
these models often produce artifacts and suffer from instability or
insufficient theoretical grounding. In parallel, the field of
Topological Data Analysis (TDA) has developed robust tools to
understand the intrinsic structure of high-dimensional data by
analysing its topological and geometric properties. This research
explores the intersection of TDA and generative image models,
investigating whether topological methods can enhance generative
quality, stability, and interpretability. We first introduce the
theoretical foundations of algebraic topology, metric geometry, and
probability theory, and review recent applications of TDA to GANs.
We then propose a novel approach to integrating TDA insights into
generative modeling pipelines, with a focus on quantifying and
minimizing topological artifacts. Experiments on benchmark datasets
evaluate the impact of topological constraints on image generation
fidelity and structure. Our results show that TDA-informed
modifications can yield improvements in sample coherence and offer
a promising direction for more robust generative models.
SESSION 15. Large Language Models and Applications II. Conference Hall «Kapitsa». Chair: Konstantin Turalchuk
Archil Maysuradze and Nikita Breskanu. Evaluating Safety of Large Language Models with Cognitive Diagnosis.
Abstract
This research introduces a novel approach to LLM safety evaluation
by applying cognitive diagnosis, a tool traditionally used in
educational platforms. The core innovation lies in moving beyond
the limiting one-exercise—one-attribute paradigm used in almost all
the contemporary safety benchmarks like XSTest, ToxiGen, HarmBench,
TruthfulQA and others. We propose the application of cognitive
diagnosis in LLM safety validation, develop an automated attribute
extraction pipeline, and demonstrate superior predictive
performance of extracted attributes compared to original XSTest
labeling. We address a critical gap: prompts typically contain
multiple safety knowledge requirements rather than single one. For
example, a prompt about historical violence may simultaneously
involve ethical reasoning, harmful content detection, cultural
sensitivity, and factual accuracy. Our methodology reduces LLM
safety evaluation to a cognitive diagnosis task by treating LLMs as
«students », prompts as «exercises», and binary scores as
«responses». This allows identifying latent LLM safety knowledge
levels in multi-attribute setting. The proposed automated attribute
discovery process represents an advancement in creating
interpretable safety taxonomies. The resulting 11 attributes
provide a more nuanced characterization than the original 8
categories from XSTest, showing improvements in predictive metrics.
These results suggest the superiority of multi-attribute over
singleattribute question labeling in safety evaluation.
Anna Glazkova, Olga Mitrofanova and Dmitry Morozov. Temperature Effects on Prompt-Based Keyphrase Generation with Instruction-Based LLMs.
Abstract
Large language models (LLMs) demonstrate strong performance in
generating coherent and contextually appropriate texts based on
given instructions. However, the influence of model parameters on
the output in specific tasks remains underexplored. This study
examines the effect of the temperature parameter on the robustness
of instruction-based LLMs in keyphrase generation (KG), a core task
in information retrieval that facilitates text organization and
search. Using three Russian-language LLMs (T-lite, YandexGPT, and
Saiga), we compare keyphrases generated with identical prompts and
varying temperature values. The results show that higher
temperatures increase the diversity of generated keyphrases and
unigrams, as well as the proportion of keyphrases not present in
the source text. The extent of these effects varies across models.
Our findings underscore the importance of selecting both the model
and temperature setting in prompt-based KG. The experiments were
conducted on two text collections: scientific and news texts.
David Avagian and Alyona Ball. Efficiency Evaluation Framework for LLM-generated Code.
Abstract
The impressive capabilities of large language models (LLMs) in
natural language processing have led to their increasing use for
code generation. Comprehensive evaluation of LLM-generated code
remains, however, an open challenge. While correctness is
well-studied, another crucial aspect of code quality, efficiency,
is still to be explored. This paper presents a new framework for
assessing the efficiency of generated code based on the Mercury and
EvalPerf benchmarks. We use additional test case generation and
sandbox improvements to refactor Mercury’s pipeline, integrating
three types of resource measurements (runtime, CPU instruction
count, and peak memory usage) and three code quality metrics
(Pass@1, Beyond, and DPS). Our evaluation of six LLMs (Phi-1,
Phi-2, Code Llama, QwQ-32B, Qwen3, and DeepSeek-V3) shows that
large LLMs vastly outperform smaller models and generated solutions
are usually better optimised for time than memory. Analysing
measurement and metric variance, we verify the stability of our
approach. Our framework is extensible, providing a foundation for
future code quality research.
[SHORT] Konstantin Burlachko and Boris Dobrov. Usage of Large Language Models in Structured Data Analysis.
Abstract
The rapid advancement of Large Language Models (LLMs) has created
new opportunities for automating tasks involving tabular data,
including spreadsheet operations and data analysis. However,
applying LLMs to this domain faces significant challenges,
including context limitations, insufficient proficiency in
numerical calculations, and the lack of standardized evaluation
methods. This paper explores the current state of LLM applications
for tabular data, reviewing both agent-based and code-generation
approaches. We evaluate performance using standard metrics to
assess the robustness and correctness of the generated solutions.
Experimental results show that modern LLMs achieve a high degree of
accuracy in generating executable code for tabular data tasks. Our
findings highlight the potential of LLMs to transform tabular data
management by overcoming existing limitations with innovative
workflows and architectural solutions.
SESSION 16. Information extraction from text I. Small conference hall. Chair: Natalia Loukachevitch
Ildar Baimuratov, Denis Turygin and Dmitrii Pliukhin. Towards the Automated Annotation of Regulatory Text for OWL-Based Compliance Checking.
Abstract
Currently, normative regulations are typically presented in a
weakly structured format within human-readable regulatory
documents, which makes it possible to check information models for
compliance with the regulations only manually. The first step in
formalizing normative regulations can be the annotation of semantic
components and domainspecific terms, but such annotation requires a
significant amount of time and expertise in semantics from the
user. This research focuses on automating the annotation of
regulations to facilitate their translation into the OWL language,
enabling subsequent automated compliance checking. We propose
methods to automate annotation across three layers of the
annotation scheme: domain terms, semantic types, and semantic
roles. Our approach achieves promising results, with a Recall@10 of
64.8%, a micro-averaged F1-score of 79.7%, and an Adjusted Mutual
Information score of 81.88% respectively.
Anna Glazkova, Olga Zakharova, Olga Prituzhalova and Lyudmila Suvorova. Environmental Discourse in Russian Online Communities: Insights from Topic Modeling and Expert vs. LLM Topic Labeling.
Abstract
This paper studies the content of Russian online communities that
focus on environmental issues. Social networks are a major channel
for sharing information and mobilizing action, and they provide a
unique source for analyzing how ecological topics are represented
in public discussions. Our goal is to trace the main directions of
environmental discourse and to understand how digital communication
reflects the spread of ecological knowledge and practices. To
achieve this, we apply modern approaches from natural language
processing to a large collection of posts published in such
communities. We also compare how experts and large language models
assign labels to the discovered topics, examining the potential of
computational tools to support research in this field. The study
contributes both to social research on green practices in Russia
and to the development of text analysis methods for large online
collections. It highlights the value of combining expert knowledge
with automated approaches in order to study complex social and
environmental processes.
Alexander Sychev. Анализ тематической структуры текстовой коллекции на основе модели Top2vec.
Abstract
В докладе описывается подход к диагностике существующей
тематической модели, представленной в размеченной коллекции
текстов, на основе модели представления текстов, слов и тематик
Top2vec. Приводятся и обсуждаются результаты машинного эксперимента
по изучению возможностей применения совместного векторного
представления тем, документов и слов в рамках Top2vec для анализа
реальной коллекции коротких текстовых сообщений, собранных из
регионального новостного портала, а также сформированного из них
словаря терминов.
Сергей Знаменский. Selecting longest common subsequence while avoiding non-necessary fragmentation as much as possible.
Abstract
A widely used LCS method, which consists of selecting a common
subsequence (CS) by maximizing its length, often results in an
excessively fragmented subsequence. Closely related to LCS, the
Levenshtein Metric does not always produce the expected results for
the same reason.
Attempts to avoid excessive fragmentation in CS extraction
have been carried out for decades using various approaches with
varying degrees of success, but unfortunately were not accompanied
by a clear understanding of how to measure fragmentation, let alone
how to minimize it.
This paper proposes to use the number of semantically
coherent common substrings to measure non-fragmentation. When
determining coherence is difficult or impossible, an empirical
distribution of the lengths of consistent substrings can be used
instead. The ROUGE-W algorithm with a weighting function calculated
based on this distribution is applicable for CS selection in
practice.
The paper presents theoretical estimates of this distribution
and numerical experiments with natural texts and program codes. The
experiments confirm the weights for the ROUGE-W metric used in
practice and highlight the fundamental difference between.
SESSION 17. Data and image processing in astronomy. Online. Chair: Alexei Pozanenko
Sergey Belkin and Alexei Pozanenko. Clusterisation in the MV – log10(Eiso) Plane for GRB–SNe: Evidence for Distinct Subclasses?
Abstract
We present the most comprehensive sample available to date of
supernovae associated with gamma-ray bursts (SN–GRBs), for which
the peak time of the supernova brightness curve and the absolute
stellar magnitude at peak brightness have been identified. The
sample contains 44 supernovae. We performed a correlation search
between the parameters of the SN-GRB peak brightness in the
spectral filter band V (absolute stellar magnitude MV in the rest
frame) and the peak time Tmax in filter V in the rest frame, as
well as between these parameters and the intrinsic gamma-ray
emission parameters of the GRBs (T90,i, Eiso, Ep,i). Statistically
significant correlations between any pairs of parameters were not
confirmed. The distribution MV –log10(Eiso) exhibits clustering,
dividing the sample into two distinct groups. These groups may
reflect differences in the initial conditions of the progenitor
stars or variations in their final evolutionary stages leading to
gamma-ray bursts. We also discuss dataset compilation methods and
strategies for mitigating observational biases and selection
effects impacting the detection of SN–GRBs and potential
correlations.
Nicolai Pankov, Pavel Minaev, Alexei Pozanenko, Eugene Schekotikhin, Sergey Belkin, Elena Mazaeva and Alina Volnova. The Automatic Image Processing Software for Optical Transient Detection.
Abstract
Rapidly evolving era of multi-wavelength (gamma, X-ray,
UV/optical/IR, radio) and multichannel (EM and gravitational)
observations requires processing of extensive amount of images to
detect fast optical transients in near-real time. Especially, it is
important for optical components of gamma-ray bursts, including
those associated with gravitational-wave events detected by LIGO,
Virgo and KAGRA. In this paper, we present specially designed
micro-service application for automatic astronomical image
processing STARFALL that mainly depends on APEX.We display main
capabilities and performance metrics of STARFALL. The obtained
scientific results are demonstrated on examples. Future plans on
the STARFALL development are proposed.
Vladimir Samodurov. Processing of many years data and search for gamma-ray bursts in the radio diapazon at a frequency of 111 MHz.
Abstract
В работе приведены результаты анализа многолетних данных с
многолучевой диаграммы БСА. Успешно выделены около 10 тысяч
радиоисточников с плотностями потока в несколько Ян. Описана
статистика ошибок потоков (для ярких источников не хуже 10% до
калибровки и 5% после нее) и дисперсий потоков (десятки процентов
ввиду мерцаний на неоднородностях околосолнечной плазмы).
Продемонстрирована методика выделения слабых источников на примере
выделения радиосигнала Юпитера (около 5 Ян). Показано, что в
ежесуточных данных можно выделять источники яркостью более 2 Ян. А
использование осреднений по несколько суток позволяет понизить
предел выделения до 0.3 - 0.5 Ян. Именно с такими верхними
пределами описаны результаты обработки для 12 GRB, попавших в створ
наших наблюдений по координатам и по эпохе наблюдений. Результаты
представлены в таблицах и на рисунках.
Pavel Kaygorodov, Ekaterina Malik, Dana Kovaleva, Oleg Malkov and Bernard Debray. A new engine to build Binary star DataBase (BDB).
Abstract
Binary star DataBase BDB (https://bdb.inasan.ru) has a very long
history and its internal design was changed twice during its
lifetime. The first version was written in mid 90’s as CGI shell
scripts and used text files for data storage. Later it was
rewritten in stackless Python with Nagare library. The next major
update was performed during last year. The Nagare and other
libraries were developing more and more compatibility issues, so we
have decided to rewrite the BDB code using a completely new
approach. In this paper we are presenting a brief introduction of
this new approach to the distributed programming paradigm, which
allows to significantly speedup the development. Here we employ the
switch from the traditional Model-View-Controller approach to the
distributed application, where the server is a “primary node” which
controls many web-clients as “subordinate nodes”, delegating all
User-Interface-related tasks to them.
SESSION 18. Information extraction from text II: Datasets. Small conference hall. Chair: Sergey Znamensky
Rodion Sulzhenko and Boris Dobrov. A Dataset of Russian-Language Debates For Argument Mining.
Abstract
We present DebateRu, annotated dataset of Russian-language student
debates designed for argument mining in culturally specific
contexts. Comprising 10 hours of spontaneous televised debates (429
arguments across 10 topics), the corpus captures authentic
rhetorical strategies and socio-political discourse patterns unique
to Russian youth culture. Unlike scripted debate datasets, DebateRu
preserves the emotional intensity and contextual nuances of
real-world argumentation, addressing a critical gap in non-English
resources. We evaluate the dataset through two tasks: stance
detection and argument generation, testing several state-of-the-art
Russian-adapted large language models. DebateRu provides a
benchmark for developing context-aware argumentation systems and
studying cross-cultural discourse patterns. We release the dataset
to support research in multilingual NLP, rhetorical education, and
computational social science. Collected dataset is publicly
available on github.
Grigory Kovalev, Natalia Loukachevitch, Mikhail Tikhomirov, Olga Babina and Pavel Mamaev. Wikipedia-based Datasets in Russian Information Retrieval Benchmark RusBEIR.
Abstract
In this paper, we present a novel series of Russian information
retrieval datasets constructed from the “Did you know... ”
section of RussianWikipedia. Our datasets support a range of
retrieval tasks, including fact-checking, retrievalaugmented
generation, and full-document retrieval, by leveraging interesting
facts and their referencedWikipedia articles annotated at the
sentence level with graded relevance. We describe the methodology
for dataset creation that enables the expansion of existing Russian
Information Retrieval (IR) resources. Through extensive
experiments, we extend the RusBEIR research by comparing
lexical retrieval models, such as BM25, with state-of-the-art
neural architectures fine-tuned for Russian, as well as
multilingual models. Results of our experiments show that lexical
methods tend to outperform neural models on full-document
retrieval, while neural approaches better capture lexical semantics
in shorter texts, such as in fact-checking or fine-grained
retrieval. Using our newly created datasets, we also analyze the
impact of document length on retrieval performance and demonstrate
that combining retrieval with neural reranking consistently
improves results. Our contribution expands the resources available
for Russian information retrieval research and highlights the
importance of accurate evaluation of retrieval models to achieve
optimal performance. All datasets are publicly available at
HuggingFace. To facilitate reproducibility and future research, we
also release the full implementation on GitHub.
Мария Елисеева, Наталья Ефремова, Наталия Баева and Цзя И Ян. Набор данных Visual Genome: перевод на русский язык и статистика.
Abstract
В настоящее время набор данных Visual Genome является одним из
немногих наборов, описывающих изображения в виде графов сцен, т.е.
в нем представлена информация не только об объектах изображения и
их атрибутах, но и об отношениях между объектами. Это делает Visual
Genome перспективной основой для новых датасетов, в том числе и на
других языках. Нами начат процесс адаптации этого набора для
русского языка.
В статье проанализирован набор данных Visual Genome: описан
ход его создания и рассмотрены особенности получившейся разметки.
Описан процесс перевода набора на русский язык, возникшие при этом
трудности и способы их разрешения. Отдельно обсуждаются результаты
статистического анализа полученных после перевода данных; основное
внимание уделено текстовым описаниям изображений и объектам.
Указаны основные причины, приведшие к появлению одинаковых
переводов для разных фраз на английском языке, что повлекло за
собой изменение соотношения данных в оригинальном и переведённом
наборе.
В заключении приводятся наши рассуждения об особенностях
Visual Genome и полученных в ходе перевода данных. Упоминаются
примеры возможного использования переведённых данных и направления
дальнейшего исследования.
[SHORT] Elena Bolshakova and Anna Stepanova. Recognizing Cognates Based on Dataset with Morpheme Segmentation.
Abstract
The paper considers cognates as relative words in a particular
language, which have the same root (e.g., air, airy, airily,
airless) and thus preserve some semantic relatedness. Recognition
of the cognates is useful for such NLP tasks as deriving meaning of
new and rare words, paraphrase detection, creation of lexical
derivational resources. The paper describes methods for
rec-ognizing cognates, as the first stage of developing a
representative derivational resource for Russian language, based on
a large dataset of words with segmented and classified morphemes
(prefixes, roots, suffixes, endings, postfixes). The methods
involve collecting words of the same root into disjoint groups
(derivational families), with accounting for homonymous roots of
Russian words, as well as root allomorphs (variants of the same
roots). The allomorphs arise due to alternations of vowels and
consonants and may be common for several non-cognate words, which
is the main problem of their recognition. To identify semantic
relatedness of words with such roots, clustering methods (DBSCAN,
Kmeans, HDBSCAN) are experimentally studied, with vector
representations of words (embeddings) in Word2Vec and FastText
models. The experiments showed acceptable quality of the described
methods, which is sufficient to eliminate most manual work on
collecting groups of cognates, to form derivational families of
words.
SESSION 19. Data analysis in astronomy. Online. Chair: Alexei Pozanenko
Maxim Pupkov, Art Prosvetov, Vasiliy Marchuk, Alexander Govorov, Olga Yushkova, Alexander Andreev and Vladimir Nazarov. Multimodal-Data-Driven Lunar Surface Reconstruction Using High-Resolution Imagery and Simulated Radargrams.
Abstract
Accurate and high-fidelity lunar surface modeling is vital for
effective mission planning and execution in contemporary lunar
exploration. In this work, we advance digital terrain
reconstruction by integrating novel machine learning techniques,
such as Neural Radiance Fields and Gaussian Splatting, with
traditional photogrammetry, leveraging high-resolution imagery in
the range of 5 to 2 meters per pixel. A key focus of our study is
the utilization of radargram data, specifically modeling the type
of profiles that are expected to be obtained by instruments on
future lunar missions. These data provide targeted height estimates
within narrow swaths directly beneath an orbiter's trajectory,
effectively offering reliable depth measurements at subsatellite
trackpoints. We incorporate this subsample of high-confidence
altimetric information into the Neural Radiance Fields and Gaussian
Splatting models by adding ground-truth depth constraints for a
subset of the dataset, which enhances the learning process. Our
findings demonstrate that the inclusion of radargram-derived depth
information leads to a significant improvement in terrain
reconstruction quality. This is evidenced by enhanced accuracy
metrics when fusing data from multiple modalities. The proposed
approach highlights the benefits of combining optical and radar
sources for robust lunar surface modeling, thereby enabling the
development of mission-ready, high-precision digital terrain
products to support the next generation of lunar exploration
missions.
Alexander Rodin, Alexei Pozanenko and Viktoria Fedorova. Observation of the gravitational wave event GW190425 in the radio range at 111 MHz.
Abstract
The paper presents the results of the search and detection of radio
emission from the gravitational-wave event GW 190425 at a frequency
of 111 MHz using the BSA radio telescope of the Lebedev Physical
Institute. The new radio source was discovered approximately
$2^\circ$ from the center of the region of maximum probable
localization of GW 190425, coordinates (J2000.0) $\alpha =
16^h\,30'\pm 15',\;\delta = 13^\circ \, 21' \pm 15'$. The light
curve was constructed, the flux at maximum was estimated at
$\approx 1.5$ Jy. The maximum of the light curve occurs on the
20-30th day after the trigger. The probability of a false alarm for
this kind of event was calculated $P_{fa}=5\cdot 10^{-7}$.
Ekaterina Malik, Pavel Kaygorodov, Dana Kovaleva, Oleg Malkov and Bernard Debray. On possible systematic errors when x-matching binary stars in Gaia.
Abstract
С использованием данных Gaia DR3 были созданы каталоги двойных
звезд, содержащие информацию о совокупно более чем 1.8 млн пар. Это
более чем на порядок увеличивает ансамбль двойных звезд с
известными характеристиками, насчитывавший ранее 144845 пар. Для
осуществления статистического анализа полного ансамбля двойных
звезд, включающего и ранее известные, и вновь обнаруженные пары,
было проведено перекрестное отождествления по координатам наиболее
полного до публикации данных Gaia синтетического каталога двойных
звезд ILB с данными каталогов двойных звезд, основанных на
результатах Gaia DR3. Проведен анализ результатов этого
отождествления, показавший зависимость его характеристик как от
данных исходных каталогов, так и от координат. Показано, что в
плотных звездных полях, в частности, в диске Галактики, можно
ожидать повышения доли ложноположительных отождествлений. В то же
время для систем с большим собственным движением велика вероятность
ложноотрицательного исхода отождествления. Предложены возможные
изменения метода отождествления для снижения роли описанных
систематических ошибок отождествления и повышения надежности его
результатов.
[SHORT] Eugene Shekotihin, Nicolay Pankov, Alexei Pozanenko, Pavel Minaev and Alina Volnova. Brownian Bridge Diffusion Model in the Problem of Conditional Inpainting of Astronomical Images.
Abstract
В работе рассматривается применение диффузионной модели
броуновского моста (BBDM) для решения задачи условного закрашивания
астрономических изображений. Предлагаемый алгоритм использует
единственную пару изображений и диффузионную модель, обучающуюся
преобразованию опорного кадра в целевой на разрешённых областях с
целью восстановления области интереса на целевом кадре. На примерах
реальных астрономических изображений обзоров демонстрируется
реализация стабильного условного закрашивания и восстановления
изображений галактик из обзора SDSS по изображениям обзора
Pan-STARRS предлагаемым методом.
SESSION 20. Information extraction from text III. Small conference hall. Chair: Boris Dobrov
Елена Шамаева и Наталья Лукашевич. Влияние токенизации на оценку качества нейросетевого синтаксического анализа.
Abstract
Синтаксические анализаторы используются в качестве вспомогательного
инструмента в разных областях автоматической обработки текста.
Поэтому важными направлениями исследований являются разработка
критериев выбора синтаксического анализатора для конкретной
прикладной задачи и методология оценки качества синтаксического
анализатора. На оценку качества синтаксического анализатора влияет
этап токенизации. Существует два способа оценки синтаксического
анализатора: с использованием встроенного токенизатора и с
использованием токенизатора, возвращающего эталонную разметку.
Данная статья посвящена сравнению этих способов оценки качества
синтаксического анализа. Исследование проведено на русскоязычных
корпусах предложений с синтаксической разметкой SynTagRus, GSD,
PUD, Taiga, Poetry и для русскоязычных синтаксических анализаторов
UDPipe, Stanza, Natasha, DeepPavlov и spacy. Выявлено, что для
значимого количества предложений разделение на токены, проводимое
встроенным токенизатором, отличается от эталонного. Установлено
также, что средние значения метрик UAS и LAS выше при использовании
токенизатора, возвращающего эталонную разметку. Разработанная
методология описания категорий токенов может использоваться для
проверки качества синтаксического анализа при внедрении нового
токенизатора. В рамках данного исследования для каждого из
рассматриваемых анализаторов реализован токенизатор, возвращающий
эталонный набор токенов из датасета. Реализация исследования
доступна по адресу:
https://github.com/Derinhelm/parser_stat/tree/tokenization_changing.
[SHORT] Anton Polevoi and Natalia Loukachevitch. Whisper Attacks and Defenses Investigation for Russian Speech.
Abstract
Automatic Speech Recognition (ASR) systems, such as Whisper,
are widely used in modern applications, but are vulnerable to
adversarial attacks like model-control attacks, where adversarial
audio segments manipulate model behavior without prompt access.
Based on prior research, this paper focuses on adversarial attacks
targeting Russian inputs and proposes defense strategies.
To improve attack imperceptibility while maintaining
effectiveness, we introduce a regularization technique that
incorporates speech similarity metrics, leveraging acoustic
embeddings to balance attack e ciency and naturalness. This allows
for adversarial perturbations that are both potent and perceptually
similar to natural speech.
Our ndings show that long adversarial pre xes can signi
cantly degrade Whisper's performance for Russian inputs, while
shorter pre xes have a reduced impact. Additional preprocessing
methods like speech enhancement showed moderate success but were
less effective for real-time scenarios. This work advances
understanding of ASR vulnerabilities and defenses for Whisper
models in Russian audios.
[SHORT] Vadim Korobkovskii and Natalia Gorlushkina. Automation of retroconversion processes of bibliographic materials.
Abstract
Создание электронных каталогов, которые значительно облегчают
доступ читателей к необходимой информации, является важной задачей
для современных библиотек. Однако проблема автоматизированного
перевода существующих бумажных каталогов в цифровой формат остается
актуальной с начала XXI века, поскольку универсального решения пока
не найдено. В статье описаны методы исследования и его результаты
для внедрения оптимизационных и функциональных улучшений.
Представлен анализ имеющегося программного кода, реализованного
ранее, для нахождения возможностей устранения выявленных
недочетов. В результате описанных изменений в алгоритм были
добавлены новые функции, касающиеся разбиения на поля формата
RUSMARC, усовершенствован алгоритм, а также внедрены правки,
которые позволили значительно ускорить работу программы и расширить
возможности разбиения текста на поля в соответствии со стандартом
RUSMARC.
SESSION 21. Database Management. Conference Hall «Kapitsa». Chair: Maria Poltavtseva
Semyon Grigorev, Vladimir Kutuev, Olga Bachishche, Vadim Abzalov and Vlada Pogozhelskaya. GLL-based Context-Free Path Querying for Neo4j.
Abstract
We propose a GLL-based context-free path querying algorithm that
handles queries in Extended Backus-Naur Form (EBNF) using Recursive
State Machines (RSM). Utilization of EBNF allows one to combine
traditional regular expressions and mutually recursive patterns in
constraints natively. The proposed algorithm solves both the
reachability-only and the all-paths problems for the all-pairs and
the multiple sources cases. The evaluation on real-world graphs
demonstrates that the utilization of RSMs increases the performance
of query evaluation. Being implemented as a stored procedure for
Neo4j, our solution demonstrates better performance than a similar
solution for RedisGraph. The performance of our solution on regular
path queries is comparable to the performance of the native Neo4j
solution, and in some cases, our solution requires significantly
less memory.
Habibur Rahman Habib and Ramon Antonio Rodriges Zalipynis. Scalable Top-K Subarray Searches: Seamless and Distributed NetCDF API Interception.
Abstract
This paper presents an approach that enables distributed and
seamless processing of NetCDF-based geospatial arrays through API
interception. By intercepting standard NetCDF read operations and
routing requests through a Vert.x/Hazelcast cluster, our system
achieves distributed processing without requiring code
modifications. Evaluation using MODIS satellite data demonstrates
linear scaling to 16 nodes with 1.76× speedup over serial
processing, while maintaining full API compatibility. The
architecture’s event-driven design achieves 15ms average request
latency through chunk-based distribution and dynamic load
balancing. This work bridges conventional NetCDF workflows with
modern distributed computing in the Cloud, enabling scalable
analysis through familiar interfaces
Alexander Solovyev. CAP theorem and NewSQL DBMS.
Abstract
The article proposes the mathematical apparatus of queueing theory
for modeling NewSQL DBMS and distributed information systems. The
applicability of the proposed apparatus for NewSQL modeling is
demonstrated. The mathematical formulation of the CAP theorem is
proposed. Typical examples of NewSQL DBMS are reviewed in the
article along with the test data confirming strong consistency and
availability of distributed data. The task of modeling a
distributed DBMS is formulated. A set of models for calculating the
main parameters of a distributed system and a set of queueing
theory models applicable for modeling of a distributed system are
proposed. Distributed system parameters are matched to the CAP
theorem terms, which allows to confirm or refute its provisions
during modeling. In further research it is planned to refine the
mathematical models proposed in this article, to confirm their
correctness and applicability to the modeling of distributed DBMS
and information systems built using NewSQL.
| Submission deadline for papers | June 9, 2025 |
| Submission deadline for tutorials | June 2, 2025 |
| Notification for the first round | July 24, 2025 |
| Final notification of acceptance | September 8, 2025 |
| Deadline for camera-ready versions of the accepted papers | September 15, 2025 |
| Conference | October 29-31, 2025 |