Conference Program

The conference program is available.

WEDNESDAY, OCTOBER 29th

SESSION 1. Conceptual Modeling and Ontologies. Small conference hall. Chair: Nikolay Skvortsov

Vladimir Budzko and Viktor Medennikov. Формирование Информационного Ландшафта Цифровой Экономики на Примере Сельского Хозяйства.

Abstract Рассматривается иконографическое и формализованное описание интеграции информационных ресурсов и алгоритмов их использования, отражающей существенную часть основных принципов цифровой экономики, позволяющих разработать математическую модель формирования цифровой платформы управления (ЦПУ) экономикой, которая представляет комплементарное объединение трех подплатформ (стандартов цифровой экономики): сбора с целью накопления для дальнейшего активного использования первичной учетной информации в общей для всех производственных отраслей России облачной базе данных (БД); единых информационных БД, отражающих технологические особенности конкретной отрасли; единых баз знаний, отражающих принятие управленческих решений также конкретной отрасли. ЦПУ – это цифровой инструмент эффективного перехода от фрагментарных методов проектирования и разработки информационных систем к комплексному, интегрированному подходу в цифровой экономике. ЦПУ включает единые отраслевые и федеральные информационные модели, классификаторы, справочники. На примере жизненного цикла оптимизированной структуры севооборотов, определяющих все процессы в сельском хозяйстве, рассмотрены информационные модели этой подсистемы. Показано, что ЦПУ с полученными тремя цифровыми стандартами и в совокупности с развитыми методами информационного моделирования представляет эффективный инструмент для построения автоматизированной системы управления (АСУ) аграрным предприятием, обеспечивающим информационную совместимость АСУ большинства из них, как следствие эффективное решение большинства стоящих перед производителями задач. Кроме того, информационная и алгоритмическая совместимость АСУ обеспечит прозрачность управления отраслью на региональном и федеральном уровнях на всех этапах производства.

Anna Shiyan, Anton Larin, Ildar Baimuratov and Nataly Zhukova. Automatic Information Retrieval and Extraction Methodology for the Ontology of Plant Diseases.

Abstract Detecting plant diseases is a challenge for farmers; however, solutions are offered by computer vision and image processing. A major issue is the limitation of information obtained only from the image. Using only computer vision along does not take into account weather conditions and visual similarity of symptoms between diseases. These challenges can be addressed by developing an expert system that uses an ontology containing knowledge about plant diseases, pathogens and symptoms.We developed an ontology of plant diseases by integrating existing ontologies and adding disease-causing factors. Our Plants and their Diseases Ontology (PatDO) can improve diagnostic accuracy by incorporating detailed symptom descriptions and linking them with specific pathogens. Wikidata served as the primary source for taxonomic data, with SPARQL queries extracting relationships among plants, pathogens, and diseases. Using data from the American Phytopathological Society and the European and Mediterranean Plant Protection Organization (EPPO), we identified relationships previously undocumented in Wikidata. In addition, large language models helped extract different symptoms of the pathogens that cause plants diseases from EPPO. The final ontology consists of 5002 classes and 8 properties that connect various entities, including plants, pathogens and symptoms.

Nikolai Kalinin and Nikolay Skvortsov. An Approach to Ontology-Based Domain Analysis for Cross-Domain Research Infrastructure Development.

Abstract Effective development of Research Infrastructures (RIs) depends on a thorough understanding of domain-specific needs. Despite its importance, domain analysis remains underrepresented in RI development methodologies. Building on previous work utilizing literature analysis, this article introduces an approach based on the analysis of ontological sources. Ontologies, as structured representations of domain knowledge, offer both a foundation for metadata schema creation and a means to capture domain requirements systematically. Moreover, ontologies play a key role in implementing the FAIR principles by enabling data to be Findable, Accessible, Interoperable, and Reusable through shared semantics and standardized vocabularies. We propose the construction of an ontology for research infrastructure development, based on the DOLCE UltraLite (DUL) foundational ontology, and describe an approach for mapping it to existing domain-specific ontologies. Through this mapping process, we create an ontology network that supports comprehensive domain analysis and enables semantic interoperability across disciplines. This integrated ontology network provides a robust basis for domain analysis and for the construction of cross-domain Research Infrastructures aligned with FAIR-compliant practices.

SESSION 2. Information security I. Lecture room №1. Chair: Maxim Kalinin

Maria Poltavtseva and Dmitry Zegzhda. Security assessment of heterogeneous big data processing and storage systems.

Abstract The paper considers the problem of assessing the security of big data processing and storage systems. Heterogeneous big data processing and storing systems are increasingly used today in large enterprises and organisations. They are characterised by two important aspects. Firstly, they use different data storage and processing tools, SQL and NoSQL DBMSs, which are geographically and organisationally distributed. Second, these tools process a cohesive, unified data set with a complex lifecycle of each information fragment. Each individual tool of such an ecosystem built on different platforms can be attacked by an intruder and cause data leakage. The high degree of trust and the volume of processed data aggravate the security situation. The paper proposes a new method for as-sessing the security of such systems. The method uses input data on the processes of data processing and data movement in the source system collected using block-chain technology. The security assessment takes into account the specificity of big data processing and storage systems and the need to integrate the specific security assessment with the security assessment of the information system as a whole. The calculation of the specific assessment is based on the analysis of the access control system, through the analysis of security policies, and the analysis of trust in the nodes performing operations on data. As a result, the paper pro-poses an integrated assessment that reflects the specificity of big data processing and storage systems and can be easily embedded in various more general assess-ment methodologies. A security assessment framework based on a previously presented target system modelling framework for security policy analysis is also presented.

Maxim Kalinin, Artem Konoplev and Vasiliy Krundyshev. Storage Optimization of a Blockchain-like Oriented Acyclic Block Graph Used for Data Protection in Highly Loaded Systems.

Abstract This paper examines existing methods for addressing the challenge of unlimited blockchain growth and evaluates their applicability to blockchain-like oriented acyclic block graphs, which are utilized for data protection in highly loaded systems. The authors introduce a novel approach to reduce the volume of such block graphs by embedding the hash image of the system’s state directly into block headers. This method improves scalability while maintaining data integrity and security, making it particularly suitable for resource-constrained environments. By fixing the system state’s hash representation within the block structure, the proposed solution minimizes storage requirements without compromising the decentralized verification process. The technique is designed to optimize the performance of oriented acyclic block graphs enabling their efficient deployment in high-throughput applications. Potential use cases include smart cities, vehicular ad-hoc networks (VANETs), industrial and medical IoT systems, and social digital services, where high transaction volumes and low latency are critical. The paper highlights the method’s ability to balance scalability, security, and decentralization, ensuring robust data protection in dynamic large-scale networks. The findings suggest that this method offers a viable path for implementing lightweight, scalable, and secure distributed ledger technologies in highly loaded systems.

Vladimir Budzko and Viktor Belenkov. Кибербезопасность систем, реализующих интенсивное использование данных. Дисциплина формирования точек синхронизации целостности на объекте ИИД-системы.

Abstract The current stage of development of Russian society is characterized by digital transformation of all its spheres, including economy, science, healthcare, education, culture, etc. One of the areas of such transformation is the widespread use of systems implementing data intensive domain (DID-systems). The increasingly widespread use of these systems, their functioning as part of global networks entails new risks of ensuring information security (IS), which can negatively affect individuals, groups, organizations, sectors of the economy and society as a whole. Ensuring IS of DID-systems requires the organization of their operation, which performs the restoration of the DID system and its databases (DB) after the implementation of cyber-attacks. The article considers the conditions that the discipline of forming integrity synchronization points used at the DID system facility must meet, ensuring the minimization of the total time costs for performing special actions that create the possibility of restoring the system, and its actual restoration after the implementation of cyber-attacks leading to its failures.

SESSION 3. Machine learning methods. Conference Hall «Kapitsa». Chair: Vladimir Parkhomenko

Nikita Tsaplin, Alexander Petrov and Dmitry Kovalev. Greedy Feature Selection for Network Traffic Shallow Packet Inspection.

Abstract Monitoring network traffic is essential for securing cloud infrastruc-ture, especially given the growing sophistication of cyberattacks and infor-mation-based threats. Traditional deep packet inspection (DPI) methods are often impractical due to high computational costs, legal concerns, and incompatibility with encrypted traffic. This paper presents RuStatExt, a high-performance driver for Hyper-V that enables real-time network monitoring through Shallow Packet Inspection (SPI) – analyzing only L2–L4 headers. We propose novel algorithms for greedy feature selection and dynamic parameter tuning via grid search, sig-nificantly improving the performance of machine learning models used for anom-aly detection. Experimental evaluation on real-world network traffic from 21 pro-duction VMs shows that combining SPI metadata with optimized model training increases the F1-score compared to baseline approaches. The best results were achieved using Isolation Forest with dynamic parameter scaling, significantly im-proving average F1-score from 0.28 to 0.73. The solution introduces negligible overhead even under 1 Gbps traffic loads, making it suitable for large-scale de-ployment in cloud environments.

Dmitry I. Ignatov and Ruslan Molokanov. Benchmarking of Boolean Matrix Factorization Models for Collaborative Filtering: Classic and Neural Approaches.

Abstract Boolean matrix factorization (BMF) is a widely used technique for dimensionality reduction and information extraction from highdimensional binary data. It is commonly applied in areas where the binary data format is naturally acquired. Examples include database tiling, financial transaction analysis, pattern recognition and recommendation systems based on processing an implicit feedback. This study aims to implement and compare different BMF methods for collaborative filtering – an approach for generating personalized recommendations by leveraging user feedback from individuals with similar preferences. Classical methods rooted in Formal Concept Analysis are examined alongside a new neural network approach inspired by the Neural Collaborative Filtering architecture. Algorithms are be implemented and evaluated using various datasets, including synthetic matrices and real-world data such as user ratings of movies.

Wei Yuan, Dmitry Ignatov and Zamira Ignatova. Statistical learning of polyhedra volumes from metric features.

Abstract We explore statistical learning methods for computing polyhedra volumes, motivated by Sabitov’s polynomial approach, which expresses volume as a root of an edge-length-dependent polynomial. It is known that searching for Sabitov polynomials presents significant computational challenges for complex polyhedra. To directly estimate the volume of a polyhedron from its edges’ lengths, we propose using ElasticNet regression to approximate volumes from edge lengths, demonstrating high accuracy (R2 > 0.999) for tetrahedra and octahedra (R2 ≥ 0.98 and > 0.999, depending on the generation parameters) yielding their volume formulas from triples of the distances of their edges and diagonals. We further speculate on the statistical learning capabilities to deal with Steffen’s flexible polyhedron, where traditional methods struggle to obtain the Sabitov polynomial. Our results bridge algebraic geometry and machine learning, offering a scalable alternative for volume computation.

SESSION 4. Data Models and Data Integration, Database Management. Small conference hall. Chair: Viktor Zakharov

Sergey Stupnikov. Верификация интеграции данных в базе данных двойных звезд.

Abstract Рост числа источников данных в науке и промышленности, структуры данных которых сильно различаются, определены с использова-нием различных моделей данных и реализованы в различных СУБД, ведет к необходимости разработки систем интеграции данных. Специализиро-ванные системы интеграции данных создаются в различных предметных областях, например, в астрономии, управлении землепользованием, мате-риаловедении. Сложность программ интеграции данных, реализуемых в си-стемах интеграции, ведет к необходимости формальной верификации их корректности. В работе рассмотрен подход к верификации корректности интеграции данных в базе данных двойных звезд Института астрономии РАН. Интеграция данных из звездных каталогов производится при помощи программ на императивном языке программирования. В качестве целевой базы данных используется реляционная СУБД. Подход к верификации ос-нован на определении семантики структур данных и программ интеграции данных в формальном языке спецификаций и последующем доказательстве корректности интеграции данных с использованием автоматизированных средств доказательства.

Vladislav Sukhomlinov and Oleg Sabinin. Application of Stratified Sampling in Statistical Data Analysis to Improve Query Plan Cardinality Estimation.

Abstract This paper proposes a new approach for generating the sample needed to collect statistics and calculate the cardinality of execution plans in relational database management systems. The study examines the well-known simple random sampling method and identifies its inherent limitations when applied to modern database systems. To address these challenges, we advocate adopting stratified sampling in contexts where data segmentation based on table attributes is feasible. We propose a modified stratified sampling algorithm that reduces the required sample size for statistical data collection without compromising the accuracy of the results. Preliminary experiments using PostgreSQL 17.4 DBMS and a 2.5 GB database confirm the effectiveness of the proposed approach on samples for statistical analysis, with up to 35,000 rows, emphasizing its potential for optimizing query performance.

[SHORT] Irina Karabulatova, Stepan Vinnikov, Anatoly Nardid and Yuriy Gapanyuk. The metagraph transformation algorithm based on incidence and nesting representation.

Abstract The article is devoted to the work of the metagraph transformation algorithm. Graph models, including complex graph models, are currently being actively used to describe the structures of complex systems. The relevance of the research lies in the fact that the article considers an algorithm for converting structures of large-sized complex systems represented in form of metagraphs. The metagraph model and its multipartite and matrix representation is discussed. The proposed approach is based on representing a metagraph data model as a combination of incidence and nesting matrices. Operations on a metagraph can be represented in the form of a combination of operations on incidence and nest-ing matrices. The metagraph transformation algorithm is presented. The example of algorithm application is explained in details; the interpretations of output ma-trices are given. The expected impact is that the proposed algorithm is based on the matrix representation of the metagraph on the one hand, and on event sourc-ing architectural pattern on the other hand, which allows algorithm to be used for data intensive domains.

SESSION 5. Information security II. Lecture room №1. Chair: Evgeny Pavlenko

Nikolai Kalinin and Nikolay Skvortsov. Towards Unified Ontology of Cloud Assets for Multi-Cloud Security.

Abstract Multi-cloud architectures have become ubiquitous as organizations leverage services from multiple cloud providers but this trend has introduced new security challenges in consistency and oversight. Cloud misconfigurations and sophisticated threats are rising in tandem with accelerated cloud adoption, resulting in frequent security incidents and data breaches. This paper presents an OWL-based unified ontology of cloud assets, constructed by analyzing Terraform provider schemas from AWS, Google Cloud, Azure, Yandex Cloud, and Cloud.ru. The ontology provides a formal, provider-agnostic framework to integrate and reason about cloud infrastructure across heterogeneous environments. By unifying cloud asset definitions, our approach enables the automated construction of a comprehensive multi-cloud asset knowledge base and supports universal Cloud Security Posture Management (CSPM) policies that can detect and prevent misconfigurations consistently across different clouds. Furthermore, security analysts can formulate and analyze attack chains using the ontology’s relationships and logical constraints, allowing reasoning engines to infer potential threat paths and misconfiguration exploits. The proposed ontology-driven framework aims to enhance cloud security monitoring and incident analysis in hybrid and multi-cloud deployments, ultimately helping organizations and Managed Security Service Providers (MSSPs) improve their security posture through formal knowledge representation and automated reasoning.

Pavel Yugai, Evgeny Zubkov, Dmitriy Moskvin and Denis Ivanov. Robustness of machine learning models in network threat defense systems in the context of adversarial attacks.

Abstract Machine learning algorithms (ML) are utilized across various domains of information technology. In network security, ML models are employed for detecting information security incidents and responding to diverse threats. Clas-sical ML algorithms or their modifications are implemented in various network security products, depending on their specific implementations and purposes. There exist adversarial attacks targeting ML models, designed to manipulate in-put data in such a way that the output of the ML model becomes erroneous. Dif-ferent methods and metrics for assessing the robustness of ML models against adversarial attacks are applied in various domains. This paper examines adver-sarial attacks that are employed against widely used ML models for detecting network attacks. It discusses both classical and specialized metrics for evaluating the resilience of ML models to adversarial attacks. An analysis of the applicabil-ity of various metrics for assessing the robustness of machine learning models against the aforementioned adversarial attacks is conducted.

Nikolay N. Shenets, Elena B. Aleksandrova and Artem S. Konoplev. Secure Data Processing Approach for Ad-Hoc Networks Based on Blockchain and Secret Sharing.

Abstract In this paper, we propose a new secure communication scheme for Ad-Hoc networks such as MANET/FANET based on blockchain technology and secret sharing. First, we review the usage of both secret sharing and blockchain for different protection purposes. Next, we describe our approach for securing Ad-Hoc networks and show its advantages. Namely, we use secret sharing as a base for authentication and key predistribution, and mutable permissioned blockchain for the identification of nodes instead of Public Key Infrastructure.

SESSION 6. Mathematical Models. Conference Hall «Kapitsa». Chair: Egor Khitrov

Dmitry Ignatov. Computation of the Dedekind-MacNeille completion of the Bruhat order for the Weyl group type B_n.

Abstract This paper presents the computation of the Dedekind-MacNeille completion for the Bruhat order of the Weyl group B6 using concept lattices (aka Galois lattices).We extend the associated sequence A378072 in the OEIS by providing the value for n = 6, which equals 142343254. Our approach leverages Formal Concept Analysis and the NextClosure algorithm to efficiently compute the completion. We also present additional results for other Weyl groups (infinite families up to computationally reasonable n, An, Dn, and special ones, G2, F4, E6) and analyze lattice properties including the number of chains and (maximal) antichains, as well as lattice height, width, and breadth.

Aleksey Buzmakov, Sophia Kulikova and Vladimir. Parkhomenko О расчете робастности и устойчивости формальных понятий на основе дельта-меры.

Abstract В работе рассмотрена робастность формальных понятий, из которой проистекает устойчивость и ее быстрое приближение в виде дельта-меры. В виде утверждения представлен вычислительно эффективный способ расчета робастности на основе подконтекстов. Основным результатом работы выступаeт метод аналитической оценки количества прообразов для понятий. На основе метода предложена оптимизация алгоритма дельта-устойчивых понятий. Рассуждения иллюстрируются примером.

Aleksandr Sidnev, Igor' Malyshev and Vladimir Tsygan. Queue network models for performance evaluation and optimization of Job shop systems.

Abstract The paper presents a modern approach to using closed queue network models to evaluate and optimize Job-shop systems. The approach uses two-moment method for performance measures at individual nodes. This method is embedded in an iterative calculation procedure for an open network equivalent to the original closed network. The method results in algorithms for calculating and optimizing general multi-class closed queue networks. Numerical studies comparing the performance of the approach with simulations suggest that the approach yields fairly accurate estimates of performance measures.

SESSION 7. Research support in data infrastructures. Small conference hall. Chair: Egor Khitrov

Anton Khritankov. Towards a Technology Platform for AI Applications with MLOps and Hybrid Computing.

Abstract The rapid advancement of artificial intelligence (AI) technologies has necessitated the development of robust frameworks that facilitate the efficient creation and deployment of AI applications. This paper proposes a specific variant of the MLOps (Machine Learning Operations) process tailored for the full cycle development of AI applications, addressing the unique challenges that arise during implementation. We identify and discuss several critical tasks associated with this process, including pipeline implementation, machine learning application verification and long-term modeling of continuous learning systems, which are essential for ensuring the efficiency and effectiveness of MLOps implementations. Additionally, we describe a hybrid cloud computing platform designed to automate these MLOps processes, enhancing scalability and flexibility in AI application development. This platform integrates on-premises and cloud resources, facilitating seamless collaboration and resource allocation. By providing a structured approach to MLOps, this work contributes to the advancement of methodologies in AI development and offers a practical framework for organizations seeking to optimize their AI initiatives and accelerate time-to-market for innovative solutions.

[SHORT] Victor Dudarev, Nadezhda Kiselyova and Alfred Ludwig. Информационная поддержка распределённых исследований в области тонкоплёночных материалов: от синтеза к управлению процессами.

Abstract Разработана система управления исследовательскими данными (RDMS) MatInf, ориентированная на поддержку исследовательских коллективов, работающих с большими объемами данных, генерируемых в рамках высокопроизводительных экспериментов в области неорганического материаловедения. Архитектура MatInf обеспечивает полную поддержку пользовательских типов данных, определяемых динамически после развертывания системы. Такая гибкость достигается благодаря механизму позднего связывания типов с внешними веб-сервисами, реализующими функции валидации, извлечения и визуализации данных. Представлены примеры использования RDMS для хранения и анализа экспериментальных данных по тонкоплёночным материалам. Ключевое отличие MatInf заключается в отсутствии свободно распространяемых альтернатив, одновременно поддерживающих типизированное представление материаловедческих данных, расширяемую систему пользовательских типов, интеграцию с произвольными форматами исследовательских документов без модификации ядра системы и поддержку связности объектов с помощью ориентированного мультиграфа.

[SHORT] Alexander Elizarov, Evgeny Lipachev and Olga Nevzorova. Towards a Research Infrastructure of Mathematical Knowledge.

Abstract We propose an approach to creating a research infrastructure for managing mathematical knowledge. The research infrastructure is presented as the system of interconnected semantic mathematical artefacts developed for different domains of mathematical knowledge. The formation of mathematical artifacts is based on the software tools of the OntoMath digital ecosystem that we have already developed. When creating mathematical artefacts, we were guided by the FAIR principles and recommended the practice of their application. We highlight the main mathematical artifacts of the research infrastructure such as an ontology of professional mathematics, an ontology for mathematical theorems and statements and an ontology of mathematical problems, an ontology of methods of solving mathematical problems, an ontology of algorithms and programs, a knowledge graph for mathematical formulas, a knowledge graph for representing the organizational structure of mathematical space, including descriptions of scientific groups, individuals, research topics presented in mathematical journals, and an ontological model for representing mathematical knowledge as a system of interconnected specialized ontologies.

Vladimir Korenkov, Irina Filozova, Galina Shestakova, Andrey Kondratyev, Aleksey Bondyakov, Tatiana Zaikina, Irina Nekrasova and Yanina Popova. Automation of Scientific Publications Management in the JINR Digital Repository.

Abstract In the context of growing volumes of scientific publications and the increasing number of digital repositories, effective management of research out-comes has become an increasingly complex challenge. This paper presents a modular system for automating the management of scientific publications, inte-grated into the Joint Institute for Nuclear Research (JINR) digital repository based on the DSpace software platform. The system enables automated harvest-ing of publication metadata and full texts from external sources, verification of authorship records, duplicate elimination, and data normalization — significantly enhancing the accuracy and completeness of repository information. The reposi-tory’s functionality is extended with data visualization: interactive histograms — among the most common and intuitive visualization types for such systems — have been implemented. This feature, developed using D3.js, enhances the re-pository's analytical capabilities.
The proposed architecture is characterized by flexibility, scalability, and ability to integrate into existing infrastructures of research organizations, opening pro-spects for its adoption in universities, research centers, and national libraries.
The development is carried out at the Laboratory of Information Technologies (LIT) of the Joint Institute for Nuclear Research (JINR).

Nikolay Skvortsov. Обеспечение семантического поиска ресурсов в рамках жизненного цикла решения задач.

Abstract Повторное использование неоднородных научных данных и методов обычно требует значительных усилий по обеспечению их интеграции и семантической интероперабельности. В статье предлагается подход к обеспечению семантического поиска ресурсов с применением онтологий предметных областей. Ресурсы, включающие источники данных и реализации методов, регистрируются в исследовательских инфраструктурах для обеспечения их повторного использования при решении задач в различных предметных областях. Таким образом создаются коллекции научных данных и наборы инструментов для совместных исследований и обеспечения преемственности научных результатов. Для связывания с данными и методами семантически значимых метаданных используются формальные спецификации предметных областей в онтологической модели. С использованием таких метаданных и логического вывода над ними данные и методы классифицируются в предметной области для обеспечения поиска ресурсов, релевантных решаемым задачам. На разных этапах жизненного цикла решения исследовательских задач обеспечивается нахождение источников данных и методов, их семантическая интеграция и корректное совместное функционирование.

SESSION 8. Information security III. Lecture room №1. Chair: Maria Poltavtseva

Nikita Gribkov and Maxim Kalinin. Evaluating the security of big data infrastructure using intelligent analysis of its code basis.

Abstract The paper analyses methods of security assessment of typical components of big data infrastructure. Based on the results of the analysis, we propose an evaluation method based on a comparative analysis of the code base of the investigated components with the sets of known potentially dangerous code fragments. To increase the universality of the method, the possibility of analyzing components without source codes has been studied. The method is complex: fragments are analyzed at several levels of abstraction at the level of binary year, assembly code and its graph representations, recovered code and its graph representations. The method allows labelling potentially dangerous code fragments, including those without syntactic samples, in components of big data infrastructure and assessing its security level based on the collected statistics.

Evgeny Zubkov and Dmitry Zegzhda. Assessment of the sustainability of the cyber-physical system based on historical data.

Abstract The study examines the sustainability issues of cyberphysical systems from the perspective of information security vulnerabilities in software and hardware components. An overview is provided of methods for assessing and ensuring the stability of such systems. A model for evaluating stability is proposed, based on a continuous Markov process, using historical data on software versioning and vulnerabilities.

[SHORT] Nikita Gololobov, Evgeniy Pavlenko and Lavrova Darya. Model of software functioning based on Bayesian networks.

Abstract This article presents an innovative model of software functionality based on Bayesian networks, developed to address critical cybersecurity issues related to code reachability assessment and vulnerability exploitation prediction. The proposed model overcomes the limitations of traditional software analysis methods, which generate an excessive number of false positives due to the lack of context regarding the actual reachability of vulnerable components.

[SHORT] Evgenia Novikova and Igor Kotenko. Towards assessment of the trustworthiness of the AI models: application to cyber security tasks.

Abstract Currently, there is an active discussion on the usage of AIbased systems in many areas of the national economy such as finance, industry, medicine, education, etc. The key issue of the practical application of AI models consists in evaluation of their trustworthiness level. The paper discusses the notion of trustworthiness of the AI model and its main characteristics. The authors demonstrate that currently there is no unified approach to evaluating the robustness of the AI model as well as the set of metrics used to assess it. To fill this gap, the authors propose a formal description of the AI model evaluation process based on ontological modelling. The proposed ontology describes key components of the evaluation methodology, including requirements defined for the given subject domain and analytical task, a set of evaluation metrics, and their calculation algorithms. The application of the developed ontology is demonstrated by evaluating the trustworthiness of the AI models developed for cyber security tasks.

SESSION 9. Machine Learning Applications. Conference Hall «Kapitsa». Chair: Dmitry Kovalev

Nadezhda Kiselyova, Victor Dudarev, Oleg Senko, Alexander Dokukin and Andrey Stolyarenko. Application of Machine Learning to Predict Crystal Lattice Parameters of ThCr2Si2 Type Crystal Structure Compounds.

Abstract A comparison of the efficiency of using various machine learning methods in predicting the qualitative and quantitative properties of inorganic compounds was carried out. In predicting qualitative properties, the most accurate programs, according to the results of cross-validation, were those based on training a neural network using the backpropagation method (average accuracy 91%), the support vector machine method (91.6%), and the knearest neighbors method (92.9%). The high average accuracy (97.6%) of the examination assessment in the cross-validation mode indicates the effectiveness of using ensembles of algorithms. Using the selected programs, prediction of new compounds of the composition AD2X2 (A and D are various elements here and below; X is B, Al, Si, P, Ga, Ge, As, Sn or Sb) with crystal structures of the ThCr2Si2, FeMo2B2, CaAl2Si2, CaBe2Ge2 and CoSc2Si2 types under ambient conditions was carried out. For the predicted compounds with the structures ThCr2Si2, FeMo2B2, CaAl2Si2 and CaBe2Ge2, the crystal lattice parameters were estimated. When solving these problems, the most accurate results according to the LOOCV (LeaveOne-Out Cross-Validation) were obtained using programs from the scikit-learn package: svm.NuSVR, svm.SVR, Random Forest, Gradient Boosting Regressor, ARD Regression, Extra Trees Regressor, Orthogonal Matching Pursuit and Bayesian Ridge, as well as a program specially developed for predicting quantitative properties DivenBoost. MAE (Mean Absolute Error) was in the range of 0.014-0.155 Å. The value of the multiple determination coefficient R2 was in the range of 0.836-0.990. To predict compounds not yet obtained and to estimate unknown values of lattice parameters, only the values of the properties of elements A, B and X were used.

Andrew Soroka and Alex Meshcheryakov. JAMPR+/L2D: scalable neural heuristic for constrained vehicle routing problems in dynamic environment.

Abstract The vehicle routing problems with real-world constraints (we consider vehicles capacity limits, time windows constrains, pickup-anddelivery multi-depo — CPDPTW) pose significant computational challenges. While classical exact and heuristic methods remain effective to solve problems of small/medium size (N ≲ 100), they often lack adaptability and scalability for larger logistics tasks. In this work, we show how JAMPR+/L2D RL deep learning model, proposed in to solve large CPDPTW problems can be adopted in the case of substantial changes of graph distance matrix.We test performance of JAMPR+/L2D model for medium-sized CVRP and VRPTW problems on CVRPLIB benchmarks: JAMPR+/L2D outperforms the state-of-the-art heuristic HGS in over 85% of instances, achieving improvement in objective gap. We show that the JAMPR+/L2D model trained on CPDPTW problem, generalizes well for tasks with simpler constraints (CVRP, VRPTW), for different problem sizes and for moderate changes in distance matrixes. For more substantial changes in distance matrixes, we propose here to make fast finetuning of JAMPR+: on ORTEC data (for CPDPTW) the proposed strategy remarkably reduces the objective gap without full model retraining, what will give both accuracy and rapid inference of the model in the practical routing scenarios with distance matrix changes.

Aleksandr Osipov and Ramon Antonio Rodriges Zalipynis. Accelerated Wildfire Simulations via Caching Techniques.

Abstract Thousands of wildfires occur daily, posing serious threats to the global environment. Therefore, accurate wildfire simulations are crucial to combat, mit-igate, and prevent wildfires effectively. However, simulating a large number of concurrent wildfires requires significant computational resources. The novel idea presented in this work is to accelerate cellular automata wildfire simulations by sharing and/or reusing calculations, thereby reducing their number. We develop this idea by proposing and implementing a set of new techniques: precise and imprecise caching, as well as fuzzy approximation. This work is pioneering in terms of designing and exploring caching techniques for the aforementioned sce-narios. We use Simpson, Jaccard, and Sneath metrics for accuracy evaluations. We assess computational efficiency by thorough theoretical algorithm analysis and profiling. Importantly, all the approaches significantly speed up the simula-tions without even modifying the wildfire simulation model, keeping it intact. Precise caching improved the computation speed by 20% without any accuracy degradation. Imprecise caching yielded similar performance gains, but with re-duced accuracy (Jaccard: 0.86, Sneath: 0.76). Further, fuzzy logic reduced the runtime by 44% but exhibited lower accuracy (Jaccard: 0.86, Sneath: 0.65, de-pending on fire characteristics). The presented approaches may enable simulating more wildfires in a fraction of the time or require fewer computational resources.

[SHORT] Aleksei Samarin, Aleksei Toropov, Alexander Savelev, Egor Kotenko, Anastasia Mamaeva, Artem Nazarenko, Alexander Motyko, Elena Mikhailova and Valentin Malykh. AI-Driven Automatic Proctoring System for Secure Online Exams.

Abstract This study presents a microservice-based AI-driven proctoring system for secure and scalable online exam monitoring. The proposed system integrates deep learning and computer vision techniques to analyze multimodal data, including video streams, audio signals, and metadata, to detect dishonest behaviors such as unauthorized assistance, device usage, and unusual gaze patterns. The system architecture ensures seamless integration with online learning platforms, providing a modular and adaptive approach to remote exam supervision. Key components include real-time facial recognition, eye-tracking, head pose estimation, and audio anomaly detection. Explainable AI techniques enhance transparency, allowing educators to interpret decisions and minimize false positives. Experimental evaluation on a controlled dataset demonstrated high detection accuracy and efficiency, validating the system’s applicability for automated proctoring. The microservice structure allows for flexible deployment, making it suitable for large-scale educational environments. This work is mainly devoted to the description of the analytical core of the system. Future improvements will focus on refining detection models, reducing bias, and addressing ethical considerations concerning student privacy. This research contributes to advancing AI-powered academic integrity solutions, offering a practical and scalable alternative to traditional proctoring methods.

THURSDAY, OCTOBER 30th

SESSION 10. Experimental Data Analysis. Small conference hall. Chair: Nikita Voinov

Anna Provorova, Kristina Lykova, Sophia Kulikova, Daria Semenova and Julia Zaripova. “Dish I Wish”: an app for studying children's eating behavior.

Abstract This study presents a specialized web application, Dish I Wish, designed to collect data on dietary preferences among children aged 4 to 14 through the use of gamification techniques. In response to the limitations inherent in traditional paper-based questionnaires, the application features an intuitive, interactive interface that enables children to simulate meal planning by selecting dishes and adjusting portion sizes. An experimental design is employed to align individual children's preferences with perceived family habits, thereby enhancing the depth and contextual relevance of the data collected. Developed using the MERN technology stack and leveraging MongoDB for flexible data storage, the system emphasizes scalability and automation to reduce the potential for human error in data collection.

Vladimir Parkhomenko, Anastasiya Ivanova, Ivan Eroshin, Pavel Drobintsev and Alexander Schukin. User Interface Evaluation Using Tracking Eyes and Facial Expressions.

Abstract Оценка пользовательского интерфейса (UI) полезна для создания программного обеспечения, удобного для людей. Мы описываем две системы на основе WebGazer, представляющие из себя стенды для проведения полевых исследований и наглядного представления статистики по результатам их выполнения. Системы имеют открытый исходный код на Github. Они использовались в полевых экспериментах в мае 2025 года с участием более 20 человек. Первый эксперимент посвящен оценке маркетплейсов с использованием тепловых карт, разработаны показатели и методика. Ранжирование по разным показателям показало относительную равноважность главной и товарной страниц маркетплейсов, похожих на Ozon и Wildberries. Второй эксперимент посвящен оценке индивидуального эффекта от воздействия (ITE) с использованием отслеживания глаз и мимики, а также системы разработанных методики и показателей, включая новый показатель отвлеченности (тревожности). Под воздействием рассматривается смена веб-браузеров Сhrome и Firefox. Несмотря на предпочтения пользователей Chrome в опросе, статистически значимой разницы между использованием браузеров в среднем нет на основе мониторинга взгляда и эмоций. Проведены эксперименты на синтетических данных для оценки чувствительности мета-алгоритмов измерения ITE, в которых при сильной гетерогенности данных модель X-learner с Causal Forest демонстрирует наиболее стабильные и значимые оценки ITE.

[SHORT] Georgiy Frolov and Vladimir Parkhomenko. Research and experimental analysis of Big Data tools: an application for Twitch streaming platform.

Abstract Статья посвящена сравнительному анализу популярных инструментов Big Data в пакетном (Spark, Hive on Tez, Map Reduce) и потоковом (Spark Streaming, Apache Flink) случаях. Мы оцениваем временную эффективность инструментов: время выполнения и временная задержка, для этих средств соответственно. Apache Flink и Spark Streaming демонстрируют достойные результаты, и оба могут считаться релевантными инструментами для потоковой обработки. Однако Apache Flink демонстрирует более стабильную и существенно меньшую абсолютную задержку, чем Spark Streaming. Также делается вывод, что Spark в пакетном случае превосходит другие сравниваемые инструменты по временной эффективности. Hive on Tez показывает небольшое отставание в эффективности задач пакетной обработки по сравнению со Spark, и благодаря простоте использования синтаксиса HiveQL признан эффективным и релевантным инструментом. Hadoop Map Reduce, напротив, демонстрирует довольно плохие результаты и не рекомендуется к использованию из-за наличия более быстрых и удобных альтернатив, перечисленных выше. Все соображения проиллюстрированы на примере приложений на платформе потокового вещания Twitch, генерация данных для потоковой обработки автоматизирована, скрипты и результаты бенчмарка размещены в открытом репозитории Github.

[SHORT] Dmitry Nikitenko and Artem Bulekov. An approach to intuitive visual analysis of ratings.

Abstract This paper aims to illustrate how the nature of the object under study can be used to select methods and ways of visual analysis. The object under consideration is various ratings, i.e. time-varying lists of some objects ranked by values of some fixed parameters. The key idea is to outline the set of most important parameters and visualize various metrics under study always together with the mentioned set of parameters. As an example, we consider the ratings of supercomputer systems the Top500 rating of the world's most productive computing systems and the Top50 rating of Russian supercomputers, within the framework of which the proposed analysis tool was implemented. For this, we choose rating editions, system position and HPL performance as the key set of parameters, which form the comprehensive vision of all the entrances in the his-tory of rating, allowing to intuitively feel the significance of the studied metric values.

[SHORT] Mikhail Lebedev, Vladimir Parkhomenko, Roman Zuev and Alexander Schukin. Container Route Aggregation Using Big Data.

Abstract Одной из основных проблем на контейнерном терминале является агрегация маршрутов, которая заключается в минимизации необоснованных разрывов, т.е. потери данных о маршруте из-за системных ошибок или человеческого фактора. Мы решаем эту задачу с помощью инструментов Big Data, интеграции данных и других методик. Корпоративная база данных на основе ClickHouse интегрируется с базой данных железнодорожных накладных в Postgres. Затем мы формируем маршруты на основе имеющихся данных с удалением ненужных маршрутов, добавляем начальные и конечные операции (станции). Результаты сохраняются в Yandex Cloud и файл Parquet. Пользователю предоставляется агрегированная по выбранным параметрам таблица. Разработанный программный модуль работает на контейнерном терминале Санкт-Петербурга и затрагивающий перемещение контейнеров по России. Он позволяет сократить количество необоснованных разрывов примерно в 3 раза.

SESSION 11. Image Analysis I. Lecture room №1. Chair: Dmitry Ignatov

Aleksei Samarin, Alexander Savelev, Aleksei Toropov, Anastasia Mamaeva, Egor Kotenko, Aleksandra Dozortseva, Artem Nazarenko, Alexander Motyko, Elena Mikhailova and Valentin Malykh. Enhancing Microorganism Classification with Vision-Language Large Models Generated Synthetic Microscopy Images.

Abstract The scarcity of annotated microscopy datasets remains a major obstacle to training robust deep learning models for microorganism classification. This study proposes a novel data augmentation pipeline that leverages Vision-Language Large Models (VLLMs) to generate synthetic microscopic images across six distinct bacterial and non-bacterial classes. These synthetic samples were progressively integrated into the training dataset in controlled proportions to systematically assess their impact on model performance. Quantitative evaluations reveal that incorporating synthetic data up to 11% significantly enhances classification accuracy, with the best-performing configuration achieving a Precision of 0.91, a Recall of 0.89, and an F1-score of 0.90. However, performance begins to decline beyond this saturation threshold, suggesting that excessive synthetic augmentation may introduce distributional noise or overfitting. Our findings highlight the potential of VLLM-based synthetic data generation as a scalable solution to address class imbalance and data scarcity in microbial image analysis tasks.

Дарья Имайкина and Константин Туральчук. Разработка модели классификации МРТ-изображений по стадиям болезни Альцгеймера с применением интерпретационных методов машинного обучения.

Abstract В данной работе предложен подход к автоматизированной классификации МРТ-изображений головного мозга по стадиям болезни Альцгеймера с применением методов интерпретируемого машинного обучения. Исследование включает сравнительный анализ архитектур нейронных сетей, в результате которого разработана эффективная CNN-модель, достигающая точности 95% на тестовой выборке. Особое внимание уделено интерпретации решений модели с помощью методов Grad-CAM и LIME, что позволило выявить ключевые области мозга, влияющие на классификацию. Показано, что при корректной классификации модель демонстрирует четкую локализацию значимых признаков, в то время как ошибки связаны с расплывчатыми активациями. Тестирование устойчивости применения методов интерпретации с моделью к искажениям изображений подтвердило ее надежность, за исключением случаев горизонтального отражения из-за анатомической асимметрии мозга. Практическая значимость работы заключается в создании прототипа клиент-серверной системы, предназначенной для автоматизации диагностики нейродегенеративных заболеваний. Полученные результаты открывают перспективы для внедрения разработанных методов в клиническую практику с целью повышения точности и скорости постановки диагноза.

Aleksei Samarin, Alexander Savelev, Aleksei Toropov, Anastasia Mamaeva, Egor Kotenko, Artem Nazarenko, Alexander Motyko, Elena Mikhailova, Valentin Malykh and Svetlana Babina. Infrared Imaging-Enhanced Automated Wildlife Detection in Nature Reserves.

Abstract This study presents a refined approach to automated wildlife detection in natural environments by integrating a neural network architecture with a dual-stream attention mechanism and a newly introduced infrared-based pre-classification stage. The method addresses a key challenge in ecological monitoring: the need for scalable and accurate tools to assess wildlife populations and support biodiversity conservation. A preliminary classification module evaluates color and thermal patterns in the infrared spectrum to enhance detection reliability before the core detection process. This step filters background noise and highlights regions likely to contain living organisms, enabling more focused and accurate downstream analysis. The main detection system is based on a dual-attention neural architecture that emphasizes semantically important regions within camera trap images while minimizing interference from visually complex natural scenes. This is particularly important for field-acquired data, which often contains partial occlusions, inconsistent lighting, and background clutter. Performance evaluation was conducted using the Wildlife Insights dataset, a large and diverse collection of camera trap images from various ecological regions. Experimental results demonstrate that the proposed approach offers higher accuracy and robustness compared to traditional models, particularly in visually challenging scenarios. By combining infrared color-based classification with an attention-augmented detection pipeline, this method significantly enhances the effectiveness of automated wildlife monitoring. The findings highlight the system’s applicability for ecological research, conservation planning, and long-term wildlife population studies.

[SHORT] Pavel Arkhipov, Sergey Philippskih and Maxim Tsukanov. Разработка специализированного метода контекстной аугментации малых объектов.

Abstract В статье проведен анализ существующих методов аугментации данных и выявлены их недостатки, связанные с ограничениями при работе со снимками, полученными при помощи БПЛА. Предложен новый специализированный метод контекстной аугментации малых объектов, поддерживающий пространственный реализм, и способный обеспечить необходимую плотность объектов небольшого размера на этапе обучения нейросетевых моделей. Учитывая сложность реализации предлагаемого метода контекстной аугментации малых объектов, было принято решение разделить процесс разработки на несколько этапов. Для эмпирической проверки предложенного метода в качестве базового детектора была выбрана модель SSD MobileNet V2 FPNLite 320x320, находящаяся в репозитории TensorFlow 2 Detection Model Zoo. Данная нейросетевая модель была обучена на наборе данных VisDrone с применением стандартных методов аугментации данных (SSD MobileNet V2 FPNLite 320x320) и разработанного метода (SSD MobileNet V2 FPNLite 320x320 CSOA). Произведено сравнение полученных результатов детектирования базовой модели SSD MobileNet V2 FPNLite 320x320 с моделью SSD MobileNet V2 FPNLite 320x320 CSOA в соответствии с протоколом оценки библиотеки COCO Evaluation metrics. Экспериментальные исследования показали, что контекстно-зависимая аугментация значительно улучшает способность нейросетевых моделей к детектированию объектов малых размеров. Полученный результат имеет большое практическое значение при работе с компактными энергоэффективными системами с ограниченной производительностью и памятью.

SESSION 12. Large Language Models and Applications I. Conference Hall «Kapitsa». Chair: Elena Bolshakova

Дарья Потапова и Наталья Лукашевич. Методы тестирования порождающих моделей, использующих информационный поиск.

Abstract Применение порождающих моделей, использующих информационный поиск, или подхода Retrieval Augmented Generation (RAG), когда поисковый модуль извлекает релевантные документы по запросу пользователя, а порождающая модель формирует ответ, комбинируя найденный контекст со своими внутренними знаниями, позволяет преодолеть такие ограничения больших языковых моделей, как устаревание знаний и неспособность работать с динамично изменяемыми данными. Эффективность таких систем напрямую зависит от качества поиска, структуры контекста и формулировок инструкций, однако всесторонняя оценка такого подхода на текущий момент является сложной задачей не только из-за малого количества подходящих наборов данных и различия в требованиях к системам, но и из-за отсутствия стандартных метрик, которые позволяли бы измерять соответствие ответа предоставленному контексту. В работе проводилось сравнительное исследование шести языковых моделей (DeepSeek, Llama, Mistral, Qwen, RuAdapt, YandexGPT) в вопросно- ответной задаче с использованием подхода RAG. Эксперименты вы- полнены на четырех русскоязычных датасетах — XQuAD, TyDi QA, RuBQ и SberQuAD — приведённых к единому формату, подходящему под задачу. Были рассмотрены различные стратегии добавления и упорядочивания фрагментов в подаваемом модели контексте, а оценка моделей с помощью метрик Context Relevance, Utilisation, Completeness, Adherence и Exact Matching позволила выявить ограничения генеративных моделей при извлечении информации из релевантного контекста.

Grigory Kovalev and Mikhail Tikhomirov. Iterative Layer-wise Distillation for Efficient Compression of Large Language Models.

Abstract This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The findings suggest that the middle transformer layers contribute less to inference, underscoring the potential of the proposed method for creating efficient models. The results demonstrate the effectiveness of iterative distillation and fine-tuning, making the approach suitable for deployment in resource-limited settings.

Dmitry Peniasov and Konstantin Turalchuk. Few-Shot Prompting Strategies for Dialogue Summarization with LLMs.

Abstract Dialogue summarization is a challenging subfield of natural language processing that requires handling fragmented, multi-speaker discourse, topic shifts, and long-range dependencies inherent in conversational data. Despite progress in text summarization, most state-of-the-art models are trained on monological corpora such as news articles, which differ significantly from dialogues in both structure and content. As a result, these models often exhibit poor transferability to dialogue summarization tasks. The scarcity of high-quality, annotated dialogue datasets further complicates effective model adaptation. In this work, we explore an alternative data-centric approach by leveraging large language models (LLMs) as synthetic annotators within a few-shot learning framework. We propose a retrieval-augmented method for in-context example selection that prioritizes semantic similarity to the input dialogue. Through a series of controlled experiments, we evaluate the impact of demonstration quality and selection strategy on summarization performance. Our findings suggest that carefully curated few-shot prompts can substantially enhance the reliability of LLM-generated dialogue summaries and reduce reliance on costly manual annotation.

[SHORT] Andrei Borisov and Konstantin Turalchuk. Contextual Support for Deep Reading of Russian Classical Literature Based on the Retrieval-Augmented Generation Method.

Abstract In response to the growing interest in classical literature in Russia, a Retrieval-Augmented Generation (RAG)–based system for contextual reading support is proposed. The system generates hints and answers to user questions at any place in the text, considering both the immediate context and a curated knowledge base that includes philological studies, cultural-historical commentaries, biographical notes, etc. The Rebind.ai project offers a similar solution for Western literature but does not support the Russian classics. Moreover, Rebind.ai is costly due to the operational complexity of its chosen process for producing new books. To simplify the addition of books, proposed system’s architecture is designed for rapid expansion of the underlying knowledge base.

SESSION 13. Time Series Analysis. Small conference hall. Chair: Mikhail Zymbler

Александр Валиулин и Михаил Цымблер. GPU-Accelerated Matrix Profile Computing for Streaming Time Series.

Abstract Currently, the mining of streaming time has become increasingly important in a wide range of applications. At the moment, the matrix profile (MP) is considered a simple yet powerful exploratory time series mining tool, which provides the discovery of numerous time series mining primitives without requiring prior knowledge. The research community introduced a substantial number of MP computing algorithms. However, to the best of our knowledge, no developments are both GPU- and stream-oriented. In this article, we introduce the StreaMP (Streaming Matrix Profile) algorithm for GPUs, which overcomes the above limitation. Our algorithm processes incoming data in a segment-wise manner, accumulating the time series arriving from a sensor in RAM. StreaMP merges MPs of the segment and the time series read so far through our proposed multistep schema. To compute MPs, StreaMP is able to apply any GPU-based algorithm, which encapsulates all details of parallel processing. The MP built by StreaMP makes it possible to discover the repeated and anomalous patterns in the streaming time series in linear time. Experimental evaluation demonstrates that StreaMP outperforms SCAMP, which is currently considered the fastest GPU-based MP computation algorithm for batch mode.

Artem Fedorov and Archil Maysuradze. Scalable Bayesian Motif Detection for Multivariate Time Series.

Abstract We develop a scalable, interpretable, and adaptive motifdiscovery framework for multivariate time series. The approach combines a kernel-based, invariance-preserving embedding with variational Bayesian inference under a Dirichlet-process mixture, so the number of motifs grows naturally with data while remaining compact and explainable. The same engine operates equally well on large offline archives and real-time streams. Tests on synthetic signals, daily stock prices, and clinical EEG show higher accuracy than state-of-the-art baselines at a fraction of their computational cost: the model reliably rediscovers expertdefined patterns, maintains clarity of the learned representations, and remains stable as data volume scales. These properties make the detector a practical tool for financial, industrial, and biomedical time-series analysis.

Dmitrii Popov and Archil Maysuradze. Optimizing Embedding Linearity for Robust Speaker Diarization in Overlapping Speech Scenarios.

Abstract Speaker diarization, the task of segmenting and identifying speakers in audio recordings, is critical for applications like automatic speech recognition, transcription, and analysis of multi-speaker recordings such as meetings and podcasts. Overlapping speech, prevalent in datasets like AMI and CALLHOME, poses a significant challenge, as it complicates accurate speaker segmentation. This work addresses this issue by investigating the linearity of biometric embeddings, a property enabling the representation of overlapping speech as a linear combination of individual speaker embeddings, which is essential for robust diarization, particularly in cascaded schemes with Target-Speaker Voice Activity Detection (TSVAD). We propose a novel fine-tuning method for the ECAPA-TDNN model to enhance embedding linearity, utilizing a synthetic dataset derived from VoxCeleb and a modified loss function combining AAM-Softmax with a linearity term. Integrated into a cascaded TSVAD-based diarization framework, our approach supports both full-context and streaming modes. Experiments on standard benchmarks (AMI, DIHARD, VoxConverse) demonstrate reduced Diarization Error Rate (DER) compared to state-of-the-art methods, highlighting improved handling of overlapping speech. The proposed method bridges a gap in optimizing embedding linearity, offering practical benefits for real-world multi-speaker scenarios.

Konstantin Izrailov, Igor Kotenko and Andrey Chechulin. Intelligent detection of voice fraud in authentication systems.

Abstract In this paper, an actual problem of countering voice falsification (spoofing attacks) in voice authentication systems is considered. An intelligent method for detecting fake speech is proposed which is based on combining classical acoustic features and modern machine learning models. As features, Mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP) and other characteristics are used, which describe individual properties of voice signal. For classification, the deep neural network architectures ECAPA-TDNN (Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network) and statistical GMM-UBM (Gaussian Mixture Model-Universal Background Model) are applied, providing high resistance to different types of attacks, including synthesized and imitated records. The developed system was tested on open corpora VCTK (Voice Cloning Toolkit) and TEDLIUM (Technology, Entertainment, Design - Laboratoire d’Informatique de l’Université du Mans), where it showed high results: classification accuracy achieves 97–99%, equal error rate (EER) decreased to 2–5%, and false acceptance rate (FAR) and false rejection rate (FRR) are minimized. The scientific novelty of this work is the integration of diverse methods of speech processing and analysis, which increases the reliability of fake detection. The practical significance is the possibility of using the developed solution in real biometric systems, including banking services and remote service platforms. The article also discusses limitations and perspectives for further development of the proposed approach.

[SHORT] Leonid Sidorov and Archil Maysuradze. Temporal and Statistical Analysis of EEG Signals for Enhanced P300 Pattern Recognition.

Abstract This paper provides a comprehensive analysis of EEG signal processing, emphasizing pattern detection and interpretation through a blend of machine learning and statistical methods. We evaluate the efficacy of neural network architectures, particularly those employing learned convolutions, for capturing temporal dependencies in EEG data from a pattern recognition task. Our approach uses the Wilcoxon test and Holm-Bonferroni correction to identify significant variations across EEG channels and intervals, enhancing the robustness of our findings. While the temporal analysis revealed significant differences in P300 and non-P300 signals around 200 milliseconds post-stimulus, channel-wise statistical methods showed limited effectiveness, emphasizing the superiority of deep learning techniques in this aspect. These results highlight the intricate dynamics of cognitive processing and the importance of later signal segments, while acknowledging the contributions of early activity. This work lays the groundwork for advancing EEG-based brain-computer interface systems, offering the framework for decoding neural phenomena and enhancing cognitive signal analysis.

SESSION 14. Image Analysis II. Lecture room №1. Chair: Aleksei Samarin

Dmitrii Vorobev, Artem Prosvetov and Karim Elhadji Daou. Real-time Localization of a Soccer Ball from a Single Camera.

Abstract We propose a computationally efficient method for real-time three-dimensional football trajectory reconstruction from a single broadcast camera. In contrast to previous work, our approach introduces a new set of discrete modes that are designed to accelerate the optimization pocedure in a multi-mode state model while preserving centimeter-level accuracy – even in cases of severe occlusion, motion blur, and complex backgrounds. The system operates on standard CPUs and achieves low latency suitable for live broadcast settings. Extensive evaluation on a proprietary dataset of 6K-resolution Russian Premier League matches demonstrates performance comparable to multi-camera systems, without the need for specialized or costly infrastructure. This work provides a practical method for accessible and accurate 3D ball tracking in professional football environments.

Aleksandr Borisov and Sergey Makhortov. Facial Expression Generation from Neutral Geometry Using a Local Shape Model.

Abstract This paper presents a method for generating three-dimensional facial expressions from a given neutral geometry, based on a local shape model and a small set of training data. The facial geometry is pre-segmented into anatomically meaningful regions, and deformation parameters are estimated for each region with respect to data fidelity and boundary consistency. After approximating the neutral shape, expressions are synthesized and transferred back to the original geometry while preserving anatomical structure. The proposed ap-proach ensures a high degree of expressiveness and consistency across expressions while maintaining subject-specific facial features. Experimental results demonstrate its superiority over existing methods under limited data conditions.

Daniil Timchenko and Dmitry Ignatov. TopoGAN Reproducibility Study: Enhancement and Analysis of an Emerging Paradigm.

Abstract Recent advances in deep generative models, such as Generative Adversarial Networks (GANs) and Diffusion Models, have led to remarkable progress in high-fidelity image synthesis. However, these models often produce artifacts and suffer from instability or insufficient theoretical grounding. In parallel, the field of Topological Data Analysis (TDA) has developed robust tools to understand the intrinsic structure of high-dimensional data by analysing its topological and geometric properties. This research explores the intersection of TDA and generative image models, investigating whether topological methods can enhance generative quality, stability, and interpretability. We first introduce the theoretical foundations of algebraic topology, metric geometry, and probability theory, and review recent applications of TDA to GANs. We then propose a novel approach to integrating TDA insights into generative modeling pipelines, with a focus on quantifying and minimizing topological artifacts. Experiments on benchmark datasets evaluate the impact of topological constraints on image generation fidelity and structure. Our results show that TDA-informed modifications can yield improvements in sample coherence and offer a promising direction for more robust generative models.

SESSION 15. Large Language Models and Applications II. Conference Hall «Kapitsa». Chair: Konstantin Turalchuk

Archil Maysuradze and Nikita Breskanu. Evaluating Safety of Large Language Models with Cognitive Diagnosis.

Abstract This research introduces a novel approach to LLM safety evaluation by applying cognitive diagnosis, a tool traditionally used in educational platforms. The core innovation lies in moving beyond the limiting one-exercise—one-attribute paradigm used in almost all the contemporary safety benchmarks like XSTest, ToxiGen, HarmBench, TruthfulQA and others. We propose the application of cognitive diagnosis in LLM safety validation, develop an automated attribute extraction pipeline, and demonstrate superior predictive performance of extracted attributes compared to original XSTest labeling. We address a critical gap: prompts typically contain multiple safety knowledge requirements rather than single one. For example, a prompt about historical violence may simultaneously involve ethical reasoning, harmful content detection, cultural sensitivity, and factual accuracy. Our methodology reduces LLM safety evaluation to a cognitive diagnosis task by treating LLMs as «students », prompts as «exercises», and binary scores as «responses». This allows identifying latent LLM safety knowledge levels in multi-attribute setting. The proposed automated attribute discovery process represents an advancement in creating interpretable safety taxonomies. The resulting 11 attributes provide a more nuanced characterization than the original 8 categories from XSTest, showing improvements in predictive metrics. These results suggest the superiority of multi-attribute over singleattribute question labeling in safety evaluation.

Anna Glazkova, Olga Mitrofanova and Dmitry Morozov. Temperature Effects on Prompt-Based Keyphrase Generation with Instruction-Based LLMs.

Abstract Large language models (LLMs) demonstrate strong performance in generating coherent and contextually appropriate texts based on given instructions. However, the influence of model parameters on the output in specific tasks remains underexplored. This study examines the effect of the temperature parameter on the robustness of instruction-based LLMs in keyphrase generation (KG), a core task in information retrieval that facilitates text organization and search. Using three Russian-language LLMs (T-lite, YandexGPT, and Saiga), we compare keyphrases generated with identical prompts and varying temperature values. The results show that higher temperatures increase the diversity of generated keyphrases and unigrams, as well as the proportion of keyphrases not present in the source text. The extent of these effects varies across models. Our findings underscore the importance of selecting both the model and temperature setting in prompt-based KG. The experiments were conducted on two text collections: scientific and news texts.

David Avagian and Alyona Ball. Efficiency Evaluation Framework for LLM-generated Code.

Abstract The impressive capabilities of large language models (LLMs) in natural language processing have led to their increasing use for code generation. Comprehensive evaluation of LLM-generated code remains, however, an open challenge. While correctness is well-studied, another crucial aspect of code quality, efficiency, is still to be explored. This paper presents a new framework for assessing the efficiency of generated code based on the Mercury and EvalPerf benchmarks. We use additional test case generation and sandbox improvements to refactor Mercury’s pipeline, integrating three types of resource measurements (runtime, CPU instruction count, and peak memory usage) and three code quality metrics (Pass@1, Beyond, and DPS). Our evaluation of six LLMs (Phi-1, Phi-2, Code Llama, QwQ-32B, Qwen3, and DeepSeek-V3) shows that large LLMs vastly outperform smaller models and generated solutions are usually better optimised for time than memory. Analysing measurement and metric variance, we verify the stability of our approach. Our framework is extensible, providing a foundation for future code quality research.

[SHORT] Konstantin Burlachko and Boris Dobrov. Usage of Large Language Models in Structured Data Analysis.

Abstract The rapid advancement of Large Language Models (LLMs) has created new opportunities for automating tasks involving tabular data, including spreadsheet operations and data analysis. However, applying LLMs to this domain faces significant challenges, including context limitations, insufficient proficiency in numerical calculations, and the lack of standardized evaluation methods. This paper explores the current state of LLM applications for tabular data, reviewing both agent-based and code-generation approaches. We evaluate performance using standard metrics to assess the robustness and correctness of the generated solutions. Experimental results show that modern LLMs achieve a high degree of accuracy in generating executable code for tabular data tasks. Our findings highlight the potential of LLMs to transform tabular data management by overcoming existing limitations with innovative workflows and architectural solutions.

FRIDAY, OCTOBER 31th

SESSION 16. Information extraction from text I. Small conference hall. Chair: Natalia Loukachevitch

Ildar Baimuratov, Denis Turygin and Dmitrii Pliukhin. Towards the Automated Annotation of Regulatory Text for OWL-Based Compliance Checking.

Abstract Currently, normative regulations are typically presented in a weakly structured format within human-readable regulatory documents, which makes it possible to check information models for compliance with the regulations only manually. The first step in formalizing normative regulations can be the annotation of semantic components and domainspecific terms, but such annotation requires a significant amount of time and expertise in semantics from the user. This research focuses on automating the annotation of regulations to facilitate their translation into the OWL language, enabling subsequent automated compliance checking. We propose methods to automate annotation across three layers of the annotation scheme: domain terms, semantic types, and semantic roles. Our approach achieves promising results, with a Recall@10 of 64.8%, a micro-averaged F1-score of 79.7%, and an Adjusted Mutual Information score of 81.88% respectively.

Anna Glazkova, Olga Zakharova, Olga Prituzhalova and Lyudmila Suvorova. Environmental Discourse in Russian Online Communities: Insights from Topic Modeling and Expert vs. LLM Topic Labeling.

Abstract This paper studies the content of Russian online communities that focus on environmental issues. Social networks are a major channel for sharing information and mobilizing action, and they provide a unique source for analyzing how ecological topics are represented in public discussions. Our goal is to trace the main directions of environmental discourse and to understand how digital communication reflects the spread of ecological knowledge and practices. To achieve this, we apply modern approaches from natural language processing to a large collection of posts published in such communities. We also compare how experts and large language models assign labels to the discovered topics, examining the potential of computational tools to support research in this field. The study contributes both to social research on green practices in Russia and to the development of text analysis methods for large online collections. It highlights the value of combining expert knowledge with automated approaches in order to study complex social and environmental processes.

Alexander Sychev. Анализ тематической структуры текстовой коллекции на основе модели Top2vec.

Abstract В докладе описывается подход к диагностике существующей тематической модели, представленной в размеченной коллекции текстов, на основе модели представления текстов, слов и тематик Top2vec. Приводятся и обсуждаются результаты машинного эксперимента по изучению возможностей применения совместного векторного представления тем, документов и слов в рамках Top2vec для анализа реальной коллекции коротких текстовых сообщений, собранных из регионального новостного портала, а также сформированного из них словаря терминов.

Сергей Знаменский. Selecting longest common subsequence while avoiding non-necessary fragmentation as much as possible.

Abstract A widely used LCS method, which consists of selecting a common subsequence (CS) by maximizing its length, often results in an excessively fragmented subsequence. Closely related to LCS, the Levenshtein Metric does not always produce the expected results for the same reason.
Attempts to avoid excessive fragmentation in CS extraction have been carried out for decades using various approaches with varying degrees of success, but unfortunately were not accompanied by a clear understanding of how to measure fragmentation, let alone how to minimize it.
This paper proposes to use the number of semantically coherent common substrings to measure non-fragmentation. When determining coherence is difficult or impossible, an empirical distribution of the lengths of consistent substrings can be used instead. The ROUGE-W algorithm with a weighting function calculated based on this distribution is applicable for CS selection in practice.
The paper presents theoretical estimates of this distribution and numerical experiments with natural texts and program codes. The experiments confirm the weights for the ROUGE-W metric used in practice and highlight the fundamental difference between.

SESSION 17. Data and image processing in astronomy. Online. Chair: Alexei Pozanenko

Sergey Belkin and Alexei Pozanenko. Clusterisation in the MV – log10(Eiso) Plane for GRB–SNe: Evidence for Distinct Subclasses?

Abstract We present the most comprehensive sample available to date of supernovae associated with gamma-ray bursts (SN–GRBs), for which the peak time of the supernova brightness curve and the absolute stellar magnitude at peak brightness have been identified. The sample contains 44 supernovae. We performed a correlation search between the parameters of the SN-GRB peak brightness in the spectral filter band V (absolute stellar magnitude MV in the rest frame) and the peak time Tmax in filter V in the rest frame, as well as between these parameters and the intrinsic gamma-ray emission parameters of the GRBs (T90,i, Eiso, Ep,i). Statistically significant correlations between any pairs of parameters were not confirmed. The distribution MV –log10(Eiso) exhibits clustering, dividing the sample into two distinct groups. These groups may reflect differences in the initial conditions of the progenitor stars or variations in their final evolutionary stages leading to gamma-ray bursts. We also discuss dataset compilation methods and strategies for mitigating observational biases and selection effects impacting the detection of SN–GRBs and potential correlations.

Nicolai Pankov, Pavel Minaev, Alexei Pozanenko, Eugene Schekotikhin, Sergey Belkin, Elena Mazaeva and Alina Volnova. The Automatic Image Processing Software for Optical Transient Detection.

Abstract Rapidly evolving era of multi-wavelength (gamma, X-ray, UV/optical/IR, radio) and multichannel (EM and gravitational) observations requires processing of extensive amount of images to detect fast optical transients in near-real time. Especially, it is important for optical components of gamma-ray bursts, including those associated with gravitational-wave events detected by LIGO, Virgo and KAGRA. In this paper, we present specially designed micro-service application for automatic astronomical image processing STARFALL that mainly depends on APEX.We display main capabilities and performance metrics of STARFALL. The obtained scientific results are demonstrated on examples. Future plans on the STARFALL development are proposed.

Vladimir Samodurov. Processing of many years data and search for gamma-ray bursts in the radio diapazon at a frequency of 111 MHz.

Abstract В работе приведены результаты анализа многолетних данных с многолучевой диаграммы БСА. Успешно выделены около 10 тысяч радиоисточников с плотностями потока в несколько Ян. Описана статистика ошибок потоков (для ярких источников не хуже 10% до калибровки и 5% после нее) и дисперсий потоков (десятки процентов ввиду мерцаний на неоднородностях околосолнечной плазмы). Продемонстрирована методика выделения слабых источников на примере выделения радиосигнала Юпитера (около 5 Ян). Показано, что в ежесуточных данных можно выделять источники яркостью более 2 Ян. А использование осреднений по несколько суток позволяет понизить предел выделения до 0.3 - 0.5 Ян. Именно с такими верхними пределами описаны результаты обработки для 12 GRB, попавших в створ наших наблюдений по координатам и по эпохе наблюдений. Результаты представлены в таблицах и на рисунках.

Pavel Kaygorodov, Ekaterina Malik, Dana Kovaleva, Oleg Malkov and Bernard Debray. A new engine to build Binary star DataBase (BDB).

Abstract Binary star DataBase BDB (https://bdb.inasan.ru) has a very long history and its internal design was changed twice during its lifetime. The first version was written in mid 90’s as CGI shell scripts and used text files for data storage. Later it was rewritten in stackless Python with Nagare library. The next major update was performed during last year. The Nagare and other libraries were developing more and more compatibility issues, so we have decided to rewrite the BDB code using a completely new approach. In this paper we are presenting a brief introduction of this new approach to the distributed programming paradigm, which allows to significantly speedup the development. Here we employ the switch from the traditional Model-View-Controller approach to the distributed application, where the server is a “primary node” which controls many web-clients as “subordinate nodes”, delegating all User-Interface-related tasks to them.

SESSION 18. Information extraction from text II: Datasets. Small conference hall. Chair: Sergey Znamensky

Rodion Sulzhenko and Boris Dobrov. A Dataset of Russian-Language Debates For Argument Mining.

Abstract We present DebateRu, annotated dataset of Russian-language student debates designed for argument mining in culturally specific contexts. Comprising 10 hours of spontaneous televised debates (429 arguments across 10 topics), the corpus captures authentic rhetorical strategies and socio-political discourse patterns unique to Russian youth culture. Unlike scripted debate datasets, DebateRu preserves the emotional intensity and contextual nuances of real-world argumentation, addressing a critical gap in non-English resources. We evaluate the dataset through two tasks: stance detection and argument generation, testing several state-of-the-art Russian-adapted large language models. DebateRu provides a benchmark for developing context-aware argumentation systems and studying cross-cultural discourse patterns. We release the dataset to support research in multilingual NLP, rhetorical education, and computational social science. Collected dataset is publicly available on github.

Grigory Kovalev, Natalia Loukachevitch, Mikhail Tikhomirov, Olga Babina and Pavel Mamaev. Wikipedia-based Datasets in Russian Information Retrieval Benchmark RusBEIR.

Abstract In this paper, we present a novel series of Russian information retrieval datasets constructed from the “Did you know... ” section of RussianWikipedia. Our datasets support a range of retrieval tasks, including fact-checking, retrievalaugmented generation, and full-document retrieval, by leveraging interesting facts and their referencedWikipedia articles annotated at the sentence level with graded relevance. We describe the methodology for dataset creation that enables the expansion of existing Russian Information Retrieval (IR) resources. Through extensive experiments, we extend the RusBEIR research by comparing lexical retrieval models, such as BM25, with state-of-the-art neural architectures fine-tuned for Russian, as well as multilingual models. Results of our experiments show that lexical methods tend to outperform neural models on full-document retrieval, while neural approaches better capture lexical semantics in shorter texts, such as in fact-checking or fine-grained retrieval. Using our newly created datasets, we also analyze the impact of document length on retrieval performance and demonstrate that combining retrieval with neural reranking consistently improves results. Our contribution expands the resources available for Russian information retrieval research and highlights the importance of accurate evaluation of retrieval models to achieve optimal performance. All datasets are publicly available at HuggingFace. To facilitate reproducibility and future research, we also release the full implementation on GitHub.

Мария Елисеева, Наталья Ефремова, Наталия Баева and Цзя И Ян. Набор данных Visual Genome: перевод на русский язык и статистика.

Abstract В настоящее время набор данных Visual Genome является одним из немногих наборов, описывающих изображения в виде графов сцен, т.е. в нем представлена информация не только об объектах изображения и их атрибутах, но и об отношениях между объектами. Это делает Visual Genome перспективной основой для новых датасетов, в том числе и на других языках. Нами начат процесс адаптации этого набора для русского языка.
В статье проанализирован набор данных Visual Genome: описан ход его создания и рассмотрены особенности получившейся разметки. Описан процесс перевода набора на русский язык, возникшие при этом трудности и способы их разрешения. Отдельно обсуждаются результаты статистического анализа полученных после перевода данных; основное внимание уделено текстовым описаниям изображений и объектам. Указаны основные причины, приведшие к появлению одинаковых переводов для разных фраз на английском языке, что повлекло за собой изменение соотношения данных в оригинальном и переведённом наборе.
В заключении приводятся наши рассуждения об особенностях Visual Genome и полученных в ходе перевода данных. Упоминаются примеры возможного использования переведённых данных и направления дальнейшего исследования.

[SHORT] Elena Bolshakova and Anna Stepanova. Recognizing Cognates Based on Dataset with Morpheme Segmentation.

Abstract The paper considers cognates as relative words in a particular language, which have the same root (e.g., air, airy, airily, airless) and thus preserve some semantic relatedness. Recognition of the cognates is useful for such NLP tasks as deriving meaning of new and rare words, paraphrase detection, creation of lexical derivational resources. The paper describes methods for rec-ognizing cognates, as the first stage of developing a representative derivational resource for Russian language, based on a large dataset of words with segmented and classified morphemes (prefixes, roots, suffixes, endings, postfixes). The methods involve collecting words of the same root into disjoint groups (derivational families), with accounting for homonymous roots of Russian words, as well as root allomorphs (variants of the same roots). The allomorphs arise due to alternations of vowels and consonants and may be common for several non-cognate words, which is the main problem of their recognition. To identify semantic relatedness of words with such roots, clustering methods (DBSCAN, Kmeans, HDBSCAN) are experimentally studied, with vector representations of words (embeddings) in Word2Vec and FastText models. The experiments showed acceptable quality of the described methods, which is sufficient to eliminate most manual work on collecting groups of cognates, to form derivational families of words.

SESSION 19. Data analysis in astronomy. Online. Chair: Alexei Pozanenko

Maxim Pupkov, Art Prosvetov, Vasiliy Marchuk, Alexander Govorov, Olga Yushkova, Alexander Andreev and Vladimir Nazarov. Multimodal-Data-Driven Lunar Surface Reconstruction Using High-Resolution Imagery and Simulated Radargrams.

Abstract Accurate and high-fidelity lunar surface modeling is vital for effective mission planning and execution in contemporary lunar exploration. In this work, we advance digital terrain reconstruction by integrating novel machine learning techniques, such as Neural Radiance Fields and Gaussian Splatting, with traditional photogrammetry, leveraging high-resolution imagery in the range of 5 to 2 meters per pixel. A key focus of our study is the utilization of radargram data, specifically modeling the type of profiles that are expected to be obtained by instruments on future lunar missions. These data provide targeted height estimates within narrow swaths directly beneath an orbiter's trajectory, effectively offering reliable depth measurements at subsatellite trackpoints. We incorporate this subsample of high-confidence altimetric information into the Neural Radiance Fields and Gaussian Splatting models by adding ground-truth depth constraints for a subset of the dataset, which enhances the learning process. Our findings demonstrate that the inclusion of radargram-derived depth information leads to a significant improvement in terrain reconstruction quality. This is evidenced by enhanced accuracy metrics when fusing data from multiple modalities. The proposed approach highlights the benefits of combining optical and radar sources for robust lunar surface modeling, thereby enabling the development of mission-ready, high-precision digital terrain products to support the next generation of lunar exploration missions.

Alexander Rodin, Alexei Pozanenko and Viktoria Fedorova. Observation of the gravitational wave event GW190425 in the radio range at 111 MHz.

Abstract The paper presents the results of the search and detection of radio emission from the gravitational-wave event GW 190425 at a frequency of 111 MHz using the BSA radio telescope of the Lebedev Physical Institute. The new radio source was discovered approximately $2^\circ$ from the center of the region of maximum probable localization of GW 190425, coordinates (J2000.0) $\alpha = 16^h\,30'\pm 15',\;\delta = 13^\circ \, 21' \pm 15'$. The light curve was constructed, the flux at maximum was estimated at $\approx 1.5$ Jy. The maximum of the light curve occurs on the 20-30th day after the trigger. The probability of a false alarm for this kind of event was calculated $P_{fa}=5\cdot 10^{-7}$.

Ekaterina Malik, Pavel Kaygorodov, Dana Kovaleva, Oleg Malkov and Bernard Debray. On possible systematic errors when x-matching binary stars in Gaia.

Abstract С использованием данных Gaia DR3 были созданы каталоги двойных звезд, содержащие информацию о совокупно более чем 1.8 млн пар. Это более чем на порядок увеличивает ансамбль двойных звезд с известными характеристиками, насчитывавший ранее 144845 пар. Для осуществления статистического анализа полного ансамбля двойных звезд, включающего и ранее известные, и вновь обнаруженные пары, было проведено перекрестное отождествления по координатам наиболее полного до публикации данных Gaia синтетического каталога двойных звезд ILB с данными каталогов двойных звезд, основанных на результатах Gaia DR3. Проведен анализ результатов этого отождествления, показавший зависимость его характеристик как от данных исходных каталогов, так и от координат. Показано, что в плотных звездных полях, в частности, в диске Галактики, можно ожидать повышения доли ложноположительных отождествлений. В то же время для систем с большим собственным движением велика вероятность ложноотрицательного исхода отождествления. Предложены возможные изменения метода отождествления для снижения роли описанных систематических ошибок отождествления и повышения надежности его результатов.

[SHORT] Eugene Shekotihin, Nicolay Pankov, Alexei Pozanenko, Pavel Minaev and Alina Volnova. Brownian Bridge Diffusion Model in the Problem of Conditional Inpainting of Astronomical Images.

Abstract В работе рассматривается применение диффузионной модели броуновского моста (BBDM) для решения задачи условного закрашивания астрономических изображений. Предлагаемый алгоритм использует единственную пару изображений и диффузионную модель, обучающуюся преобразованию опорного кадра в целевой на разрешённых областях с целью восстановления области интереса на целевом кадре. На примерах реальных астрономических изображений обзоров демонстрируется реализация стабильного условного закрашивания и восстановления изображений галактик из обзора SDSS по изображениям обзора Pan-STARRS предлагаемым методом.

SESSION 20. Information extraction from text III. Small conference hall. Chair: Boris Dobrov

Елена Шамаева и Наталья Лукашевич. Влияние токенизации на оценку качества нейросетевого синтаксического анализа.

Abstract Синтаксические анализаторы используются в качестве вспомогательного инструмента в разных областях автоматической обработки текста. Поэтому важными направлениями исследований являются разработка критериев выбора синтаксического анализатора для конкретной прикладной задачи и методология оценки качества синтаксического анализатора. На оценку качества синтаксического анализатора влияет этап токенизации. Существует два способа оценки синтаксического анализатора: с использованием встроенного токенизатора и с использованием токенизатора, возвращающего эталонную разметку. Данная статья посвящена сравнению этих способов оценки качества синтаксического анализа. Исследование проведено на русскоязычных корпусах предложений с синтаксической разметкой SynTagRus, GSD, PUD, Taiga, Poetry и для русскоязычных синтаксических анализаторов UDPipe, Stanza, Natasha, DeepPavlov и spacy. Выявлено, что для значимого количества предложений разделение на токены, проводимое встроенным токенизатором, отличается от эталонного. Установлено также, что средние значения метрик UAS и LAS выше при использовании токенизатора, возвращающего эталонную разметку. Разработанная методология описания категорий токенов может использоваться для проверки качества синтаксического анализа при внедрении нового токенизатора. В рамках данного исследования для каждого из рассматриваемых анализаторов реализован токенизатор, возвращающий эталонный набор токенов из датасета. Реализация исследования доступна по адресу: https://github.com/Derinhelm/parser_stat/tree/tokenization_changing.

[SHORT] Anton Polevoi and Natalia Loukachevitch. Whisper Attacks and Defenses Investigation for Russian Speech.

Abstract Automatic Speech Recognition (ASR) systems, such as Whisper, are widely used in modern applications, but are vulnerable to adversarial attacks like model-control attacks, where adversarial audio segments manipulate model behavior without prompt access. Based on prior research, this paper focuses on adversarial attacks targeting Russian inputs and proposes defense strategies.
To improve attack imperceptibility while maintaining effectiveness, we introduce a regularization technique that incorporates speech similarity metrics, leveraging acoustic embeddings to balance attack e ciency and naturalness. This allows for adversarial perturbations that are both potent and perceptually similar to natural speech.
Our ndings show that long adversarial pre xes can signi cantly degrade Whisper's performance for Russian inputs, while shorter pre xes have a reduced impact. Additional preprocessing methods like speech enhancement showed moderate success but were less effective for real-time scenarios. This work advances understanding of ASR vulnerabilities and defenses for Whisper models in Russian audios.

[SHORT] Vadim Korobkovskii and Natalia Gorlushkina. Automation of retroconversion processes of bibliographic materials.

Abstract Создание электронных каталогов, которые значительно облегчают доступ читателей к необходимой информации, является важной задачей для современных библиотек. Однако проблема автоматизированного перевода существующих бумажных каталогов в цифровой формат остается актуальной с начала XXI века, поскольку универсального решения пока не найдено. В статье описаны методы исследования и его результаты для внедрения оптимизационных и функциональных улучшений. Представлен анализ имеющегося программного кода, реализованного ранее, для нахождения возможностей устранения выявленных недочетов. В результате описанных изменений в алгоритм были добавлены новые функции, касающиеся разбиения на поля формата RUSMARC, усовершенствован алгоритм, а также внедрены правки, которые позволили значительно ускорить работу программы и расширить возможности разбиения текста на поля в соответствии со стандартом RUSMARC.

SESSION 21. Database Management. Conference Hall «Kapitsa». Chair: Maria Poltavtseva

Semyon Grigorev, Vladimir Kutuev, Olga Bachishche, Vadim Abzalov and Vlada Pogozhelskaya. GLL-based Context-Free Path Querying for Neo4j.

Abstract We propose a GLL-based context-free path querying algorithm that handles queries in Extended Backus-Naur Form (EBNF) using Recursive State Machines (RSM). Utilization of EBNF allows one to combine traditional regular expressions and mutually recursive patterns in constraints natively. The proposed algorithm solves both the reachability-only and the all-paths problems for the all-pairs and the multiple sources cases. The evaluation on real-world graphs demonstrates that the utilization of RSMs increases the performance of query evaluation. Being implemented as a stored procedure for Neo4j, our solution demonstrates better performance than a similar solution for RedisGraph. The performance of our solution on regular path queries is comparable to the performance of the native Neo4j solution, and in some cases, our solution requires significantly less memory.

Habibur Rahman Habib and Ramon Antonio Rodriges Zalipynis. Scalable Top-K Subarray Searches: Seamless and Distributed NetCDF API Interception.

Abstract This paper presents an approach that enables distributed and seamless processing of NetCDF-based geospatial arrays through API interception. By intercepting standard NetCDF read operations and routing requests through a Vert.x/Hazelcast cluster, our system achieves distributed processing without requiring code modifications. Evaluation using MODIS satellite data demonstrates linear scaling to 16 nodes with 1.76× speedup over serial processing, while maintaining full API compatibility. The architecture’s event-driven design achieves 15ms average request latency through chunk-based distribution and dynamic load balancing. This work bridges conventional NetCDF workflows with modern distributed computing in the Cloud, enabling scalable analysis through familiar interfaces

Alexander Solovyev. CAP theorem and NewSQL DBMS.

Abstract The article proposes the mathematical apparatus of queueing theory for modeling NewSQL DBMS and distributed information systems. The applicability of the proposed apparatus for NewSQL modeling is demonstrated. The mathematical formulation of the CAP theorem is proposed. Typical examples of NewSQL DBMS are reviewed in the article along with the test data confirming strong consistency and availability of distributed data. The task of modeling a distributed DBMS is formulated. A set of models for calculating the main parameters of a distributed system and a set of queueing theory models applicable for modeling of a distributed system are proposed. Distributed system parameters are matched to the CAP theorem terms, which allows to confirm or refute its provisions during modeling. In further research it is planned to refine the mathematical models proposed in this article, to confirm their correctness and applicability to the modeling of distributed DBMS and information systems built using NewSQL.


Important dates

Conference
Submission deadline for papers June 9, 2025
Submission deadline for tutorials June 2, 2025
Notification for the first round July 24, 2025
Final notification of acceptance September 8, 2025
Deadline for camera-ready versions of the accepted papers September 15, 2025
Conference October 29-31, 2025