Multimodal-to-Text Prompt Engineering for GNSS Interference PDF

Multimodal-to-Text Prompt Engineering in Large Language Models Using Feature Embeddings for GNSS Interference Characterization Ha...

Multimodal-to-Text Prompt Engineering in Large Language Models Using Feature Embeddings for GNSS Interference Characterization Harshith Manjunath, Lucas Heublein, Tobias Feigl, Felix Ott Fraunhofer Institute for Integrated Circuits IIS, Nürnberg, Germany {harshith.manjunath, lucas.heublein, tobias.feigl, felix.ott}@iis.fraunhofer.de Abstract—Large language models (LLMs) are advanced AI LLMs serve as a universal interface for a general-purpose systems applied across various domains, including NLP, infor- assistant, enabling the explicit representation of various task arXiv:2501.05079v1 [cs.AI] 9 Jan 2025 mation retrieval, and recommendation systems. Despite their instructions in language. The recent success of ChatGPT adaptability and efficiency, LLMs have not been extensively explored for signal processing tasks, particularly in the domain has exemplified the power of aligned LLMs in adhering to of global navigation satellite system (GNSS) interference moni- human instructions. Pre-trained language models are toring. GNSS interference monitoring is essential to ensure the task-agnostic, extending to the learned hidden embedding reliability of vehicle localization on roads, a critical requirement space, where models such as recurrent neural networks or for numerous applications. However, GNSS-based positioning is transformers are pre-trained on web-scale unlabeled text cor- vulnerable to interference from jamming devices, which can compromise its accuracy. The primary objective is to identify, pora for general tasks and subsequently fine-tuned for specific classify, and mitigate these interferences. Interpreting GNSS tasks. LLMs, characterized by their larger model size and snapshots and the associated interferences presents significant enhanced language comprehension, are capable of in-context challenges due to the inherent complexity, including multipath learning , where they acquire new tasks from a small set of effects, diverse interference types, varying sensor characteristics, examples provided in the prompt during inference time. and satellite constellations. In this paper, we extract features from a large GNSS dataset and employ LLaVA to retrieve Through advanced application and augmentation techniques, relevant information from an extensive knowledge base. We LLMs can be deployed as AI agents – artificial entities that employ prompt engineering to interpret the interferences and perceive their environment, make decisions, and take actions. environmental factors, and utilize t-SNE to analyze the feature These agents often need to augment LLMs to access updated embeddings. Our findings demonstrate that the proposed method information from external knowledge bases and verify whether is capable of visual and logical reasoning within the GNSS context. Furthermore, our pipeline outperforms state-of-the-art system actions yield the desired results. machine learning models in interference classification tasks. Previous research has primarily focused on developing Github: https://gitlab.cc-asp.fraunhofer.de/darcy gnss agents tailored to specific tasks and domains. However, in the Index Terms—Large Language Models, LLaVA, Multimodal- context of GNSS interference monitoring –, no studies to-Text, Prompt Engineering, In-context Learning, Global Navi- have yet explored the use of language models for analyzing gation Satellite System, Interference Characterization GNSS signals , , which have potential applications I. I NTRODUCTION Humans interact with the world through various channels, Database Output Based on the provided em- such as vision and language, with each channel offering dis- Snapshots bedding, I can observe seve- GNSS tinct advantages for representing and communicating specific ral features and anomalies Feature in the GNSS signal snap- concepts. The aim is to develop a versatile assistant capable Vector shot. The embedding con- Extraction Store tains various signal data of effectively following multimodal vision-and-language in- Interference Characteristics: features such as carrier ▪ Class: chirp, noise, tone, etc. phase, code phase, and sig- structions, aligning with human intent to execute a wide range ▪ Bandwidth: [2, …, 60] LLM nal strength. The carrier ▪ Signal-to-noise ratio: [6, 8, 10] of real-world tasks in dynamic environments. Language- phase has a value of 0.037, which is within the normal Query augmented foundational vision models have demonstrated range for carrier phase Question: measurements. The low sig- strong performance in open-world visual understanding tasks, What are the features nal strength value may indi- and characteristics of cate that the signal is weak including classification , object detection , semantic User this GNSS snapshot? or that there is inference segmentation , as well as visual generation and editing. affecting the signal. This work has been carried out within the DARCII project, funding code Fig. 1: Based on feature embeddings extracted from GNSS 50NA2401, sponsored by the German Federal Ministry for Economic Affairs snapshots, which include associated interference character- and Climate Action (BMWK) and supported by the German Aerospace istics, a language model (LLM) generates a description in Center (DLR), the Bundesnetzagentur (BNetzA), and the Federal Agency for Cartography and Geodesy (BKG). response to a contextual query provided by a user. in crowdsourcing , aerospace systems, and toll collection A. Prompt Engineering with Language Models management for highway trucking ,. The application In the context of signal processing, Verma & Pilanci of LLMs in these domains remains largely unexplored and establish a comparison between classical Fourier transforms is still at an early stage of research. The accuracy of GNSS and learnable time-frequency representations for each inter- receivers is significantly compromised by interference from mediate activation signal within an LLM. Nguyen et al. jamming devices , a problem that arises by the increasing examine the vulnerabilities of LLMs in the context of 6G availability of affordable and accessible jammers. There- technology, particularly focusing on their susceptibility to ma- fore, mitigating these interference signals is crucial, requiring licious exploitation by analyzing known security weaknesses. the detection, classifation, and localization of interference Lin et al. highlight the potential of deploying LLMs at source. However, analyzing GNSS signal and charac- the 6G edge, emphasizing their ability to reduce long response terizing the interferences present substantial challenges due times, high bandwidth costs, and data privacy violations. Yu to the wide variability in interference bandwidths, signal-to- et al. investigate the application of LLMs in diagnosing noise ratios, antenna characteristics, and environmental factors cardiac diseases and sleep apnea using ECG signals, by incor- such as multipath effects –. The challenge lies in the porating expert knowledge to guide the models beyond their multimodal vision-and-text approach to interference character- inherent capabilities. Go et al. provide a comprehensive ization, the transferability of techniques from image analysis to survey of prompt engineering, a technique that involves aug- GNSS tasks , and the analysis of snapshots through feature menting a large pre-trained model with task-specific prompts extraction. LLMs hold great promise for GNSS interference to adapt the model to new tasks. This approach allows for monitoring, thanks to their capability to process and interpret predictions based solely on the prompt, without requiring up- complex, multivariate data in real-time. They can accurately dates to the model parameters, which aligns with our proposed detect interference events while delivering explainable insights method. We extract features from a large database without that strengthen system resilience and adaptability. Figure 1 fine-tuning the language model. However, this method requires illustrates our objective: to retrieve information from GNSS complex reasoning to identify model errors, hypothesize about snapshots, characterize features using LLMs, and present a gaps or misleading aspects in the current prompt, and com- descriptive output to the end user, who may be a decision- municate the task with clarity. Its potential is limited due to maker, such as a non-expert interpreting the model’s output lack of sufficient guidance for complex reasoning, as well as in real-world operations. We evaluate the LLM output the need for detailed descriptions, context specification, and by examining prompt engineering in the context of GNSS a structured reasoning framework. Nevertheless, prompt interference monitoring systems. engineering and language models have yet to be applied Contributions. The primary objective of this work is to pro- to GNSS interference monitoring, which we address in the vide a detailed characterization of GNSS interferences using subsequent section. language models and prompt engineering. The novelty lies in adapting LLM capabilities to the specialized domain of GNSS B. GNSS Interference Classification interference monitoring by addressing key challenges; this Several recent methods have focused on the classification includes adapting LLMs to process GNSS signal data, which of GNSS interferences. For instance, Swinney et al. fundamentally differs from text data. The key contributions explored the use of jamming signal power spectral density, of this research are as follows: (1) We introduce a method spectrogram, raw constellation, and histogram signal represen- that leverages feature extraction and LLaVA to generate tations as images to apply transfer learning from the imagery descriptive outputs in response to user queries. (2) We present domain. Ferre et al. employed support vector machines a comprehensive description of a GNSS dataset, focusing on its and convolutional neural networks to classify jammer types characteristics, including interference class, bandwidth, signal- in GNSS signals, whereas Li et al. and Xu et al. to-noise ratio, and multipath effects. (3) We explore prompt adopted a twin SVM-based approach. Ding et al. lever- engineering by manually evaluating hundreds of query-output aged ML models in a single static (line-of-sight) propagation pairs with varying levels of detail. (4) We assess feature environment; in contrast, we extend this work by considering embeddings using t-SNE analysis. (5) We demonstrate that multipath environments. Gross et al. utilized a maximum our proposed method outperforms traditional machine learning likelihood method to ascertain whether a synthetic signal is (ML) approaches in the task of interference classification. compromised by multipath or jamming. Heublein et al. highlighted the challenges associated with data discrepancies II. R ELATED W ORK in the GNSS context. Few-shot learning has been applied in the GNSS context to integrate new classes into a support set, First, we provide a summary of related work that has aiming for a more continuous representation between positive evaluated prompt engineering for language models (see Sec- and negative interference pairs. Raichur et al. adapted to tion II-A). Following this, we introduce the methods used for novel interference classes through continual learning. Brieger GNSS interference classification (see Section II-B). et al. incorporated both spatial and temporal relationships 2. Query 1. Load Source Data Image Pre- Vision 𝑥 ϵ ℝ𝑑 Vector Store (1) (2) Images processing Encoder Vector ××××××××××××××××× Store Load, Transform, Vector Store Embed ××××××××××××××××× (clip processor) (CLIP Model) 0.6, 0.1, …, 0.2, 0.6 3.4, -0.1, Embed … …, 1.3, 2.1 𝑑 𝑦1 , 𝑦2 , … , 𝑦𝑛 ϵ ℝ 2.0, 0.2, …, -1.5, 0.7 3. Retrieve most similar Retrieved 2 context Result: 𝑘 − argmin 𝑥 − 𝑦𝑖 ××××××××××××××××× 𝑖=1,…, ××××××××××××××××× Large Query = Image + Question (3) Context (4) Language Fig. 4: Overview of the vector store. Query Model Prompt (LLaVA) User (5) Output Language Response 𝑺𝑎 Language Model 𝒇𝜙 Fig. 2: Method overview. GNSS snapshots are processed (1), and the resulting image embeddings are stored as vectors (2). 𝑯𝑣 𝑯𝑞 Projection 𝑾 When the user submits a context query prompt (3), the LLM Vision Encoder 𝒁𝑣 (4) provides the corresponding output (5). Image 𝑺𝑣 Language Instruction 𝑺𝑞 Question: What are the features and characteristics of User What What are are the the features features Text What are the this GNSS snapshot? What in this thefeatures areimage? features Encoder in in this this image? image? in this image? 𝑇1 𝑇2 𝑇3 … 𝑇𝑁 𝐼1 𝐼1 𝐼1 𝐼1 Fig. 5: Network architecture of LLaVA. 𝐼1 … ∙ 𝑇1 ∙ 𝑇2 ∙ 𝑇3 ∙ 𝑇𝑁 𝐼2 𝐼2 𝐼2 𝐼2 𝐼2 … ∙ 𝑇1 ∙ 𝑇2 ∙ 𝑇3 ∙ 𝑇𝑁 GNSS Vision 𝐼3 𝐼3 𝐼3 𝐼3 𝐼3 … Snapshots Encoder ∙ 𝑇1 ∙ 𝑇2 ∙ 𝑇3 ∙ 𝑇𝑁 GNSS snapshots. Initially, the images are processed using the … … … … ⋱ … 𝐼𝑁 𝐼𝑁 𝐼𝑁 𝐼𝑁 CLIP (contrastive language-image pre-training) visual encoder 𝐼𝑁 … ∙ 𝑇1 ∙ 𝑇2 ∙ 𝑇3 ∙ 𝑇𝑁 ViT-L/14. The features extracted by the CLIP model are Fig. 3: Overview of CLIP of an image-text pair input. stored as embeddings in a vector store. The process continues when the user submits a query, which includes an image (a GNSS snapshot) and a related question. Using a context query between samples by utilizing a joint loss function and a late prompt, we instruct the LLM on its task. The LLM employed fusion technique. Furthermore, Raichur et al. introduced a is based on LLaVA , and the output is constrained to a crowdsourcing approach that leverates smartphone-based fea- maximum of 500 tokens. tures to localize the source of detected interference. For a com- Vision Encoder. We employ the CLIP1 ViT-L/14 vision prehensive overview of recent adaptive GNSS applications, encoder (refer to Figure 3), as proposed by Radford et al.. refer to Ott et al.. Gaikwad et al. proposed a federated CLIP is a neural network trained on a diverse set of (image, learning approach using few-shot learning and aggregation of text) pairs. It is designed to predict the most relevant text the model weights on a global server by introducing a dynamic snippet for a given image based on natural language instruc- early stopping method to balance out-of-distribution classes tions, without direct task-specific optimization. This approach based on representation learning, specifically utilizing the facilitates zero-shot transfer of the model to downstream tasks. maximum mean discrepancy of feature embeddings between Vector Store. A prevalent method for storing and searching local and global models. Heublein et al. proposed an unstructured data involves embedding the data, storing the ML approach that achieves high generalization in classifying resulting embedding vectors, and then embedding the unstruc- interference through orchestrated monitoring stations deployed tured query at query time to retrieve the vectors that are most along highways. The presented semi-supervised approach is similar to the embedded query. A vector store manages the coupled with an uncertainty-based voting mechanism by com- storage of embedded data and executes vector searches (refer bining Monte Carlo and Deep Ensembles that effectively to Figure 4). In this work, we utilize the Facebook AI similarity minimizes the requirement for labeled training samples to less search (FAISS) vector database to store embeddings of than 5% of the dataset while improving adaptability across size (1, 512). varying environments. Language Model. The objective is to effectively harness III. M ETHODOLOGY the capabilities of both the pre-trained LLM and the visual First, we provide an overview of the method. Following model. We employ the LLaVA2 model , which is based on that, we introduce the embedding and LLM model. Lastly, we Vicuna , as the language model fϕ (·), parameterized by present the specifics of our prompt engineering approach. ϕ. Although LLaVA-NeXT is too large for our current Method Overview. Figure 2 presents a detailed overview of the proposed pipeline. The input image has dimensions 1 CLIP encoder: https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14 of 1, 024 × 34, and the dataset comprises a total of 42,592 2 LLaVA model: https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf (a) None. (b) Chirp. (c) Freq- (d) Modu- (e) Multi- (f) Pulsed. (g) Noise. (h) Chirp, (i) Chirp, (j) Chirp, (k) Chirp, (l) Chirp, (m) Chirp, (n) Chirp, (o) Chirp, (p) Chirp, Hopper. lated. tone. BW 2. BW 20. BW 60. SNR -10. SNR 4. SNR 10. scenario scenario scenario 1. 7. 8. Fig. 6: Exemplary snapshot samples (concatenation of 10 samples) of the non-interference class (a) and all six interference types (b to g), a signal with chirp interference with different bandwidths (BW) (h to j) and signal-to-noise ratios (SNR) (k to m), and a chirp interference from the scenario 1 (open environment), scenario 7, and scenario 8. Figures from Heublein et al.. hardware configuration3 , it remains a potential architecture for future work. Figure 5 illustrates the network architecture, where a GNSS snapshot is used as input, and user queries serve as language instructions. Given an input image snapshot Sv , the visual feature Zv = g(Sv ) is extracted using the pre- trained CLIP visual encoder ViT-L/14. A linear layer maps the image features into the word embedding space; specifically, the visual features Zv are projected via the matrix (a) Scenario 2. (b) Scenario 5. (c) Scenario 7. (d) Scenario 8. W into language embedding tokens Hv by Hv = W · Fig. 7: Overview of different multipath scenarios where large Zv. During the training phase, multi-turn conversation black absorber walls are placed between and around the signal data (S1q , S1a ,... , STq , STa ) is generated for each image Sv , generator and the antenna (from Heublein et al. ). where T represents the total number of turns. Following the approach used in LLaVA, we conduct instruction-tuning of the LLM on the prediction tokens using an auto-regressive training where H represents the task instruction function, taking the objective: For a sequence of length L, the probability of the image Sv and text t as inputs to produce a modified input target answers Sa are computed by representation xinput. In-context learning , is a method L Y where the model is presented with a sequence of related p(Sa |Sv , Sinstruct ) = pθ (si |Sv , Sinstruct,

Multimodal-to-Text Prompt Engineering for GNSS Interference PDF

Document Details

Tags

Related

Summary

Full Transcript