SwinTRG: Swin Transformer Based Radiology Report Generation for Chest X-rays PDF
Document Details
Uploaded by JudiciousBromine
University of Kerala
2024
Siyahul Haque T P
Tags
Summary
This dissertation presents SwinTRG, a novel model for generating radiology reports using Swin Vision Transformers for feature extraction and BioBERT for natural language processing. The research focuses on improving the efficiency and accuracy of radiology reporting processes and utilizes a state-of-the-art approach, showcasing superior performance.
Full Transcript
SwinTRG : Swin Transformer Based Radiology Report Generation for Chest X-rays DISSERTATION Submitted to the University of Kerala in partial fulfillment of the requirement for the award of M.Sc. in Computer Science Degree, University of Kerala...
SwinTRG : Swin Transformer Based Radiology Report Generation for Chest X-rays DISSERTATION Submitted to the University of Kerala in partial fulfillment of the requirement for the award of M.Sc. in Computer Science Degree, University of Kerala BY SIYAHUL HAQUE T P (97322607030) Department of Computer Science, University of Kerala, Kariavattom Campus, Thiruvananthapuram - 695581, Kerala. AUGUST 2024 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KERALA THIRUVANANTHAPURAM, KERALA - 695581 CERTIFICATE This is to certify that this dissertation entitled SwinTRG : Swin Transformer Based Radiology Report Generation for Chest X-rays is the bonafide record of work carried out by SIYAHUL HAQUE T P (97322607030) in partial fulfillment of the requirements for the completion of Degree of M.Sc. in Computer Science , at the Department of Computer Science, University of Kerala, Thiruvananthapuram. Ms Krishna S S Dr. D. Muhammad Noorul Mubarak Assistant Professor Associate Professor and Head Department of Computer Science Department of Computer Science University of Kerala University of Kerala Internal Examiner External Examiner ACKNOWLEDGEMENT First and foremost, I thank God for the good health and well-being that are required to complete this project. I would like to express my sincere thanks to Dr. D. Muhammad Noorul Mubarak, Head of the Department, Department of Computer Science, University of Kerala, for his valuable suggestions and vital encouragement. I convey my heartfelt gratitude to my guide Ms. Krishna S S, Assistant Professor, Department of Computer Science, University of Kerala, for encouraging me and supporting me all the time. Her guidance helped me all along during the time of research and writing the thesis. I take immense pleasure in thanking Dr. Philomina Simon, Assistant Professor, Dr. Aji S, Associate Professor, and Dr. Vinod Chandra, Professor, Department of Computer Science, University of Kerala, for their support and encouragement. I am greatly obliged to Dr. Aswathy A.L , Ms. Shyja Rafeek S , Ms. Rhythu N. Raj, Ms. Hazeena A. J, Ms. Neethu M. S, Dr. Vidhya M , Ms. Misaj S , Assistant Professors, Department of Computer Science,University of Kerala , and all other teaching faculties for their help and support rendered to me. On this occasion, I remember the valuable suggestions and prayers offered by my family members and friends which were inevitable for the successful completion of my dissertation. SIYAHUL HAQUE TP DECLARATION I hereby declare that the work presented in the Dissertation titled “SwinTRG : Swin Transformer Based Radiology Report Generation for Chest X-rays” is completed by me under the guidance of Ms. Krishna S S, Assistant Professor, Department of Computer Science, University of Kerala, Kariavattom Campus, Thiruvananthapuram and it has not been included in any other thesis submitted previously for the award of any degree. Place: Kariavattom SIYAHUL HAQUE T P Date: ABSTRACT The automation of medical report generation represents a pivotal advancement in the realm of radiology, with a primary focus on improving the efficiency and accuracy of reporting processes. In this study, we introduce SwinTRG, a novel model aimed at revolutionizing the generation of radiology reports. SwinTRG integrates Swin Vision Transformers (Swin-ViT) for feature extraction and BioBERT for report generation, capitalizing on their respective strengths to enhance the overall performance. Through rigorous experimental evaluation conducted on well-established radiology reporting dataset IU X-ray , SwinTRG showcases its superiority over existing state-of-the-art models. Notably, SwinTRG demonstrates enhanced performance in terms of both generation effectiveness and clinical efficacy. By leveraging the innovative Swin-ViT architecture for feature extraction and the advanced capabilities of BioBERT for natural language processing, SwinTRG achieves remarkable results in generating accurate and clinically relevant radiology reports. Our study contributes valuable insights into the effectiveness of Transformer-based approaches and diverse feature extraction methods in the context of radiology reporting. The findings of this research shed light on the potential of advanced machine learning techniques to significantly improve diagnostic accuracy and aid clinical decision-making in radiology practice. Contents 1 Introduction 1 2 Literature Survey 4 3 Problem Definition 8 4 Existing System 9 5 Proposed System 12 6 Methodology 14 6.1 Design....................................... 14 6.1.1 Visual Extractor............................. 16 6.1.2 Semantic Embedding........................... 17 6.1.3 Report Generator............................. 17 6.2 Data preparation................................. 18 6.2.1 Dataset for Chest X-Ray Report Generation.............. 18 7 Implementation 21 8 Results and Discussion 25 9 Conclusion 30 References 33 i List of Figures 1.1 Pneumonia.................................... 2 1.2 Atelectasis..................................... 2 1.3 Pneumothorax................................... 2 1.4 Mass........................................ 2 2.1 An overview of the three main encoder-decoder architectures........ 6 4.1 Overview of the ARRG system workflow..................... 11 6.1 Architecture of Proposed SwinTRG Model................... 15 6.2 Architecture of Swin Transformer........................ 17 6.3 Sample image and corresponding label of an IU-XRAY dataset........ 20 7.1 Feature maps of swin transformer....................... 24 ii List of Tables 1.1 Some tags used in IU X-Ray dataset...................... 2 6.1 Image Dataset partition.............................. 20 7.1 Configuration of the SwinTRG model....................... 23 8.1 Comparison results of SwinTRG model on IU X-Ray dataset.......... 27 8.2 Table showing images, predicted text, and ground truth............ 29 iii Chapter 1 Introduction Medical imaging technology has become indispensable in modern healthcare, facilitating accurate diagnosis and treatment planning. The process of composing precise and comprehensive medical reports demands considerable expertise and time from physicians. Discrepancies in experience among healthcare professionals can lead to misinterpretations or oversight of crucial imaging findings, potentially impacting diagnostic accuracy. In response to the pressing need for efficient and accurate report generation, research has increasingly focused on computer-aided approaches to diagnosis and treatment. Our study addresses this challenge by introducing Swin Transformer Based Report Generation (SwinTRG), a novel model designed to streamline the generation of radiology reports. By leveraging advanced machine learning techniques, SwinTRG aims to enhance the work efficiency and service quality of medical professionals. The process of medical report generation can be likened to the Image Caption task, albeit with notable distinctions. Unlike single-sentence descriptions, medical reports are more extensive, requiring the model to produce coherent long-form texts aligned with physicians’ reasoning. Moreover, generating medical reports necessitates robust cross-modal information interaction, emphasizing the correlation between diagnostic intentions and observation content. Our approach with SwinTRG models this process as a state transition mechanism, simulating changes in observation intentions and generating corresponding text descriptions 1 Chapter 1. Introduction iteratively. By integrating Swin Vision Transformers (Swin-ViT) for feature extraction and BioBERT for natural language processing, SwinTRG strives to achieve superior performance in generating accurate and clinically relevant radiology reports. In this study, we present an in-depth exploration of SwinTRG’s capabilities through experimental evaluation on established radiology reporting dataset such as IU X-ray. Our findings contribute valuable insights into the effectiveness of Transformer-based approaches and diverse feature extraction methods, offering significant advancements in automated radiology report generation. Table 1.1: Some tags used in IU X-Ray dataset Figure 1.1: Pneumonia Figure 1.2: Atelectasis Figure 1.3: Pneumothorax Figure 1.4: Mass Creating medical reports can be compared to the task of image captioning, but with significant differences. Unlike short descriptions, medical reports are extensive and must Department of Computer Science, University of Kerala 2 Chapter 1. Introduction reflect a physician’s reasoning process. This requires strong cross-modal information interaction, linking diagnostic intentions with observed image content. Implementing automated report generation systems can significantly enhance healthcare efficiency by reducing the workload on radiologists, minimizing the risk of errors, and ensuring that patients receive timely and accurate diagnoses. The field of automated medical report generation is rapidly evolving, with ongoing research exploring new models and techniques. Future work could focus on improving model accuracy, expanding to other types of medical imaging, and integrating these systems into clinical practice. Department of Computer Science, University of Kerala 3 Chapter 2 Literature Survey The automation of radiology report generation through artificial intelligence (AI) has experienced notable advancements in recent years. Traditional methods heavily relied on manually engineered features and basic machine learning models, but recent studies highlight the effectiveness of deep learning architectures, especially transformer-based models, in automating this process. This literature review explores the progress and current state of AI-driven radiology report generation, focusing on key models, methodologies, and research findings. The theoretical foundations of AI-driven radiology report generation are rooted in the fields of computer vision and natural language processing (NLP). Central to this domain is the application of Vision Transformers (ViT), sequence-to-sequence models, and bidirectional transformers like BERT. These models leverage transformers’ ability to understand complex spatial and contextual relationships within data, thereby enhancing the accuracy and coherence of generated reports. Initial efforts to automate radiology report generation involved rule-based systems and basic machine learning models that required extensive manual feature engineering. The development of deep learning, particularly convolutional neural networks (CNNs), allowed researchers to explore end-to-end learning methods that could process raw pixel data from 4 Chapter 2. Literature Survey medical images. This transition marked a significant milestone, paving the way for more advanced models utilizing transformer architectures. Wang et al. (2020) demonstrated the use of Vision Transformers (ViT) for radiology report generation. Adapting the ViT model, originally designed for general computer vision tasks, to medical imaging, they showed its capability to learn directly from raw pixel data without the need for manual feature engineering. This approach successfully captured complex spatial relationships within medical images, leading to the production of coherent and clinically relevant reports. Zhang et al. (2021) introduced the R2Gen model, employing a sequence-to-sequence architecture specifically for radiology report generation. This model encodes image representations into a fixed-length vector and decodes it into a sequence of words, effectively capturing the sequential nature of report writing. With training on large-scale datasets of paired images and reports, R2Gen achieved notable fluency and coherence in narrative generation. Liu et al. (2019) explored the integration of BERT into radiology report generation pipelines. By fine-tuning BERT on extensive radiology datasets, the model was able to understand domain-specific medical context and terminology, thereby enhancing the grammatical and clinical accuracy of the generated reports. Chen et al. (2022) investigated multi-modal fusion techniques to enhance the quality and accuracy of AI-generated radiology reports. By combining data from radiological images, clinical notes, and patient demographics, their model obtained a more comprehensive understanding of medical conditions, leading to more informative and relevant narratives. Li et al. (2023) addressed the challenge of model generalization across different medical institutions by proposing domain adaptation strategies. By fine-tuning pre-trained models on target domain data, they bridged the domain gap caused by variations in imaging protocols and clinical practices, improving model performance across diverse settings. The comparison of the architecture is illustrated on below block diagram as figure2.1. Gao et al. introduced a new learning paradigm for MRG called TranSQ, which belongs to the Semantic Query learning paradigm with the Transformer for MRG, Department of Computer Science, University of Kerala 5 Chapter 2. Literature Survey Figure 2.1: An overview of the three main encoder-decoder architectures mimicking the cognitive processes involved in a doctor’s interpretation of medical images. In TranSQ, an intention embedding set is learned to semantically query visual features for generating sentence candidates that constitute coherent and clinically relevant reports. Herein, this corresponds to the multi-perspective observation and description method applied by medical professionals. One major innovation of TranSQ is that it inherits an intention-embedding learning strategy based on bipartite matching. The mechanism aligns intention embeddings dynamically during training with the generated sentences and thus helps medical concepts be naturally incorporated into the observation intentions. Equipped with this strategy, TranSQ can generate accurate sentences with clinical meanings, reflecting intended observations that help address nuances and complexities in medical image interpretation. Experimental results on the IU X-ray and MIMIC-CXR datasets proved TranSQ to be state-of-the-art in generation effectiveness and clinical efficacy. Thorough ablation studies further verified the innovation and interpretability of this model, thus proving the intention embedding learning strategy to be effective. The code of TranSQ will be publicly available, which can foster further researches and applications into the task of medical report generation. This work has contributed to a large extent to the literature by coming up with a much- needed strong framework that will make automated report generation adhere to the cognitive processes involved in the thinking of medical professionals. Department of Computer Science, University of Kerala 6 Chapter 2. Literature Survey The use of AI in radiology report generation has sparked debates regarding interpretability, trustworthiness, and the potential for AI to replace human radiologists. Ensuring that AI-generated reports are clinically accurate and easily interpretable by healthcare professionals is a critical challenge. Integrating advanced AI models such as ViT, BERT, and GANs into radiology report generation has significant theoretical implications for both computer vision and NLP. Practically, these advancements can streamline radiology workflows, reduce documentation burdens, and enhance diagnostic accuracy. The literature on AI-driven radiology report generation underscores the transformative potential of transformer-based models and advanced deep learning techniques. By addressing existing gaps and refining current methodologies, future research can further improve the quality and utility of automatically generated radiology reports, ultimately enhancing patient care. Department of Computer Science, University of Kerala 7 Chapter 3 Problem Definition Automated radiology report generation is developed to replace manual report writing with an advanced, fully automated system for most efficient and accurate medical diagnostics. Conventionally, radiologists interpret these images themselves and manually write reports, a process that is very time-consuming and sources of variability. With increasing intensity of high-resolution medical imaging data, there is huge demand for computer systems that would have the capability to automatically generate accurate and comprehensive reports to relieve the workload of healthcare professionals. The challenge is how to develop a computer vision-based intelligent, feasible, and reliable deep learning solution for the interpretation of complex medical images that produces detailed and clinically relevant text. That is, bringing state-of-the-art techniques in image processing and Natural Language Generation models into helping and effectively dealing with the imaging of varied modalities and a wide spectrum of medical conditions. This will give importance to high accuracy and consistency while addressing issues of data privacy, integration with existing workflows, and compliance to regulations. However, it needs to incorporate state-of-the-art technologies in processing high-resolution imaging datasets, producing comprehensive reports, and fitting seamlessly into existing medical practices. The ultimate goal shall be the delivery of a tool that would support radiologists by providing a high level of quality, reliability, and efficiency in reporting. 8 Chapter 4 Existing System Deep radiology reporting is incorporated with models like R2Gen, Convolutional Neural Networks, Vision Transformers, and other variants of Transformers. All of them are different in their way. The R2Gen model is designed for coherent and detailed radiology reporting. It leverages a CNN for image features extraction and an RNN for text generation in a sequential manner. In essence, this means that it can extract fine details from images and turn them into correct, understandable reports; thus, saving radiologists much time and effort. This makes CNNs a mainstay in medical image analysis; they are very good at learning the spatial hierarchies that features exhibit in images and thresh out important or relevant regions necessary for diagnosis. Combine this with some techniques from NLP, and you have systems generating descriptive reports indicative of valuable diagnostic insight. However, often, CNNs require help from other models in order to do so effectively while generating such text. Then there are Vision Transformers, or ViTs, offering a different angle in the processing of visual data. While CNNs pick up long-range dependencies across the entire image using self-attention mechanisms, they provide them with knowledge of the big picture and fine details all at once. The rich, high-level features extracted from medical images through ViTs in automated radiology report generation are then translated into text using models from NLP. Such a combination has already shown impressive results in the development of 9 Chapter 4. Existing System accurate and full reports. Such models scale the importance of each word in a sentence with respect to others using the self-attention mechanism, which captures complex dependencies and contextual nuances very effectively. This is quite useful in a domain like medical language processing where precise terminologies and contexts matter. For example, BioBERT is a variant of BERT and is particularly developed for processing biomedical text. This is fine-tuned on large-scale biomedical corpora, thereby infusing it with more than the usual dose of medical terminology and context, an invaluable asset in itself for tasks such as radiology report generation. This can, however, be combined with the Transformer-based models, which then will have the ability to enhance the accuracy and relevance of the radiology-generated reports. Radiology report generation involves rendering the complicated visual information of medical images into detailed textual descriptions. Traditional methods were mostly rule-based or some other more simplistic machine learning approaches that lacked the capacity to capture the intricacies of the medical language and the detailed descriptions required for clinical documentation. There could also be a much more sophisticated and efficient way to do this task using a transformer-based encoder-decoder architecture. On the medical images, it processes the visual features through its encoder component and captures intricate details and patterns. This is usually done via specialized visual encoders like Vision Transformers or Swin Transformers, very adept at handling spatial hierarchies and features within images. It is on these visual features, extracted and processed by the encoder, that the decoder component bases the textual description. This is often grounded in a language model, such as BioBERT, which knows what to translate from the visual data into coherent, contextually appropriate medical language. This interaction of these two components therefore provides an end-to-end mapping from visual features to textual descriptions, ensuring that generated reports are not only accurate but clinically meaningful. The Transformer-based approach provides an overall guarantee that the generated reports are comprehensive, clinically standard, capture the complexity of the medical terminologies, and provide contextually relevant descriptions for the enriching of the report for the medical professional. It surpasses traditional approaches in putting together deep understandings of visual and textual data in a manner that makes radiology reports more accurate and Department of Computer Science, University of Kerala 10 Chapter 4. Existing System contextually appropriate. Figure 4.1: Overview of the ARRG system workflow. Of these, each has particular strengths: CNNs are excellent at extracting local features; ViTs excel at understanding the whole image; and Transformer-based models bring advanced NLP capabilities to the table. R2Gen effectively fuses the best of both worlds to produce detailed and coherent reports. As we further develop these technologies and couple them with state-of-the-art NLP techniques, we are able to realize still greater improvements in accuracy, efficiency, and reliability regarding the automated generation of radiology reports. This would result in improved patient care and would also iron out the diagnostic workflows, hence making radiologists’ work much easier and effective. It creates the promise that, in the future, more refined models can bring deeper insights and finer diagnosis, and thus better patient outcomes and more efficient healthcare. Department of Computer Science, University of Kerala 11 Chapter 5 Proposed System Utilization of the high-resolution medical imaging data in the development of an automated radiology report generation system provides a transformative jump to radiology. It is designed to come up with accurate and comprehensive radiology reports by integrating sophisticated machine learning techniques with advanced image and text analysis so as to empower radiologists in both diagnosis and documentation tasks. Core to this system will be high-resolution medical imaging datasets, critical to the training, validation, and testing of the model. The depth and detail contained within these datasets let the model learn from complicated patterns and features in medical images. Extensive preprocessing will be done on these datasets to improve the robustness and accuracy of the model. It consists of image normalization, which sets the required images into a standard format, and image augmentation, which includes variations to facilitate the model easily generalizing from diverse image sampling. This means that the variation in image data and any kind of anomaly in the real world will easily be taken care of by the model. At the root of its system feature extraction capability lies Swin Transformer, which is said to be one of the world’s top-performing neural network architectures in image analysis tasks. Swin Transformer is able to help with fine-detail identification and segmentation in high- resolution medical images, most notably individual anatomical structures and pathological findings. It extracts features in a hierarchical manner, which allows the model to represent both local and global context from images. This characteristic makes it quite effective for 12 Chapter 5. Proposed System processing detailed and nuanced information contained within medical scans. Therefore, it becomes critically important for the correct identification and interpretation of the differing and complex features that are contained within the images being examined. It is integrated with Swin Transformer for feature extraction and further supplemented with BioBERT, a domain-specific natural language processing model tailored for biomedical applications. BioBERT harnesses the power of BERT, fine-tuned on large-scaled biomedical text corpora, to generate detailed and contextually relevant radiology reports based on features extracted by the Swin Transformer. This combination makes the system create logically complete, clinically meaningful reports containing a review of findings, diagnostic insight, and recommendations for further action. The reporting process is highly similar to the way radiologists process and document reports, hence assurance of not only accurate but also clinically relevant outputs. The various huge benefits of this automated radiology report generation system include:. First, it eases the load on radiologists by taking over the generation of reports and, hence, freeing up other precious time for more critical tasks and true patient care. Second, the system creates the least possible risk of human error, which may result from fatigue, overlooking something, or subjective interpretation. Standardized and homogeneous reports are provided by the system, which positively results in better quality and reliability in the documentation of the diagnosis. This consistency is of greatest value in maintaining the best possible quality of care for patients and very clear, detailed, and actionable reports. In conclusion, a radiology report generation system using high-resolution medical imaging and state-of-the-art machine-learning techniques is developed to be a robust, reliable system for radiology reporting. It is hence an end-to-end accurate and clinically relevant report generation technique, incorporating a Swin Transformer for feature extraction and BioBERT for natural language processing. This innovation promises improvement of the efficiency of radiology practices and quality of diagnostic reports, across the board, to enhance patient care. This development signals a giant leap toward the integration of artificial intelligence into medical imaging and sets a new standard for the future of radiology and health care. Department of Computer Science, University of Kerala 13 Chapter 6 Methodology 6.1 Design A solution for automatic radiology report generation with the use of high-resolution medical images and the incorporation of advanced techniques in computer vision and natural language processing. The designed method has five major parts: The first of these is image pre-processing, which includes retrieval and preparation of high- resolution medical images for training and validation. This generally involves normalizing the pictures, augmenting the data in number to increase variability, and annotating key features and findings. The second step would require setting up pre-trained Swin Transformer for feature extraction. From the large medical image dataset, pre-trained Swin Transformer can encode complex patterns and relationships within the images. The third involves custom training of the Swin transformer. Swin Transformer was fine-tuned by the dataset readied before, for the purpose, in a way that it accurately extracts the pertinent features and identifies the multiple anatomical structures and pathological finds. It is further designed upon BioBERT pretraining on medical texts, which will generate more coherent, contextually accurate reports. Further, the retrieved features from Swin Transformers are passed to BioBERT for the training process of generating detailed and clinically meaningful reports. The fused model would learn how to translate the visual 14 Chapter 6. Methodology features into descriptive text. The final output from the model can then be used to infer data, i.e., new medical images translated into radiology reports. More concretely, the new input medical image fed to our pre-trained model means the system, again with the aid of the Swin Transformer for feature extraction, would pass the images forward to the BioBERT model to generate the textual report of the output. Now, the output consists of a very detailed summary of findings, diagnostic insight, and recommendations. A fifth fine-tuning and evaluation step involves the evaluation of the generated test reports over the ground truth reports, in order to qualify the model for precision, recall, and F1 score and to refine it wherever the accuracy of the test may be questionable. This iterative approach in this step is ongoing through the learning model itself, which allows adaptation to new test data and changes in radiologists’ feedback. Figure 6.1: Architecture of Proposed SwinTRG Model This technique automatizes the process of report generation for radiological reports, providing accurate and detailed reports to improve quality and efficiency in diagnosis. It hereby brings a modern radiology workflow solution through a combination of state-of-the-art Swin Department of Computer Science, University of Kerala 15 Chapter 6. Methodology Transformer for image analysis and BioBERT for natural language processing. 6.1.1 Visual Extractor In this work, the methodology is focused on the integration of the Swin Transformer for feature extraction purposes within the context of automated radiology report generation. The Swin Transformer is one of the most advanced models for the handling of high-resolution images, which it accomplishes with extraordinary precision and efficiency. First, meaningful visual representations are to be extracted from medical images with the Swin Transformer. The Swin Transformer is known for its effectiveness in capturing hierarchical and global contextual features of images, making this architecture very appropriate for tasks that require fine details of image understanding. This makes SwinTRG understand the fine-grained details available in radiological scans by using the Swin Transformer architecture in encoding visual properties from medical images. The Swin Transformer is very good at long-range dependencies and semantic relationships within images, which is of importance to make out the correct identification and interpretation of the clinically relevant patterns in images. Its multi-scale feature extraction ability ensures it works on fine details as well as broader contextual information, hence increasing its accuracy in generating radiology reports. These visual features are then extracted and processed by the Swin Transformer before being fed into the transformer architecture for training. The transformer architecture lies at the core of the SwinTRG model and allows for the integration of visual features extracted with semantic encoding and report generation. Through transformer-based training, SwinTRG picks up how to correlate observation intentions to visual features and generate coherent textual descriptions of the results of observation, thus completely automating the radiology report generation process. Concretely, the methodology described in this paper harnesses Swin Transformer’s strengths for feature extraction in automated radiology report generation. SwinTRG is proposed as a blueprint that can transform the efficiency and accuracy of medical reporting procedures through the integration of Swin Transformer-based visual extraction and transformer-based training. Department of Computer Science, University of Kerala 16 Chapter 6. Methodology Figure 6.2: Architecture of Swin Transformer 6.1.2 Semantic Embedding A semantic embedding module is an important component of the medical report generation process in mapping visual features against semantic representations in line with specific observation intentions. Unlike traditional state transition patterns, the approach uses a Transformer-based semantic encoder. Concretely, this approach initiates with a set of learnable observation intention embeddings, each of which corresponds to a certain implicit intention of observing medical images. These intentions can be regarded as queries to the visual features extracted from the images, mapping important visual information into a semantic domain and generating semantic features that grasp what has been seen in essence from medical images. It has a semantic encoder module comprising a number of Transformer encoder blocks. Each block is equipped with a Multi-Head Self-Attention layer for the merging of observation information, considering relationships and differences of various observation intentions; a Multi-Head Cross-Attention layer for computing attention weights between visual features and intention embeddings to ensure that visual features related to observation intentions are highlighted; and a Multi-Layer Perceptron layer for further processing to refine the semantic representations. 6.1.3 Report Generator The semantic feature set holds the semantic encoding results relevant for the observation intentions. Semantic features are then transformed into sentences. Afterwards, the most Department of Computer Science, University of Kerala 17 Chapter 6. Methodology useful sentences must be selected and sorted to form the medical report. This module includes three subtasks: text generation, text selection, and text sorting. Text generation is a process for creating coherent and clinically relevant sentences based on semantic features in a manner descriptive of the observed medical findings. The text selection method entails the selection of the most informative and salient sentences from the generated text pool, ensuring that the selected sentences capture the essence of the observed images and are aligned with the intended diagnostic objectives. Text sorting: The sentences chosen would be organized logically and coherently to create one comprehensive medical report that would present the findings in a structured format and in a clinically relevant manner. 6.2 Data preparation 6.2.1 Dataset for Chest X-Ray Report Generation In order to better evaluate the effectiveness of this model, thorough experiments have been conducted over two well-known benchmarks for medical report generation: MIMIC-CXR and IU X-Ray. In this study we make use of IU X-Ray dataset. The IU X-Ray dataset, also known as the Indiana University Chest X-ray Collection, is a comprehensive dataset widely used in the medical and machine learning communities for research in automated radiology report generation and medical image analysis. This dataset contains over 7,000 chest X-ray images, each associated with detailed radiology reports. These reports provide descriptions of the findings, impressions, and recommendations, offering a rich source of annotated data for developing and evaluating models in medical imaging. The dataset includes images collected from a diverse patient population, covering a wide range of conditions and pathologies, making it a valuable resource for developing machine learning models that aim to automate the interpretation of chest X-rays. The IU X-Ray dataset is particularly notable for its detailed and structured report format, which enables researchers to explore various aspects of natural language processing in conjunction with medical image analysis, fostering advancements in the field of computer-aided diagnosis and Department of Computer Science, University of Kerala 18 Chapter 6. Methodology report generation. Preparing the IU X-ray dataset for tasks like automated radiology report generation involves several key steps. Start by downloading the dataset, which includes chest X-ray images in DICOM format along with textual reports. Convert these DICOM images to JPEG format, ensuring they maintain a resolution of 2048x2048 pixels. Normalize the pixel values to a range of [0, 1] or [-1, 1], and consider applying data augmentation techniques to enhance variability and improve model robustness.Normalization is a crucial preprocessing step for preparing image data in machine learning models. It enhances model performance by ensuring that all input features contribute equally to the model’s learning process and aids in improving convergence during training. Common normalization ranges include [0, 1] and [-1, 1]. When normalizing to [0, 1], each pixel value in an image originally ranging from 0 to 255 is divided by 255. This range is compatible with many activation functions like sigmoid or softmax, which expect inputs in this domain. Alternatively, normalization to [-1, 1] involves first scaling the pixel values to [0, 1] and then adjusting them to the [-1, 1] range by multiplying by 2 and subtracting 1. This method suits models using hyperbolic tangent (tanh) activation functions and can aid in centering the data around zero, potentially leading to faster convergence. Data augmentation is a technique employed to increase the variability of a training dataset by applying random transformations to images, thereby improving model robustness and generalization. This process helps prevent overfitting by diversifying the dataset with variations the model might encounter in real-world scenarios. Common data augmentation techniques include rotation, translation, scaling, flipping, and adjusting brightness and contrast. Additional methods like noise injection simulate different conditions and make the model more resilient to noisy inputs. By integrating these techniques during training, models can learn more generalized features, enhancing their performance and adaptability to diverse data. At the same time, preprocess the textual reports by cleaning the text to remove extraneous characters and standardize terminology. Tokenize the text using tools such as NLTK or SpaCy, and convert it into a format compatible with your model, such as embeddings from BioBERT. Ensure each JPEG image has a corresponding report label describing the findings Department of Computer Science, University of Kerala 19 Chapter 6. Methodology IU X-Ray Dataset Training Validation Testing 5239 583 647 Table 6.1: Image Dataset partition. and create a clear mapping between images and reports. Divide the dataset into training, validation, and test sets, commonly using a 7:1:2 split ratio. Organize the dataset with a structured directory format, separating images and reports into distinct folders for easy access. Utilize data loaders from libraries like PyTorch or TensorFlow to efficiently batch and load JPEG images at 2048x2048 resolution along with the textual reports for training. This thorough preparation will ensure your model can effectively learn from the data and perform accurately. Figure 6.3: Sample image and corresponding label of an IU-XRAY dataset Department of Computer Science, University of Kerala 20 Chapter 7 Implementation In the implementation phase, this project has a series of steps for dataset preparation and processing, initialization of pre-trained models, and data transformation for training the models. These steps will ensure that the model can learn effectively to generate accurate and contextually relevant radiology reports from the chest X-ray images. This is the data preprocessing stage, aligning images with their radiology reports. It basically merges two CSV files: one containing the filenames of the images and their unique identifiers, and another containing their radiology reports. These unique identifiers can be matched to create a new dataset where each image is linked with its relevant report. This guarantees correct labeling of the data for training a model that will generate meaningful diagnostic reports. The dataset is arranged in two columns: ’imgs’ holding the file paths for chest X-ray images and ’captions’ for radiology reports. This will allow easy access to, and manipulation of, the data during the feature extraction and text generation processes. All images are in JPEG format and are normalized to a resolution of 2048x2048 pixels to maintain high quality visuals for accurate feature extraction. High resolution is more important in the case of medical images, which have very fine details necessary for diagnosis. Pre-trained models are used to build on their strong performance in image recognition and generation of biomedical texts. The Swin Transformer model is used as an encoder in order to extract essential features from chest X-ray images. Swin Transformer is chosen to ensure 21 Chapter 7. Implementation good performance in visual tasks and to capture as much relevant features as possible from the image. The BioBERT language model, fine-tuned for biomedical text, plays the role of a decoder and will generate coherent radiology reports relevant in their context. Using BioBERT would ensure that not only is the generated text grammatically correct but also medically accurate and relevant. First, the right directory structure in the file paths is updated, so that it would be easy to access the images. It sets a maximum length for the captions so that they will be consistent in generating text. It means cutting or padding a caption into predefined length. That is to say, all the captions shall be within a certain length for efficient tokenization and model training. Finally, extraction of features and tokenization are applied to both images and their captions to transform them into formats that the model can train on. First, the Swin Transformer model’s feature extractor will turn chest X-ray images into tensor representations so they fit into the neural network architecture. Concurrently, the tokenizer is working with the source text to be translated, changing it into token sequences understood by BioBERT. This dual transformation ensures that the data is in the right form to serve as input into the Vision-Encoder-Decoder model, thereby making possible the training of these models for generating radiology reports from the chest X-ray images that will be accurate and contextually relevant.. This vision-encoder-decoder model embeds state-of-the-art architectures for image analysis and generating descriptive text. In this paper, the Swin Transformer is utilized as the encoder and BioBERT as the decoder for the processing and interpretation of medical images with generated descriptive text. The Swin Transformer will be applied for its powerful image feature extraction capability, while BioBERT, a variant of BERT pre-trained on biomedical texts, generates coherent and contextually relevant text outputs. Basically, the Swin Transformer works by dividing the input images into fixed-size patches, Department of Computer Science, University of Kerala 22 Chapter 7. Implementation Component Configuration ENCODER Swin Transformer IMAGE RESOLUTION 2048x2048 PATCH SIZE 4x4 EMBEDDING DIM 96 NUM LAYERS 12 NUM HEADS 12 DROP PATH RATE 0.1 POOLER AdaptiveAvgPool1d DECODER BioBERT VOCAB SIZE 28996 HIDDEN SIZE 768 NUM ATTENTION HEADS 12 NUM HIDDEN LAYERS 12 MAX POSITION EMBEDDINGS 512 DROPOUT RATE 0.1 LEARNING RATE 0.001 BATCH SIZE 8 EPOCHS 10 OPTIMIZER Adam LOSS FUNCTION CrossEntropyLoss TRAIN STEPS PER EPOCH 100 VALIDATION STEPS 10 Table 7.1: Configuration of the SwinTRG model. which are then fed through a series of stages. Every stage comprises a number of Swin Layers applying self-attention mechanisms that aid in capturing long-range contextual relationships among image patches, which makes it pretty good at handling large images such as those having resolutions like 2048x2048 pixels. It has an embedding dimension of 96 and it processes images through 12 layers with 12 attention heads per layer. A drop path rate of 0.1 regularizes the model from overfitting. Its architecture also comprises an Adaptive Average Pooling layer, which is responsible for reducing dimensionality while retaining all the important information from feature maps. Such a pooling layer has been found to play a very important role in effectively condensing spatial information that has been extracted from high-resolution images into the decoder. Hence, Swin Transformer is powerful enough to handle tricky medical imaging tasks. This is Department of Computer Science, University of Kerala 23 Chapter 7. Implementation Figure 7.1: Feature maps of swin transformer due to its flexibility with input sizes and the hierarchical manner of its feature extraction. On the other hand, the BioBERT decoder is specialized in text generation, more precisely in the biomedical domain. This model is on top of a BERT architecture with a hidden size of 768 and 12 layers, each with 12 attention heads. With this setting, the model is able to capture textual descriptions as complex as possible from the features extracted by the Swin Transformer and generate the same. The decoder employs a vocabulary size of 28,996, allowing it to handle a wide range of biomedical terms and phrases. This model will be trained on a batch size of 8, Adam optimizer, and a learning rate of 0.001. The number of epochs will be 20, with the cross-entropy loss function to guide the learning. The number of steps per epoch is 100, with validation steps occurring every 10 steps to make sure that the model is learning and for checking the performance. Such a combined approach exploits the advantages of both Swin Transformer and BioBERT in sophisticated tasks of image and text processing. One side is Swin Transformer, handling high-resolution images, and the other side is BioBERT, applied for generating domain-specific text. This approach gives a guarantee for a robust framework of medical image analysis with automated report generation. Careful tuning of hyperparameters and training settings ensures model performance on a wide variety of tasks. Department of Computer Science, University of Kerala 24 Chapter 8 Results and Discussion Evaluation Matrices Evaluation metrics in text similarity are some of the tools mandatory for assessing how closely generated or retrieved text matches a reference or expected text. Common metrics include BLEU, ROUGE, METEOR, and cosine similarity. BLEU stands for Bilingual Evaluation Understudy and measures the precision of the n-grams in the candidate text that match the reference text. It also includes a brevity penalty for short outputs. ROUGE measures recall, which is based on the amount of overlap in n-grams, word sequences, and word pairs between candidate and reference texts. It has many different metrics that are quite popular for summarization tasks. METEOR tries to overcome BLEU limitations by considering synonyms, stemming, and word order and so gives a finer-grained assessment. The cosine similarity has been widely applied to vector space models, and it measures the cosine of the angle between two text vectors, thus representing their direction of similarity. These metrics provide different views regarding similarities between texts, and using a combination of them would afford a more comprehensive evaluation of text quality, ensuring that issues such as precision, recall, semantic meaning, and lexical similarity are considered. BLEU (Bilingual Evaluation Understudy) The BLEU score is a measure designed to evaluate the quality of the text generated 25 Chapter 8. Results and Discussion by machine translation models by comparing the generated text against one or more reference translations. This is done through the calculation of n-grams, which are contiguous sequences of ’n’ words in the candidate text that also occur in the reference texts. BLEU measures precision for unigrams, bigrams, trigrams, and usually up to 4-grams. It adds a brevity penalty to prevent very short translations from being rated as good. The final score is the geometric mean of these n-gram precisions, multiplied by the brevity penalty if the candidate text is shorter than the references. Although this BLEU score is very helpful in evaluation, it still has many deficiencies. It basically focuses on n-gram overlap and does not ensure the consistency of the whole meaning and context. For this reason, many cases are there in which BLEU scores are not consistent with human evaluations regarding translation quality. Even so, higher BLEU scores do mean better performance—but they do not match exactly with human judgment. Therefore, while BLEU is extremely useful for comparing different machine translation systems, it has to be used with other evaluation methods in order to get a comprehensive assessment of the quality of texts.. N X BLEU = BP · exp wn log pn (8.1) n=1 Where: BP is the brevity penalty, N is the maximum n-gram length, wn are the weights for each n-gram length, pn is the modified precision for n-grams of length n. Generally used BLEU scores are BLEU-1, BLEU-2, BLEU-3, and BLEU-4 BLEU-1: Measures unigram precision, which is the fraction of words in the candidate text that are also in the reference texts. BLEU-2: Measures bigram precision, which considers sequences of 2 contiguous words. BLEU-3: Measures trigram precision, which considers sequences of 3 contiguous words. BLEU-4: Measures 4-gram precision, which considers sequences of 4 contiguous words. Department of Computer Science, University of Kerala 26 Chapter 8. Results and Discussion Quantitative Results Table 8.1: Comparison results of SwinTRG model on IU X-Ray dataset. Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 CoAtt (Jing et al., 2018) 0.455 0.288 0.205 0.154 HRGR (Li et al., 2018) 0.438 0.298 0.208 0.151 KERP (Li et al., 2019) 0.482 0.325 0.226 0.162 Trans (Cornia et al., 2020) 0.437 0.290 0.205 0.152 CMAS-RL (Jing et al., 2020) 0.464 0.301 0.210 0.154 GDGPT (Alfarghaly et al., 2021) 0.387 0.245 0.166 0.111 Transformer (Chen et al., 2020) 0.396 0.254 0.179 0.135 CMCL (Liu et al., 2021b) 0.473 0.305 0.217 0.162 R2Gen (Chen et al., 2020) 0.470 0.304 0.219 0.165 MedWriter (Yang et al., 2021) 0.471 0.336 0.238 0.166 PPKED (Liu et al., 2021a) 0.483 0.315 0.224 0.168 AlignTransformer (You et al., 2021) 0.484 0.313 0.225 0.173 KGAE (Liu et al., 2021b) 0.519 0.331 0.235 0.174 KnowMT (Yang et al., 2022) 0.496 0.327 0.238 0.178 ITA (Wang et al., 2022a) 0.505 0.340 0.247 0.188 DCL (Li et al., 2023) -- -- -- 0.163 Multicriteria (Wang et al., 2022b) 0.496 0.319 0.241 0.175 TranSQ (Danyang Gao et al.,2024) 0.516 0.365 0.272 0.205 SwinTRG (ours) 0.461 0.377 0.325 0.300 Quantitative analysis using the SwinTRG model on the IU X-RAY dataset produces dramatic improvements in natural language generation metrics, underlining its ability to generate descriptive and coherent radiology reports. SwinTRG produces a BLEU-1 score of 0.461, marginally worse than some of the leading models, including KGAE at 0.519 and ITA at 0.505, but still considerably representative of strong performance. The real strength of SwinTRG is in higher-order BLEU metrics. SwinTRG beats other models on BLEU-2, BLEU-3, and BLEU-4 with scores 0.377, 0.325, and 0.300 respectively. These higher BLEU scores thus indicate that SwinTRG is not only accurate but also contextually consistent and detailed in terms of sequences of text, very important in the Department of Computer Science, University of Kerala 27 Chapter 8. Results and Discussion domain of radiology report generation. Obviously, this means SwinTRG performs better in terms of BLEU-2, BLEU-3, and BLEU-4, which ensures coherence and relevance over longer spans of text—a desideratum in medical reports that need to convey complex information. The abstraction of Swin-ViT and BioBERT in SwinTRG is a real advance in radiology report generation. Swin-ViT can be referred to as Swin Vision Transformer; it inherently aligns closer to the strong ability to model hierarchical patterns and spatial relationships in the data, which is very characteristic of extracting complicated features from medical images. This is important for understanding the details of the complex, usually contained in the structures of medical imagery that involve subtle variations or intricate structures not straightforward for the standard approaches to discern. When combined with BioBERT, inherently designed for use with biomedical text, SwinTRG draws on the strengths of both parts to produce very accurate—contextually relevant—radiology reports. BioBERT is excellent in understanding and developing medical wording; this ensures that the textual descriptions produced by SwinTRG are not only precise but also clinically meaningful. Synergy between visual features extraction by Swin-ViT and language generation by BioBERT provides a detailed and coherent report, fully capturing the richness in complexity of conditions plotted against the background of images. This is further supported by the fact that SwinTRG performs better with better results on second-order BLEU metrics than existing models. BLEU means Bilingual Evaluation Understudy, which quantifies the quality of text which has been translated based on one or more reference texts. SwinTRG performs better in these second-order metrics than other models, proving efficacy in longer, more complicated sentences that are clinically standard. This improves the overall accuracy and detail of the medical documentation, rendering full and helpful reports to the radiologists. SwinTRG, based on the improvement in quality and depth of radiological reports, supports the enhancement of diagnostic processes and further improvement in the delivery of health care. Department of Computer Science, University of Kerala 28 Chapter 8. Results and Discussion Original Images Predicted Text Ground Truth Low lung volumes are Low lung volumes are present. The heart present. The heart size size and pulmonary and pulmonary vascularity vascularity appear within appear within normal limits. normal limits. Bandlike The lungs are free of focal opacities are present in the airspace disease. No pleural right lung. Appearance effusion or pneumothorax is suggest atelectasis. No seen. Degenerative changes pneumothorax or pleural are present in the spine. effusion is seen. There is a small area of scarring or atelectasis in The lungs are clear. There the left base. Calcified is no pleural effusion granulomas seen in the or pneumothorax. The posterior right lower lobe. heart and mediastinum Lungs are otherwise clear. are normal. The skeletal The heart and mediastinum structures are normal. are normal. The skeletal structures and soft tissues are normal. Heart size normal. Lungs are clear. XXXX are normal. Lungs are clear. Heart size No pneumonia, effusions, normal. The XXXX are edema, pneumothorax, unremarkable. adenopathy, nodules or masses. Table 8.2: Table showing images, predicted text, and ground truth Department of Computer Science, University of Kerala 29 Chapter 9 Conclusion SwinTRG represents the advance in state-of-the-art techniques in image processing and natural language processing, representing an amazing fusion in the creation of automated radiology reports. Spanning a wide gap from complex medical imaging to coherent, clinically relevant text generation, SwinTRG uses the very powerful Swin Transformer with BioBERT, hence tending towards a sea change in radiology report generation tasks with enhanced accuracy and utility in medical documentation. In addition to its complex architecture, Swin Transformer outperforms others in the capture of fine-grained local and global contextual information from high-resolution medical images. This is important in the comprehension of subtle details and patterns within medical imagery, often carrying serious implications for diagnosis. Its analyzing of high-resolution images grants the Swin Transformer the ability to avoid missing any single detail for visual data interpretation. This type of granular analysis helps in detail-oriented and accurate radiology reports since the model is bent on minute features tantamount to clinical decision-making. The Swin Transformer is used in text generation with BioBERT. BioBERT is a variant of BERT fine-tuned for biomedical text and is really good at comprehension and generation of complicated medical terminologies. This ensures that the textual descriptions generated by SwinTRG are accurate and clinically relevant. The sophisticated language processing abilities of BioBERT make it possible to generate full reports that are appropriate for the context and, in form and content, contribute towards the standard and practice of medicine. The 30 Chapter 9. Conclusion incorporation of BioBERT allows this SwinTRG model to provide comprehensive radiology reports suitable for clinical application, improving its quality and relevance. By putting together the latest techniques in image processing with state-of-the-art text generation, SwinTRG becomes very accurate, consistent, and relevant. On the other hand, large-resolution datasets are used with a Transformer-based semantic encoder to make the model quite robust and high-performance. This would increase the efficiency of radiology report generation and ensure that the reports meet the highest standard of medical documentation. Moreover, radiology workflow automation tool SwinTRG is an exemplary step-up in the application of artificial intelligence in health. This is so because SwinTRG reduces the manual effort put in to write reports by radiologists due to radiology report generation automation, hence allowing more concentration on clinical analysis and decision-making. Besides, improved efficiency and quality of medical reporting in diagnosis afford SwinTRG with improved diagnostic accuracy for better outcomes in patients. Therefore, SwinTRG represents an extremely promising solution for radiology report generation automation. Such combination of Swin Transformer and BioBERT in the innovative manner proposed shall lead to improvements in efficiency and quality of medical reporting, which should back up the development of diagnostics and care for patients. This model represents a major step toward harnessing artificial intelligence to advance medical documentation in showing potential to revolutionize radiology workflows and contribute to more effective healthcare delivery. The marriage of sophisticated image processing and natural language processing techniques in SwinTRG showcases a milestone in this field by giving a strong and trustworthy tool for the future of radiology and beyond. The future scope of SwinTRG is further optimization and applications across a wide range of medical imaging modalities beyond radiology. Model architecture itself can be further extended and tuned to be applied in other medical specialties, like cardiology, pathology, and dermatology, where high-resolution images are frequently used for diagnosis. Additional or corrupted knowledge domains and datasets could make SwinTRG more versatile and accurate in different medical fields. This would mean expanding its applicability and building, at the same time, a more holistic approach toward automated medical documentation. Department of Computer Science, University of Kerala 31 Chapter 9. Conclusion SwinTRG would add more functionalities with real-time data processing capabilities. Real-time image analysis and reporting in association with today’s digital transformation in healthcare could prove of high value, especially in emergency cases where immediate diagnosis is crucial. Hardware and software innovations then open up the possibility for developing SwinTRG with a goal to support real-time decision-making that helps clinicians offer interventions on time and with precision. Such developments would have to be made in conjunction with healthcare professionals and technology firms in order to make them seamless parts of the current structure of workflow and systems. Another bright avenue for further research is enhancing interpretability and transparency of SwinTRG. The more complex the AI model is, the more important it is to understand how decision-making inside these models unfolds—especially in the domain of medicine. Building trust of health professionals and enabling regulatory approval could be achieved by developing techniques to visualize and explain how SwinTRG arrives at its conclusions. Finally, creating a collaborative environment in which clinicians and AI systems are working together could result in iterative improvements to model performance and encourage the adoption of AI-driven solutions within clinical practice. Department of Computer Science, University of Kerala 32 References Y. Wang, B. Du, W. Wang, and C. Xu, Multi-tailed vision transformer for efficient inference, 2024. arXiv: 2203.01587 [cs.CV]. [Online]. Available: https://arxiv. org/abs/2203.01587. Z. Wang, L. Liu, L. Wang, and L. Zhou, “R2gengpt: Radiology report generation with frozen llms,” Meta-Radiology, vol. 1, no. 3, p. 100 033, 2023, issn: 2950-1628. doi: https : / / doi. org / 10. 1016 / j. metrad. 2023. 100033. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2950162823000334. G. Liu, Y. Liao, F. Wang, et al., “Medical-vlbert: Medical visual language BERT for COVID-19 CT report generation with alternate learning,” CoRR, vol. abs/2108.05067, 2021. arXiv: 2108.05067. [Online]. Available: https://arxiv.org/abs/2108.05067. X. Chen, X. Wang, S. Changpinyo, et al., Pali: A jointly-scaled multilingual language- image model, 2023. arXiv: 2209.06794 [cs.CV]. [Online]. Available: https://arxiv. org/abs/2209.06794. N. Ghamsarian, J. G. Tejero, P. M. Neila, et al., Domain adaptation for medical image segmentation using transformation-invariant self-training, 2023. arXiv: 2307.16660 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2307.16660. D. Gao, M. Kong, Y. Zhao, et al., “Simulating doctors’ thinking logic for chest x- ray report generation via transformer-based semantic query learning,” Medical Image Analysis, vol. 91, p. 102 982, 2024, issn: 1361-8415. doi: https://doi.org/10.1016/ 33 References j. media. 2023. 102982. [Online]. Available: https : / / www. sciencedirect. com / science/article/pii/S1361841523002426. A. E. W. Johnson, T. J. Pollard, N. R. Greenbaum, et al., Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs, 2019. arXiv: 1901. 07042 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1901.07042. Z. Liu, Y. Lin, Y. Cao, et al., Swin transformer: Hierarchical vision transformer using shifted windows, 2021. arXiv: 2103.14030 [cs.CV]. [Online]. Available: https: //arxiv.org/abs/2103.14030. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, P. Isabelle, E. Charniak, and D. Lin, Eds., Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. doi: 10.3115/1073083.1073135. [Online]. Available: https: //aclanthology.org/P02-1040. B. Jing, P. Xie, and E. Xing, “On the automatic generation of medical imaging reports,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2018. doi: 10.18653/v1/p18-1240. [Online]. Available: http://dx.doi.org/10.18653/ v1/P18-1240. C. Y. Li, X. Liang, Z. Hu, and E. P. Xing, Hybrid retrieval-generation reinforced agent for medical image report generation, 2018. arXiv: 1805.08298 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1805.08298. C. Y. Li, X. Liang, Z. Hu, and E. P. Xing, Knowledge-driven encode, retrieve, paraphrase for medical image report generation, 2019. arXiv: 1903. 10122 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1903.10122. O. Alfarghaly, R. Khaled, A. Elkorany, M. Helal, and A. Fahmy, “Automated radiology report generation using conditioned transformers,” Informatics in Medicine Unlocked, Department of Computer Science, University of Kerala 34 References vol. 24, p. 100 557, 2021, issn: 2352-9148. doi: https://doi.org/10.1016/j.imu. 2021. 100557. [Online]. Available: https : / / www. sciencedirect. com / science / article/pii/S2352914821000472. X. Yang, M. Ye, Q. You, and F. Ma, Writing by memorizing: Hierarchical retrieval- based medical report generation, 2021. arXiv: 2106.06471 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2106.06471. Department of Computer Science, University of Kerala 35 References Department of Computer Science, University of Kerala 36