Page 70 - ID - Informazioni della Difesa n. 03-2024
P. 70

BioMedInformatics 2024, 4
                                                                             116
             their societal impact, including the deskilling of professionals like doctors [41]. Therefore,
             while the application of machine learning techniques in healthcare is inevitable, establish-
             ing standardized criteria for interpretable ML in this field is urgently needed to enhance
             transparency, fairness, and safety.
             3. Primaries on Vision Transformer
            Ferdinando SPAGNOLO
                 Vision Transformer (ViT) [17] is a deep learning model that has gained significant
             attention in computer vision. In contrast to traditional convolutional neural networks
             (CNNs), which have been the dominant architecture for image recognition tasks, ViT
             adopts a transformer-based architecture inspired by its success in natural language process-
             ing (NLP) [42]. ViT breaks down an image into fixed-size patches, which are then linearly
             embedded and processed using a transformer encoder to capture global contextual infor-
            L’ARCHITETTURA TRANSFORMER
             mation. This approach allows ViT to handle local and global image features effectively,
            Con l’introduzione dell’architettura  Transformer (modelli implementati per la
             leading to remarkable performance in various computer vision tasks, including image
            comprensione di sequenze di dati, testo o serie temporali), gli LLMs hanno segnato
             classification and object detection.
                 Generally, a Vision Transformer consists of a patch embedding layer and several
            un’evoluzione rapida, evidenziata da modelli rivoluzionari quali GPT e BERT, che con
             consecutively connected encoders, as depicted in Figure 1. The self-attention layer is
            miliardi  di  parametri,  possiedono  notevoli  capacità  di  comprensione,  traduzione  e
             the key component that enables ViT to achieve many state-of-the-art vision recognition
            generazione testuale a fronte di un limitato addestramento.
             performances. The self-attention layer first transforms the input image into three different
            Parallelamente, i LVMs hanno trasformato la visione computazionale, superando i limiti
             vectors—the query vector, the key vector, and the value vector. Subsequently, the attention
             layer then computes the scores between each pair of vectors and determines the degree of
            delle prestazioni in vari compiti di riconoscimento visivo utilizzando i Vision Transformers
             attention when given other tokens.
            (ViTs) e i Convolutional Neural Networks (CNNs).





                                                                                            Schema di un
                                                                                            Convolutional Neural
                                                                                            Networks







             Figure 1. The basic framework of Vision Transformer (ViT) [17] and its encoder architecture.
            I LMMs rappresentano la frontiera
            nell’integrazione di dati visivi  H×W×3 , the patch embedding layer first splits and
                 Formally, given an image x ∈ R
                                                       2
            e linguistici, promuovendo una  patches x p ∈ R N×(P d) , where (H, W) represents the
             flattens the sample x into sequential
             height and width of the input image, (P, P) is the resolution of each image patch, d denotes
            comprensione   e   generazione 2
             the output channel, and N = HW/P is the number of image tokens. The list of patch
            multimodale  che  apre  nuove
             tokens is further fed into Transformer encoders for attention calculation.
            possibilità  in  ambiti  come
                 Each Transformer encoder mainly consists of two types of sub-layers—a multi-head
             self-attention layer (MSA) and an MLP
            la creazione di immagini da layer. In MSA, the tokens are linearly projected and
             further re-formulated into three vectors, namely Q, K and V. The self-attention calculation
            testi. Sfruttando l’architettura
             is performed on Q, K and V by
            Transformer,  questi   modelli
            offrono rappresentazioni unificate         Q · K  ⊤
                             ′                               ) · V,           (1)
            applicabili a una vasta gamma di              d
                            x = Attention(Q, K, V)= Softmax( √
                             ℓ
            compiti.
            Nel  panorama  attuale  delle
            biotecnologie  e  della  salute
            digitale, i modelli di intelligenza
            artificiale  di  grandi  dimensioni
            sono   potenti  strumenti  per
            affrontare  e  risolvere  complesse
            sfide in vari settori critici, dalla
            bioinformatica  alla  robotica
            medica   biomedica,  rendendo
            agevoli, semplici e diffuse attività
            che fino a pochi anni fa erano di
            dominio di pochi e selezionati
            professionisti.


            68                                                                                       ID 3/2024
   65   66   67   68   69   70   71   72   73   74   75