AI Transformers for Machine Learning: A Deep Learning Dive 1st Edition by Dr. Uday Kamath (Author), Dr. Kenneth Graham (Author), Dr. Wael Emara (Author):

Foreword by Dr. Yogesh Malhotra, Amazon Web Services (AWS) Partner and New York State Venture Capitalist, MIT Computer Science & AI Lab-MIT Sloan School of Management Executive Education Programs Management & Leadership Artificial Intelligence Faculty-SME, Princeton Quant Finance and Trading and FinTech-Crypto SME, New York State Faculty-SME AI-ML-DL-Quant-Cyber-Crypto-Quantum-Risk Computing, Who's Who in America®, Who's Who in the World®, Who's Who in Finance & Industry®, Who's Who in Science & Engineering® Since 1999.

Foreword by Dr. Yogesh Malhotra, Amazon Web Services Partner, AI-Quant-Cyber-Crypto-Quantum-Risk-Computing Pioneer:
New York State: "Join Dr. Yogi Malhotra to get up to speed on Cloud Technology."


Renowned AI pioneer and Nobel laureate Herbert Simon underscored ‘attention’ as the most valuable resource of information economy, as necessary to allocate attention efficiently among the overabundance of information resources. Having written the foundational paper on meaning-aware AI and recently having served as MIT-Princeton-USASF-AFRL AI Faculty-SME, I had the privilege of publishing by invitation in the same journal’s special issue of ASQ, the Malcolm Baldrige National Quality Award administrator as well as guiding the Malcolm Bridge Awarded corporate CxOs for the Conference Board US Quality Council at the invitation of the Florida Power & Light Company Chief Quality Officer. In addition, I have been ranked along with Simon in the same global academic citation impact studies.

Given above background, I am thrilled to share with you the most thorough and up-to-date compendium of research, practices, case studies, and applications available today that can provide the best ROI on the latest AI technological advances on transformers inspired by the paper, “Attention is All You Need.” Since Google introduced transformer architecture in 2017, transformers have provided exponential improvements in context-focused realization toward meaning-aware AI as deep (neural network) learning models based upon attention mechanisms such as dot-product attention and multi-head attention. Resulting advances in enhanced parallel processing of sequential data have made efficient context-sensitive and hence more “meaningful” for ever-larger datasets and much more feasible than earlier.

Covering the latest advances in neural network architectures related to transformers spanning applications such as Natural Language Processing (NLP), speech recognition, time series analysis, and computer vision and domain-specific models spanning science, medicine, and finance, the book aims to meet the theoretical, research, application, and practical needs across academia and industry for multiple audiences including  postgraduate students and researchers, undergraduate students, industry practitioners, and professionals. The book rounds off its theory-driven applied and practical coverage with hands-on case studies with focus on AI explainability, an increasingly important theme in practice imposed by greater focus on issues such as ethical AI and trustable AI.

- Dr. Yogesh Malhotra, Founding Chairman and CEO, New York State Global Venture Capital and Private Equity Firm, Global Risk Management Network LLC, R&D Impact Ranked Among Artificial Intelligence and Quantitative Finance Nobel Laureates such as Herbert Simon and Black-Scholes – AACSB-ASIS&T, Editorial-Review Boards of the National Science Foundation and Top-50 Global Scientific-STEM & Computer Science Research Journals & Publishers, www.YogeshMalhotra.com .

Foreword by Dr. Yogesh Malhotra, Amazon Web Services (AWS) Partner and New York State Venture Capitalist, MIT Computer Science & AI Lab-MIT Sloan School of Management Executive Education Programs Management & Leadership Artificial Intelligence Faculty-SME, Princeton Quant Finance and Trading and FinTech-Crypto SME, New York State Faculty-SME AI-ML-DL-Quant-Cyber-Crypto-Quantum-Risk Computing, Who's Who in America®, Who's Who in the World®, Who's Who in Finance & Industry®, Who's Who in Science & Engineering® Since 1999


Foreword by Dr. Yogesh Malhotra, Amazon Web Services (AWS) Partner and New York State Venture Capitalist, MIT Computer Science & AI Lab-MIT Sloan School of Management Executive Education Programs Management & Leadership Artificial Intelligence Faculty-SME, Princeton Quant Finance and Trading and FinTech-Crypto SME, New York State Faculty-SME AI-ML-DL-Quant-Cyber-Crypto-Quantum-Risk Computing, Who's Who in America®, Who's Who in the World®, Who's Who in Finance & Industry®, Who's Who in Science & Engineering® Since 1999


Congratulations to Dr. Uday Kamath and his co-authors Dr. Wael Emara and Dr. Kenneth Graham on the publisher’s release of their brand new Artificial Intelligence-Machine Learning-Data Science book titled Transformers for Machine Learning: A Deep Dive by the publisher Chapman and Hall/CRC in Machine Learning & Pattern Recognition series. I have had the pleasure and privilege of being invited to write the Foreword of this latest and most outstanding AI-ML-Data Science industry and research reference on the state-of-the-art Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL) and Natural Language Processing (NLP) transformer architectures.


Listed below is the Book Description followed by the Table of Contents of this latest AI industry reference by this seasoned and experienced team of Data Scientists and Machine Learning Engineers. Provided hyperlinks will lead you to the publisher’s book page Amazon and Google book pages with additional content. Uday, currently Chief Analytics Officer of Smarsh – named a Leader for the 7th year in 2022 Gartner Magic Quadrant for Enterprise Information Archiving – earlier served in the role of Chief Data Scientist for BAE Systems Applied Intelligence and Wael serves as Principal Data Scientist with Uday. The three collaborators on this book got this project off the ground while leading Digital Reasoning, the AI company that understands the nuances of human intention and behavior where Uday served as the Chief Analytics Officer, Kenneth as Principal Research Engineer, and , Wael as Senior Research Engineer.


Transformers for Machine Learning: A Deep Dive
By Uday Kamath, Kenneth L. Graham, Wael Emara
Copyright Year 2022.


Book Description


Transformers are becoming a core part of many neural network architectures, employed in a wide range of applications such as NLP, Speech Recognition, Time Series, and Computer Vision. Transformers have gone through many adaptations and alterations, resulting in newer techniques and methods. Transformers for Machine Learning: A Deep Dive is the first comprehensive book on transformers.


Key Features:


A comprehensive reference book for detailed explanations for every algorithm and techniques related to the transformers.
60+ transformer architectures covered in a comprehensive manner.
A book for understanding how to apply the transformer techniques in speech, text, time series, and computer vision.
Practical tips and tricks for each architecture and how to use it in the real world.
Hands-on case studies and code snippets for theory and practical real-world analysis using the tools and libraries, all ready to run in Google Colab.
The theoretical explanations of the state-of-the-art transformer architectures will appeal to postgraduate students and researchers (academic and industry) as it will provide a single entry point with deep discussions of a quickly moving field. The practical hands-on case studies and code will appeal to undergraduate students, practitioners, and professionals as it allows for quick experimentation and lowers the barrier to entry into the field.

Table of Contents
List of Figures
List of Tables
Author Bios
Foreword
Preface
Contributors

Deep Learning and Transformers: An Introduction
1.1 DEEP LEARNING: A HISTORIC PERSPECTIVE
1.2 TRANSFORMERS AND TAXONOMY
1.2.1 Modified Transformer Architecture
1.2.1.1 Transformer block changes
1.2.1.2 Transformer sublayer changes
1.2.2 Pretraining Methods and Applications
1.3 RESOURCES
1.3.1 Libraries and Implementations
1.3.2 Books
1.3.3 Courses, Tutorials, and Lectures
1.3.4 Case Studies and Details

Transformers: Basics and Introduction
2.1 ENCODER-DECODER ARCHITECTURE
2.2 SEQUENCE TO SEQUENCE
2.2.1 Encoder
2.2.2 Decoder
2.2.3 Training
2.2.4 Issues with RNN-based Encoder Decoder
2.3 ATTENTION MECHANISM
2.3.1 Background
2.3.2 Types of Score-Based Attention
2.3.2.1 Dot Product (multiplicative)
2.3.2.2 Scaled Dot Product or multiplicative
2.3.2.3 Linear, MLP, or additive
2.3.3 Attention-based Sequence to Sequence
2.4 TRANSFORMER
2.4.1 Source and Target Representation
2.4.1.1 Word Embedding
2.4.1.2 Positional Encoding
2.4.2 Attention Layers
2.4.2.1 Self-Attention
2.4.2.2 Multi-Head Attention
2.4.2.3 Masked Multi-Head Attention
2.4.2.4 Encoder-Decoder Multi-Head Attention
2.4.3 Residuals and Layer Normalization
2.4.4 Position-wise Feed-Forward Networks
2.4.5 Encoder
2.4.6 Decoder
2.5 CASE STUDY: MACHINE TRANSLATION
2.5.1 Goal
2.5.2 Data, Tools and Libraries
2.5.3 Experiments, Results and Analysis
2.5.3.1 Exploratory Data Analysis
2.5.3.2 Attention
2.5.3.3 Transformer
2.5.3.4 Results and Analysis
2.5.3.5 Explainability

Bidirectional Encoder Representations from Transformers (BERT)
3.1 BERT
3.1.1 Architecture
3.1.2 Pre-training
3.1.3 Fine-tuning
3.2 BERT VARIANTS
3.2.1 RoBERTa
3.3 APPLICATIONS
3.3.1 TaBERT
3.3.2 BERTopic
3.4 BERT INSIGHTS
3.4.1 BERT Sentence Representation
3.4.2 BERTology
3.5 CASE STUDY: TOPIC MODELING WITH TRANSFORMERS
3.5.1 Goal
3.5.2 Data, Tools, and Libraries
3.5.2.1 Data
3.5.2.2 Compute embeddings
3.5.3 Experiments, Results, and Analysis
3.5.3.1 Building Topics
3.5.3.2 Topic size distribution
3.5.3.3 Visualization of topics
3.5.3.4 Content of topics
3.6 CASE STUDY: FINE-TUNING BERT
3.6.1 Goal
3.6.2 Data, Tools and Libraries
3.6.3 Experiments, Results and Analysis

Multilingual Transformer Architectures
4.1 MULTILINGUAL TRANSFORMER ARCHITECTURES
4.1.1 Basic Multilingual Transformer
4.1.2 Single-Encoder Multilingual NLU
4.1.2.1 mBERT
4.1.2.2 XLM
4.1.2.3 XLM-RoBERTa
4.1.2.4 ALM
4.1.2.5 Unicoder
4.1.2.6 INFOXL
4.1.2.7 AMBER
4.1.2.8 ERNIE-M
4.1.2.9 HITCL
4.1.3 Dual-Encoder Multilingual NLU
4.1.3.1 LaBSE
4.1.3.2 mUSE
4.1.4 Multilingual NLG
4.2 MULTILINGUAL DATA
4.2.1 Pre-training Data
4.2.2 Multilingual Benchmarks
4.2.2.1 Classification
4.2.2.2 Structure Prediction
4.2.2.3 Question Answering
4.2.2.4 Semantic Retrieval
4.3 MULTILINGUAL TRANSFER LEARNING INSIGHTS
4.3.1 Zero-shot Cross-lingual Learning
4.3.1.1 Data Factors
4.3.1.2 Model Architecture Factors
4.3.1.3 Model Tasks Factors
4.3.2 Language-agnostic Cross-lingual Representations
4.4 CASE STUDY
4.4.1 Goal
4.4.2 Data, Tools, and Libraries
4.4.3 Experiments, Results, and Analysis
4.4.3.1 Data Preprocessing
4.4.3.2 Experiments

Transformer Modifications
5.1 TRANSFORMER BLOCK MODIFICATIONS
5.1.1 Lightweight Transformers
5.1.1.1 Funnel-Transformer
5.1.1.2 DeLighT
5.1.2 Connections between Transformer Blocks
5.1.2.1 RealFormer
5.1.3 Adaptive Computation Time
5.1.3.1 Universal Transformers (UT)
5.1.4 Recurrence Relations between Transformer Blocks
5.1.4.1 Transformer-XL
5.1.5 Hierarchical Transformers
5.2 TRANSFORMERS WITH MODIFIED MULTI-HEAD SELF-ATTENTION
5.2.1 Structure of Multi-head Self-Attention
5.2.1.1 Multi-head self-attention
5.2.1.2 Space and time complexity
5.2.2 Reducing Complexity of Self-attention
5.2.2.1 Longformer
5.2.2.2 Reformer
5.2.2.3 Performer
5.2.2.4 Big Bird
5.2.3 Improving Multi-head-attention
5.2.3.1 Talking-Heads Attention
5.2.4 Biasing Attention with Priors
5.2.5 Prototype Queries
5.2.5.1 Clustered Attention
5.2.6 Compressed Key-Value Memory
5.2.6.1 Luna: Linear Unified Nested Attention
5.2.7 Low-rank Approximations
5.2.7.1 Linformer
5.3 MODIFICATIONS FOR TRAINING TASK EFFICIENCY
5.3.1 ELECTRA
5.3.1.1 Replaced token detection
5.3.2 T5
5.4 TRANSFORMER SUBMODULE CHANGES
5.4.1 Switch Transformer
5.5 CASE STUDY: SENTIMENT ANALYSIS
5.5.1 Goal
5.5.2 Data, Tools, and Libraries
5.5.3 Experiments, Results, and Analysis
5.5.3.1 Visualizing attention head weights
5.5.3.2 Analysis

Pretrained and Application-Specific Transformers
6.1 TEXT PROCESSING
6.1.1 Domain-Specific Transformers
6.1.1.1 BioBERT
6.1.1.2 SciBERT
6.1.1.3 FinBERT
6.1.2 Text-to-text Transformers
6.1.2.1 ByT5
6.1.3 Text generation
6.1.3.1 GPT: Generative Pre-training
6.1.3.2 GPT-2
6.1.3.3 GPT-3
6.2 COMPUTER VISION
6.2.1 Vision Transformer
6.3 AUTOMATIC SPEECH RECOGNITION
6.3.1 Wav2vec 2.0
6.3.2 Speech2Text2
6.3.3 HuBERT: Hidden Units BERT
6.4 MULTIMODAL AND MULTITASKING TRANSFORMER
6.4.1 Vision-and-Language BERT (VilBERT)
6.4.2 Unified Transformer (UniT)
6.5 VIDEO PROCESSING WITH TIMESFORMER
6.5.1 Patch embeddings
6.5.2 Self-attention
6.5.2.1 Spatiotemporal self-attention
6.5.2.2 Spatiotemporal attention blocks
6.6 GRAPH TRANSFORMERS
6.6.1 Positional encodings in a graph
6.6.1.1 Laplacian positional encodings
6.6.2 Graph transformer input
6.6.2.1 Graphs without edge attributes
6.6.2.2 Graphs with edge attributes
6.7 REINFORCEMENT LEARNING
6.7.1 Decision Transformer
6.8 CASE STUDY: AUTOMATIC SPEECH RECOGNITION
6.8.1 Goal
6.8.2 Data, Tools, and Libraries
6.8.3 Experiments, Results, and Analysis
6.8.3.1 Preprocessing speech data
6.8.3.2 Evaluation

Interpretability and Explainability Techniques for Transformers
7.1 TRAITS OF EXPLAINABLE SYSTEMS
7.2 RELATED AREAS THAT IMPACT EXPLAINABILITY
7.3 EXPLAINABLE METHODS TAXONOMY
7.3.1 Visualization Methods
7.3.1.1 Backpropagation-based
7.3.1.2 Perturbation-based
7.3.2 Model Distillation
7.3.2.1 Local Approximation
7.3.2.2 Model Translation
7.3.3 Intrinsic Methods
7.3.3.1 Probing Mechanism
7.3.3.2 Joint Training
7.4 ATTENTION AND EXPLANATION
7.4.1 Attention is not Explanation
7.4.1.1 Attention Weights and Feature Importance
7.4.1.2 Counterfactual Experiments
7.4.2 Attention is not not Explanation
7.4.2.1 Is attention necessary for all tasks?
7.4.2.2 Searching for Adversarial Models
7.4.2.3 Attention Probing
7.5 QUANTIFYING ATTENTION FLOW
7.5.1 Information flow as DAG
7.5.2 Attention Rollout
7.5.3 Attention Flow
7.6 CASE STUDY: TEXT CLASSIFICATION WITH EXPLAINABILITY
7.6.1 Goal
7.6.2 Data, Tools, and Libraries
7.6.3 Experiments, Results and Analysis
7.6.3.1 Exploratory Data Analysis
7.6.3.2 Experiments
7.6.3.3 Error Analysis and Explainability

Bibliography
Alphabetical Index