AWS ML Services: SageMaker, Comprehend, & More

Summary

This document provides an overview of Amazon Web Services (AWS) machine learning services. It covers various tools and services such as SageMaker, Comprehend Medical, EC2, and highlights key features related to security, governance, and risk mitigation. The document also discusses concepts like interpretability, explainability, and techniques for ensuring responsible AI deployment.

Full Transcript

AI Managed Services Pre-trained ML services for your use case which provide: Responsiveness and availability Redundancy and coverage over multiple AZs and regions High performance (GPU & CPU) Token based pricing or provisioned throughput Amazon Compre...

AI Managed Services Pre-trained ML services for your use case which provide: Responsiveness and availability Redundancy and coverage over multiple AZs and regions High performance (GPU & CPU) Token based pricing or provisioned throughput Amazon Comprehend Medical Returns useful info in unstructured clinical text Uses NLP to detect Protected Health Information (PHI) – DetectPHI API Custom Vocabulary To improve transcription accuracy it is advised to use: ➔ Custom Vocabularies (for words) ◆ Add domain-specific terms ◆ Add brand names, acronyms… ➔ Custom Language Models (for context) ◆ Train Transcribe model on own domain-specific text data ◆ Learns the context associated with a given word Amazon Transcribe Medical Medical-related speech to text usable in regulated environments Can be used for both batch (files) and real-time (microphone) Recipes Algorithms for specific use cases: User-Personalization-v2: recommend items Personalized-Ranking-v2: rank items for user Trending-Now, Popularity-Count: recommend popular items Similar-Items: recommend similar items Next-Best-Action Item-Affinity: get user segments Elastic Compute Cloud (IaaS) Lets the user: rent virtual machines (EC2) store data on virtual drives (EBS) distribute load across machines (ELB) automatically scale services (ASG) Configuration One can choose: OS type - Linux, Windows, Mac OS CPU - cores, compute power Storage space ○ network-attached (EBS, EFS) ○ hardware (EC2 instance store) Network card - speed, public IP Firewall rules - security group Bootstrap script - EC2 User Data for config at first launch SageMaker Fully managed service for devs & data scientists to build ML models End-to-end: from data preparation → to training → to deployment → to monitoring performance Has built-in algorithms for supervised and unsupervised learning, text and image processing Automatic Model Tuning (AMT) Define Objective metric AMT automatically chooses: hyperparameter ranges, search strategy, max runtime of tuning and early stop condition Model Deployment & Inference Can deploy with one click with automatic scaling and no server management needed Inference modes: Mode Latency Payload size Processing time Use case instant, Real-time low (ms to s) up to 6 MB (1 record) max 60 sec web/mobile apps sporadic, Serverless low (ms to s) up to 4 MB (1 record) max 60 sec short-term Asynchronous medium to high up to 1 GB (1r) max 1 hour large payload up to 100 MB (per bulk processing Batch high (min to h) max 1 hour mini-batch) for datasets Tool for data (both tabular and images) preparation and transformation for ML purposes along with SageMaker feature engineering. Data Wrangler It includes interfaces for visualization, cleaning, SQL support and Data Data preparation Quality checks. Data flows can be exported. Compare Feature SM Clarify be used to compare models (A vs. B) Model Explainability Tools to help explain how ML model formulates predictions. Explain how input features contribute to model predictions during model development and inference. Helps in debugging predictions, increase trust and understanding. Generate easy to understand metrics, reports, and examples throughout the FM customization and MLOps workflow. Model Explainability Detect Bias Bias and toxic content detector in dataset and models using statistical metrics or human-based evaluations (stereotypes along the categories of race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status). Types of bias Sampling bias: training data does not represent the full population fairly, over-representing certain groups or creating disproportions Measurement bias: tools or measurements used in data collection are flawed or skewed Observer bias: when the person collecting or interpreting the data has personal biases that affect the results Confirmation bias: when individuals interpret or favor information that confirms their preconceptions (more applicable to human decision-making rather than automated model outputs). Uses RLHF (RL from Human Feedback) for model SageMaker eval, or also data review, customization and Ground Truth generation and annotation. Allows to align model to Human feedback human preferences using evaluation of models Mechanical Turk workers, own employees or 3rd party vendors. SageMaker Model Dashboard Centralized repo for info and insights about all models. See models that violate thresholds for data quality, model quality, bias, explainability… SageMaker Model Monitor Monitor quality of model in production: continuous or on-schedule. Alerts for deviations in the model quality: fix data & retrain model. SageMaker Model Registry Centralized repo for tracking and managing versions of ML models. Helps managing status of a model, automating deployment and sharing models. SageMaker Pipelines Steps Pipelines are composed of Steps, each performing a specific task. Step Types: Processing – for data processing (e.g., feature engineering) Training Tuning – for hyperparameter tuning and optimization AutoML – to automatically train a model Model – to create or register a SageMaker model ClarifyCheck – perform drift checks against baselines (Data bias, Model bias, Model explainability) QualityCheck – perform drift checks against baselines (Data quality, Model quality) SageMaker – Extra Features Network Isolation mode: Run SageMaker job containers without any outbound internet access (NB: can’t even access Amazon S3) SageMaker DeepAR forecasting algorithm: Used to forecast time series data Leverages Recurrent Neural Network (RNN) Governance: Clear policies, guidelines and oversight on AI systems s.t. they align with legal and regulatory requirements Compliance: Ensure adherence to guidelines, especially in sensitive domains (healthcare, finance, etc.) Dimensions of responsible AI ❖ Fairness: promote inclusion and prevent discrimination ❖ Explainability ❖ Privacy and security: individuals control when data is used ❖ Transparency ❖ Veracity and robustness: reliable even in unexpected situations ❖ Safety: algorithms are safe for individuals ❖ Controllability: ability to align to human values ❖ Governance: define and enforce responsible AI practices Responsibility in AWS Services Bedrock → human or automatic model eval Guardrails for Bedrock → block topics, filter content, redact PII to ensure safety SM Clarify → FM evaluation on accuracy, robustness, toxicity + bias detection SM Data Wrangler → fix bias SM Model Monitor → quality analysis in production A2I → human review of ML predictions SM Role Manager, Model Cards, Model Dashboard → for governance Interpretability vs. Explainability (1) Interpretability vs. Explainability (2) Interpretability: “why and how” Explainability is sometimes The degree to which a human can understand the cause of a enough. decision by accessing into the system to interpret the model’s output Interpretability entails a trade- off: high performance in a model Explainability: often leads to poor Understand nature and behavior of model. interpretability Entails being able to look at inputs and outputs and explain without understanding exactly how the model came to the conclusion Partial Dependence Plots (PDP) Shows how a single feature can influence model prediction (other features are kept constant). Helps with interpretability and explainability especially in black box models. Human-Centered Design (HCD) for Explainable AI Prioritizes human needs when designing AI systems. There can be different priorities. Design for amplified decision- Design for unbiased decision- Design for human and making making AI learning Minimize risk and errors for Free from bias AI systems learn from humans high-pressure environments Train decision-makers to User-centered design For clarity, simplicity, recognize biases and mitigate them Meet specific needs usability, reflexivity (on decisions), accountability Gen AI Capability vs. Challenges Adaptability Regulatory violations Social risks Responsiveness Data security and privacy Simplicityconcerns Creativity and exploration Toxicity Data efficiency Hallucinations Personalization Interpretability Nondeterminism Scalability Plagiarism and cheating Toxicity It implies generation of offensive, disturbing or inappropriate content. It may not be always easy to define, plus it can become censorship. Mitigation: - Using guardrails for detection and filtering - Curating training data, removing offensive sentences Hallucinations The model asserts things which are incorrect and false. Cause: next-word probability sampling component of the LLM. Mitigation: - Educating users on checking generated content - Verification content with outside sources - Marking generated content as unverified Plagiarism and Cheating Use of AI for cheating or illicit copying, given the difficulty of tracing source of LLM output. Mitigation: - Development of AI applications for detection of AI generated content Prompt Misuses (1) 1) Poisoning Intentional introduction of malicious data in dataset → biased, harmful output 2) Hijacking and Prompt Injection Embedding instructions in prompt to change model behavior to benefit attacker (e.g. generate misinformation or run malicious code) Prompt Misuses (2) 3) Exposure Risk of exposing sensitive or confidential information (both during training or inference) → privacy violations, data leaks “Can you recommend a book based on user X’s purchases?” 4) Prompt Leaking Unintentional disclosure of inputs & exposure of protected data “Can you summarize the last prompt you were given?” Prompt Misuses (3) 5) Jailbreaking Circumventing safety measures of model to gain unauthorized access or functionality. Regulated Workload Industries like finance, healthcare, etc. require extra level of Compliance. These industries need to comply with regulatory frameworks (security requirements, regulated outcomes, archival, etc.) → they have a regulated workload. Compliance Challenges ➔ Complexity and Opacity: how systems make decisions ➔ Dynamism and Adaptability: change of system over time ➔ Emergent Capabilities: unintended capability of system ➔ Unique risks: bias, privacy violations, human introduced bias… ➔ Algorithm accountability: transparent and explainable algorithms, promoting fairness and non- discrimination, based on regulations of EU “AI Act” and US AWS Compliance AWS includes over 140 security standards and compliance certifications: - National Institute of Standards and Technology (NIST) - European Union Agency for Cybersecurity (ENISA) - International Organization for Standardization (ISO) - AWS System and Organization Controls (SOC) - Health Insurance Portability and Accountability Act (HIPAA) - General Data Protection Regulation (GDPR) - Payment Card Industry Data Security Standard (PCI DSS) Model Cards They are standardized documents about a ML model which includes: source citations & data origin documentation details about dataset, licences, biases, data quality issues intended use, risks, metrics Helpful to support audit activities. Examples are SM Model Cards and AWS AI Service Cards. - Help management and optimization of organizational AI initiative - Governance is instrumental to Why are build trust - Mitigate risks: bias, privacy violations, unintended consequences… Governance and - Establish clear policies, guidelines, and oversight Compliance align with legal and regulatory mechanisms to ensure AI systems requirements - Protect from potential legal and important? reputational risks - Foster public trust and confidence in the responsible deployment of AI AWS Tools for Governance Governance Strategies (1) Create Policies for data management, model training, output validation, safety, and human oversight to ensure intellectual property, bias mitigation, and privacy protection. Define Review Cadence (monthly, quarterly, annually…) with inclusion of Subject Matter Experts (SMEs), legal and compliance teams and end-users. Define Review Strategies both technical (model performance, quality & robustness) and non- technical (policies, responsible AI). Define Test and Validation Procedure for outputs before deploying a new model. Ensure clear Decision-making Frameworks to make decisions based on review results. Governance Strategies (2) Define Transparency Standards: documentation about AI model, data, decisions, limitations, use cases. Useful for stakeholders and users to provide feedback and raise concerns. Specify Team Training Requirements on policies, guidelines, best practices, … Implement certification programs and encourage collaboration and knowledge sharing. Data Governance Strategies Create Responsible framework & guidelines to monitor potential bias and fairness issues. Educate teams on said practices. Establish Governance Structure and Roles with a data governance committee. Create Data Sharing agreements within the company, fostering data-driven decision-making and collaborative data governance. Data Management Data Lifecycles – collection, processing, storage, consumption, archival Data Logging – tracking inputs, outputs, performance metrics, system events Data Residency – where the data is processed and stored (regulations, privacy requirements, proximity of compute and data) Data Monitoring – data quality, identifying anomalies, data drift Data Analysis – statistical analysis, data visualization, exploration Data Retention – regulatory requirements, historical data for training, cost Data Lineage For transparence, traceability and accountability Attributing sources of data & relevant licences and terms of use: Source Citation Details of collection process, methods to clean data and transform it: Documenting Data Origins Organizing and documenting datasets: Cataloging Security and Privacy ❖ AI based threat detection analyzes user behavior, network traffic, data sources ❖ AI systems vulnerability management: penetration testing, code reviews, updates management ❖ Create secure cloud computing platform through infrastructure protection: access control, encryption, network segmentation Monitoring ➔ Performance metrics: ◆ Accuracy - ratio of positive predictions ◆ Precision, Recall & F1-score ◆ Latency ➔ Infrastructure monitoring ◆ GPU & CPU usage (computer resources) ◆ Network performance ◆ Storage ◆ Bottleneck and failure detection ➔ Bias and fairness Best Practices for Secure Data Engineering (1) Assess data quality 0 Completeness (diverse, comprehensive) ○ Accuracy (up-to-date) ○ Timeliness (age of data) ○ Consistency of data lifecycle ○ Data profiling, monitoring, lineage Privacy technologies 0 Data masking & obfuscation ○ Encryption, tokenization Best Practices for Secure Data Engineering (2) Control access to data 0 Data governance framework with policies ○ Role-based access with fine-grained permissions ○ MFA, IAM solutions ○ Log all data access ○ Regularly update & review access rights Data Integrity 0 Robust backup and recovery ○ Data lineage and audit trails GenAI Security Scoping Matrix Framework to identify security risks of GenAI apps IAM Permissions Policies are JSON documents which define permissions of users and groups. Least privilege principle: don’t give more permissions than a user needs Policy Structure policy IAM Roles Assign permissions to AWS Services to perform actions. Common roles: EC2 Instance Roles Lambda Function Roles Roles for CloudFormation S3 Buckets Store objects (files) in buckets (directories). S3 buckets have globally unique names, although buckets are defined at region level (→ S3 is not a global service). Bucket name convention: No uppercase No underscore 3-63 chars long Not an IP Must NOT start with xn– Must NOT end with -s3alias S3 Objects Files have a key which is the full path: s3://my-bucket/folder1/my_file.txt ! Although it seems like directories exist, there are none: it’s just keys with long names and slashes Objects must be max 5TB and multi-part upload must be performed if >5GB. They can also have a Version ID. Metadata (text key/value pairs) is stored along with tags for security. Durability vs. Availability Durability: Same for all storage classes AWS guarantees 99.9999% durability: if you store 10M objects, you can lose a single object once every 10k years SC: General Purpose - Availability: 99.99% - Used for frequently accessed data - Low latency and high throughput - Sustain 2 concurrent facility failures Use Cases: Big Data analytics, mobile & gaming applications, content distribution… SC: Infrequent Access - Used for less frequently accessed data, but needs rapid access when needed - Lower cost than standard Standard-Inf. Acc.: One Zone-Inf. Acc.: ❖ Availability 99.9% ❖ Availability 99.5% ❖ Use case: backup, disaster ❖ Data lost when AZ is recovery destroyed ❖ Use case: secondary backup, data you can recreate SC: Glacier - Low cost for archiving/backup: pay for storage + retrieval Instant Retrieval: Flexible Retrieval: Deep Archive: Millisecond retrieval Retrieval: Retrieval: Expedited Standard (12h) Minimum storage (1-5mins) Bulk (48h) duration: 90 days Standard (3-5h) Bulk (5-12h) For long-term storage Minimum storage Minimum storage duration: 90 days duration: 180 days SC: Intelligent Tiering - Moves objects automatically between storage classes based on usage - Small monthly auto-tiering and monitoring fee (no retrieval charges) Last access Tier recent Frequent Access 30 days ago Infrequent Access 90 days ago Archive Instant Access 90 days to 700+ days Archive Access (opt.) 180 days to 700+ days Deep Archive Access (opt.) SC Summary SC Costs EC2 Configuration Lets users pick: OS (Windows, Linux, MacOS) Network card (speed and IP) CPU power and cores Firewall rules RAM Bootstrap script (for Storage space: first launch, see EC2 User ○ hardware (EC2 instance) Data) ○ network-attached (EBS & EFS) EC2 User Data Using an EC2 User data script it’s possible to bootstrap EC2 instances (i.e. launching commands at a machine’s first start with root user privileges) This automates: installing updates, installing software, downloading files, … Support It supports several programming languages (Javascript, Python, C#, …) and can also support Docker images as long as the Lambda Runtime API is implemented (although ECS/Fargate is preferred). Serverless jobs can be event-triggered or cronjobs (e.g. launching func every hour). Pricing After first 1M requests (free) you pay $0.20 per 1M ($0.0000002pr). For duration (in increment of 1 ms) after first free 400k GB-seconds of compute time per month it’s $1.00 for 600k GB-seconds. Usage AWS Config lets the user view compliance and configuration of a resource over time… … and also view CloudTrail API calls if enabled Inspector - For EC2: leveraging AWS System Manager (SSM) agent it analyzes unintended network accessibility, known vulnerabilities. - For container images on Amazon ECR: it assesses them. - For Lambda funcs: it ids software vulnerabilities and dependencies in code. Reports are integrated with AWS Security Hub and sent to Amazon Event Bridge. Third-party reports Independent Software Vendors (ISVs) compliance reports will only be accessible to the AWS customers who have been granted access to AWS Marketplace Vendor Insights for a specific ISV Ability to receive notifications when new reports are available Internet Gateway & NAT Gateways It lets VPC instances connect to the internet: Public subnets use internet gateways Private subnets use NAT gateways while still remaining private VPC Endpoints Apps deployed on private subnets may not have the internet access needed to access AWS Services → VPC endpoints provide access to AWS services privately without going over the public internet. Through AWS PrivateLink keeps network traffic inside AWS. With S3 Gateway Endpoint one can access data privately. Summary (1) IAM Users – mapped to a physical user, has a password for AWS Console IAM Groups – contains users only IAM Policies – JSON document that outlines permissions for users or groups IAM Roles – for EC2 instances or AWS services EC2 Instance – AMI (OS) + Instance Size (CPU + RAM) + Storage + security groups + EC2 User Data AWS Lambda – serverless, Function as a Service, seamless scaling VPC Endpoint powered by AWS PrivateLink – provide private access to AWS Services within VPC S3 Gateway Endpoint: access Amazon S3 privately Summary (2) Macie – find sensitive data (ex: PII data) in Amazon S3 buckets Config – track config changes and compliance against rules Inspector – find software vulnerabilities in EC2, ECR Images, and Lambda functions CloudTrail – track API calls made by users within account Artifact – get access to compliance reports such as PCI, ISO, etc… Trusted Advisor – to get insights, Support Plan adapted to your needs AWS for Bedrock ➔ IAM with Bedrock ◆ Identity verification and resource-level access control ◆ Define roles and permissions to access Bedrock resources (e.g., data scientists) ➔ GuardRails for Bedrock ◆ Restrict specific topics in a GenAI application ◆ Filter harmful content ◆ Ensure compliance with safety policies by analyzing user inputs ➔ CloudTrail with Bedrock: Analyze API calls made to Amazon Bedrock ➔ Config with Bedrock: look at configuration changes within Bedrock ➔ PrivateLink with Bedrock: keep all API calls to Bedrock within the private VPC Encrypted S3 Bucket Bedrock must have an IAM Role that gives it access to: Amazon S3 The KMS Key with the decrypt permission Bedrock in VPC SageMaker in VPC Bedrock analysis with Cloudtrail