LLMs Under BAA: Which AI Models a US Healthcare Founder Can Actually Use in 2026

LLMs Under BAA: Which AI Models a US Healthcare Founder Can Actually Use in 2026

In January 2023, a Philadelphia-based telehealth startup, Seed stage, $3.2M raised, added an AI feature to their platform. The feature was a clinical summary generator: a provider could click a button and get a two-paragraph summary of a patient’s recent visit notes, medications, and lab results to share with a specialist via secure message. The engineering team built it in one sprint using the OpenAI API. The feature shipped. Providers loved it.

Six months later, the startup’s new VP of Engineering was reviewing the codebase. She found the clinical summary feature. She traced the API calls. Every time a provider requested a summary, the prompt included the patient’s name, date of birth, diagnosis codes, medication list, and recent lab values, pulled directly from the EHR. Fully identified ePHI. Sent to OpenAI’s standard API.

They did not have an enterprise agreement with OpenAI. They did not have a HIPAA BAA with OpenAI. They did not have a BAA with OpenAI for any tier of service. The standard OpenAI API terms, the ones the engineering team had agreed to when they signed up for the API key, explicitly stated that OpenAI might use API inputs to improve its models.

Fully identified ePHI. Sent to a third-party service. Without a BAA. Potentially used for model training. For six months.

The VP of Engineering escalated immediately. The feature was disabled the same day. Their healthcare attorney spent four weeks analyzing the exposure. Because OpenAI’s model training opt-out had been available and the data volume was below a certain threshold, the remediation was manageable, but not free.

Legal fees, a HIPAA risk analysis update, new vendor management policies, and the delayed launch of the next three features while the engineering team rebuilt the AI infrastructure under a compliant architecture cost the company approximately $140,000 and four months.

The feature they had built in one sprint cost them $140,000 and four months to fix.

I have been on 2,000+ calls with US healthcare founders since 2014. The LLM BAA gap is the fastest-growing compliance error I see in 2026. The capability has outrun the compliance awareness. This guide exists to close that gap before the six months happen.

01 the 140k mistake

Eight Things Healthcare Founders Get Wrong About LLMs and BAAs

  • Wrong #1: “The standard OpenAI API is HIPAA-compliant if we use it carefully.”

The standard OpenAI API, the pay-as-you-go tier that most developers start with, does not include a Business Associate Agreement. Sending ePHI through an API without a BAA with the API provider is a HIPAA violation regardless of how carefully the data is handled on your side. HIPAA compliance is not a function of how carefully you use the tool. It is a function of whether the tool’s provider has contractually committed to handle ePHI in accordance with HIPAA requirements.

  • Wrong #2: “We have a BAA with AWS, that covers everything we do on AWS including Bedrock.”

The AWS BAA covers AWS services that are on the HIPAA-eligible services list. AWS Bedrock is on that list, so yes, using Bedrock for LLM inference is covered by the AWS BAA. But the AWS BAA does not cover third-party services accessed through AWS (for example, if you make API calls from an AWS Lambda function to an external service, the external service is not covered by the AWS BAA). And the AWS BAA does not cover the application-layer decisions your engineering team makes about how data flows through your system.

  • Wrong #3: “A BAA means the LLM provider won’t use our data for training.”

A BAA is a contract that commits the provider to handle ePHI in accordance with HIPAA requirements. HIPAA does not prohibit using ePHI for all purposes, it restricts unauthorized disclosures and limits permissible uses. Whether a BAA-covered LLM provider uses your data for model training depends on the specific terms of the BAA and the service agreement, not on the existence of a BAA alone. Read the data use terms, not just the BAA existence confirmation.

  • Wrong #4: “We anonymize the data before sending it to the LLM, so we don’t need a BAA.”

If the data is properly de-identified under HIPAA’s Safe Harbor method (§164.514(b)), removing all 18 HIPAA identifiers, it is no longer ePHI and does not require a BAA. The key word is “properly.” Removing a patient’s name but retaining their date of birth, ZIP code, and diagnosis in a prompt is not de-identification. It may still be re-identifiable and may still be ePHI. Confirm de-identification with a healthcare attorney before concluding that a BAA is not required.

  • Wrong #5: “The LLM provider said they’re HIPAA-compliant, that means they have a BAA.”

“HIPAA-compliant” is not a certification. It is a claim. An LLM provider that says they are HIPAA-compliant may mean they have security controls that align with HIPAA requirements. It does not necessarily mean they will sign a BAA with you or that they have been audited for HIPAA compliance. Confirm BAA availability, obtain the executed BAA document, and have your healthcare attorney review the BAA scope before clinical data flows through the service.

  • Wrong #6: “The BAA covers the LLM output, whatever the model generates is protected.”

The BAA covers the handling of ePHI by the Business Associate, how they store, process, and protect the ePHI you send them. It does not protect the content the LLM generates. If the LLM generates a clinical hallucination, a fabricated medication, an incorrect diagnosis, the BAA does not make that hallucination safe. The BAA is a compliance control. It is not a clinical safety control. Clinical safety requires guardrails, not BAAs.

  • Wrong #7: “We’ll add the BAA when we go to production, development is fine on standard API.”

Development environments that use real patient data, even for testing, require the same BAA coverage as production. If your development environment uses real EHR data to test your AI feature, your healthcare development environment is handling ePHI and requires a BAA with every service it uses. Use synthetic or de-identified data in development. Use real ePHI only in environments with full BAA coverage.

  • Wrong #8: “Self-hosting the LLM on our own servers means we don’t need a BAA.”

If you self-host the LLM on your own HIPAA-compliant cloud infrastructure (AWS with a BAA covering the EC2 instances, S3 storage, and other services you use), you do not need a separate BAA with an LLM provider, you are the operator of the infrastructure. But you still need BAA coverage for the cloud infrastructure itself (the AWS BAA), and you still need to ensure that no other service in your inference pipeline touches ePHI without a BAA.

Why the LLM BAA Question Is More Complex Than It Looks

Most healthcare founders encounter the LLM BAA question in one of two ways: a developer on the team says “I want to use GPT-4 for this feature,” or a healthcare attorney says “you need a BAA for any service that touches ePHI.” The conversation that follows is often frustratingly vague, because the answer is genuinely nuanced.

Here is why:

  • The BAA is not a binary yes/no:

A BAA between your company and an LLM provider is a contract. Like all contracts, the terms matter. Two BAAs from two different LLM providers can both be called “HIPAA BAAs” while covering very different scopes, very different services, very different data use limitations, and very different liability allocations. Reading the BAA is not optional. Having your healthcare attorney review the BAA is not optional.

Specifically, the terms that matter most for an LLM BAA:

Which services are covered: Does the BAA cover the specific API endpoint you are using? Does it cover fine-tuning? Does it cover model training on your data? Does it cover embeddings? Each service may be covered or excluded separately.

Data retention terms: How long does the provider retain the data sent in prompts? Is prompt data retained at all? Is it used to improve models? Under what conditions?

Sub-processor disclosure: Who are the subprocessors that handle the ePHI? Are they disclosed? Do they have their own BAA obligations? A provider that uses multiple sub-processors to run inference, GPU cloud providers, caching layers, logging services, may have HIPAA sub-processor obligations that are not obvious from the top-level BAA.

Breach notification terms: What is the provider’s commitment for notifying you of a breach? HIPAA requires Business Associates to notify Covered Entities of breaches, but the specific timeline and process vary by BAA.

  • The pipeline is more than the LLM:

The LLM call is one step in a clinical AI pipeline that typically includes: data ingestion from the EHR (FHIR F4 API or HL7 v2), data storage (database), prompt construction (application code), LLM API call, response handling, output storage, and delivery to the provider or patient. Every step that touches ePHI is a potential BAA gap.

The services most commonly missed in clinical AI pipeline BAA analysis:

Transcription service: For ambient documentation products, audio transcription precedes the LLM call. The transcription service processes audio that contains patient ePHI. Amazon Transcribe Medical (under AWS BAA) is the most common compliant option. Consumer transcription services, Google Speech-to-Text on standard terms, Whisper via OpenAI API on standard terms, are not covered by a healthcare BAA without a specific enterprise agreement.

Vector database: For RAG-based clinical AI, a vector database stores embeddings of clinical documents. If those documents contain ePHI, the vector database is handling ePHI and requires a BAA. Pinecone, Weaviate, and other vector databases have enterprise BAA options.

LLM observability platform: Langsmith, Helicone, Braintrust, and similar LLM monitoring tools may capture prompt content, including ePHI, in logs. These platforms require BAA coverage if they are logging prompts that contain ePHI.

Prompt cache: Some LLM infrastructure uses prompt caching to reduce latency and cost. If the cached prompts contain ePHI, the caching layer requires BAA coverage.

Error monitoring: Sentry, Datadog, and similar error monitoring tools may capture request/response payloads when an error occurs. If those payloads contain ePHI, the monitoring tool requires BAA coverage.

  • The intended use changes the BAA requirement:

Using an LLM to process ePHI for treatment purposes (generating a clinical summary for a treating provider) is a permissible use of ePHI under HIPAA. Using an LLM to process ePHI for research purposes (training a model on patient data to improve future clinical recommendations) requires separate authorization or de-identification.

The BAA does not distinguish between these uses, the data use policy in the service agreement governs whether specific uses are permitted.

Understand not just the BAA but the service agreement terms for data use before committing to any LLM provider for clinical AI.

07 compliance dashboard

The 2026 LLM-Under-BAA Landscape, Every Option, Honestly Evaluated

Here is the complete 2026 picture. I am going to be more specific than most guides, because “check with your legal team” is not useful advice when you are deciding which LLM to build on at 11 PM IST before a sprint kickoff.

The current BAA landscape as of 2026:

Provider BAA Available Plan Required Models Covered Data Training Use Our Assessment
AWS Bedrock Yes Standard AWS (any paid) Claude (Anthropic), Llama, Mistral, Titan, Stable Diffusion No training on customer data per AWS terms Default recommendation
Azure OpenAI Yes Enterprise with HIPAA config GPT-4o, GPT-4 Turbo, GPT-4, GPT-3.5, embeddings No training on customer data per Microsoft terms Strong for Azure ecosystem
OpenAI API Yes Enterprise agreement only GPT-4o, GPT-4 Turbo, o1, o3, embeddings No training on customer data under Enterprise Valid with legal review
Anthropic API Yes Enterprise negotiation required Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku Confirm specific terms in BAA Direct option with negotiation
Google Vertex AI Yes (with config) Enterprise with HIPAA setup Gemini 1.5 Pro/Flash, Med-PaLM 2 (select partners) No training per Google Cloud terms Good for GCP ecosystem
Self-hosted (any) N/A, you own infra HIPAA-compliant cloud Llama 3, Mistral, Mixtral, clinical fine-tunes You control completely Maximum control, highest overhead
Standard OpenAI API No May train on inputs Never use with ePHI
Standard Anthropic API No standard BAA Confirm current terms Do not use with ePHI without enterprise BAA
Hugging Face Inference API No healthcare BAA Do not use with ePHI
Replicate No healthcare BAA Do not use with ePHI
Groq (standard) No healthcare BAA Do not use with ePHI

Important disclaimer: BAA availability, service coverage, and data use terms change. Verify current status directly with each provider before clinical data flows through their service. This table reflects the landscape as of mid-2026, confirm with your healthcare attorney before relying on it for compliance decisions.

The 14-Question LLM Compliance Readiness Audit

Before your engineering team makes the first LLM API call with real patient data:

  • Does the ePHI touch the LLM at all?

Map the exact data that enters the prompt. Does it include patient name, date of birth, MRN, diagnosis codes, medication names, lab values, clinical notes, or any other element of the 18 HIPAA identifiers linked to health information? If yes, a BAA is required with the LLM provider.

  • Have you obtained and executed a BAA with the LLM provider?

Not “they said they have a BAA available.” The BAA is executed, signed by both parties, in your document management system, reviewed by your healthcare attorney.

  • Does the BAA cover the specific API endpoint you are using?

The BAA may cover the chat completions endpoint but not fine-tuning. It may cover text generation but not image generation. Confirm the specific endpoint coverage in the BAA language.

  • Have you reviewed the service agreement data use terms?

Does the provider retain prompt data? For how long? Is it used for model training? Under what conditions can the provider access your prompt data? The BAA governs ePHI handling requirements. The service agreement governs what the provider does with your data beyond HIPAA minimum requirements.

  • Who are the LLM provider’s sub-processors that may touch ePHI?

Cloud infrastructure providers, GPU rental providers, caching layers, any sub-processor in the inference chain that may touch ePHI must be disclosed in the BAA or service agreement, and must themselves have HIPAA-appropriate controls.

  • Have you mapped the full ePHI pipeline, not just the LLM call?

Audio transcription, EHR data ingestion, prompt construction, LLM API call, response handling, output storage, delivery layer, error monitoring, observability platform, vector database, every service in this pipeline that touches ePHI requires BAA coverage.

  • Is your transcription service (if applicable) HIPAA-BAA-covered?

Amazon Transcribe Medical (under AWS BAA), Deepgram for Healthcare (enterprise BAA), AssemblyAI (healthcare BAA available). Consumer transcription via standard OpenAI Whisper API does not have a standard healthcare BAA.

  • Is your vector database (if applicable) HIPAA-BAA-covered?

Pinecone (enterprise BAA), Weaviate (enterprise BAA), self-hosted Chroma (you own the infrastructure). Confirm current BAA availability for your specific vector database.

  • Is your LLM observability platform (if applicable) HIPAA-BAA-covered?

If you are logging prompts for debugging or monitoring, check whether your observability platform has a healthcare BAA. Most LLM observability platforms are not purpose-built for healthcare and do not have standard BAAs.

  • Is your development environment using real ePHI for testing?

Development and staging environments that use real patient data require the same BAA coverage as production. Use de-identified or synthetic data in development.

  • What is your prompt data retention policy?

How long do you retain prompts that contain ePHI? What is your deletion schedule? Prompt retention creates ongoing HIPAA obligations for the retained data.

  • What is your incident response plan for an LLM-related data breach?

If ePHI is improperly transmitted to an LLM provider without a BAA, or if an LLM provider’s systems are breached, what is your response? Your incident response plan must specifically address LLM-related incidents.

  • How do you validate that ePHI is not leaking into non-BAA-covered services?

Technical controls that prevent ePHI from being logged, cached, or transmitted to services without BAA coverage. These should be automated controls, not policy-only controls that depend on engineers remembering the rule.

  • How will you handle LLM provider BAA changes?

LLM providers update their terms of service and BAA coverage as their services evolve. Who monitors for changes? What is your process for evaluating the impact on your compliance posture when a provider updates its terms?

AWS Bedrock, The Default and Why

AWS Bedrock is our default recommendation for most clinical AI features that process ePHI. Here is the complete picture, including the limitations that most guides skip.

What AWS Bedrock is:

AWS Bedrock is a fully managed service that provides access to foundation models from multiple AI providers, Anthropic (Claude), Meta (Llama), Mistral, Amazon (Titan), Stability AI (Stable Diffusion), and others, through AWS’s infrastructure and APIs.

You do not interact directly with Anthropic, Meta, or Mistral when using their models through Bedrock, you interact with AWS Bedrock, and AWS manages the model serving infrastructure.

Why Bedrock is the default for ePHI processing:

BAA coverage is the AWS BAAWS Bedrock is on the HIPAA-eligible services list. If you have the standard AWS BAA (which any AWS account with healthcare data should have), Bedrock is covered. No separate enterprise negotiation. No minimum spend threshold. No waiting for legal review of a new BAA. Your existing AWS BAA covers Bedrock.

No data training on customer datAWS’s terms for Bedrock explicitly state that AWS does not use customer data to train the foundation models. Your prompts and completions are not used to improve Claude, Llama, or any other Bedrock model. This is a critical data governance commitment that matters beyond the BAA.

Model selection breadth: Bedrock provides access to Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku, Llama 3 (70B and 8B), Mistral 7B, Mixtral 8x7B, and Titan models through the same AWS interface. You can evaluate multiple models for your specific clinical AI use case without managing separate provider relationships.

AWS ecosystem integration: If your clinical AI product is already on AWS, which most digital health products are, Bedrock integrates cleanly with your existing architecture: VPC configuration, IAM access controls, CloudTrail logging (the Bedrock API calls are logged in CloudTrail, providing the audit trail you need for HIPAA), and encryption with AWS KMS.

CloudTrail audit logging: Every Bedrock API call is logged in AWS CloudTrail, who made the call, when, from which IAM identity, to which model, with which parameters. This CloudTrail log is your HIPAA audit trail for LLM access. No additional audit infrastructure is required.

The Bedrock limitations, what the marketing doesn’t say:

Not every frontier model is on Bedrock: OpenAI’s GPT-4o and o1/o3 series are not available on Bedrock, OpenAI is a competitor to AWS. The most recent Anthropic models typically arrive on Bedrock within weeks of direct API availability, but Bedrock’s model catalog lags behind direct provider APIs for some providers.

Bedrock availability regions matter for HIPANot all AWS regions support Bedrock. Not all Bedrock models are available in all regions. For HIPAA-eligible service use, you must confirm that the Bedrock model you are using is available in a HIPAA-eligible AWS region (us-east-1, us-west-2, and select others). Check the current Bedrock regional availability before architecting.

Model performance is provider-dependent: The Claude 3.5 Sonnet you access through Bedrock is the same model as the Claude 3.5 Sonnet you access through Anthropic’s direct API, the model weights are identical. But model performance can be affected by inference configuration, latency, and throughput, which Bedrock manages and which differ from Anthropic’s direct API in some configurations.

Bedrock fine-tuning has specific constraints: If you want to fine-tune a model on Bedrock, the fine-tuning process has specific data format requirements, supported model constraints, and data storage requirements (your fine-tuning data must be in S3, covered by the AWS BAA). The fine-tuning capability is available for some models but not all.

The Bedrock implementation for clinical AI:

import boto3

import json

 

# Initialize the Bedrock client, IAM role must have bedrock:InvokeModel permission

# Using the us-east-1 region, confirm HIPAA-eligible status for your region

bedrock_client = boto3.client(

    service_name=’bedrock-runtime’,

    region_name=’us-east-1′

)

 

def generate_clinical_summary(patient_context: dict, clinical_notes: str) -> str:

    “””

    Generate a clinical summary using Claude via AWS Bedrock.

    ePHI is processed under the AWS HIPAA BAA.

    Prompt is logged in CloudTrail for HIPAA audit trail.

    “””

    

    # System prompt with clinical safety constraints

    system_prompt = “””You are a clinical documentation assistant.

    

    CRITICAL RULES:

    – Summarize ONLY what is documented in the provided clinical notes

    – Do NOT add clinical interpretation not present in the notes

    – Do NOT include medications not explicitly mentioned

    – Use uncertainty language for ambiguous findings: ‘documented as’, ‘reported’, ‘noted’

    – If information is unclear, state ‘Documentation unclear, provider to review’

    – This summary is for clinical reference only and requires provider review before use”””

    

    # Construct prompt with ePHI

    # This prompt is processed under the AWS BAA

    user_message = f”””

    Patient context: {json.dumps(patient_context)}

    

    Clinical notes to summarize:

    {clinical_notes}

    

    Generate a structured clinical summary for specialist referral.

    “””

    

    response = bedrock_client.invoke_model(

        modelId=’anthropic.claude-3-5-sonnet-20241022-v2:0′,

        body=json.dumps({

            “anthropic_version”: “bedrock-2023-05-31”,

            “max_tokens”: 1024,

            “system”: system_prompt,

            “messages”: [

                {“role”: “user”, “content”: user_message}

            ]

        })

    )

    

    response_body = json.loads(response[‘body’].read())

    return response_body[‘content’][0][‘text’]

This call is logged in CloudTrail automatically, no additional logging infrastructure required for HIPAA audit compliance for the LLM call itself.

Bedrock Guardrails:

AWS Bedrock Guardrails is a native feature that allows you to define content filters, topic denials, and PII detection for Bedrock-based applications. For clinical AI, Bedrock Guardrails can:

  • Block responses that contain certain content categories (violence, hate speech, etc.)
  • Detect and mask PII in inputs and outputs (names, SSNs, email addresses, though clinical ePHI detection is less complete than general PII detection)
  • Deny specific topics that are out of scope for your application

Bedrock Guardrails is not a substitute for clinical hallucination guardrails, it is a content safety layer, not a clinical accuracy layer. Use both.

04 why bedrock default

Azure OpenAI with HIPAA Mode, The Microsoft Ecosystem Play

Azure OpenAI is the right choice when you are already in the Microsoft ecosystem, Azure infrastructure, Microsoft 365, Teams-integrated clinical workflows, or when you specifically need GPT-4o class performance and your enterprise agreement with Microsoft is already in place.

  • What HIPAA mode means for Azure OpenAI:

Azure OpenAI does not have a separate “HIPAA mode” toggle that you turn on. HIPAA compliance for Azure OpenAI means: you are using Azure OpenAI within an Azure environment where the Azure BAA is in place, the specific Azure OpenAI service is on Microsoft’s list of HIPAA-covered services, and your Azure configuration meets Microsoft’s guidance for HIPAA-compliant Azure deployments.

The Azure BAA covers Azure OpenAI when properly configured. “Properly configured” means: confirming the specific Azure OpenAI API version and deployment you are using is on the HIPAA-covered services list, not using features that are explicitly excluded from HIPAA coverage (some preview features are excluded), and ensuring your Azure environment has the security configuration that the BAA assumes (encryption, access controls, audit logging through Azure Monitor).

  • Models available through Azure OpenAI:

GPT-4o, GPT-4 Turbo, GPT-4, GPT-3.5 Turbo, text embedding models (text-embedding-3-large, text-embedding-3-small, text-embedding-ada-002). Azure OpenAI does not provide access to Anthropic, Meta, or Mistral models, it is an OpenAI-specific service. For multi-model access, use AWS Bedrock.

  • Azure OpenAI data processing terms:

Microsoft’s data processing terms for Azure OpenAI under an enterprise agreement state that customer data is not used to train the foundation models. This is the same commitment as AWS Bedrock, your prompts and completions are not used to improve GPT-4.

  • The Azure OpenAI deployment model:

Unlike the direct OpenAI API (which is a shared multi-tenant service), Azure OpenAI deployments are provisioned within your Azure subscription. Your model deployment is isolated, your requests do not share compute with other customers’ requests. This provides a higher level of data isolation than the direct OpenAI API.

For healthcare applications with strict data isolation requirements, some federal health programs, some large health system enterprise contracts, Azure OpenAI’s deployment isolation model may be preferred over shared API infrastructure.

  • Azure OpenAI implementation considerations:

Azure OpenAI API is compatible with the OpenAI SDK, you can switch from direct OpenAI API to Azure OpenAI API by changing the endpoint URL and adding your Azure-specific credentials. The prompt format, response format, and model behavior are identical.

The authentication model is different: Azure OpenAI uses Azure Active Directory (Entra ID) for authentication, not OpenAI’s API keys. For applications that are already integrated with Azure AD, this is seamless. For applications that are not in the Azure ecosystem, it adds authentication complexity.

OpenAI Enterprise BAA, The GPT-4 Direct Route

OpenAI’s Enterprise tier includes a Business Associate Agreement and is the route for healthcare organizations that specifically need GPT-4o, o1, or o3 model performance and want to access these models directly through OpenAI rather than through Azure.

  • What the OpenAI Enterprise BAA covers:

The OpenAI Enterprise BAA covers the API services included in the Enterprise agreement. As of 2026, this includes: the Chat Completions API (GPT-4o, GPT-4 Turbo, o1, o3), the Embeddings API, the fine-tuning API (with specific data handling terms), and the Assistants API with specific configuration requirements.

The BAA does not automatically cover every OpenAI service or every model. Confirm with OpenAI’s enterprise team which specific API endpoints and models are covered by the BAA before committing to them for ePHI processing.

  • Data training under OpenAI Enterprise:

OpenAI’s Enterprise terms state that customer data submitted through the API under an Enterprise agreement is not used to train OpenAI’s models by default. The default data retention period under Enterprise is 30 days for API inputs and outputs, retained for safety monitoring purposes before deletion. Confirm the current data retention terms in your specific Enterprise agreement.

  • OpenAI Enterprise pricing:

OpenAI Enterprise is not a self-service signup. It requires a sales engagement with OpenAI’s enterprise team. Pricing is negotiated based on usage volume. Minimum commitments vary. Expect a 4–8 week sales process before an Enterprise agreement is executed.

The per-token pricing under Enterprise is higher than the standard API pricing but includes the BAA, zero data retention options (available for some Enterprise agreements), dedicated support, and higher rate limits.

  • When to choose OpenAI Enterprise over Azure OpenAI:

Choose OpenAI Enterprise over Azure OpenAI when: you need the latest OpenAI models at the earliest possible release (OpenAI releases new models on its direct API before Azure OpenAI), you need OpenAI-specific features (Assistants API with specific configuration, OpenAI’s specific fine-tuning pipeline), you do not have an existing Azure enterprise relationship, or you prefer a direct relationship with OpenAI rather than Microsoft.

Choose Azure OpenAI over OpenAI Enterprise when: you are already in the Azure ecosystem, you need deployment isolation rather than shared multi-tenant infrastructure, or your enterprise relationship with Microsoft makes Azure OpenAI more commercially favorable.

  • The honest limitation:

OpenAI Enterprise BAA is a valid compliance path. It is also a slower, more expensive, and more complex path than AWS Bedrock for most healthcare startups. Unless you have a specific reason to prefer direct OpenAI API access, the Bedrock path to Claude (which gives you the same Anthropic models) is simpler and available on your existing AWS BAA without enterprise negotiation.

Anthropic Enterprise BAA, The Claude Direct Route

Anthropic’s Claude models are available through two routes: AWS Bedrock (under the AWS BAA) and Anthropic’s direct API (under an Anthropic Enterprise BAA). For most clinical AI use cases, AWS Bedrock provides the easier compliance path. Here is when the direct Anthropic route makes sense.

  • What the Anthropic Enterprise BAA covers:

Anthropic offers a Business Associate Agreement to enterprise customers through direct negotiation with Anthropic’s enterprise sales team. The BAA covers the Anthropic API for the services specified in the enterprise agreement, Claude chat completions, embeddings, and potentially other services depending on the specific agreement.

As with every BAA, the specific scope, data retention terms, and sub-processor disclosures in the Anthropic Enterprise BAA must be reviewed by your healthcare attorney before clinical data flows through the API.

  • Why you might choose Anthropic Direct over Bedrock:

Latest model availability: Anthropic releases new Claude models on the direct API before AWS Bedrock. If your clinical AI use case is sensitive to the specific capabilities of the latest Claude version, direct API access ensures you have the newest model immediately on release.

Anthropic-specific features: Some Anthropic API features, Claude’s extended context window usage patterns, specific prompt caching configurations, Anthropic’s Constitutional AI features, may be available on the direct API before or in different configurations than on Bedrock.

Regulatory preference: Some healthcare legal teams prefer a direct relationship with the model provider (Anthropic) rather than an indirect relationship through a cloud provider (AWS). The legal clarity of a direct BAA with the model creator is appealing to some compliance officers.

  • The honest limitation:

The Anthropic Enterprise BAA requires direct negotiation with Anthropic’s enterprise team, a 4–8 week process with minimum commitment thresholds. For most healthcare startups below $10M ARR, AWS Bedrock provides Claude access under the existing AWS BAA without enterprise negotiation. The Bedrock route is operationally simpler and commercially accessible at earlier stages.

The direct Anthropic Enterprise BAA becomes the right choice when: your product is at a scale where the enterprise negotiation is justified by usage volume, you need direct Anthropic API features not yet available on Bedrock, or your compliance posture specifically requires a direct BAA with the model provider rather than a BAA with the cloud infrastructure provider.

Google Cloud Vertex AI, The Gemini and Med-PaLM Option

Google Cloud Vertex AI is the Google Cloud Platform’s managed AI service, providing access to Google’s foundation models, the Gemini family, and to Med-PaLM 2, Google’s healthcare-domain-specific foundation model available to select clinical AI partners.

  • HIPAA coverage for Vertex AI:

Google Cloud’s HIPAA BAA covers Google Cloud services that are on Google’s list of HIPAA-covered products. Vertex AI is on this list, but the specific models and API endpoints must be confirmed against the current HIPAA-covered products list, as Google periodically updates which specific Vertex AI services are covered.

The correct process: confirm current Vertex AI HIPAA coverage with your Google Cloud account team before using Vertex AI for ePHI processing. Do not rely on a list that is more than 90 days old, Google’s covered services list changes.

  • Med-PaLM 2, the clinical domain model:

Med-PaLM 2 is Google’s foundation model specifically trained on medical data, medical literature, clinical guidelines, and curated clinical question-answering datasets. It is notable for achieving performance on the USMLE benchmark comparable to expert physician performance.

Med-PaLM 2 access is not a standard API, it requires a partner engagement with Google Health AI. For digital health companies building clinical decision support, clinical question answering, or medical knowledge applications, Med-PaLM 2 is worth pursuing as a partner.

The clinical performance advantage of a medically-trained model like Med-PaLM 2 over a general-purpose model like GPT-4 or Claude for clinical knowledge tasks is real, but the performance differential varies significantly by specific clinical domain and task type. Evaluate on your specific use case, not on benchmark performance alone.

  • When to choose Vertex AI:

Your product is already on Google Cloud Platform and you have an existing enterprise relationship with Google.

You are building a clinical knowledge application where Med-PaLM 2’s medical training data represents a meaningful performance advantage. You need Google-specific AI features, Google’s multimodal capabilities, Google’s specific embedding models, for your clinical AI architecture.

When not to choose Vertex AI over Bedrock: when you are on AWS, when multi-model access across providers is important, or when you do not yet have a Google Cloud enterprise relationship that includes HIPAA BAA coverage.

Self-Hosted Open-Source Models, The Maximum Control Option

Self-hosting an open-source LLM on your own HIPAA-compliant cloud infrastructure is the highest-control, highest-complexity option. It eliminates the dependency on any external LLM provider’s BAA, because you are running the model yourself, within your own infrastructure that is covered by your existing cloud provider BAA.

  • When self-hosting is the right choice:

Maximum data sovereignty: For clinical AI products where the ePHI sensitivity is highest, psychiatric records, HIV/AIDS status, substance use disorder records (42 CFR Part 2), or clinical data from high-profile patients, some organizations require that data never leave their own controlled infrastructure. Self-hosting is the only option that provides this guarantee.

Enterprise customer requirements: Some large health systems and federal healthcare programs require that AI inference occur on-premises within the health system’s own infrastructure. For these customers, self-hosting within the health system’s environment (or a dedicated tenant on the health system’s cloud) is required.

Fine-tuning with maximum control: When you need to fine-tune a model on clinical ePHI without any external provider touching the training data, self-hosting allows the fine-tuning process to occur entirely within your HIPAA-compliant infrastructure.

Volume economics at scale: At very high inference volumes, hundreds of thousands of LLM calls per day, the per-token cost of commercial APIs can exceed the cost of self-hosted GPU infrastructure. At scale, self-hosting may be commercially preferable.

  • The open-source clinical models available in 2026:

Llama 3 (Meta): 70B parameter and 8B parameter versions. The 70B model produces output quality competitive with GPT-3.5 Turbo for many tasks. For clinical tasks requiring nuanced reasoning, the 70B model is generally preferred. Meta releases Llama under a research and commercial license, confirm the license terms for your specific use case.

Mistral 7B and Mixtral 8x7B: Mistral 7B is a highly efficient model that punches above its parameter count. Mixtral 8x7B (a mixture-of-experts architecture) provides GPT-3.5 Turbo-level quality in a self-hosted configuration. Both are available under the Apache 2.0 license.

BioMistral: A fine-tuned version of Mistral 7B trained on medical literature from PubMed Central. Shows improved performance on clinical knowledge tasks compared to base Mistral 7B. Available under research license.

ClinicalCamel: A fine-tuned Llama 2 model trained on clinical reasoning data. Designed for clinical decision support tasks. Available for research use.

Meditron: A Llama 2-based model fine-tuned on medical guidelines, clinical notes, and medical literature. Developed by EPFL. Competitive performance on clinical benchmarks.

  • The honest limitations of self-hosting:

Infrastructure cost: A 70B parameter model requires significant GPU memory for inference, a minimum of 2 × A100 80GB GPUs for full-precision inference, or 1 × A100 80GB with 4-bit quantization. On AWS, that is approximately $12–$20/hour for the compute. At 16 hours/day of active inference, that is $70,000–$117,000/year in compute costs before storage, networking, and operations overhead.

Inference optimization complexity: Running a large LLM in production at acceptable latency (under 3 seconds for most clinical AI use cases) requires significant inference optimization: model quantization, KV cache optimization, batching, and potentially tensor parallelism across multiple GPUs. This is specialized infrastructure engineering.

Model management overhead: You are responsible for model updates, security patching of the inference infrastructure, monitoring, scaling, and availability. This is ongoing engineering work that a managed API service handles for you.

Performance ceiling: The best open-source models in 2026 are competitive with GPT-3.5 Turbo for many tasks and approaching GPT-4 quality for some tasks. For clinical AI requiring the highest level of reasoning quality, complex clinical decision support, nuanced clinical note generation, frontier commercial models (Claude 3.5 Sonnet, GPT-4o) still have a performance advantage for most tasks.

  • The self-hosting decision framework:

Self-host when: data sovereignty requirements make any external API unacceptable, enterprise customer requirements mandate on-premises inference, your fine-tuning data is identified ePHI that cannot leave your infrastructure, or inference volume makes self-hosting commercially preferable.

Use a managed BAA-covered API when: any of the above conditions do not apply, time-to-market matters, your team does not have specialized LLM inference infrastructure expertise, or the performance advantage of frontier commercial models is clinically meaningful for your use case.

The Full Pipeline BAA Map, Every Service That Touches ePHI

This is the map that most clinical AI products are missing. The LLM API call is one step in a pipeline. Every step that touches ePHI needs a BAA.

Here is the complete pipeline for an ambient clinical documentation product, the most complex clinical AI pipeline in terms of ePHI handling:

  • Step 1: Audio Capture Service: Device microphone (provider’s mobile app or desktop app) ePHI involved: Session audio containing patient voice and clinical discussion BAA required: No, this is your own application running on the provider’s device. You are the operator of the audio capture. But the data must be encrypted in transit from the moment of capture.
  • Step 2: Audio Transmission Service: Your application → your cloud infrastructure ePHI involved: Audio stream containing ePHI BAA required: Your cloud infrastructure BAA (AWS BAA) covers the transmission and storage infrastructure.
  • Step 3: Audio Transcription Service: Amazon Transcribe Medical (most common HIPAA-compliant option) ePHI involved: The audio being transcribed BAA required: Yes, covered under the AWS BAA (Amazon Transcribe Medical is on the HIPAA-eligible services list) Alternative: Deepgram for Healthcare (enterprise BAA available), AssemblyAI (healthcare BAA available) Not acceptable: Standard Whisper API (OpenAI), Google Speech-to-Text without enterprise HIPAA configuration, standard Deepgram without healthcare BAA
  • Step 4: Transcript Storage Service: Your database (AWS RDS, DynamoDB) ePHI involved: The transcribed text BAA required: Covered under AWS BAA (both RDS and DynamoDB are HIPAA-eligible services) Must: Encryption at rest enabled, audit logging for access
  • Step 5: EHR Data Retrieval Service: EHR FHIR API (Epic, Cerner, Athena) ePHI involved: Patient demographics, medication list, allergy list, problem list BAA required: Yes, your BAA with the health system covers the data exchange. The EHR is a Covered Entity; you are a Business Associate under the BAA.
  • Step 6: Prompt Construction Service: Your application code (Lambda, EC2, ECS) ePHI involved: Transcript + EHR data combined into the prompt BAA required: Covered under AWS BAA for the compute infrastructure
  • Step 7: LLM API Call Service: AWS Bedrock (recommended) ePHI involved: The full prompt including audio transcript and EHR data BAA required: Yes, covered under the AWS BAA (Bedrock is on the HIPAA-eligible services list) CloudTrail logging: The Bedrock API call is logged in CloudTrail automatically, this is your audit trail for the LLM access
  • Step 8: Response Handling and Storage Service: Your application + your database ePHI involved: The LLM-generated clinical note (which may contain ePHI derived from the transcript and EHR data) BAA required: Covered under AWS BAA
  • Step 9: Provider Review Interface Service: Your web or mobile application ePHI involved: The generated note displayed to the provider for review BAA required: Covered under AWS BAA for the backend infrastructure. The frontend delivery is your application.
  • Step 10: EHR Note Delivery Service: EHR FHIR write API or HL7 v2 ePHI involved: The signed clinical note being written back to the EHR BAA required: Covered under your health system BAA
  • Step 11: LLM Observability (if enabled) Service: Langsmith, Helicone, Braintrust, or similar ePHI involved: If prompt logging is enabled, the full prompt including ePHI is logged BAA required: YES, and most LLM observability platforms do not have standard healthcare BAAs Action: Disable prompt logging in production, OR confirm your observability platform has a healthcare BAA, OR use AWS CloudWatch (covered under AWS BAA) for logging instead of a third-party LLM observability platform
  • Step 12: Error Monitoring Service: Sentry, Datadog, or similar ePHI involved: If error monitoring captures request/response payloads, ePHI may be captured in error traces BAA required: Datadog has a healthcare BAA (enterprise plan). Sentry does not have a standard healthcare BAA. Action: Configure error monitoring to NOT capture request body contents for LLM API calls, OR confirm your error monitoring platform has a healthcare BAA, OR use AWS CloudWatch for error logging instead
  • Step 13: Audio Deletion Service: AWS S3 with automated deletion ePHI involved: Stored audio files BAA required: Covered under AWS BAA Action: Automated deletion after 72 hours (recommended retention period for ambient documentation audio)

The BAA gap pattern:

Looking at this pipeline, the BAA gaps that most clinical AI products have are at Steps 11 and 12, LLM observability and error monitoring. These are the tools that engineering teams add naturally for debugging and monitoring purposes. They are the tools that most commonly touch ePHI without BAA coverage.

The fix: configure observability and error monitoring to scrub or exclude LLM prompt content, OR confirm BAA coverage for each tool, OR use AWS-native logging (CloudWatch, CloudTrail) which is covered under the AWS BAA.

03 full ephi pipeline

Clinical Use Case to LLM Architecture Decision Matrix

Different clinical AI use cases have different requirements that influence the LLM architecture decision. Here is the decision matrix across the most common clinical AI use cases.

  • Ambient Clinical Documentation (AI Scribe)

Primary requirement: Low latency (under 3 seconds for note generation), high-quality long-form text generation, clinical constraint following

Recommended LLM: Claude 3.5 Sonnet via AWS Bedrock Reason: Claude excels at following complex instruction sets, critical for the clinical constraint prompts that prevent hallucination. Bedrock provides low latency at clinical scale. Claude’s long context window handles long transcripts without truncation.

BAA path: AWS Bedrock (standard AWS BAA) Key pipeline BAA gaps to address: transcription service (Amazon Transcribe Medical), observability platform

  • Clinical Decision Support (Risk Stratification, Care Gap Detection)

Primary requirement: Structured output (JSON or specific formats), reasoning quality for multi-factor clinical assessment, reliability at scale

Recommended LLM: Claude 3.5 Sonnet via AWS Bedrock for highest reasoning quality; Llama 3 70B (self-hosted or via Bedrock) for cost optimization at high volume

BAA path: AWS Bedrock (standard AWS BAA) or self-hosted on HIPAA-compliant infrastructure Key consideration: Bias analysis of model outputs across demographic subgroups required before clinical deployment

  • Patient-Facing Health Navigation

Primary requirement: Clear, plain-language responses appropriate for health literacy Level 6, safety guardrails for crisis detection, multilingual capability

Recommended LLM: GPT-4o via Azure OpenAI (for health literacy and multilingual quality) or Claude 3 Haiku via AWS Bedrock (for cost optimization at high patient interaction volume)

BAA path: Azure OpenAI with HIPAA configuration or AWS Bedrock Key consideration: System prompt for health literacy (sixth-grade reading level), crisis escalation keyword detection, explicit “I don’t know” response design

  • Prior Authorization Automation

Primary requirement: Document parsing, structured data extraction, criteria matching, audit trail for decisions

Recommended LLM: Claude 3.5 Sonnet via AWS Bedrock (for complex document reasoning) or GPT-4o via Azure OpenAI (for structured output reliability)

BAA path: AWS Bedrock or Azure OpenAI Key consideration: Every prior auth decision that influences coverage determination requires documented rationale, structured output with explicit criteria citations

  • Clinical Coding (ICD-10/CPT Suggestion)

Primary requirement: Medical knowledge, code specificity, reasoning transparency

Recommended LLM: Med-PaLM 2 via Vertex AI (for partners with access) for highest medical knowledge performance; Claude 3.5 Sonnet via AWS Bedrock as the general alternative

BAA path: Google Vertex AI (with HIPAA configuration) or AWS Bedrock Key consideration: Coding suggestions must be presented as suggestions for human coder review, not as automatic code assignments

  • Clinical Research Summarization (de-identified data)

Primary requirement: High-quality summarization, literature synthesis, structured output

Recommended LLM: Any capable model, since the data is de-identified, HIPAA BAA is not required Options: Direct Claude API, direct OpenAI API, Gemini via Google AI Studio, any capable model Key consideration: Verify de-identification rigor before concluding BAA is not required

  • Fine-Tuning on Clinical ePHI

Primary requirement: Maximum data control, fine-tuning capability, compliance documentation

Recommended approach: Self-hosted on HIPAA-compliant AWS infrastructure (EC2 with GPU, S3 for training data, SageMaker for managed training)

BAA path: AWS BAA covers the entire fine-tuning infrastructure Key consideration: Fine-tuning data governance: de-identification status, IRB approval if research use, data retention and deletion policy for training data

The Real Cost Stack for LLM Infrastructure in 2026

Understanding the economics before committing to an architecture is essential. Here are the real numbers.

1. AWS Bedrock pricing (2026, confirm current rates at aws.amazon.com/bedrock/pricing):

Claude 3.5 Sonnet (Anthropic via Bedrock):

  • Input: $3.00 per million tokens
  • Output: $15.00 per million tokens

Claude 3 Haiku (Anthropic via Bedrock):

  • Input: $0.25 per million tokens
  • Output: $1.25 per million tokens

Llama 3 70B (Meta via Bedrock):

  • Input: $2.65 per million tokens
  • Output: $3.50 per million tokens

Titan Text (Amazon via Bedrock):

  • Input: $0.80 per million tokens
  • Output: $1.60 per million tokens

2. Clinical AI cost modeling:

Ambient documentation at 500 sessions/day: Average session: 8,000 input tokens (transcript + EHR context) + 1,500 output tokens (SOAP note) Daily cost (Claude 3.5 Sonnet): (500 × 8,000 × $3.00/1M) + (500 × 1,500 × $15.00/1M) = $12.00 + $11.25 = $23.25/day = $8,486/year At $89/provider/month subscription: 500 sessions implies approximately 25 active providers. Revenue: 25 × $89 × 12 = $26,700/year. LLM cost: $8,486/year. LLM as percentage of revenue: 32%, needs optimization.

Switch to Claude 3 Haiku for lower-stakes portions of the pipeline (transcript initial processing before the final SOAP generation with Sonnet): cost reduction of 60–70% for the processing steps where Haiku performs adequately. Hybrid model architecture: Haiku for initial entity extraction, Sonnet for final SOAP generation. Approximate hybrid cost: $3,500–$4,500/year at 500 sessions/day.

Patient-facing health navigation at 10,000 interactions/day: Average interaction: 1,500 input tokens + 300 output tokens Daily cost (Claude 3 Haiku): (10,000 × 1,500 × $0.25/1M) + (10,000 × 300 × $1.25/1M) = $3.75 + $3.75 = $7.50/day = $2,738/year Patient-facing AI at volume is commercially favorable with Haiku-class models.

  • Transcription costs:

Amazon Transcribe Medical: $0.0086/second of audio At 500 sessions/day × 15 minutes average: 500 × 900 seconds = 450,000 seconds/day × $0.0086 = $3,870/day = $1,412,550/year

This is the dominant cost in an ambient documentation product at scale. The transcription cost at 500 sessions/day is 166× the LLM inference cost. At 50 sessions/day (early stage): $38,700/year in transcription, still significant.

Transcription cost optimization: real-time transcription (required for live session display) costs more per minute than batch transcription (acceptable if the note is generated after the session ends). If real-time display is not required for clinical workflow, batch transcription reduces cost by 20–30%.

3. Azure OpenAI pricing:

Azure OpenAI Enterprise pricing is negotiated and not publicly listed. Expect pricing similar to or slightly higher than AWS Bedrock for equivalent models, with volume discounts available at enterprise scale.

OpenAI Enterprise pricing:

Negotiated. Typically comparable to Azure OpenAI for equivalent models with enterprise commitments.

Self-hosted infrastructure costs:

AWS EC2 p4d.24xlarge (8 × A100 40GB): $32.77/hour on-demand, $19.92/hour reserved (1-year) At 16 hours/day active inference (8 hours overnight minimal load), annual cost: $116,000 on-demand or $70,600 reserved for one instance.

A 70B parameter model on one p4d.24xlarge at 4-bit quantization: achievable. Performance at scale (handling 500 concurrent clinical sessions): requires multiple instances or inference optimization.

Total self-hosted infrastructure cost for production ambient documentation at 500 sessions/day: $150,000–$300,000/year including compute, storage, networking, and operations overhead. Commercial for a product with meaningful revenue; premature for early-stage.

The cost optimization framework:

Use the smallest model that meets clinical quality requirements. Claude 3 Haiku at $0.25/$1.25 per million tokens is 12× cheaper than Claude 3.5 Sonnet at $3.00/$15.00. For clinical tasks where Haiku produces acceptable output quality, use Haiku.

Build a hybrid pipeline: use a smaller, cheaper model for initial processing (entity extraction, classification, intent detection) and a larger, more capable model only for the final output generation step that requires the highest quality. This reduces cost by 50–70% for many clinical AI pipelines without meaningfully degrading output quality.

Cache repeated context: for clinical AI that repeatedly processes the same reference materials (medical guidelines, formulary data, institutional protocols), implement prompt caching to reduce the input token count for repeated context.

Use embeddings for retrieval rather than full documents in context: a RAG architecture that retrieves the three most relevant chunks of clinical guidelines, rather than including all clinical guidelines in every prompt, reduces input tokens significantly.

Prompt Engineering for Clinical Safety Under BAA

Having a BAA does not make your clinical AI safe. The BAA manages the compliance risk. Prompt engineering manages the clinical risk. Both are required.

  • The clinical safety system prompt architecture:

Every clinical AI system prompt must include explicit clinical safety constraints. Here is the framework:

Section 1, Role and scope definition: Define exactly what the AI is and is not. “You are a clinical documentation assistant. You summarize clinical encounter information. You are not a clinician. You do not diagnose, treat, or recommend treatment.” This section limits the scope of the AI’s activity and reduces the risk of out-of-scope clinical outputs.

Section 2, Source fidelity requirements: “Include ONLY information that is explicitly present in the provided clinical notes or transcript. Do NOT add clinical information not documented in the source material. Do NOT infer diagnoses or clinical findings that are not explicitly documented.”

This is the hallucination prevention section. It is the most important section of the clinical system prompt. It is also the section most commonly absent or weak in clinical AI products that hallucinate.

Section 3, Uncertainty handling: “When clinical information is ambiguous, incomplete, or unclear in the source material: Use ‘Documentation unclear, provider to review’ rather than inferring a specific value. Use ‘Documented as’ or ‘Reported’ before clinical findings to attribute information to the source rather than stating it as objective fact.”

Section 4, Out-of-scope query handling: “If asked a question outside the scope of clinical documentation summarization, including requests for clinical advice, diagnosis, or treatment recommendations, respond: ‘This is outside my function. Please consult a clinician.’ Do not attempt to answer clinical questions outside the defined scope.”

Section 5, Safety escalation language (for patient-facing AI): “If the patient’s message indicates an emergency, suicidal ideation, or immediate safety concern, respond immediately with [crisis resource language] and do not attempt to address the clinical question before the safety escalation.”

  • The clinical output schema:

Structured outputs, JSON schema output from the LLM, reduce hallucination risk for clinical AI by constraining the model to produce content in a predefined format with predefined fields. A SOAP note structured output schema:

{

  “subjective”: {

    “chief_complaint”: “string, patient’s stated reason for visit, quoted from transcript”,

    “history_of_present_illness”: “string, documented history only”,

    “review_of_systems”: “array of documented symptoms only”

  },

  “objective”: {

    “vital_signs”: “object, include only values explicitly in source”,

    “physical_examination”: “string, documented findings only”

  },

  “assessment”: {

    “diagnoses”: “array, include only diagnoses explicitly discussed in encounter”,

    “confidence_note”: “string, any diagnostic uncertainty or items requiring provider clarification”

  },

  “plan”: {

    “medications”: “array, include ONLY medications explicitly mentioned in encounter”,

    “orders”: “array, include ONLY orders explicitly placed during encounter”,

    “follow_up”: “string, as documented in encounter”,

    “provider_action_required”: “array, items requiring provider decision or clarification not resolved in encounter”

  }

}

The provider_action_required field is the most important innovation in clinical SOAP note structured output, it gives the AI a specific place to flag anything it is uncertain about rather than fabricating a resolution.

The minimum necessary data principle in prompts:

Include only the patient data elements the prompt actually needs. A prompt generating a specialist referral summary does not need the patient’s full 10-year medication history, it needs the medications relevant to the current presenting condition. A prompt generating discharge instructions does not need the patient’s HIV status or psychiatric history.

Implement minimum necessary data selection at the EHR data retrieval step, before the data enters the prompt. The fewer data elements in the prompt, the smaller the ePHI exposure, and the more focused the model’s attention on the relevant clinical context.

Fine-Tuning Under BAA, The Data Governance Rules

Fine-tuning is the process of training a pre-trained foundation model on a specific dataset to improve its performance on a target task or domain. For clinical AI, fine-tuning can improve performance on clinical language understanding, clinical note generation quality, and domain-specific reasoning.

  • When fine-tuning is appropriate for clinical AI:

Fine-tuning makes sense when: your clinical AI use case requires specific language patterns or formatting that cannot be reliably achieved through prompt engineering alone (certain EHR-specific note formats, specialty-specific clinical terminology), your production volume is high enough that fine-tuning a smaller model to match a larger model’s performance on your specific task produces meaningful cost savings, or you have a high-quality labeled clinical dataset that represents the specific clinical context of your deployment.

Fine-tuning is overkill when: prompt engineering can achieve acceptable performance (for most clinical AI use cases with well-designed prompts, it can), your dataset is too small for fine-tuning to provide meaningful improvement (typically under 1,000 high-quality examples), or the compliance overhead of fine-tuning on ePHI is not justified by the performance improvement.

  • The data governance requirements for fine-tuning under HIPAA:

Option 1, Fine-tune on de-identified data (preferred): De-identify the training dataset under HIPAA Safe Harbor (§164.514(b)), removing all 18 HIPAA identifiers. De-identified data is not ePHI and can be used for fine-tuning without HIPAA restrictions. The fine-tuning process does not require a BAA with the fine-tuning service provider (since the data is not ePHI).

The de-identification challenge: clinical notes are difficult to de-identify completely. Narrative text frequently contains identifiers that are not in the 18-item Safe Harbor list, implicit references to rare diagnoses that could identify a patient in a small population, specific dates combined with clinical details, or descriptive information that is re-identifiable in context. 

Expert Determination de-identification (§164.514(b)(1)), certification by a statistical or scientific expert that the re-identification risk is very small, is more robust than Safe Harbor for clinical narrative text.

Option 2, Fine-tune on ePHI within HIPAA-compliant infrastructure (when de-identification is not possible): When the clinical data cannot be de-identified without destroying the linguistic patterns needed for fine-tuning, fine-tuning on ePHI is possible under HIPAA if:

The fine-tuning occurs within HIPAA-compliant cloud infrastructure under an existing BAA (AWS SageMaker under the AWS BAA, for example).

The training data is stored in HIPAA-compliant encrypted storage (AWS S3 with encryption under the AWS BAA).

The fine-tuned model weights do not leak ePHI, the model must be evaluated to confirm that the ePHI in the training data is not extractable from the model weights through adversarial prompting. This is a research-active area; work with a machine learning security expert to evaluate extraction risk for your fine-tuned model.

The data use for fine-tuning falls within a permissible HIPAA use, healthcare operations (improving the organization’s clinical AI tools for its own clinical operations) or with patient authorization. Research uses of ePHI for fine-tuning may require IRB approval.

Option 3, Fine-tune via AWS Bedrock fine-tuning (managed): AWS Bedrock supports fine-tuning for select models (Titan, some Llama models) with training data stored in S3 (under the AWS BAA). The fine-tuning process occurs within AWS’s infrastructure, covered under the AWS BAA. This is the cleanest compliance path for fine-tuning with ePHI when Bedrock’s supported models are appropriate for the use case.

  • Fine-tuning data documentation requirements:

Regardless of which option is chosen, document the fine-tuning dataset provenance:

  • What clinical data was used (source, date range, clinical domain)
  • De-identification method (Safe Harbor or Expert Determination, with the Expert Determination statistician’s certification)
  • Any consents or authorizations obtained
  • Retention period for the fine-tuning dataset
  • Deletion schedule for the fine-tuning dataset after the model is trained

This documentation is part of your HIPAA risk management program and your model governance program. It will be requested by enterprise health system customers and by SOC 2 auditors.

LLM Observability and Monitoring Under HIPAA

LLM observability, monitoring the quality, performance, and behavior of LLM-based clinical AI in production, is essential for catching hallucinations, monitoring model drift, and ensuring clinical safety. It is also a significant HIPAA compliance challenge.

  • The HIPAA problem with standard LLM observability:

Standard LLM observability tools, Langsmith, Helicone, Braintrust, Weights & Biases, work by logging prompt inputs and model outputs to their platforms. This enables you to review past conversations, analyze model performance, debug issues, and monitor for quality problems.

When the prompt contains ePHI, which it does in almost every clinical AI application, logging the prompt to any of these platforms means sending ePHI to a third-party service. That service must have a HIPAA BAA covering this use.

Most LLM observability platforms are not designed specifically for healthcare and do not have standard healthcare BAAs. Some have enterprise BAA options, but this is the exception, not the rule.

  • The four approaches to LLM observability under HIPAA:

Approach 1, Disable prompt logging, log metadata only: Log what you can without logging ePHI: request timestamp, model ID, input token count, output token count, response latency, completion reason (finished, max tokens, content filter), and any error codes. This metadata is enough to monitor performance and detect anomalies, without logging the ePHI content.

The limitation: you cannot review specific prompts and responses when debugging a reported hallucination or quality issue. When a provider reports that “the AI generated an incorrect medication in the note for patient X,” you cannot retrieve the original prompt and response to investigate.

Approach 2, Log to AWS CloudWatch (under AWS BAA): AWS CloudWatch is on the HIPAA-eligible services list. Logging LLM inputs and outputs to CloudWatch means the logs are stored in a HIPAA-covered service under your existing AWS BAA. CloudWatch log groups can be encrypted with KMS. Access is controlled through IAM.

The tradeoff: CloudWatch provides log storage and basic querying (CloudWatch Logs Insights) but does not provide the structured LLM-specific analysis features of purpose-built observability platforms. Reviewing individual conversations, computing aggregate statistics by clinical use case, or comparing model versions requires more custom tooling than LangSmith or Helicone provides out of the box.

Approach 3, De-identify before logging: Apply a PHI detection and redaction layer to prompts and responses before logging them to any observability platform. Replace patient names with [PATIENT_NAME], dates with [DATE], and other identifiers with labeled placeholders. The redacted prompt/response pair can be logged to any observability platform without triggering HIPAA BAA requirements (assuming the redaction is complete and correct).

The challenge: clinical AI prompts contain implicit ePHI that is difficult to detect and redact reliably, specific medication combinations, rare diagnoses, or clinical details that are re-identifiable. Imperfect redaction is worse than no redaction because it creates false confidence in compliance. If you use this approach, validate the redaction quality rigorously before relying on it for compliance.

Approach 4, Self-hosted observability: Run an open-source LLM observability tool (Langfuse, Helicone open-source, Phoenix from Arize) on your own HIPAA-compliant cloud infrastructure. This gives you the full observability features, prompt logging, conversation review, performance analytics, without sending data to a third-party service. The logs stay within your AWS infrastructure under your existing AWS BAA.

The overhead: running and maintaining an observability service is an additional operational responsibility. For mature clinical AI products at scale, this is the most complete solution. For early-stage products, AWS CloudWatch logging with custom dashboards is a reasonable starting point.

  • The hallucination monitoring approach under HIPAA:

For monitoring hallucination rates in production clinical AI, the key signal is provider edit rate, how often providers edit specific fields in AI-generated clinical outputs. This signal is available without logging ePHI: log the field ID, whether it was edited (boolean), and optionally the field type, without logging the original AI content or the provider’s correction.

At aggregate scale, elevated edit rates on specific field types (medication field edited in >30% of notes) indicate systematic hallucination in that field type, triggering engineering investigation without requiring access to specific ePHI-containing logs.

When the BAA Is Not Enough, Additional Safeguards

The BAA is a compliance control. Clinical AI safety requires additional safeguards that operate independently of the BAA.

  • The safeguards that the BAA does not provide:

Hallucination prevention: The BAA ensures the LLM provider handles your data according to HIPAA requirements. It does not ensure the LLM generates clinically accurate content. Hallucination prevention is an engineering problem, not a compliance problem. See the guardrail architecture in Guide 10.

Clinical appropriateness: The BAA does not ensure the AI’s clinical outputs are clinically appropriate for the patient population, the clinical context, or the clinical use case. Clinical appropriateness requires clinical oversight, a licensed clinician who reviews the AI’s outputs and confirms they are clinically sound before the AI is deployed.

Demographic equity: The BAA does not ensure the AI development equitably across demographic subgroups. Bias analysis, performance evaluation by age, sex, race/ethnicity, and other relevant subgroups, is required for clinical AI regardless of which LLM provider is used.

Adverse event reporting: The BAA does not require the LLM provider to report clinical adverse events caused by hallucinated outputs. Your organization is responsible for medical device reporting (if your product is FDA-regulated SaMD) and for complaint handling in response to clinical quality issues.

  • The technical safeguards that operate alongside the BAA:

Prompt injection defense: Clinical AI that processes external data sources, patient-generated text, EHR notes from multiple providers, clinical documentation from referring providers, is vulnerable to prompt injection attacks where adversarial content in the input overrides the system prompt. Implement input sanitization and structured input formatting to reduce injection risk.

Output validation: For clinical AI that generates structured outputs (JSON, structured notes), validate the output against the expected schema before displaying it to the provider. Invalid outputs, missing required fields, fields exceeding expected value ranges, should trigger regeneration or flagging, not be displayed as-is.

Clinical entity validation: Validate medications, diagnoses, and lab values in AI outputs against authoritative clinical reference databases. A medication that does not exist in RxNorm, a diagnosis code that does not exist in ICD-10, or a lab value that is outside physiologically plausible ranges indicates a hallucination. These should be flagged before the output reaches the provider.

Data retention minimization: Retain ePHI in your clinical AI infrastructure for the minimum period necessary for the clinical function. Audio: 72 hours after note generation. Prompts containing ePHI: consider ephemeral processing (process in memory, do not persist to storage) where possible. The ePHI that does not exist in storage cannot be breached.

The LLM-Under-BAA Scorecard™

Score each row 0 (absent), 1 (partial), or 2 (fully present). Maximum score: 70.

# Criterion Weight Your Score
1 BAA executed with LLM provider before any ePHI enters the pipeline /4
2 BAA scope reviewed by healthcare attorney, specific services confirmed covered /4
3 LLM provider data training terms reviewed, ePHI not used for model training confirmed /4
4 Full ePHI pipeline mapped, every service that touches ePHI identified /4
5 BAA confirmed for transcription service (if audio processing in scope) /4
6 BAA confirmed for vector database (if RAG architecture in scope) /2
7 LLM observability platform either disabled from logging ePHI or has healthcare BAA /4
8 Error monitoring configured to not capture ePHI in request/response payloads /4
9 CloudTrail (or equivalent) audit logging enabled for LLM API calls /4
10 Development environment uses de-identified or synthetic data (not real ePHI) /4
11 Clinical safety system prompt with explicit hallucination prevention constraints /4
12 Grounding validation layer implemented for clinical outputs /4
13 Clinical entity validation (RxNorm, ICD-10, allergy check) for AI outputs /4
14 Human review gate before AI-generated content enters clinical record /4
15 Audio retention policy (72-hour default) with automated deletion /2
16 Prompt retention minimization, ephemeral processing where possible /2
17 Fine-tuning data governance documented (de-identification method or ePHI authorization basis) /2
18 LLM provider sub-processors disclosed and HIPAA handling confirmed /2
19 Model version registry, every production model version logged with deployment dates /2
20 Clinical AI output attributable to specific model version /2
21 Bias analysis by demographic subgroup before clinical deployment /2
22 Prompt injection defenses implemented for external data sources /2
23 LLM BAA change monitoring, process for detecting provider BAA term changes /2
24 Incident response plan specifically addresses LLM-related ePHI exposure /2
25 BAA registry maintained with all LLM pipeline services /2

Score interpretation:

  • 55–70: Strong LLM compliance and clinical safety posture, ready for enterprise clinical deployment
  • 40–54: Proceed with identified gaps remediated, ePHI pipeline and clinical safety 2× items are non-negotiable
  • Under 40: Significant HIPAA and clinical safety exposure, do not process real ePHI through AI pipeline until gaps are closed

Conclusion

The prompt that should never have reached OpenAI reached OpenAI for one reason: the engineering team that built the feature knew about LLMs and did not know about BAAs. The compliance team that would have caught it had not yet been hired. The healthcare attorney who eventually reviewed the exposure was not in the loop when the sprint started.

That gap, between the speed at which clinical AI features get built and the compliance infrastructure that needs to govern them, is the gap this guide exists to close.

The BAA landscape in 2026 is actually workable. AWS Bedrock gives you Claude, Llama, and Mistral under the AWS BAA you already have. Azure OpenAI and OpenAI Enterprise are available for teams with the enterprise relationships to support them. Self-hosting is viable for teams with the infrastructure maturity to run it. The options exist.

The failure mode is not that compliant options are unavailable. The failure mode is that the sprint starts before anyone asks the question.

Ask the question in Week 1 of discovery. Map the full ePHI pipeline before the first API call. Execute the BAA before clinical data flows. Review the data training terms, not just the BAA existence. These are not complex steps. They are sequencing steps. Get the sequence right.

If you want 30 minutes to map your clinical AI pipeline and identify the BAA gaps before they cost you $140,000 and four months, book a call with me or Aditi. No slides. No pitch. Just the pipeline conversation.

FAQ

Can I use the standard OpenAI API (not enterprise) for a healthcare application? Not for processing ePHI. The standard OpenAI API, the pay-as-you-go tier, does not include a Business Associate Agreement. Sending ePHI through any API without a BAA with the API provider is a HIPAA violation. For development and testing with de-identified or synthetic data, the standard API is acceptable. For any processing of real patient data, you need either an OpenAI Enterprise BAA or an alternative with BAA coverage (AWS Bedrock, Azure OpenAI with HIPAA configuration).

Does the AWS BAA cover all AWS Bedrock models? AWS Bedrock is on the HIPAA-eligible services list, and the AWS BAA covers HIPAA-eligible services. However, not all models available through Bedrock are available in all AWS regions, and HIPAA-eligible service coverage requires using Bedrock in a HIPAA-eligible AWS region (us-east-1, us-west-2, and select others). Confirm current regional availability and HIPAA eligibility for the specific Bedrock model you are using before processing clinical ePHI.

Is fine-tuning on clinical data HIPAA-compliant? Fine-tuning on de-identified clinical data (de-identified under HIPAA Safe Harbor or Expert Determination) is HIPAA-compliant, de-identified data is not ePHI. Fine-tuning on identified ePHI requires that the fine-tuning process occur within HIPAA-compliant infrastructure under an existing BAA, that the data use falls within a permissible HIPAA purpose (healthcare operations or patient authorization), and that the fine-tuned model is evaluated for ePHI extraction risk. AWS Bedrock fine-tuning with training data in S3 (under the AWS BAA) is the cleanest compliance path for fine-tuning within a managed service.

What is the difference between a BAA with the LLM provider and data processing terms? A BAA is a HIPAA-required contract that commits the provider to handle ePHI in accordance with HIPAA requirements, including safeguards, breach notification, and permitted use limitations. Data processing terms (in the service agreement) govern what the provider does with your data beyond HIPAA minimum requirements, specifically, whether they use your data for model training, how long they retain it, and who has access to it. Both must be reviewed. A BAA without reviewing data processing terms may leave you with compliant ePHI handling but with model training use of your clinical data that you did not expect.

Does logging LLM prompts for debugging create a HIPAA problem? Yes, if the prompts contain ePHI and the logging service does not have a HIPAA BAA. Standard LLM observability platforms (Langsmith, Helicone, Braintrust) do not have standard healthcare BAAs. Options: disable ePHI-containing prompt logging and log metadata only; log to AWS CloudWatch (HIPAA-eligible under AWS BAA); de-identify prompts before logging (with validated redaction); or run open-source observability (Langfuse) on your own HIPAA-compliant infrastructure.

Can I use Anthropic’s Claude without going through AWS Bedrock? Yes, through Anthropic’s Enterprise BAA. Anthropic offers a BAA to enterprise customers through direct negotiation with their enterprise sales team. This requires an enterprise agreement (minimum commitment thresholds apply, 4–8 week sales process). For most early-stage to mid-stage healthcare startups, AWS Bedrock provides access to Claude under the existing AWS BAA without enterprise negotiation, making it the operationally simpler path to Claude for ePHI processing.

How do I monitor LLM clinical AI performance without logging ePHI? Log metadata without ePHI: request timestamp, model version, input token count, output token count, response latency, completion reason, and error codes. Log provider edit behavior without ePHI: which field was edited (boolean, field identifier), field type (medication, diagnosis, assessment), and session identifier, without the ePHI content. At aggregate scale, elevated edit rates on specific field types indicate systematic quality issues requiring investigation. For cases where specific prompt/response review is required (debugging a reported hallucination), use CloudWatch logging under the AWS BAA, which preserves full prompt/response content within HIPAA-compliant infrastructure.

Is Med-PaLM 2 available for clinical AI startups? Med-PaLM 2 access is not a standard self-service API. It requires a partner engagement with Google Health AI. If you are building clinical knowledge applications, medical question answering, clinical literature synthesis, clinical knowledge base, Med-PaLM 2’s medical training data may provide a meaningful performance advantage over general-purpose models. Reach out to Google Cloud’s healthcare team to explore the partner pathway.

What happens if an LLM provider changes their BAA terms? LLM providers update their terms of service, privacy policies, and BAA terms as their services evolve. Changes to BAA terms that expand the provider’s data use rights, reduce their security commitments, or add new sub-processors can affect your HIPAA compliance posture. Implement a process to monitor for BAA and terms of service updates from every LLM provider in your clinical AI pipeline. Subscribe to provider change notifications, review the HIPAA-covered services list annually for each cloud provider, and have your healthcare attorney evaluate material changes to BAA terms.

Can a self-hosted LLM process ePHI without a BAA? If you self-host the LLM on your own HIPAA-compliant cloud infrastructure, AWS EC2 with encryption, under your existing AWS BAA, you do not need a BAA with a separate LLM provider because there is no separate LLM provider. You are the operator of the inference infrastructure. The AWS BAA covers the cloud infrastructure (EC2, S3, networking). You are responsible for ensuring the infrastructure meets HIPAA Security Rule requirements (encryption at rest and in transit, access controls, audit logging, vulnerability management). Self-hosting eliminates the LLM provider BAA dependency while creating the infrastructure management responsibility.

How should I handle a situation where ePHI was sent to an LLM API without a BAA? Treat it as a potential HIPAA breach. Immediately disable the feature that is sending ePHI to the unprotected API. Engage your healthcare attorney for a breach risk assessment, the HIPAA Breach Risk Assessment evaluates the probability that ePHI was compromised, considering the nature of the data, the nature of the unauthorized recipient, and available safeguards. If the risk assessment concludes there is more than a low probability that ePHI was compromised, HIPAA Breach Notification Rule obligations apply, notification to affected individuals within 60 days. Document the incident, the risk assessment, and the remediation in your HIPAA records. Rebuild the AI feature under a compliant architecture before re-enabling.