SR 11-7 for ML models: conceptual soundness in practice

SR 11-7

OCC 2011-12

Model Risk Management

Effective Challenge

SR 11-7 — the Federal Reserve's 2011 Supervisory Guidance on Model Risk Management, issued jointly with the OCC as Bulletin 2011-12 — remains the foundational US standard for governing models in banking. It was written for an era of regression-based credit scorecards and value-at-risk engines, not transformers. But supervisors have been explicit that it is technology-neutral: a model is a model, and a machine-learning or large-language model that informs a business decision sits within its scope. The challenge is that ML models strain several of SR 11-7's assumptions, and meeting the guidance for them requires translating its principles into the vocabulary of modern modeling.

The definition is broad on purpose

SR 11-7 defines a model as a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates. That definition comfortably covers a gradient-boosted credit model, a neural-network fraud detector, and — increasingly the live question — an LLM used to summarize, classify, or draft within a decision workflow. The guidance also frames model risk as having two sources: a model may have fundamental errors, or it may be used incorrectly or inappropriately. Both sources apply with full force to ML, and the second — misuse — is acute for general-purpose models repurposed beyond their validated intent.

Conceptual soundness when the model is a black box

The first pillar of SR 11-7's validation triad is an evaluation of conceptual soundness — assessing the quality of the model's design and construction, including the evidence supporting the methods used and the variables selected. For a linear scorecard this is tractable: you can read the coefficients. For a high-dimensional ML model, conceptual soundness cannot mean inspecting every parameter, so it shifts to a different set of questions that a validator must be able to answer and document.

Is the modeling approach appropriate for the problem, and is the choice justified against alternatives rather than assumed?
Is the training data representative of the population the model will serve, and have its limitations and biases been examined?
Are the input variables sensible and defensible — including, critically, the absence of features that would constitute proxies for protected characteristics in a lending context?
Is the model's behavior interpretable to the degree the use case demands, using post-hoc explainability such as SHAP or LIME feature-attribution where the model is not intrinsically transparent?
Are the limitations understood and documented, with a clear statement of the conditions under which the model should not be relied upon?

Effective challenge is the cultural test, not a document

SR 11-7 elevates "effective challenge" — critical analysis by objective, informed parties who can identify model limitations and assumptions and produce appropriate changes — to a first-class principle. For ML this means validation cannot be performed by the same team that built the model, and the challenger must have the standing and the competence to push back. The guidance ties the weight of effective challenge to incentives, competence, and influence. A validation function that cannot, in practice, block a model from production is not delivering effective challenge regardless of how thorough its report reads.

Ongoing monitoring — where ML models actually fail

The second validation pillar is ongoing monitoring, and it is where machine-learning models most diverge from the static models SR 11-7 imagined. The guidance requires monitoring to confirm that a model is implemented appropriately and continues to perform as intended — and ML models degrade in ways that are quiet and continuous. The world drifts away from the training distribution; the relationship the model learned weakens; performance erodes before anyone files a complaint. SR 11-7 names benchmarking and the analysis of overrides as monitoring tools, but the operative requirement for ML is statistical drift and performance surveillance.

1Input drift — has the distribution of incoming features moved away from the training baseline? The Population Stability Index is the standard measure, with a commonly cited threshold of PSI above 0.25 signaling a material shift; the Kolmogorov-Smirnov statistic is a distribution-free complement.
2Performance degradation — are accuracy, precision, and recall against realized outcomes holding to the levels established at validation, or sliding?
3Stability across subpopulations — is degradation uniform, or concentrated in a demographic segment in a way that also raises fair-lending exposure?
4Override analysis — how often, and why, are humans overriding the model, which SR 11-7 specifically flags as a monitoring signal about model fitness.

Outcomes analysis and the audit trail

The third pillar, outcomes analysis, compares model outputs to actual outcomes — back-testing realized performance against what the model predicted. For this to be possible at all, the model's predictions, the inputs that produced them, and the eventual outcomes must have been recorded contemporaneously and durably. SR 11-7's emphasis on documentation — it states that model development should be documented so that activities can be tracked and a third party could understand and assess the work — combined with banking record-retention expectations, makes a tamper-evident decision log a practical necessity. A retention horizon on the order of seven years is the conventional planning assumption for model-decision records, though the governing schedule depends on the institution's records policy and applicable rule, and should be set with compliance rather than from a rule of thumb.

The model inventory and risk tiering SR 11-7 expects

SR 11-7 is explicit that model risk management starts with knowing what models you have. It calls for a comprehensive inventory of models in use, under development, or recently retired, and it ties the rigor of validation to a model's materiality and complexity — high-materiality models warrant the deepest validation and the most frequent review, while lower-stakes models can be governed proportionately. For an ML portfolio this inventory has to capture more than a name and an owner. It needs the model's purpose and the decisions it informs, its risk tier, its training-data lineage, the version currently in production, and a pointer to its validation and monitoring records. The reason is practical: when a regulator asks about a specific decision, the inventory is the index that lets you find the model, its tier, and its evidence quickly — and an institution that cannot produce a current, complete inventory has failed the first test before validation quality is even examined.

The tiering question becomes sharper as models proliferate. A scattershot of notebooks and ad-hoc scripts that influence decisions but were never inventoried is precisely the shadow-model risk SR 11-7's inventory mandate exists to surface. Bringing every decision-influencing model — including the ones that crept in through a business unit's experimentation — into a single tiered inventory is what makes proportionate governance possible: you cannot apply the right level of scrutiny to a model you do not know exists.

LLMs strain the framework in specific ways

Large-language models sit uneasily inside a framework built for models with stable inputs and quantitative outputs, and the friction points are worth naming because each maps to an SR 11-7 principle.

Non-determinism: the same prompt can yield different outputs, which complicates the reproducibility outcomes analysis assumes. Capturing the exact input, model version, and key generation parameters at decision time is what restores a reproducible record.
Use beyond validated intent: a general-purpose model validated for summarization that quietly starts informing eligibility decisions is the "used incorrectly or inappropriately" risk SR 11-7 names — and it is invisible without an inventory that records each model's approved use.
Opaque training data: when the base model's training corpus is not disclosed, the Article-style data-representativeness analysis is harder, and conceptual soundness has to lean more heavily on behavioral testing and documented limitations.
New failure modes: hallucination and prompt-injection are model risks in SR 11-7's sense — sources of erroneous output — that did not exist for scorecards, and a monitoring program has to account for them.

Independence is structural, not procedural

SR 11-7's insistence that validation be independent of development is easy to satisfy on paper and hard to satisfy in substance, especially for fast-moving ML teams where the people who understand the model are the people who built it. The guidance's answer is that independence is about authority and incentives: the validation function must be organizationally positioned to deliver a finding that the business does not want to hear, and to have that finding stick. Where the same model can be re-validated by an independent secondary check rather than only by self-attestation, the effective-challenge requirement is met in practice and not merely documented.

How Pratvi helps

Pratvi AI maps onto SR 11-7's validation triad. The AI Model Inventory holds the comprehensive, risk-tiered inventory the guidance demands — design rationale, data lineage, variable justification, approved use, version, owner, and lifecycle state — at the depth a validator and an examiner expect, so conceptual-soundness evidence and the model index live in one system of record. The Drift & Performance Monitor delivers the ongoing-monitoring pillar: Population Stability Index against a rolling baseline, Kolmogorov-Smirnov testing, performance-degradation tracking against the validation benchmark, and demographic-differential drift, with automatic retraining flags. The Confidence & Verification module supports effective challenge operationally — independent secondary-model verification and disagreement surfacing — and tracks the override signal SR 11-7 calls out. And the Immutable Audit Trail provides the durable, hash-chained prediction-and-outcome record that makes outcomes analysis and a credible third-party review possible, with retention configurable to your model-risk records policy.

This article is educational and does not constitute legal advice. Regulatory requirements change and apply differently by jurisdiction and facts — confirm specifics with qualified counsel. References to Pratvi AI modules describe platform capability and do not imply certification.

How Pratvi helps

The modules that map to these obligations

Each module below is implemented in the platform today. Inclusion of a regulation indicates capability, never certification.

AI Model Inventory

One source of truth for every AI system in your organization.

Catalog every AI / ML system, foundation model integration, and AI-assisted decision flow with risk classification, ownership, lifecycle state, and a full dependency graph. Required for OMB M-24-10 federal inventories, NAIC AI bulletin governance, and EU AI Act Annex VIII technical documentation.

Explore the module

Drift & Performance Monitor

Statistical detection when your models stop working as designed.

Continuous distribution-shift and performance-degradation monitoring. Population Stability Index against rolling baselines, Kolmogorov-Smirnov test, accuracy / precision / recall degradation tracking, and demographic differential drift. Flags automatic retraining triggers per model.

Explore the module

Confidence & Verification

Calibrated confidence, multi-model verification, automatic escalation.

Every AI decision carries a calibrated confidence score with automatic routing — auto-approve, flag-for-review, or human-required. Multi-model verification runs the same input through a primary plus secondary model and surfaces disagreements; configurable to auto-accept, flag, or block based on delta thresholds.

Explore the module

Immutable Audit Trail

SHA-256 hash-chained logs of every AI decision.

Cryptographically tamper-evident audit log of every AI inference, decision, and human override. Each entry hash-chained to its predecessor — any tampering breaks the chain and is detected on verification. Exportable as FHIR R4 AuditEvent for healthcare and as evidence for regulatory examinations.

Explore the module

See these controls against your own exposure

30-minute walkthrough. Bring the rules that govern your AI and we'll map them to platform capabilities live.

Request access