flamehaven-audit-reports

STEM-BIO-AI Audit Report: yorkeccak/bio

When a README Claim Meets a Deterministic Scanner

Published: May 2026
Evaluation Target: yorkeccak/bio
Evaluation Engine: STEM-BIO-AI v1.7.5
Execution Mode: LOCAL_ANALYSIS
Audit Expires: 2026-06-29 (45-day freshness window)

Executive Summary

Field	Value
Final Score	48 / 100
Formal Tier	T1 Quarantine
Use Scope	Exploratory review only. No patient-adjacent use.
Replication Tier	R1 (Stage 4: 30/100)
AIRI Coverage	8 / 32

The T1 Quarantine result is justified. The repository presents a strong bio-facing README and a recognizable data-source story, but it still lacks the engineering, governance, and deployment-boundary evidence needed for patient-adjacent or regulated use. The most important shift in the v7 output is not the final score itself. It is that the report now surfaces external-service dependency and unsupported legal/compliance claims more explicitly, and no longer presents Code Integrity or AIRI in a way that looks artificially clean.

What STEM-BIO-AI Measures

STEM-BIO-AI is a deterministic local CLI scanner. It does not execute model inference, make network requests, or run the target application. It evaluates repository evidence surfaces: what signals are present in README, docs, package metadata, dependency manifests, CI/test infrastructure, code structure, and replication artifacts.

The evaluation is organized into four stages.

Stage	Weight	Description
Stage 1 README Evidence Signal	0.40	What the README and package metadata claim
Stage 2R Repo-Local Consistency	0.20	Whether local code/doc surfaces support those claims
Stage 3 Code/Bio Responsibility	0.40	Engineering governance: tests, CI/CD, provenance, bias, release hygiene
Stage 4 Replication Evidence	reported separately	Reproducibility infrastructure

Stage 4 is reported separately and does not alter the final score. Stage 3 remains the heaviest weighted stage and is still the clearest reason this repository remains in quarantine.

Score Breakdown

Stage 1 — README Evidence Signal: 75/100

Signal	Points
Non-nascent README baseline	+60
Bio/medical domain vocabulary in README	+10
Bio/medical domain vocabulary in package metadata	+5
Self-asserted privacy/compliance language without stronger grounding	+5
CA-INDIRECT surface lacks explicit non-clinical boundary	-5

The README is the strongest part of the repository. It names the right biomedical sources, uses domain vocabulary competently, and presents a coherent user-facing story. But the same README also contains compliance-adjacent language that the reviewed repository sources do not substantiate.

The current scanner no longer treats HIPAA-compliant architecture as a strong regulatory-framework signal. It is now surfaced as a weaker self-assertion signal plus a separate unsupported legal/compliance claim warning. That is the right direction. The claim remains visible, but it is no longer credited as if the repository had demonstrated formal governance support.

Stage 2R — Repo-Local Consistency: 40/100

Signal	Points
Non-nascent local repository baseline	+60
README domain overlap with package metadata	+15
Clinical-adjacent surfaces without non-diagnostic boundary	-20
README claims runnable workflow without matching local support	-15

This stage captures the central contradiction. The repository reads like a polished product surface, but the local support evidence is shallow. The repository describes a working biomedical workflow, yet there is no matching support structure for tests, CI, or locally verifiable runnable scaffolding.

The -20 clinical-boundary deduction is especially important. Querying PubMed, ClinicalTrials.gov, FDA DailyMed, and ChEMBL without a clear non-diagnostic or non-clinical intended-use boundary is not a minor documentation omission. It is a governance gap.

Stage 3 — Code/Bio Responsibility: 25/100

Rubric Item	Score	Maximum
T1 CI/CD pipeline	0	15
T2 Domain-specific tests	0	15
T3 Changelog and release hygiene	0	15
B1 Data provenance controls	15	15
B2 Bias and limitations documentation	0	15
B3 COI, funding, acknowledgements	5	5
Raw total	20	80

This is still the decisive stage.

Three engineering-governance items remain at zero:

no workflow files,
no test suite,
no changelog or release history.

That means the repository still lacks the minimum operational evidence expected of a tool that presents itself as a serious biomedical interface.

B1 now scores 15/15, which is materially different from earlier runs. The scanner now recognizes JavaScript dependency and lockfile surfaces correctly, so package.json, package-lock.json, and pnpm-lock.yaml count as provenance or dependency evidence where appropriate. That is a real improvement in the audit logic, not a generosity bump.

B3 also scores correctly because the repository acknowledges its relationship to Valyu. That disclosure matters. But provenance and acknowledgment do not compensate for missing tests, missing automation, and absent release hygiene.

Stage 4 — Replication Evidence: 30/100

Evidence Item	Score	Maximum
Container or compose file	10	10
Environment lock manifest	10	10
Exact dependency pins or hashes	10	10
Everything else	0	remaining

Stage 4 is now stronger than it appeared in earlier runs because JavaScript lockfiles are correctly recognized. The repository gets real credit for containerization and dependency pinning.

That said, the replication lane is still thin:

no reproducibility section,
no runnable examples,
no dataset URL evidence,
no seed setting,
no citation file,
no checksum or artifact-verification surface.

So 30/100 is an improvement, but not a strong reproducibility posture.

Top Risks

The current v7 report surfaces six top risks. Four are structural and two are mirrored from code-integrity warning lanes.

Risk 1: Clinical-adjacent surfaces exist without an explicit non-diagnostic or non-clinical boundary.
This remains the most important governance issue. The tool works close to biomedical and clinical information flows, but the repository does not state a firm intended-use boundary that rules out diagnosis, treatment, or patient-care use.

Risk 2: Self-asserted compliance or privacy-governance claim requires independent verification.
The repository uses compliance-adjacent language in the README. The scanner now treats this as a weak self-assertion that must be reviewed independently rather than as evidence of formal governance maturity.

Risk 3: Legal, privacy, or compliance claim appears without supporting governance or security-grounding evidence in reviewed repository sources.
This is broader than the HIPAA phrase itself. The scanner is now explicitly surfacing unsupported legal/compliance claims in the report layer and regulatory traceability layer.

Risk 4: Core workflow appears materially dependent on named external service providers; local or self-host claims may overstate operational independence.
This is the clearest new structural warning in the v7 artifact. The repository depends on named external services such as Valyu and Daytona. That dependency is now visible as a first-class risk rather than being buried in config files or documentation assumptions.

Risk 5: C2_dependency_pinning: WARN.
In this report, C2 no longer reads as “all clear.” It now doubles as a code-integrity lane for external operational dependency risk when a named provider becomes a material single point of reliance.

Risk 6: C4_exception_handling_clinical_adjacent_paths: WARN.
The current C4 warning is not about a literal except: pass pattern. It is being used here as a boundary-integrity lane for unsupported legal/compliance claims and clinical-adjacent governance concerns.

Unsupported HIPAA / Compliance Claim: Evidence-Bound Reading

The README includes the phrase HIPAA-compliant architecture (when self-hosted).

The important point is not to overread the scanner. STEM-BIO-AI does not determine legal compliance. It determines whether reviewed repository sources surface evidence that would help ground a claim like this.

At the repository-evidence layer, the answer is weak:

the compliance phrase is present,
stronger governance or security-grounding evidence is not surfaced clearly in reviewed sources,
the claim therefore becomes a review signal rather than a positive trust signal.

This is why the current report now surfaces both:

a weak Stage 1 self-assertion signal, and
a broader unsupported legal/compliance claim warning.

That is a more defensible interpretation than the earlier behavior, which risked treating compliance-adjacent wording as a stronger positive signal than it deserved.

For context, a HIPAA-oriented operational claim normally invites review of at least the following:

Business Associate Agreement expectations before PHI handling
administrative safeguards such as risk analysis, workforce training, and contingency planning
physical safeguards for devices and workstations
technical safeguards such as access controls, audit controls, integrity controls, and transmission security
breach-notification readiness under 45 CFR Part 164, Subpart D

The current report does not say those controls are absent in deployment. It says the reviewed repository sources do not surface enough evidence to treat the README phrase as grounded governance proof.

That distinction matters because the downside of over-reading a README compliance claim is not abstract. Under the 2025 HHS inflation-adjusted civil monetary penalty schedule, HIPAA violations can scale as follows:

Tier	Per Violation	Annual Maximum
No knowledge	$137	$27,481
Reasonable cause	$1,379	$137,897
Willful neglect, corrected	$13,785	$68,928
Willful neglect, not corrected	$27,570	$2,190,294

The narrow takeaway is not “this repository is legally noncompliant.” The narrower and more defensible takeaway is that a compliance-adjacent README claim should not be operationalized without independent verification of controls, contracts, deployment boundaries, and incident-handling procedures.

Regulatory Traceability

The traceability block is useful here because it shows where the repository is generating governance-relevant signals without turning those signals into compliance claims.

Framework	Article / Signal	Stage	Strength
EU AI Act	Article 13 — Transparency	Stage 1	Weak
Compliance claim grounding signal	Traceability-only	Stage 1	Weak to moderate
IMDRF SaMD/GMLP	Clinical context boundary	Stage 2R	Moderate
EU AI Act	Article 10 — Data governance	Stage 3	Weak
EU AI Act	Article 12 — Record keeping	Stage 4	Moderate

This is the right use of traceability. It helps reviewers orient the evidence. It does not certify compliance, safety, or readiness.

AIRI Coverage

Here, AIRI refers to the AI Risk Repository from the MIT AI Risk Initiative: https://airisk.mit.edu/

The AI Risk Repository has three core parts:

the AI Risk Database, which captures 1700+ risks extracted from 74 existing AI-risk frameworks and classifications,
the Causal Taxonomy of AI Risks, which classifies how, when, and why these risks occur,
the Domain Taxonomy of AI Risks, which classifies risks into 7 domains and 24 subdomains, such as false or misleading information.

In STEM-BIO-AI, AIRI is used as a bounded risk-vocabulary layer that connects local detector findings to broader named AI-risk categories when detector-to-risk mappings exist. It is not a separate scoring engine and it is not a truth layer.

The most important difference from older outputs is that AIRI is no longer 0/31.

Current result:

Covered Risks: 8 / 32
Coverage Rate: 0.250

Examples now surfaced:

24.01.03 Safe exploration problem with widely deployed AI assistants
24.04.01 Physical and Psychological Harms
33.01.05 Privacy and security
39.25.00 Verifiability
60.02.01 Reliability issues
69.01.00 False information
70.01.02 Accidental harm
72.04.02 Market Concentration and Infrastructure Dependencies

This matters because the earlier 0/31 output could be misread as “no relevant AIRI risk signal.” The current report is better aligned with the actual findings. Unsupported compliance claims, dependency concentration, and boundary weaknesses now connect to a broader AIRI-backed risk vocabulary.

Important boundary:

these AIRI links remain bounded by detector mappings,
they do not prove actual harm, causality, clinical failure, or legal noncompliance,
and they still leave important unmapped or partially mapped territory.

Known gaps still called out explicitly:

65.03.03 Reidentification
70.02.02 Misinformation — hallucination of clinical knowledge

Those gap notes are still important because they point to plausible failure territory that is not yet fully covered by the current static detector set.

Code Integrity

The biggest interpretive improvement in the v7 report is that Code Integrity no longer looks falsely clean.

Check	Result
C1 Hardcoded credentials	PASS
C2 Dependency pinning / external operational dependency lane	WARN
C3 Deprecated patient-adjacent paths	PASS
C4 Boundary-integrity / unsupported compliance claim lane	WARN

This is much closer to what a reviewer would intuitively expect after reading the rest of the report.

Two cautions still matter:

PASS does not mean strong overall engineering quality. It means the specific detector lane did not surface a mapped problem.
Some architectural issues remain outside direct static detection. For example, design-level fail-open or mock-auth behavior may still require a dedicated detector rather than relying on current C4 logic.

What Would Need to Change

For this repository to move out of T1 Quarantine, the changes are not subtle.

Immediate

Add an explicit non-clinical / non-diagnostic intended-use boundary.
Remove or reframe unsupported compliance language unless governance evidence is documented clearly.
Document the operational dependency on Valyu and Daytona honestly, including what breaks when those services are unavailable.

Short-term

Add CI/CD.
Add tests.
Add release hygiene and changelog discipline.
Add explicit limitations/bias discussion.

Medium-term

Add reproducibility guidance and runnable examples.
Add stronger artifact and data-reference documentation.
Add mitigations around AIRI gap territory such as reidentification and hallucination-adjacent misuse risk.

Method Boundary

This remains a deterministic local CLI scan. No LLM inference, runtime execution, or network activity was used to produce the result. The report is therefore strongest when describing repository evidence surfaces and weaker when inferring live deployment behavior.

That boundary matters. This report is a pre-screen and review aid, not clinical certification, regulatory clearance, or legal advice.

Raw Audit Data

The full machine-generated STEM-BIO-AI report for this run is attached below as an interactive HTML artifact and also exists in Markdown, JSON, explain-trace, and PDF form.

STEM-BIO-AI | Weekly Repository Evaluation #1 | Every Friday
Next: BioClaw — AI-powered bioinformatics research assistant for WhatsApp group chats. 374 stars. A bioRxiv paper. Where does the genomic data go after the message is sent?

This site is open source. Improve this page.