VULNERABILITY INVENTORY

What We Know We Don't Know — and What Can't Be Designed Away

Why This Document Exists

steamHouse has a well-designed credentialing system. But design is not validation. Every element described in the Design Logic statement has been carefully thought through — and none of it has been formally tested.

This document presents the hardest questions a well-informed partner would ask, distinguishes between problems that better design can solve and problems that are inherent to the enterprise, and identifies the research needed to close each gap.

We publish this because serious partners deserve the full picture, and because the questions we're unable to answer are precisely the questions that make research partnership valuable.

Design Vulnerabilities

These are problems where the architecture is sound but the implementation details are unproven. They can be addressed through testing, iteration, and research.

1. Inter-Rater Reliability

The question: Can different mentors, assessing the same participant, reach the same conclusions? If Mentor A rates a participant as "Integrating" on Active Listening and Mentor B rates them as "Applying," which one is right — and how often will they disagree?

What we know from other fields: Competency-based medical education has spent decades wrestling with this. The Entrustable Professional Activities (EPA) framework — the closest analog to steamHouse's approach — has found that inter-rater reliability improves significantly with three conditions: clear behavioral anchors at each level, structured assessment occasions (not just general impressions), and rater training that includes calibration exercises with shared examples. Even with all three, some variability persists — and medicine handles this through multi-source feedback and committee-based decisions rather than relying on any single rater.

What steamHouse has: Detailed behavioral descriptions at each progression level for all 58 markers. A multi-rater architecture (self, mentor, peer, family) designed to triangulate across observers. Clear evidence type guidance distinguishing strong from weak evidence for each marker type.

What steamHouse needs: Psychometric validation. A formal study where multiple trained mentors independently assess the same participants and the results are analyzed for agreement. This is the single most important research gap in the project and the most natural opportunity for academic partnership.

Can this be solved by design? Largely yes. The medical education literature shows that inter-rater reliability is achievable for behavioral assessment — it just requires systematic calibration, clear rubrics, and multiple assessors. steamHouse's architecture supports all three. The implementation needs testing.

2. Portfolio Legibility

The question: Can someone unfamiliar with steamHouse — an employer, a college admissions officer, a program director — read a steamHouse credential portfolio and understand it within five minutes? Or does interpreting the portfolio require steamHouse-specific literacy that limits its value outside the ecosystem?

What we know from other fields: Learning and Employment Records (LERs) are an active area of development across multiple states, and the research consistently finds that legibility is a major barrier. States like Colorado, Virginia, and Alabama are building infrastructure to identify "credentials of value," but the challenge of making non-traditional credentials readable to employers who didn't issue them remains largely unsolved.

What steamHouse has: A three-domain structure (Stars, Lenses, Keys) that maps to natural categories (character, thinking, capability). A two-dimensional display (progression × verification) that is conceptually clear. Accessible names for all 58 markers designed to be immediately intelligible without jargon.

What steamHouse needs: User testing with people outside the steamHouse ecosystem. Can a hiring manager parse a portfolio in under five minutes? What do they notice? What confuses them? What would make them trust it? This is testable with current resources and doesn't require external research partners — though external validation would strengthen the findings.

Can this be solved by design? Yes. This is a UX problem, and UX problems respond to iterative testing and refinement.

3. Assessment of Lenses vs. Stars vs. Keys

The question: Can all three marker types be assessed with equal reliability? Keys (behavioral skills like Active Listening or Project Planning) are the most naturally observable. Lenses (thinking frameworks like Scout Mindset or Pre-Mortem) are observable when applied but harder to verify as habitual. Stars (character qualities like Growth Mindset or Heart at Peace) are the most challenging — they describe dispositional orientations that may manifest differently across contexts.

What we know from other fields: Medical education distinguishes between competencies that are easily observed in workplace settings and those that are harder to pin down. Technical procedures are reliably assessed. Professionalism — which is closer to what steamHouse calls Stars — is notoriously difficult to rate consistently, because the observable behavior may not reflect the internal disposition.

What steamHouse has: Different evidence type guidance for each marker category. Stars require testimony from others and cross-context consistency. Lenses require application in novel situations, not just recitation. Keys require observable behavioral performance. These distinctions are documented but not validated.

What steamHouse needs: Category-specific reliability studies. It is plausible that Keys will achieve acceptable inter-rater agreement first, Lenses second, and Stars last — which has implications for phased implementation of the credentialing system.

Can this be solved by design? Partially. Better rubrics and calibration exercises can improve reliability for Stars, but there may be a floor below which dispositional assessment cannot go in terms of inter-rater agreement. Understanding where that floor is — and being transparent about it — is more important than pretending it doesn't exist.

Structural Vulnerabilities

These are challenges inherent to the enterprise that cannot be fully resolved through better design. They require ongoing management, cultural practices, and honest communication.

4. Goodhart's Law

The question: "When a measure becomes a target, it ceases to be a good measure." If steamHouse markers become valuable enough to game, people will try to game them. How is this different from teaching to the test?

The honest answer: Goodhart's Law applies to steamHouse exactly as it applies to every other measurement system. The mitigation is real but partial.

For Keys, gaming is structurally difficult because the observable behavior is the capability. A person who performs all the indicators of Active Listening convincingly enough to fool a trained observer has, functionally, become a good listener. Unlike a test score, where gaming the measure bypasses the underlying skill, gaming a behavioral capability marker requires developing the capability.

For Stars, the picture is more complex. Someone who performs all observable indicators of Heart at Peace in front of a trained mentor may not have internalized that orientation — they may simply be skilled at contextual performance. Medical education faces exactly this problem with professionalism assessments. The literature suggests that the primary protection is not a design fix but a cultural one: building communities where the intrinsic value of the capabilities is experienced directly, so that gaming them feels pointless because the actual capabilities are more rewarding than the credential.

This is what Club does. The credentialing system works because it sits on top of a community that makes the capabilities genuinely valuable to possess. The credential without the community is vulnerable to Goodhart. The credential with the community is substantially more robust — not immune, but robust.

We do not claim to have solved Goodhart's Law. No one has. We claim to have a design that makes gaming harder than average and a community context that makes gaming less attractive than average.

5. Institutional Capture

The question: You criticize Boy Scout badges for being locked inside a single organization. But your 58 markers are defined by steamHouse, assessed through steamHouse's framework, and verified by steamHouse-trained mentors. How is this different?

The honest answer: It's a real risk. Open-source design mitigates it — anyone can adopt the markers and assessment framework without steamHouse's permission. But "open-source" doesn't automatically mean "widely adopted." If the markers remain primarily used within steamHouse communities, then in practice the credentials are as institution-bound as merit badges.

The path out of institutional capture is adoption by other organizations — which requires the markers to be genuinely useful and the framework to be genuinely adaptable across contexts. The Bootstrap Guides (integration templates for FIRST LEGO League, theater, soccer, 4-H, and more) are the beachhead: they demonstrate that the framework can wrap around existing activities without replacing them. But whether other organizations actually adopt them at scale is unproven.

This vulnerability cannot be fully resolved by design. It can only be resolved by adoption — which takes time, demonstrated value, and a credentialing system that other organizations find more useful than burdensome.

6. Demand Creation

The question: An alternative scoreboard only matters if someone looks at it. What makes employers, colleges, or other gatekeepers adopt a new measurement system?

The honest answer: This is the single hardest question in the project.

The Harvard Business School / Burning Glass Institute research shows that even when companies announce skills-based hiring, 45% change nothing about their actual behavior. The structural barriers to adopting new credentialing systems are formidable: hiring managers default to familiar filters, institutional processes resist change, and the transaction costs of learning to read a new credential format are real.

steamHouse does not currently have employer or institutional adoption of its credential system — because the credential platform doesn't exist yet. The theory is that verified behavioral capabilities have intrinsic value to employers already frustrated with traditional credentialing, and the evidence of that frustration is substantial. But frustration with the status quo does not automatically translate into adoption of an alternative.

The demand creation strategy is presented separately. The honest framing: we have strong design logic for why this should work, supporting evidence that employers want something like this, and no evidence yet that they will adopt this specific implementation. That is what a pilot is for.

What This Means for Partners

The vulnerabilities above define steamHouse's research agenda with precision. For potential research partners, these are not embarrassments — they are opportunities. Each gap is specific, testable, and aligned with active research programs in competency-based education, assessment design, credentialing systems, and workforce development.

For potential funders, the vulnerabilities demonstrate intellectual honesty about the difference between design validity (strong) and outcome validity (not yet established). The ask is for resources to close the gap — through the pilot studies, psychometric validation, user testing, and longitudinal tracking that the project's current stage requires.

[See the Research Questions →] · [Read the Demand Creation Strategy →] · [Return to the Landscape Brief →]