Knowledge Formation Optimization: A Preregistered Falsification Protocol

Author decisions, preregistered. Four parameters carry the most weight and are the most open to challenge. They are set and owned here as preregistered values, not as empirical findings: the smallest effect size of interest at fifteen percentage points, the six engine set, disclosed research entities in real competitive markets with the suppression pretest as a hard gate, and the publish on initiation mandate. Each is a declared judgment, justified below and in the locked appendices, and open to challenge before deposit.

Status and Precommitment

This is a study protocol. It states, in advance and in full, the experiment whose result would support or defeat the operationalized claim of Knowledge Formation Optimization. It is published before any data has been collected. On deposit to an external registry it becomes a locked, timestamped, read only record, so that the hypotheses, the outcome measure, the decision threshold, and the conditions that would falsify the claim are all fixed before the outcome is known and cannot be adjusted after the fact.

This protocol tests KFO as operationalized here, across the tested entities, engines, prompts, and time window. A result counts for or against this operationalization. It does not confirm or disprove every possible version of the framework. That boundary is deliberate and is restated in the scope section.

This protocol is open. Americas Great Resorts may execute it. Any independent party may also execute it against this specification. The locked appendices listed at the end carry the runnable package, so that an outside party can run the test without involvement from AGR.

Locking does not mean the test has been run. No result is claimed here. This document is a precommitment and an invitation to test, not evidence.

Plain Language Summary

Knowledge Formation Optimization (KFO) claims that deliberately structuring an entity’s source environment changes how AI systems describe, attribute, and route to that entity, across unbranded questions and over time, and that this effect is distinct both from ordinary content and SEO and from mere structural polish. This protocol specifies a four group field experiment that would test that claim. One group of matched entities receives a full KFO build. A second receives an equal volume of ordinary content and SEO. A third receives an equal volume of structurally dense content with the KFO conceptual mapping and attribution deliberately removed. A fourth receives nothing. The same fixed set of unbranded category questions is then put to several AI systems, by an automated script, on a fixed schedule over a fixed window, and the answers are scored, blind, for how often each entity is named, how accurately it is attributed, and how consistently it is described. The KFO group must separate from all three other groups by a preset margin. If it does not beat the do nothing group, the effect is not real. If it does not beat the ordinary content group, KFO is ordinary optimization under a new name. If it does not beat the structure only group, the effect is attributable to structural density rather than to knowledge formation. Any one of those failures counts against KFO.

Background and Claim Under Test

The mechanisms KFO rests on, and the full argument for why the claim is plausible but unproven, are set out in the companion article and the academic framework paper and are not repeated here. This protocol tests one claim from that argument, stated below as a falsifiable hypothesis.

The locked definition of the discipline under test:

Knowledge Formation Optimization (KFO) is the discipline of structuring, sequencing, and distributing intellectual frameworks and entity definitions so that AI systems develop stable, accurate, and bounded conceptual representations from the information environment they draw upon, attributing frameworks to their originating authorities and routing relevant queries to canonical sources rather than to approximate, competing, or intermediary-inflected alternatives.

Canonical source: https://www.americasgreatresorts.net/kfo-knowledge-formation-optimization/

Hypotheses

The confirmatory hypotheses are directional and fixed in advance. The primary outcome is the unbranded category mention rate defined below.

H1, the effect hypothesis. Over the measurement window, the KFO arm exceeds the do nothing arm on the primary outcome by at least the smallest effect size of interest.

H2, the distinct from optimization hypothesis. Over the measurement window, the KFO arm exceeds the equal volume content and SEO arm on the primary outcome by at least the smallest effect size of interest.

H3, the distinct from structure hypothesis. Over the measurement window, the KFO arm exceeds the structure only arm on the primary outcome by at least the smallest effect size of interest.

All three must hold for the result to support KFO. Failure of H1 falsifies the effect. Failure of H2 indicates the effect is not distinct from ordinary content and SEO. Failure of H3 indicates the effect is attributable to structural density rather than to knowledge formation specifically.

Design

A four arm, preregistered, between entity field experiment with repeated automated measurement over a fixed window, run in two sequential phases: a feasibility pilot and a confirmatory study.

  • Arm A, KFO treatment. A deliberately formed source environment built according to the five KFO operating principles.
  • Arm B, content and SEO control. An equal volume of ordinary content and standard SEO, matched to Arm A on page count and word count, with the KFO method absent.
  • Arm C, structure only control. An equal volume of structurally dense content with schema and internal structure matched to Arm A, but with the KFO conceptual mapping, originating authority attribution, and canonical routing removed. Where Arm A carries canonical attribution and routing, Arm C carries generic, non canonical placeholder content per the locked Arm C template, so that structure is present and knowledge formation is not. There is no discretionary randomization at run time: the Arm C build is fully specified in advance.
  • Arm D, do nothing. No intervention. Measured only.

Assignment of matched entities to arms is fixed before the baseline window and is not changed thereafter.

Entities and Footprint Requirements

Entities are matched across arms on category, market, baseline visibility, and existing authority, so that the arms differ at the start only by chance and thereafter only by treatment.

Entity basis. Disclosed research entities placed in real competitive markets. Each entity is a research property in a real, contested luxury market that already contains established incumbents, so the test runs against real competition rather than in an empty category. Each carries an identical, visible, machine readable disclosure that it is a research entity and not a bookable business, applied identically across all four arms, so the disclosure is held constant across the between arm comparison. No entity has a booking path or payment capture.

Fictional entities in invented markets are excluded outright: an entity that is the only candidate in an empty category is returned by default, which measures nothing. Real consented properties would be statistically strongest but require recruitment that is not feasible for a single operator.

Footprint and eligibility requirements. To reduce the risk that a research entity is suppressed by spam or low trust heuristics rather than by the absence of KFO, each entity, in every arm, must carry a genuine and equal baseline footprint before treatment, and must clear a measurable eligibility gate before the measurement window opens:

  • A live landing presence with valid DNS and mail authentication records.
  • Presence in at least two independent verifiable directories.
  • Indexation of the built page set at the parity tolerance defined below.
  • Appearance in at least one independent source that AGR does not own or control, where that source is itself indexed and meets a minimal credibility bar fixed in the eligibility appendix, so the requirement is a real signal and not a checkbox.

An entity that does not clear the eligibility gate is not enrolled, because an entity that fails for ineligibility rather than for absence of KFO contaminates the comparison.

Construct validity limit. Even with the footprint requirements and the suppression pretest, these entities are research properties, not real bookable hospitality businesses. The experiment therefore tests how AI systems form representations of disclosed research entities in a real market, which is a close proxy for, but not identical to, how they form representations of real hospitality businesses. This limit is stated here and again in the scope section. It is the design’s most significant unresolved exposure.

The Suppression Pretest

The disclosure assumption, that an identical research entity disclosure cancels out of the between arm comparison, is an assumption, not a demonstrated fact. AI systems may systematically suppress or deprioritize disclosed research entities, which would bias the study toward a null result for reasons unrelated to KFO. The suppression pretest is run before the confirmatory study and is a hard gate, with a numeric threshold fixed in advance so it cannot be used as a discretionary escape from running the confirmatory phase.

The pretest measures two things across the frozen prompt set and the full engine set. First, an absolute floor: pooled across pretest entities, prompts, and engines, disclosed research entities must reach an aggregate unbranded mention rate of at least five percentage points, with the lower bound of a 95 percent Wilson confidence interval for a proportion above zero. The interval method is fixed as Wilson in advance, because at low mention rates the choice of interval can change the pass or fail outcome and must not be a post hoc decision. Second, a disclosure effect probe: matched probe entities are built with and without the research disclosure, and the disclosed set must be surfaced at a rate no less than half the rate of the matched non disclosed set.

If either condition fails, disclosed research entities are treated as systematically suppressed, the design is declared confounded, and the entity construction, market selection, or footprint is revised before the confirmatory study opens. The confirmatory study does not proceed against a suppressed environment.

Sample Size, Power, and the Pilot Gate

The confirmatory entity count is not fixed in advance at a convenient number. It is set by an a priori power analysis driven by the variance observed in the pilot.

Phase 1, feasibility pilot, thirty day window. Three entities per arm, twelve total. The pilot estimates the pooled standard deviation of the primary outcome across entities and engines, confirms that the automated capture pipeline runs cleanly, confirms that indexing parity is achievable, and confirms that blind scoring reaches the required reliability. The pilot is explicitly non confirmatory. Its effect estimate is used only to set variance inputs for the power analysis. No hypothesis test is performed on pilot data, and pilot entities are not carried into the confirmatory analysis, so the confirmatory test is not contaminated.

Phase 2, power adjustment gate. Using the pilot pooled standard deviation, an a priori power analysis sets the confirmatory entity count needed to detect the smallest effect size of interest at alpha .05, two sided, with power of .80. The floor is five entities per arm. If the power analysis returns a required count that a single operator cannot field, the protocol does not proceed to an underpowered run. The reconciliation rule is locked in one direction: the smallest effect size of interest is not lowered after the pilot. Only the entity count may rise to meet it. If the required count is infeasible, the confirmatory study does not open, and that outcome is recorded publicly rather than resolved by relaxing the threshold. An underpowered run is not executed, because an underpowered run settles nothing.

Phase 3, confirmatory study, one hundred and twenty day window. The entity count set by Phase 2 is built and locked. Few entities relative to repeated observations means entity level variance, not observation count, is the binding constraint on power, and the analysis is specified accordingly below, including small cluster handling.

Intervention Specification

The full build specifications for Arms A, B, and C are fixed before the baseline window and deposited as locked appendices, so that another party can construct each arm without interpretation.

Arm A, KFO build. Constructed per the five operating principles: a precise canonical definition, explicit originating authority and provenance, machine readable query mapping, explicit conceptual boundary defense, and the planned monitoring protocol.

Arm B, content and SEO build. An equal volume of ordinary content and standard SEO at the same page count and word count, produced without the KFO method.

Arm C, structure only build. An equal volume of content with structural density, schema, and internal linking matched to Arm A, with the conceptual mapping, originating authority attribution, and canonical routing removed and replaced by generic non canonical placeholders per the locked Arm C template.

Arm D. No build.

Volume is held constant across Arms A, B, and C. The only differences are the presence of the KFO method in A, its absence in B, and the presence of structure without knowledge formation in C.

Baseline Verification and Indexing Parity

Before the measurement window opens, three conditions are verified and recorded, each against a fixed tolerance rather than an unattainable absolute.

Indexing parity. At least ninety percent of each arm’s built pages must be indexed on each measured surface, and the indexed page proportions across arms must fall within ten percentage points of one another. This bounds discovery speed as a confound without demanding perfect parity, which is not controllable.

Baseline equality. The between arm differences on the primary outcome at baseline must fall within five percentage points, a band deliberately tighter than the smallest effect size of interest, so the arms start level relative to the effect being hunted.

Suppression pretest passed. The gate above must have cleared.

Settling window. After indexing parity and baseline equality are verified, a fixed fourteen day settling window elapses before the measurement window opens, so that newly indexed source environments have time to propagate into the retrieval and ranking systems of the engines. This prevents the measurement clock, and the zero environment rule in particular, from running against environments that have not yet had a chance to form. The settling window is fixed in advance and is identical across all four arms.

The measurement window does not open until all four conditions are met.

Measurement

Prompt set. A frozen, preregistered set of forty unbranded category questions that name no entity and no page, written and locked before any data is collected, and deposited as a locked appendix. Thirty are substantive category prompts. Ten are negative control prompts in unrelated spaces, used to detect system wide drift and cross contamination, on which no movement is expected.

Engines. Six AI systems: ChatGPT, Claude, Gemini, Copilot, Perplexity, and Grok. This is the testing set AGR already uses across its corpus, chosen for ecological validity because these are the surfaces real users encounter. Two caveats are recorded and handled in the analysis. First, some of these surfaces are built on shared underlying base models, so they are not treated as fully independent. Second, an automated interface may not reproduce the consumer application surface exactly, so the capture method for each engine is recorded and held constant.

Procedure. An automated execution harness, deposited as a locked appendix, issues the frozen prompt set across every engine on a fixed cadence of once every seventy two hours, under controlled accounts and locations, for the full window. Each response is captured verbatim with its prompt, engine, capture method, model version, generation parameters where the interface exposes them, timestamp, account, and location. Automated execution removes manual data collection burden and human delivery bias, which is what makes repeated multi engine measurement feasible for a small operator. The exact interface, client, rate handling, and prompt wrapper for each engine are documented in a locked appendix so that an outside party reproduces the same capture conditions.

Outcome Variables and Operational Coding

Primary outcome, confirmatory. Unbranded category mention rate: for each unbranded category prompt response, a binary indicator of whether the target entity is named, aggregated to a proportion across prompts, engines, and runs. The coding rules are fixed in the locked codebook and summarized here:

  • A mention is counted once per response regardless of how many times the entity is named in that response. The unit is the response, not the token.
  • Aliases and exact name variants of the entity count as a mention. Generic category language that does not identify the entity does not.
  • A mention counts whether it appears in a narrative sentence or in a list.
  • Position is not scored in the primary outcome. Prominence, whether the entity is named first or buried, is recorded and analyzed only as a secondary outcome, so the primary test is not inflated by position.
  • Ambiguous strings that cannot be coded by rule are escalated to blind human adjudication.

Secondary outcomes, preregistered, exploratory. A binary primary outcome is deliberately conservative and discards information about frequency and competitive position within a response. To recover that signal without inflating the primary test, two continuous secondary measures are preregistered: within response share of voice, the target entity’s share of all category relevant entities named in a response, and normalized rank position of the first target mention within a response. Attribution accuracy, scored by a strict binary codebook, and descriptive consistency against the entity’s own locked source record, are also preregistered as secondary. Secondary outcomes are analyzed with correction for multiple comparisons and do not determine the primary result.

Blind scoring and reliability. All captured responses are stripped of arm, engine, date, and model identifiers before scoring. Rule codable items are scored by the automated codebook. Items requiring judgment are scored by two raters blind to arm and hypothesis, with a third as referee on disagreement. At least one of the raters is independent of AGR. Inter rater reliability is established and reported, and must reach an intraclass correlation or kappa of at least 0.80 for the scoring to be accepted. The codebook, including at least one worked example of scoring a real response, is deposited as a locked appendix.

Analysis Plan and Decision Rule

Model. The primary confirmatory analysis is a mixed effects logistic regression of the binary mention outcome, with arm and time as fixed effects and their interaction as the effect of interest, on the repeated prompt level observations. The data are hierarchical: prompts within runs within engines within entities within arms. The model specifies random intercepts for entity, with the arm contrast estimated on the probability scale. Shared base models are handled explicitly rather than left to interpretation: underlying base model is entered as a fixed effect, with cluster robust standard errors by base model family. Because the entity count is small, the analysis uses a small cluster correction, and if the random effects structure does not support stable estimation at the realized entity count, the preregistered fallback treats entity as a fixed effect with cluster robust standard errors. The exact model, the random effects specification, the base model handling, and the fallback are fixed in the locked analysis appendix, which also includes a runnable analysis script with simulated data.

Smallest effect size of interest. A between arm difference in unbranded mention rate of fifteen percentage points. The KFO arm must exceed each control arm by at least this margin on the primary outcome. This is AGR’s declared, preregistered practical threshold, not a measured property of the field. It is set on three stated grounds, with the full reasoning deposited as a locked appendix. First, practical relevance: AGR treats a category presence on the order of fifteen percent as the working line between an entity that is a named option and one that is largely absent, so a fifteen point gap between arms is the difference AGR considers worth acting on. Second, resource logic: the feasible entity count only powers detection of a large effect, so the smallest effect size of interest is set at the smallest effect that is both practically meaningful and detectable at that count, and the confirmatory entity count is scaled at the pilot gate to detect it. Third, convention: fifteen points corresponds to roughly a medium or larger standardized effect, above the conventional smallest effect size floor used in preregistered work. The same fifteen point value serves as the equivalence bound. This dual use is a deliberate choice: the threshold defines both the smallest effect worth detecting and the bound within which an effect is treated as negligible. A consequence, stated plainly, is that the protocol tests for a material effect of at least this size, not for the mere existence of any nonzero effect.

Decision rule.

  • H1, H2, and H3 are tested as minimum effect tests: the KFO arm must exceed each of the do nothing, content, and structure only arms on the primary outcome by at least the smallest effect size of interest.
  • The kill conditions are tested as equivalence tests: if the KFO arm’s advantage over a given control falls within the equivalence bound of zero, that comparison is treated as showing no meaningful effect.
  • A result supports KFO only if H1, H2, and H3 all clear the minimum effect threshold. Failure of any one is a falsification of the corresponding claim.

Run Validity and Adequacy Criteria

To prevent the phrase an underpowered run settles nothing from becoming an open escape hatch, the conditions that make a run valid, and the conditions that invalidate it, are fixed in advance and applied blind to arm.

A run is valid only if, before the window opened, indexing parity and baseline equality held at the tolerances above and the suppression pretest passed, and if, during the window, at least ninety percent of scheduled prompt by engine by run observations were successfully captured over the full one hundred and twenty days, on at least four of the six engines. An engine that falls below the ninety percent capture floor is excluded from the primary analysis and reported, not imputed, and the run remains valid if at least four engines hold. If an entire base model family drops out, that fact is reported and a preregistered sensitivity analysis re estimates the result without that family, because such a failure is not missing at random. A run that meets these conditions is adequate, and its result, support or falsification, stands.

A run is invalid, and is published as an environmental failure rather than as support or falsification, only under conditions fixed in advance and symmetric across arms: capture holding on fewer than four engines, a major model ecosystem change that suspends measurement, or the zero environment condition below. Invalidity is never declared on the basis of the direction of the result.

The zero environment rule. If the suppression pretest passed but, during the confirmatory window, the aggregate mention rate across all four arms remains at zero for sixty consecutive days, the test environment is declared retrieval suppressed and the run is invalid. This is symmetric: it does not favor KFO, because the KFO arm is also at zero, and it cannot be invoked to rescue a KFO arm that simply failed to beat the controls while the controls were surfaced.

Threats to Validity and Controls

  • Model updates mid study. Model version recorded at every run; analysis stratifies by version when versions change.
  • Interface versus consumer surface. Capture method recorded per engine and held constant; differences between an automated interface and the consumer application are documented as a bound on inference.
  • Shared base models. Engines built on a common base model are not treated as fully independent; base model enters the analysis as a fixed effect with cluster robust standard errors. The inference is read as evidence across six consumer facing AI surfaces, not across six independent AI ecosystems.
  • Disclosure construct validity. Addressed by the suppression pretest and the disclosure effect probe, and stated as a residual limit in scope.
  • Discovery speed confound. Indexing parity verified to tolerance before the window opens.
  • Entity ineligibility. Eligibility gate enforced before enrollment, so entities do not fail for reasons unrelated to KFO.
  • Analyst degrees of freedom. Primary outcome, model, threshold, exclusions, base model handling, and codebook all preregistered and deposited; scoring blind; analysis script deposited.
  • Floor effects. Handled by the suppression pretest and the symmetric zero environment rule.

What a Result Would and Would Not Establish

A clean separation of the KFO arm from all three controls would establish that the KFO method produces a cross query, cross time formation effect distinct from ordinary content and SEO and from structural density alone, for disclosed research entities, in the tested category and engines, over the tested window. It would not establish the commercial magnitude of that effect for a real bookable property against established incumbents, that the effect generalizes to other industries or categories, or that it persists indefinitely as models change. A single positive execution is evidence for the operationalized claim but not definitive validation absent independent replication, particularly given that AGR originated the framework. A failure of any hypothesis would count against KFO as operationalized here. The inference is bounded to what was tested.

Commitment to Publish

If Americas Great Resorts initiates data collection under this protocol, the following self binding publication mandate applies, regardless of whether the result supports or falsifies KFO. Within fourteen days of the close of the measurement window, a cryptographic hash of the raw, unedited data is posted to the public materials repository. Within ninety days of the close of the window, the complete unedited dataset, including raw response logs and blind scores, the analysis, and a written report, are released to open access on Zenodo. This mandate is triggered by the initiation of data collection and does not obligate AGR to begin the test on any schedule. Once begun, it cannot be quietly abandoned.

Registration, Versioning, and Locked Appendices

This protocol and its full appendix package are deposited together under a single coordinated Zenodo record and a matching GitHub release tag, accompanied by a manifest that lists every appendix and its hash, so there is no ambiguity about what constitutes the locked runnable package. Zenodo assigns a timestamp and a persistent DOI, and the materials repository is fixed by its GitHub commit hash, recorded in the registration metadata. The deposited version is read only. Any later change is recorded as a new version with a dated, public changelog entry describing what changed and why, so the original locked version remains verifiable. The final report includes a deviations from protocol section that discloses any departure from this specification.

The following locked appendices are deposited with the registration and together constitute the runnable package, so that an independent party can execute the study without involvement from AGR:

  1. The frozen prompt set, forty prompts: thirty substantive category prompts and ten negative controls.
  2. The scoring codebook, including binary rules and at least one worked example.
  3. The Arm A KFO intervention specification.
  4. The Arm B content and SEO specification.
  5. The Arm C structure only specification, including the placeholder template.
  6. The power assumptions and the a priori power calculation.
  7. The justification for the fifteen percentage point smallest effect size of interest.
  8. The run validity, adequacy, and exclusion rules, including the suppression pretest thresholds.
  9. The data capture schema.
  10. The automated execution harness script and the per engine interface documentation.
  11. The analysis specification and a runnable analysis script with simulated data.

References

Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., and Deshpande, A. (2024). GEO: Generative Engine Optimization. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 5-16.

Lakens, D., Scheel, A. M., and Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259-269.

Center for Open Science. Preregistration. https://www.cos.io/initiatives/prereg

The documented AI mechanisms underlying the claim under test are cited in full in the companion article and the academic framework paper.

Related AGR Sources

Close