How I orchestrated multiple AI systems to produce a submission-ready research paper in a few days
A behind-the-scenes look at using Claude, ChatGPT, and a DGX Spark as an AI research team.
I just published an independent statistical analysis of one of the most debated claims in archaeology — that Göbekli Tepe's Pillar 43 encodes a 13,000-year-old astronomical message. 13 experiments, ~1,500 lines of Python, a 15-page paper with figures, tables, and 27 references.
The whole thing — from first reading the original paper to having a submission-ready manuscript — took a few days.
Here's how I used AI at every step, what worked, what didn't, and what I learned about orchestrating LLMs as a research team.
I'm an independent researcher. No university affiliation, no department, no grad students. My hardware is a Mac Mini M4 for orchestration, a Mac Studio M3 Ultra running local LLMs through MLX, and an NVIDIA DGX Spark for heavy computation.
My approach: use different AI systems for different roles, the same way a research team divides labor. Not one AI doing everything. Multiple systems, each with a specific job, each checking the others.
I started by feeding Claude the original Sweatman & Tsikritsis (2017) paper, the Notroff et al. (2017) rebuttal, and the revised Sweatman & Gerogiorgis (2025) probability analysis. I asked it to identify what a quantitative rebuttal would need to test.
Claude identified the two-component structure that became the paper's backbone: the date-coincidence claim and the visual-similarity claim are mathematically independent, so they should be tested separately. That decomposition wasn't in any of the published papers. Neither Sweatman nor his critics had separated the two components and evaluated them independently.
This is where LLMs are genuinely useful for research: not as search engines, but as reasoning partners that can read multiple papers and identify structural gaps that no single paper addresses.
Claude wrote the Python script through an iterative conversation. Not a single prompt that produced a finished script — that never works for anything nontrivial. Instead, I'd describe what the next experiment should test and why. Claude would write the function. I'd look at the output, decide whether the result answered the question I actually cared about, and steer the next step.
My role was scoping and sequencing, not line-by-line code review. I decided which experiments mattered, which results raised new questions, and when to stop. Claude handled the implementation.
The script started with 4 experiments. By the end it had 13, each one added because the previous results raised a new question. The rank-sum decomposition (top 3 vs bottom 3) came from asking "where is the significance actually concentrated?" The coverage fraction came from a ChatGPT reviewer asking "what fraction of the era is near some event anyway?" The circular-shift test came from a reviewer pointing out that my original negative control was just re-deriving the same null.
Each experiment was a conversation, not a prompt. The script evolved through maybe 20 iterations over a few days.
This is the part most people skip, and it's the most valuable.
After each major version of the script, I sent the full code and output to ChatGPT and asked it to review it as a hostile peer reviewer. Not "does this look good?" — "find everything wrong with this."
ChatGPT caught real issues every time:
- The original "negative control" experiment was just re-deriving my existing null distribution with extra steps. ChatGPT identified that 10,000 random panels producing 0 hits wasn't informative — it was expected from the threshold. It suggested the circular-shift test instead, which actually tests something new (whether the absolute event placement matters).
- The bottom-3 probability (1 in 187) is actually below p = 0.01. My original wording called this "statistically ordinary." ChatGPT flagged it: "That's significant by conventional thresholds. Your wording is wrong." Fixed.
- The multiverse analysis originally treated all paths as equally defensible. ChatGPT pointed out that dropping animals post hoc is a much stronger degree of freedom than choosing center vs. window anchor. That led to the primary vs. extended multiverse split.
- The identity permutation test originally multiplied P(rank-sum) × P(identity) and presented it as a new combined p-value. ChatGPT caught that this isn't a clean joint probability because the identity factor conditions on the specific rank multiset. Removed.
Each of these was a real error that would have weakened the paper if published. Different AI systems have different blind spots. Using multiple models as sequential reviewers catches things that any single model (or single human) would miss.
One of the paper's most important arguments — the semantic-geometric circularity problem — came out of a conversation about whether to build a blind visual matching study.
I had already built the web app. Participants would see animal carvings and tap which constellation stick figures looked most similar. Then I stopped and thought about it: the stick figures are abstract dot-and-line patterns. The carvings are realistic animals. Nobody would match them. They'd guess randomly.
But if I used traditional constellation illustrations instead (the artistic depictions with the mythological figures), participants would match scorpion to Scorpius trivially — because both are scorpions. That doesn't prove anything about astronomy.
This led to the three-layer analysis that became Section 5.2 of the paper: did ancient people see constellations at all → did they group the same stars → did they assign the same animals? Sweatman's statistical test operates only at layer 3 and assumes layers 1 and 2 are settled. They aren't.
The decision not to run the blind study — and to explain why it can't fully resolve the question — makes the paper stronger than running a study with an ambiguous result.
Claude produced the structural outline and most of the draft text. My role was to steer the argument, decide what to keep, reject weak framing, and use ChatGPT as an adversarial reviewer to pressure-test the logic and wording. I was acting less like a traditional solo author writing line by line, and more like a research director coordinating multiple fast but imperfect collaborators.
Specific things the review process caught:
- "The first independent quantitative evaluation" → "To our knowledge, the first independent quantitative decomposition." Protects against a reviewer finding some obscure prior attempt.
- "Within the published scoring framework, three animal-constellation matches rank 1st" — the "within the published scoring framework" qualifier was added after a reviewer pointed out that stating "rank 1st" without qualification implies objective fact rather than one researcher's assessment.
- The AI disclosure section was moved from Acknowledgments to its own section, because some journals require that.
The LaTeX conversion, table formatting, figure placement, and bibliography were all handled by Claude.
Role separation. Claude as the technical reasoning partner. ChatGPT as the hostile reviewer. Local models for quick iteration. Each system has different strengths and different blind spots. Using them together catches more errors than any single system.
Iterative development. Not "write me a paper." Instead: build one experiment, run it, look at the output, decide what question comes next, build the next experiment. The script grew organically from the results, not from a predefined plan.
Treating AI review as real review. Every piece of feedback from ChatGPT was evaluated on its merits, the same way you'd evaluate a human reviewer's comments. Some suggestions were wrong and got rejected. Most were right and improved the work.
Autonomous agents. Before this project, I tried a Karpathy-style autonomous loop where an LLM would write code, run it, evaluate the output, and iterate without human intervention. It didn't work. The model would get stuck in loops, make changes that broke earlier functionality, or optimize for the wrong metric. Human-in-the-loop was dramatically more productive.
Single-model workflows. Using only one AI system for everything produces blind spots. Claude is excellent at reasoning but occasionally too agreeable — it'll write code that does what you asked rather than what you need. ChatGPT is better at adversarial critique but less consistent at long technical conversations. Using both compensates for each system's weaknesses.
The key bottleneck was not typing speed or code generation. It was judgment: which questions mattered, which experiments actually added value, which critiques were right, and when the paper was strong enough to stop iterating. AI provided throughput. I provided direction and final responsibility.
The best metaphor I've found: AI is a team of postdocs who can write code in any language, format papers in any style, and work 24 hours a day, but have no taste and no judgment about what matters. You supply the research direction. They supply the execution.
This paper would have taken weeks or months working alone. It took days. Not because the AI did the thinking — but because it handled the mechanical parts fast enough that I could focus entirely on the decisions that actually shape the research.
- Full pipeline: https://github.com/simossss/gobekli-pillar43-statistics
- Paper: https://doi.org/10.5281/zenodo.20364947
- Previous project (SAR tomography reproduction): https://doi.org/10.5281/zenodo.19574701