I Built an AI Research Lab in My Terminal

7 min read
AI ResearchAgent SystemsInfrastructurePerplexity

This is an open application for the Perplexity Research Residency. I am not a PhD researcher. I am a builder who has spent the last year running what amounts to an applied research lab out of a terminal window. Here is what I have learned, and what I would do with three months of compute and mentorship.


The Setup

Every day I work inside a system I built from scratch: a development environment where AI agents are not assistants I talk to, but infrastructure I build on. The system manages dozens of concurrent AI sessions, each with persistent memory, tool access, browser automation, and the ability to spawn sub-agents for parallel work.

This is not a wrapper around ChatGPT. It is a working operating environment where the boundaries between human intent and machine execution are deliberately blurred, tested, and refined.

Loading diagram...

The system includes:

  • Agent orchestration that manages parallel sub-agents with state tracking, dependency resolution, and automatic recovery when things go wrong
  • Persistent memory that survives across sessions, storing not just facts but behavioral feedback, project context, and reference pointers to external systems
  • Browser automation that lets agents navigate real websites, fill forms, extract data, and verify deployed applications
  • A verification framework that checks whether agents actually achieved their goals, not just whether they completed their tasks
  • A stress detection system that monitors my own messages for signs of flow disruption, classifies root causes, and adjusts agent behavior in real time

That last one sounds strange. It is also the most interesting piece of research in the whole system.


The Research Questions I Am Already Answering

I did not set out to do research. I set out to get work done faster. But the problems I kept hitting are the same ones the field is struggling with, and the solutions I built are worth examining.

1. How do you make agent tool use reliable?

The industry consensus is that tool use is fragile. Agents hallucinate function calls, misinterpret outputs, and fail silently. The standard fix is better prompting or fine-tuning.

I found a different answer: design the tool layer so failure is cheap and verification is automatic.

My system does not trust that an agent used a tool correctly. After every significant action, a verification loop checks the actual state of the world (file system, browser DOM, git history) against the intended outcome. When there is a mismatch, the system does not retry blindly. It diagnoses the category of failure and routes to the appropriate recovery strategy.

This is not prompt engineering. It is systems engineering applied to AI workflows. And it works far more reliably than asking the model to "be more careful."

2. How should AI systems manage context across sessions?

Every AI product struggles with this. Context windows are finite. Conversations end. Users come back the next day and the AI has amnesia.

I built a memory architecture that separates context into four types:

  • User memories (who I am, how I work, what I know)
  • Feedback memories (corrections and confirmations, with reasons)
  • Project memories (what is happening, why, and by when)
  • Reference memories (where to find things in external systems)

Each type has different write triggers, different staleness profiles, and different retrieval patterns. The system verifies memories against current state before acting on them, because a memory that says "function X exists in file Y" is a claim about the past, not the present.

This is a small-scale implementation of a problem Perplexity is working on directly. Your AI assistants with memory feature shipped in November 2025. The questions I am answering in my terminal are the same questions your team is answering at scale: what to remember, when to forget, and how to avoid acting on stale state.

3. When should a human intervene in an agentic workflow?

This is the question nobody wants to talk about because the answer undermines the "fully autonomous agent" narrative. But it is the most important question for making AI agents useful in practice.

My stress detection system is a direct attempt to answer it empirically. Every message I send is scanned for signals of flow disruption: frustration markers, repeated corrections, context switches that suggest the agent lost the thread. The system classifies root causes (did the agent make an assumption it should not have? did it change something without asking? did it fail to verify before acting?) and adjusts behavior in real time.

Over months of data, clear patterns emerged. Three prevention rules handle the vast majority of flow disruptions:

  1. Verify before changing. Never modify something based on assumptions about its current state.
  2. Smallest possible fix. When something breaks, do the minimum intervention, not a refactor.
  3. Admit uncertainty. When the agent does not know, say so, instead of guessing confidently.

These sound obvious. They are also the three things every AI system gets wrong under pressure, because the training incentive is to be helpful and confident, not cautious and honest.


What I Would Research at Perplexity

Perplexity is building the infrastructure layer for AI-powered information retrieval. You have search, you have models, you have agents, and you have a browser (Comet). The problems I have been solving in my one-person lab are the same problems you face at scale:

Proposed focus: Reliable multi-step agent workflows for complex information tasks.

Specifically:

  1. Verification architectures for search agents. When a Perplexity agent follows a multi-step research path (search, read, synthesize, cite), how do you verify the chain? My verification loop approach could be tested against Perplexity's actual search pipelines to measure whether post-hoc verification catches errors that prompt-level guardrails miss.

  2. Context persistence for research sessions. A user doing deep research across multiple sessions needs the system to remember not just what was found, but what was tried, what was ruled out, and why. My four-type memory taxonomy could be evaluated as a framework for structuring this kind of research-session state.

  3. Human-agent handoff points in agentic search. When should a search agent ask for clarification vs. make an assumption and proceed? My stress-signal data suggests there are predictable patterns. With Perplexity's scale, we could test whether those patterns generalize beyond a single user.


Why This Format

I am publishing this as a blog post instead of filling out a form because the work speaks better than credentials do. You can evaluate my technical thinking, my writing, and my research instincts from this page. A resume would tell you I have been in tech for years. This post tells you what I actually think about and how I solve problems.

The system I described is not theoretical. It runs every day. The code exists. The patterns are tested. What I lack is compute at scale, collaborators who are working on the same problems, and three months to focus on nothing else.

That is exactly what the residency offers.


About Me

I am Brian Sowards. I run an AI consulting practice (sowards.ai) focused on helping teams adopt AI workflows that stick. Before that, I spent years building software and leading engineering teams.

My background is non-traditional by research standards. I do not have a PhD. I do not have publications. What I have is a working system that demonstrates the ideas in this post, hundreds of hours of empirical data on human-AI collaboration, and the ability to ship things that work.

If you are evaluating Research Resident applications and this resonates, I would welcome the conversation. You can reach me at brian@sowards.ai or book time at sowards.ai/contact.

← Back to Blog