Hackazona / Cognite
An Industrial AI Agent in 48 Hours
Built at Hackazona, Cognite's industrial AI hackathon. A working anomaly triage agent that reasons over 3 million sensor readings, the maintenance history, and the technical docs sitting in PDFs.
The Problem Cognite Framed
Cognite ran a hackathon on a simulated offshore oil and gas platform. Six months of operations. 96 assets. 175 sensors producing 3 million rows of 15-minute readings. 84 maintenance work orders. 7 real failures hidden in the timeline, with misleading anomalies layered on top. A folder of SOPs, P&IDs, and equipment manuals in PDF form.
The brief: build a next-generation industrial AI agent that can reason over all of that to help a field manager triage anomalies, trace root causes, and pull the right documentation when something is going wrong.
The harder problem underneath it: a field manager does not know what they do not know. The expertise that matters lives in the heads of a small number of subject matter experts. The SMEs only get pulled in reactively, after something breaks. By the time they are looped in, production is already down.
I called this the tribal engineer problem. The visualization below is how I framed it for the opening of the hackathon deck.
The map shows the Phoenix metro area during Hackazona. The dark blots are where the tribal knowledge lives. A refinery sitting on the coast has no way to reach into that map. The agent is the bridge.
What I Built
A dashboard that runs the full anomaly triage loop end to end.
The top left shows the top six prioritized questions the agent pulled out of the timeseries. Each question is a thesis, not a reading. "Is E-301 fouled? Outlet temperature trending above normal while flow is dropping." "Is P-101 safe to keep running? Vibration peaked at 9.0 millimeters per second." The questions are the agent's working hypothesis, written in the language a field manager would use in a shift handoff.
Click any question and the right side populates. A 3D sensor timeline renders the breach window. The raw sensor logs are one click away. A Run Root Cause Analysis button fires a grounded LLM call that reads the event context, cross-references the maintenance history and the SOP corpus, and returns a markdown report with an actionable to-do list.
A Telegram panel in the right sidebar lets the field manager push the question straight to a scoped Telegram channel for the asset, so the technicians on the ground can chime in with what they are seeing. The channel is created on the fly if it does not already exist.
The system does five things in sequence:
- Anomaly detection across 3 million sensor readings in DuckDB, filtered down to the events that actually matter.
- Question generation that turns each event into a thesis a human would recognize.
- Context building that pulls the related sensors, recent maintenance, nearby failures, and relevant SOPs.
- Root cause analysis via GPT-4o, with the full context packed into a single prompt.
- Field communication via Telegram, scoped to the asset and the question.
The Architecture Choices That Mattered
Three decisions shaped what the agent could actually do in 48 hours.
DuckDB over a hosted time series database. The 3 million row timeseries file is heavy, but DuckDB reads it straight off disk with zero setup. I loaded it once into a prebuilt platform.duckdb file and the backend opens it read-only on startup. Sub-second queries against arbitrary sensor windows. No infra to stand up.
Questions as the primary abstraction, not alarms. The instinct is to surface every threshold breach. That is noise. Ranking events by severity and rolling up the top six into natural language questions made the dashboard feel like a briefing instead of a monitor. A field manager reads six lines and knows what to ask about.
RAG on the SOP corpus, folded into the RCA prompt. Instead of standing up a separate retrieval pipeline, the RCA service builds the full event context, formats it for the model, and appends the relevant documents inline. Slower prompts, richer answers, no vector database to maintain. The tradeoff is fine at 48-hour scale.
What It Looks Like in Practice
A field manager opens the dashboard. The status bar shows 141 events, 6 critical, 33 high. The first question is about E-301, the produced water cooler. Outlet temperature is trending above normal while flow is dropping. Classic fouling signature, but it could also be a fouled inlet filter, or a control valve stuck partially closed, or a sensor drift.
They click Run Root Cause Analysis. The agent reads the last four months of sensor data for E-301, the maintenance work orders that touched it, the failure history for the produced water system, and the cooler's equipment manual. It comes back with a structured report: the most likely cause, the sensor evidence, the related assets to check, and an ordered action list that starts with isolation and ends with a recommended preventive maintenance update.
They scroll to the Telegram panel. The channel for E-301 does not exist yet, so they hit create. A channel is spun up with the on-call technicians, the question is posted as the opening message, and the conversation starts.
The whole loop from anomaly to technician conversation is about 30 seconds.
The Takeaway for Non-Oil-and-Gas Teams
The tribal engineer problem is not unique to refineries. Every organization has a small number of people who hold the working knowledge of how the business actually runs, and the rest of the team pays a tax every time those people are not in the room.
The pattern Hackazona made obvious: AI agents are good at generating hypotheses, pulling the right context, and handing a human a starting point. They are not good at being the final answer. The system works because the agent narrows the field to six questions and hands the expert something concrete to react to. The expert does the last mile.
That shape transfers directly to sales ops, legal review, customer support triage, and any domain where the judgment calls are the bottleneck.
What I Would Build Next
Three things, in order:
- Feedback writes back into the agent. The SME's response in the Telegram channel should update the question's data section and the RCA prompt for the next event. The agent should get smarter inside a single shift.
- A case library. Every resolved anomaly becomes a retrievable case. New events get similarity-matched against resolved ones before the LLM call, so the agent gets to ride on the work that has already been done.
- A handoff artifact. At the end of every RCA, the system should generate a one-page SOP update proposal that either patches an existing document or proposes a new one. The tribal knowledge that got surfaced in the chat becomes a durable artifact instead of evaporating into the channel history.
The dashboard shipped. The rest is a roadmap I would take to a paying customer, not a hackathon judge.
Built solo at Hackazona, Cognite's industrial AI hackathon, April 2026. Stack: FastAPI, DuckDB, React with Three.js, GPT-4o, Telethon. If this pattern resonates for a domain your team is wrestling with, book a call and we can talk through where the tribal knowledge is hiding in your business.