AI as a Business Tool, Not a Business Mind
- 17 hours ago
- 12 min read

A great deal of public discussion about AI still misses the point in both directions. On one side, it is treated as though it were a kind of digital judgment that can simply hand a business the answer. On the other, it is dismissed as little more than polished autocomplete. Neither description is especially useful in
practice. Large language models can be powerful. They can also be wrong, shallow, or misaligned with what the user actually needs. The value, at least in my own work, is not in treating AI as a business mind. It is in understanding what kind of tool it is, how it works at a high level, what kinds of tasks different systems are better suited for, and how much more useful it becomes when the human user remains actively involved throughout the process.[2][3][4]
Large Language Model Overview
At a basic level, a large language model is trained on very large bodies of human-produced text. That text is broken into smaller units called tokens, which are not always whole words. They can be words, parts of words, punctuation, or other subword units. Modern LLMs are generally built on transformer architectures, which rely on attention mechanisms to model relationships within a sequence, and many generative models are trained autoregressively, meaning they generate likely continuations from the context already in front of them.[1][2]
That is why I think the common phrase that an LLM is “just predicting the next word” is both technically related and still incomplete. Yes, next-token prediction is part of the mechanism. But that alone does not explain why the system can summarize, compare, classify, rewrite, answer questions, adapt to instructions, or behave differently when the user gives it different examples and constraints. Brown et al. showed that tasks and demonstrations can be specified through text interaction alone, and that larger models become more capable in zero-shot, one-shot, and few-shot settings.[2] So while the mechanism may be described simply, the practical behavior is broader than that description often suggests.
At the same time, usefulness should never be confused with factuality. Modern chat-oriented systems are commonly tuned to follow instructions and align more closely with user intent. InstructGPT describes this directly: larger base models can be untruthful, toxic, or simply not helpful, and fine-tuning with human feedback is used to make them better aligned with what users are asking them to do.[3] TruthfulQA is important here for the opposite reason. It shows that language models can still generate false answers that reflect misconceptions learned from human text.[4] That is why I do not think an LLM should be asked to be the answer. It should be used to help the user work toward an answer while the user remains responsible for checking what is true.[3][4]
Different Tools for Different Kinds of Work
I do not treat AI as a single monolithic system, and I do not use every model the same way. Different platforms expose different tools, different source pathways, and different strengths. That matters in practice, because choosing the tool is part of the workflow.[10][11][12][15]
When I am researching current public topics such as EUDR, trade policy, or changes in the national and global economy and how they affect the forestry products industry, I tend to use Gemini and its Deep Research tool. I do not use it as a button that produces a finished paper. I use it as part of an iterative, human-led exploration: research, discuss, refine, research again. Google’s own documentation describes Gemini Deep Research as an in-depth and real-time research tool that uses Google Search by default and can also incorporate other sources such as Gmail, Drive, uploads, and NotebookLM notebooks.[10] That matches the kind of live, public-web research work I use it for.
When I already have the relevant source material in hand and need structured analysis, I tend to move that work into ChatGPT. That is especially true when I need a report or summary produced in a specific way, with my own context, priorities, and formatting expectations carried through the result. OpenAI describes ChatGPT deep research as a system that can reason, research, and synthesize into a documented report using uploaded files, the public web, specific sites, and enabled apps, while keeping the user in control.[11] It also documents data analysis with uploaded files, where ChatGPT can work directly with user-provided data.[14] In my own repeated use, that has made ChatGPT more useful to me for source-driven analytical work than for open-ended public-web exploration.
And when I am running a repeatable, instruction-driven workflow on source material already in hand, I use ChatGPT Agent Mode. OpenAI describes ChatGPT agent as a mode that can reason, research, take actions, work with uploaded files, and do so while the user remains in control.[12] That is much closer to the way I use it when I am generating summary reports from meeting transcripts or dense industry reports. By contrast, Google describes Gemini Agent as a supervised multi-step task tool in which Gemini can act under user supervision and may require confirmation for some web tasks.[15] I do not treat one as universally superior to the other. But in my own work, I have consistently found ChatGPT Agent Mode more faithful and reliable when I need specific instructions followed in a controlled, repeatable way.[12][15]
Sometimes I also use one system to critique or pressure-test the work of another. I do not do that because I think one model will magically “fix” the other. I do it because a second pass from a different system or a fresh review context can expose weaknesses, omissions, or assumptions that were easier to miss in the original flow. Recent work on Cross-Context Review found that separating production and review into independent sessions improved error detection compared with same-session self-review.[16] That does not prove that every second-model review is better. But it does support the broader logic that independent review conditions can matter.
Structure Matters More Than the “Perfect Prompt”
When I use AI seriously, I do not approach it as a generic chatbot exchange. I structure the interaction.
That structure usually includes a few core things:
a defined lens or role,
defined source boundaries,
a clearly stated task,
an expected output structure,
and some form of verification or reevaluation.
None of that guarantees truth. It does something more practical than that. It reduces ambiguity, narrows the field of possible responses, and makes it more likely that the output will be relevant, usable, and easier to evaluate.[3][16][17] Research on prompt sensitivity shows that LLM performance can vary substantially across prompt variants, and that few-shot examples can reduce some of that sensitivity.[17] In other words, the way the task is framed matters.
This is also why I do not put much stock in the idea of a “perfect prompt.” In practice, useful AI work is iterative. The user clarifies the goal, adds missing context, narrows the scope, tests the first result, identifies what is weak or unsupported, and then refines the exchange. The model contributes pattern recognition, organization, comparison, and synthesis. The user contributes context, judgment, correction, and accountability.[3][16]
That process also becomes more useful when it is formalized. OpenAI’s documentation on Skills describes them as reusable workflows that tell ChatGPT exactly how to do a specific task so it can do that task better and more consistently.[13] I am not using that idea in an abstract sense. It is directly relevant to how I work. When the instructions for a recurring workflow are formalized, saved, reused, and adjusted over time, AI use starts to look less like casual prompting and more like an actual business process.[13]
Video Conference Summary: A Case Study in a HITL Workflow
One example in my own work is a weekly meeting-summary workflow. The meeting takes place via video conference with leadership members spread throughout Southwestern Virginia. It is held in Microsoft Teams, and Teams provides the AI-generated transcript. We do not rely on Teams recap because it is of limited value since we cannot fully control and audit the transcript accuracy prior to the recap generation. The transcript therefore becomes the raw source material, not the finished product.
A multi-speaker business meeting is not a clean input environment. A 30–45 minute call with six to ten speakers may include presentation segments, questions and answers, interruptions, short back-and-forth exchanges, overlapping speech, and industry-specific terminology. Research on meeting transcription shows that multi-speaker settings remain difficult because of speaker overlap, speaker variability, noise, and other real-world conditions.[8] Research on transcript-based summarization shows that ASR errors degrade the resulting summaries, reducing coherence, fluency, and factual consistency with the source.[9] More recent work on meeting summarization with LLMs still describes the task as error-prone, with risks of hallucinations, omissions, and irrelevancies.[18]
When I use AI to generate meeting summaries, I run a structured, multi-step, human-in-the-loop workflow in ChatGPT Agent Mode after the transcript exists. This workflow is designed to reduce ambiguity and control how the system interprets the raw material before accepting any summary as useful.
In practical terms, the workflow looks like this:
Step 1: Define the lexicon and transcript-cleaning rules.
Step 2: Clean the transcript into a more usable record.
Step 3: Manually verify what the system cannot reliably resolve.
Step 4: Apply a formalized summarization and report structure.
Step 5: Reevaluate the output against the source material.
Step 1: Define the Lexicon Before Summarization Begins
The first step in my process is to define the lexicon and establish the cleanup rules before I ask for a serious summary. That means names, company terms, recurring vocabulary, and domain-specific language are identified in advance. The purpose is not to revise what people said. It is to reduce ambiguity around what the transcript is trying to represent.
Generalized systems do not naturally know the language of every industry, every organization, or every conversation. Microsoft’s speech documentation explicitly notes that phrase lists can improve recognition for names, homonyms, acronyms, and words unique to an industry or organization.[6] Its custom speech documentation makes the same broader point: domain-specific vocabulary and conditions often require adaptation beyond the generic base model.[7] In plain terms, if “kiln-dried” is misheard, or a company name is rendered incorrectly, that uncertainty should be addressed before it spreads into everything downstream.
Step 2: Clean the Transcript into a More Usable Record
The second step is transcript cleaning. This is where I use ChatGPT Agent Mode and a formalized instruction set to standardize names, terms, and speaker references so the transcript is clearer and more internally consistent. The goal is not stylistic polish. It is factual legibility. I want a cleaner representation of what was actually said, by whom, and in what operational context.
For us, speaker attribution is not cosmetic in a business meeting. It helps establish who owns a responsibility, who introduced a concern, and where a given point sits inside the organization. If the raw transcript muddles the terms, the names, or the speaker attribution, the summary built on top of it becomes less reliable before summarization has even begun.[8][9]
Step 3: Verify What the System Cannot Reliably Resolve
The third step is manual verification. This is one of the most important parts of the workflow. There are always cases where only a meeting participant can resolve what the transcript should say. A muffled word, a rushed name, a half-heard operational term, or an internal reference may not be recoverable from the transcript alone. The AI does not possess the situational and operational context that the participants do.
That is not a flaw in the workflow. It is the reason the workflow needs a human in it. NIST’s Generative AI Profile recommends comparing generated content against known ground truth and using human oversight as part of evaluation and risk management.[16] In my own work, that means I do not let the system have the last word on what the transcript means when I was the one in the meeting and the system was not.
Step 4: Use a Formalized Instruction Set for the Report
The fourth step is where the process becomes more than ad hoc prompting. I use a separate, formalized instruction set for summarization and report structure. That instruction set is saved and reused, not improvised from scratch every time. It explains what kind of report is needed, how the meeting should be analyzed, how the output should be structured, what should be emphasized, and what is not acceptable.
A generic summary is not a universally useful object. A format that works for one company, one team, or one stage of business development may be poorly suited to another. And as a business evolves, the reporting structure may need to evolve with it. That is why I think saved instruction systems matter so much. The goal is not to find one magic prompt. The goal is to build a repeatable workflow that produces more consistent and more useful results over time.[13][17]
Step 5: Reevaluate the Output Against the Source Material
The fifth step is user-assisted reevaluation. After the summary is generated, I do not assume the first pass is final even though the instructions provided verification and revision steps. I manually direct the system to go back to the source material and compare the summary against the transcript again. This is a second analytical pass meant to test whether the report is accurately supported by the meeting record.
Recent meeting-summarization research still describes LLM-generated summaries as error-prone, with risks of hallucinations, omissions, and irrelevancies.[18] More broadly, it reflects a larger principle: coherence is not proof. A summary can read well and still misrepresent the source. That is why I keep the user in the position of final reviewer and final arbiter of relevance and meaning.[16][18]
What This Process Actually Accomplishes
What I am doing in that workflow can be stated very simply: I am not asking AI to replace judgment. I am asking it to help me process complexity in a controlled way.
The process accomplishes a few specific things:
It turns a messy raw transcript into a more legible working record.
It reduces the impact of domain-specific transcription errors.
It preserves speaker context that matters operationally.
It produces summaries in a format that is actually useful to the business.
It creates a repeatable process that can be refined over time.
It keeps the human user in the position of final reviewer and decision-maker.
The result is not “AI wrote my report for me.” The result is that AI helps me move from messy raw material to a more coherent and usable document without pretending that the machine now owns the facts.[3][16]
The Same Logic Applies Beyond Meetings
The broader value of this approach is that it extends well beyond meetings. The same principles apply whenever I am dealing with large amounts of text, mixed-quality records, or information that is too dense to process quickly in one pass. That can include industry reports, regulatory materials, market developments, economic changes, production logs, machine timing logs, and other recurring bodies of information.
The exact material changes, but the logic stays the same. The user defines the source boundaries, the timeframe, the task, the output structure, and the verification standard. The model helps summarize, compare, reorganize, classify, and surface patterns across that material. The final judgment still belongs to the human user.[11][14][16]
This is also where the distinction between my use of Gemini and my use of ChatGPT becomes especially important. When I am exploring current public information, Gemini Deep Research is often where that work begins. I research, discuss, refine, and research again inside Gemini. But when I want to turn that research into a structured report shaped by my own context and understanding, I bring that material into ChatGPT. That is where I formalize the output. OpenAI’s deep research and data-analysis tools are explicitly designed to work across uploaded files and structured user-provided materials.[11][14] Google’s Deep Research is explicitly designed for in-depth and real-time research using Google Search by default.[10] Those are not the same job, and I do not use them as if they were.
AI as a Tool, Not a Substitute for Judgment
That, to me, is the practical difference between using AI as a business tool and treating it like a business mind. A tool can help a company process more information, more consistently, and often more quickly. A tool can help reveal what is relevant, compress what is unwieldy, and make patterns easier to see. But a tool does not own context, accountability, or truth. Those still belong to the people using it. When AI is treated as a structured collaborator inside a formalized workflow, it becomes genuinely useful. When it is treated as a substitute for disciplined human judgment, it becomes much easier to trust what should have been verified.[4][13][16]
Next Week: Part III — AI as a Tool for Human Problem Solving and Creativity
References
[1] Vaswani, A., et al. Attention Is All You Need. NeurIPS, 2017.URL: https://arxiv.org/abs/1706.03762
[2] Brown, T. B., et al. Language Models are Few-Shot Learners. NeurIPS, 2020.URL: https://arxiv.org/abs/2005.14165
[3] Ouyang, L., et al. Training language models to follow instructions with human feedback. 2022.URL: https://arxiv.org/abs/2203.02155
[4] Lin, S., Hilton, J., & Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL, 2022.URL: https://aclanthology.org/2022.acl-long.229/
[5] Microsoft Support. View live transcription in Microsoft Teams meetings.URL: https://support.microsoft.com/en-us/office/view-live-transcription-in-microsoft-teams-meetings-dc1a8f23-2e20-4684-885e-2152e06a4a8b
[6] Microsoft Learn. Improve recognition accuracy with phrase list.URL: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/improve-accuracy-phrase-list
[7] Microsoft Learn. Custom speech overview.URL: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-speech-overview
[8] Lyu, X., et al. PP-MeT: A Real-World Personalized Prompt Based Meeting Transcription System. 2023.URL: https://arxiv.org/abs/2309.16247
[9] Sharma, R., et al. Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization? ACL, 2024.URL: https://aclanthology.org/2024.acl-long.790/
[10] Google Gemini Help. Use Deep Research in Gemini Apps.URL: https://support.google.com/gemini/answer/15719111?hl=en
[11] OpenAI Help Center. Deep research in ChatGPT.URL: https://help.openai.com/en/articles/10500283-deep-research-in-chatgpt
[12] OpenAI Help Center. ChatGPT agent.URL: https://help.openai.com/en/articles/11752874-chatgpt-agent
[13] OpenAI Help Center. Skills in ChatGPT.URL: https://help.openai.com/en/articles/20001066-skills-in-chatgpt
[14] OpenAI Help Center. Data analysis with ChatGPT.URL: https://help.openai.com/en/articles/8437071-data-analysis-with-chatgpt
[15] Google Gemini Help. Use Gemini Agent for multi-step tasks in Gemini Apps.URL: https://support.google.com/gemini/answer/16596215?co=GENIE.Platform%3DAndroid&hl=en
[16] NIST. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (AI 600-1). 2024.URL: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
[17] Zhuo, J., et al. Assessing and Understanding the Prompt Sensitivity of LLMs. Findings of EMNLP, 2024.URL: https://aclanthology.org/2024.findings-emnlp.108/
[18] Kirstein, F., et al. Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions. Findings of EMNLP, 2025.URL: https://aclanthology.org/2025.findings-emnlp.1094/




Comments