The analyst’s day is full of research. Now, this is the age of AI and AI is here to help, isn’t it? As everyone is talking about copilots and AI agents, why not using the tools at hand to do a little research on research.
NB., no one really has a good definition of an AI agent, so this might become an additional topic for research.
But I digress.
Imagine the following project at hand, which is not only interesting for analysts, btw, but also for a variety of roles in the corporate world. Let’s call it vendor (competitor) monitoring. The job is the following:
- Research reputable sites for news about a number of vendors, relating to a set of keywords. Reputable sites are high quality news sites, high quality tech publications, high quality analyst sites and, of course the news pages of the vendors in question.
- Limit the time frame of the search matching to the cadence of my information requirement, e.g., “yesterday” for a daily update or “last week” for a weekly update.
- Provide a summary of the news
- Give an assessment of how the news affects the positions of the vendors in the marketplace re the key words in question
- Provide these news with their assessments as a prioritized list, sorted from high impact to low impact
- Add an executive summary as a preface
- Send it to me as an email
So far, so simple. After all, a lot of folks, yours truly included, do this every day. And it is taking quite some time. So, this job is a perfect one for an automated update beyond a CSS feed. And it seems like a perfect job for an LLM turned agent – or is it a copilot?
Now, the basic question is: Which one to use? After all, there are plenty, from free to not so free ones. Answering this question turns into yet another interesting experiment: Why not ask some LLMs for their evaluation of suitability? Kinda meta, but an interesting one.
So, I did just that: I asked ChatGPT 4.5, Grok 3, and Gemini in its 4 versions 2.0 Flash Thinking Experimental, 2.0 Flash, 1.5 Pro with Deep Research, 2.0 Pro Experimental for their analysis of which of them is best suited for the research task at hand.
For this, I used the following, simple prompt:
describe the different capabilities and limitations of Gemini 2.0 Flash Thinnking Experimental, 2.0 Flash, 1.5 pro with deep research, 2.0 pro experimental, Grok 3 and chatGPT 4.5, both with and without deep reasoning. Which model is best to support the following use case:
research the web for news on a given set of companies and a given set of topics. The news shall cover the past 2 days only
assess the news regarding their impact on the companies' market positions re the given set of topics
create this in the form of a report
do this as a daily scheduled task
accuracy, reasoning and reliability are of high importance. Speed is of lower importance.
generate a comparison table, give a recommendation and justify the recommendation.
The results are quite interesting.
- Gemini 1.5 Pro with Deep Research resolves to Gemini 2.0 Flash because it offers “a balance of reasoning, accuracy, reliability, and tool use necessary for fulfilling the requirements of the specified use case. Its production-ready status, combined with its ability to handle complex analysis and generate comprehensive reports, makes it the ideal LLM for this task.” It looks at Grok as the runner up.
- Gemini 2.0 Pro Experimental recommends Gemini 1.5 Pro with Deep Research as it “offers the best balance of accuracy, reasoning, and reliability for your demanding research and reporting task, leveraging Google's strengths in both LLMs and web search.”. This is closely followed by Grok 3.
- Gemini 2.0 Flash Thinking Experimental suggests doing a competitive evaluation of ChatGPT 4.5 with Deep Reasoning and Gemini 2.0 Pro Experimental – as they are a tie. “Which is truly "better" in practice will depend on your specific prompts, data, and tolerance for latency and cost”.
- ChatGPT 4.5 (without Deep Research) suggests going for ChatGPT 4.5 with Deep Research as the model “offers enhanced reasoning capabilities, reduced hallucination rates, and a broader knowledge base, aligning well with the requirements for accuracy, reasoning, and reliability in daily scheduled tasks. While models like Gemini 2.0 Flash Thinking Experimental and Grok 3 also provide advanced features, ChatGPT 4.5's maturity and proven track record make it a suitable choice for generating comprehensive and reliable reports.”.
- Grok 3 in Deep Research mode suggests using Gemini 2.0 Pro Experimental as its “advanced reasoning capabilities, as evidenced by its performance in complex tasks and 2 million token context window, make it ideal for researching news, assessing market impact, and generating daily reports (Gemini 2.0 Pro). The integration with Google Search ensures access to recent news, and as a Google product, it likely offers high reliability for scheduled tasks, aligning with the emphasis on accuracy and reasoning over speed. While Gemini 1.5 Pro with Deep Research is tailored for research, Gemini 2.0 Pro Experimental, being a newer model, likely offers superior capabilities”. Grok looks at Grok as the runner up is it offers advanced reasoning and Deep Search “but potential biases from X data integration”.
So, what does this tell me?
There is probably a bit of self-serving involved in the LLM’s assessments and suggestions. At least Google consistently suggests a Google model and ChatGPT suggests itself. What is a bit confusing is that
An interesting side remark is that only Gemini 1.5 Pro with Deep Research, ChatGPT 4.5 and Grok 3 provide the sources used for the research. Perplexity does this, too. Providing references is important for validating the results.
It looks like the results delivered by these LLMs seem to favor Gemini 2.0 Pro Experimental and ChatGPT 4.5, though, although I am impressed by Grok 3. On the other hand, one needs to know that “experimental” means exactly that – the models are not yet fully stable.
Having said this, if one needs to perform research tasks, as many of us need to, environment matters. Especially smaller businesses often run Google Workspace. In the case that they subscribed to the Business Standard Edition (like I am doing), Gemini is readily available, there is probably no immediate need to purchase an additional ChatGPT license (I have a pro subscription) or a Grok or Perplexity subscription. This is especially true as most of these tools use a lot of the data that users provide to improve their services, which is especially true for free services. Grok, in its privacy statement explicitly recommends to not input any personal data – as it will be used.
In summary, if and when I need to do research, I’ll use Google Gemini 1.5 Pro with Deep Research and Gemini 2 Pro Experimental as my preferred option, simply because ChatGPT 4.5 with Deep Research only offers limited runs per month. As it doesn’t cost much, additionally running the same research – potentially with a slightly changed prompt to cater for model differences – I will use ChatGPT and (if no sensitive data involved) Grok 3 in addition. Worst case, this gives me additional food for thought.
What do you think?
Comments
Post a Comment