The First Spin: Evaluating OpenAI’s Deep Research Tool
A chemical engineer (PhD) and physician (MD) take Open AI's Deep Research for a spin, here are our first impressions from the tool.
TL;DR:
Deep Research is an incredible tool and will accelerate performing research analysis. We feel that it saves us days and even weeks of research analysis time.
Deep Research excels at case study-style analysis but struggles with research like meta-analyses and where nuanced technical insights are needed.
It often relies on blogs instead of the primary peer-reviewed sources, since blogs can be non-peer reviewed personal opinions it can risk incorporating bias in scientific analysis.
The tool sometimes fails to engage in multi-turn refinement, making it less effective for strategic research.
Outputs sometimes lack completeness (e.g., for meta-analysis it did not provide explicitly asked resources PRISMA checklists, inclusion criteria).
Future potential product functionality that could help alleviate the pain points we saw e.g.
“Research Depth Mode” → Ensures only peer-reviewed sources are used.
“Multi-Turn Questioning Mode” → Engages in dialogue before finalizing the research question and results.
“Reference Verification System” → Prevents citing blogs over original research.
Summary: OpenAI’s Deep Research tool promises to revolutionize technical research by rapidly synthesizing information, generating structured reports, and offering analytical insights. In this article, we critically assess its effectiveness across various research domains—ranging from meta-analyses and case study evaluations to factual analysis and implications for policy. While the tool excels in accelerating preliminary research, we identify key areas where it falls short, including its reliance on non-authoritative sources, difficulty in handling nuanced interpretations, and limited iterative clarification.
How We Evaluated OpenAI’s Deep Research Tool
To assess the Deep Research tool, we tested it on topics within our domains of expertise. I (Shamel) have a decade of experience in autoignition chemistry, while Farah is a fellowship-trained physician specializing in obesity and lifestyle medicine. Rather than designing adversarial prompts to exploit known LLM weaknesses, we approached this evaluation as we would in real-world research workflows.
We structured our evaluation around:
Case Study Analysis (e.g., evaluating healthcare policies and their impact on public programs)
Research Analysis (e.g., meta-analyses and systematic reviews of peer-reviewed literature)
Factual Analysis of Recent Trends (e.g., implications of changes in obesity diagnosis criteria)
For each category, we conducted tests using multiple iterations of prompts:
Running the prompt as initially conceived.
Refining the prompt using OpenAI’s “o3-mini-high” or “4o” model to improve clarity.
Iteratively adjusting our approach based on observed gaps.
Findings and Insights
1. Case Study Analysis: Strengths and Weaknesses
Example Prompt: AI-Powered Nutrition Apps Analysis (ChatGPT link)
Prompt Summary: We asked the Deep Research tool to analyze the rapid rise of AI-powered nutrition tracking apps, evaluating their pros and cons, accuracy, and alignment with modern nutrition guidelines.
Key Takeaways:
The tool provided a well-structured discussion of the benefits and drawbacks, significantly accelerating analysis that would typically take an expert 1–2 days.
However, it frequently referenced blog-style opinion pieces instead of original peer-reviewed sources, leading to potential biases in the output.
This raises concerns about how the tool navigates AI-generated content ecosystems, as many blogs increasingly summarize research rather than providing original insights.
Example where a blog was used instead of the article:
Some top apps demonstrate remarkably high image-recognition accuracy, correctly identifying foods in pictures over 90% of the time in tests. Referenced sydney.edu.au but could have used the article in Nutrient journal directly insteadModel crashed when the issue was pointed out and a rerun was requested.
The final downloadable meta-analysis Word doc did not actually have the report.
2. Meta-Analysis and Deep Technical Research
Example Prompt: Meta-Analysis on CGM and Weight Loss (ChatGPT link)
Prompt Summary: We tasked the tool with generating a meta-analysis of continuous glucose monitoring (CGM) for weight loss in non-diabetic adults, specifying statistical outputs, inclusion/exclusion criteria, and a PRISMA checklist.
Key Takeaways:
The tool successfully identified relevant RCTs and generated a structured report within 9 minutes, an enormous time-saving advantage.
Its reasoning for excluding a specific RCT was flawed—highlighting a limitation in critical evaluation rather than just data aggregation (e.g. it confuses the definition of a “risk” for a medical condition vs medical “diagnosis” of cancer.)
However, it failed to include explicitly requested artifacts like inclusion/exclusion criteria, PRISMA flowcharts. Moreover, neither mentioned nor clarified MeSH terms.
Potential strategies:
When using the deep research ask for clarifying questions e.g. which can be in context definition of how to include and exclude some articles
Enable multi-turn interactive questioning before finalizing the final research output e.g. ask the researcher further clarification questions or rechecking with them prior to actually running the analysis.
Example Prompt: Low-Temperature Autoignition Factors (ChatGPT link)
Prompt Summary: We asked the tool to analyze hydrocarbon autoignition chemistry at low temperatures and generalize controlling reaction mechanisms. This prompt is useful for PhD research in fuel reaction mechanism.
Key Takeaways:
The output captured fundamental chemical kinetics (e.g. it picked up isomerization of ROO to QOOH and ROOH decomposition is rate controlling) but lacked depth in nuanced insights critical to fuel design (e.g. rather that it is ROO to QOOH and then decomposition of ketoperoxide OQ’OOH to OQ’O + OH and at higher temperatures ROO dissociation to Alkene + HO2, note it picked up similar but not exactly the right reaction sequence).
It heavily relied on review articles rather than synthesizing multiple primary recent studies to extract richer insights.
The tool omitted (or did not show it in final considerations) key open-access literature that should have informed a more comprehensive answer (e.g. fuel oxidation paper)
3. Factual Analysis and Strategic Implications
Example Prompt: Redefining Obesity Diagnosis (ChatGPT link)
Prompt Summary: Analyze the definition of obesity. Given the drawbacks of traditionally used BMI criteria, we asked the model to redefine it more accurately taking into account anthropometric measurements and staging based on obesity severity.
Key Takeaways:
Did an excellent job defining every element of existing factors that are involved in assessment and definition of obesity.
Did a comparative analysis of BMI with other measures of obesity
It was also able to find a recent article discussing the current Lancet consensus on the definition of obesity which has caused a lot of uproar (and the reason we gave it this prompt), however, it did not directly refer and analyze this important information that will play a critical role in actually refining obesity.
Our review of DeepResearch output felt it heavily picked up the viewpoint of single authoritative body e.g. OMA vs that of Lancet commission. Likely due to the fact that deep research did not actually review the article itself but rather the opinion pieces on it (e.g. OMA disagrees with the new obesity definition of “preclinical” and “clinical” and has strong opinion about how health coverage will be affected however the Lancet Commission and its experts have a differing view point of moving from BMI like metric to a more robust definition of obesity to help direct the resources to correct patient population, as a physician it is important to consider all informed views to be able to guide the discussion with patients and policy makers)
Example Prompt:: Fuel Blending Optimization Strategy ChatGPT link)
Prompt Summary: We asked the tool to explore how hydrocarbon fuels can be blended to optimize Formula 1 engine performance while remaining within regulatory constraints.
Key Takeaways:
The tool identified general blending strategies (e.g. blend more branched compounds and blend less straight chain hydrocarbons), which would take a PhD-level researcher months to compile manually.
However, it failed to uncover deeper strategic insights—such as those derived from patents and proprietary fuel research by Oil & Gas companies that partner with Formula 1 companies to design the fuel. We will not provide the right answer here but did analyze that the research missed the key strategic insight available in open source literature which can create a breakthrough fuel.
Instead of synthesizing highly technical primary sources, it relied on blog-style articles (e.g. Ferrari chat blog), leading to incomplete conclusions.
The tool’s inability to prioritize high-value sources over surface-level summaries remains a core limitation.
Product Observations and Suggested Improvements
Processing Time: While official documentation states analysis may take up to 30 minutes, we found most queries completed within 5–10 minutes.
Clarifications & Follow-Ups:
The tool asks useful clarifying questions but typically limits this to a single interaction. A multi-turn iterative refinement approach (akin to an advisor-researcher dialogue) could be far more effective.
When prompted to rerun an analysis, it sometimes becomes unresponsive.
Source Selection Bias:
The tool over-relies on recent blog articles rather than authoritative sources.
It doesn’t always follow citation trails to original studies, leading to potential bias towards surface-level interpretations.
Encouraging deeper retrieval of foundational and industry-specific literature (e.g., patents and technical white papers) could significantly enhance reliability.
Other UX improvements:
Unclear if “Deep Research Button” needs to be activated in the follow-up query once the deep research agent has finished the first round of output generation.
Deep research would say it will update the output when a follow-up query would be provided but would just hang-up or make no changes
Final artifacts generated would not always conform to the request made in query (e.g. if a word file was requested for meta-analysis paper it would generate just the outline, if table was requested in specific criterion it would ignore it etc.)
Overall First Impressions
Deep Research is one of the most significant AI agentic systems currently available. It excels in synthesizing information for case study-style evaluations, allowing for rapid knowledge aggregation. For any type of research work we have no doubt that the tool will accelerate the path to value creation, in many of our use cases we think it saved us days if not weeks of work that we would spend on the topics. However, in fields/topics requiring extraction of critical insights from extensive factual databases, expert oversight remains crucial. While the tool is often described as capable of PhD-level research, I would argue that it serves as a very good first draft generator—akin to the opening chapter of a dissertation—rather than an exhaustive literature synthesis. To transform its outputs into truly strategic high value insights, expert intervention remains essential for scientific areas.
Recommendations that could be helpful for improving the tool
Enhance Multi-Turn Questioning: A more iterative, interactive research workflow would significantly improve query refinement.
Prioritize Authoritative Sources: Implement retrieval mechanisms that emphasize peer-reviewed papers and foundational research over blogs.
Improve Completeness in Meta-Analysis Outputs: Ensure inclusion/exclusion criteria and methodology artifacts are comprehensively presented.
Enhance Handling of Nuanced Technical Insights: Instead of favoring broad summaries, the tool should strive for depth, synthesizing multiple sources into a richer, more insightful whole.



