Is Grok 4 the Most Powerful AI or the Most Biased?

INDUSTRIES

Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.
Discover More
- "Working with Arbisoft has felt less like hiring a vendor and more like gaining a team of trusted colleagues. Their developers don’t just build what we ask, they think alongside us, offer smart suggestions, and care deeply about getting it right."
  Sarah Johnson / SVP of Product, Summit K12
Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.
Discover More
- “Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
  Paul English / Co-Founder, KAYAK
As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.
Discover More
- "I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented."
  Matt Hasel / Program Manager, eHuman
We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.
Discover More
- “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
  Jake Peters / CEO & Co-Founder, PayPerks
Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!
Discover More
- "The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met."
  Veronika Sonsev / Co-Founder
Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!
Discover More
- “The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.
  Silvan Rath / CEO, Predict.io

TL;DR

xAI launched Grok 4 on July 9, 2025, and it's now competing head-on with GPT-4, Claude, and Gemini.

Grok 4 runs on a Mixture-of-Experts Transformer trained on a 200,000-GPU cluster called Colossus.
Context window hits 256,000 tokens, roughly five to six times what GPT-4 handles.
The Heavy version uses multiple agents in parallel and scored 50.7% on Humanity's Last Exam.
AIME25 score: 100% on Grok 4 Heavy. GPQA: 88.9%. HumanEval: 97.0%.
API pricing is $0.003 per 1K input tokens, about 10x cheaper than GPT-4 Turbo.
Live web and X search built in, plus voice and real-time camera input.
No public system card, and researchers have flagged answers that lean toward Musk's views.

Worth testing for hard reasoning and long-document tasks. Skip it for quick everyday queries.

Introduction

In July 2025, Elon Musk’s xAI introduced Grok 4, its most advanced large language model yet (according to the company). This release marked a major leap in artificial intelligence. Grok 4 is designed to solve complex problems, understand images and speech, and even search the internet in real-time. It’s not just smarter than its earlier versions; it's being described by xAI as the “most intelligent model in the world.”

But what exactly makes Grok 4 so powerful? How does it compare to other leading models like OpenAI’s GPT-4, Anthropic’s Claude, or Google’s Gemini? What are its limits? And is the hype backed by real results?

This blog takes you through everything you need to know, from Grok 4’s architecture and performance to user reactions and future plans.

What Is Grok 4 and Why Is It Important?

Grok 4 is a multimodal, tool-using, real-time AI model from xAI, Elon Musk’s artificial intelligence company. It builds on earlier versions, Grok 1, 2, and 3, which rolled out between late 2023 and early 2025. Grok 4 became publicly available on July 9, 2025, through the X app (formerly Twitter), xAI’s standalone mobile apps, and an API for developers.

Unlike simple chatbots, Grok 4 is designed to:

Answer with updated information using live web search
Run code and solve math problems
Analyze photos, documents, or even your camera view
Hold long and deep conversations with a 256,000-token memory
Use multiple agents in parallel to improve reasoning

Musk calls it “PhD-level in everything.” While that’s a bold claim, Grok 4 continues to demonstrate impressive results across real academic and logic benchmarks, proving it’s more than just hype.

What Powers Grok 4? A Look at the Technology

Grok 4 is based on a Mixture-of-Experts (MoE) Transformer architecture, like GPT-4 and Claude. The key difference with MoE is that the model does not activate all its parameters for every query. It routes each input to a specialized subset of the network, which makes it more efficient to run even at very large scales.

It trains on a wide mix of data, web content, programming code, textbooks, documents, and streaming data from X.com. What sets Grok 4 apart is how much compute and reinforcement learning it uses during training.

Key Technical Facts about Grok 4

Let’s discuss some of the technicalities of Grok 4:

Architecture: Mixture-of-Experts (MoE) Transformer, only a portion of the total parameters is active for any single query.
Supercomputer: Grok 4 trains on xAI’s “Colossus,” a 200,000-GPU cluster located in Tennessee.
Training process: xAI uses 6 times more reinforcement learning compute compared to Grok 3, with a strong focus on "chain-of-thought" reasoning.
Context length: It supports up to 256,000 tokens, about 5 to 6 times more than what GPT-4 typically handles. This allows the model to read and remember large documents or long chat sessions.
Multimodal input: Grok 4 understands text, images, voice, and real-time camera input. This makes it useful for real-world tasks like reading signs, understanding graphs, or assisting while driving.
Multi-agent mode (Grok 4 Heavy): This optional “Pro” mode uses multiple AI agents working together to improve results on harder tasks.
Live internet search: Grok 4 pulls current data from the internet and from X posts, giving it access to information that most models cannot see at inference time.
Speed: Approximately 73 tokens per second, which is slower than Gemini 2.5 Flash (374 tok/sec) but faster than DeepSeek (24 tok/sec).

No official parameter size has been published, but many experts estimate that Grok 4 may exceed 1.7 trillion parameters, placing it among the largest models ever built.

How Well Does It Perform? Test Results and Benchmarks

Grok 4 stands out mainly because of its strong results in different types of benchmark tests, including academic tests and business task simulations, showcasing how modern AI solutions can scale across multiple modalities. According to xAI, the model performs much better than older versions, mostly because it uses more computing power during training and testing, not because of big changes in its design.

Humanity’s Last Exam (HLE)

The main reason behind Grok 4’s strong performance is scaling. It uses more computing power during training and even more during testing. Simply put, the more resources it gets, the better it works. This is especially clear in its scores on the Humanity’s Last Exam (HLE) benchmark.

Based on xAI’s data, when no tools are used, Grok 4 scores around 26.9% accuracy. When tools like code execution are turned on, the score rises to 41%. In its advanced multi-agent “Heavy” version, it reaches 50.7%, which is more than twice the score of earlier models that don’t use tools.

Image Illustrating Grok 2's Humanity’s Last Exam

Academic and STEM Tests

Grok 4 also performs well on regular STEM-related tests, which are often used to compare powerful language models. Here are some top results:

Benchmark	Top Competing Models	Grok 4 (No Tools)	Grok 4 Heavy
GPQA	79.6–86.4%	87.5%	88.9%
AIME25	75.5–98.8%	91.7%	100.0%
LCB (Jan–May)	72.0–74.2%	79.0%	79.4%
HMMT25	58.3–82.5%	90.0%	96.7%
USAMO25	21.7–49.4%	37.5%	61.9%

These are impressive numbers. In most areas, Grok 4 beats Claude Opus, Gemini 2.5 Pro, and GPT-4 (o3). But some people have said that the scores might favor Grok 4 by picking comparison numbers that are easier to beat.

ARC-AGI (Advanced Reasoning Tests)

Another tough and less open benchmark is ARC-AGI. It checks how well a model can handle abstract and complex reasoning problems. On ARC-AGI version 1, Grok 4 scores 66.6%, ahead of all known models. On version 2, it scores 15.9%, while Claude 4 Opus scores 8.6%.

These test details aren’t fully available to the public, so the results should be taken with a grain of salt. Still, if they are correct, Grok 4 is doing well in tasks that involve step-by-step thinking and logic.

Vending-Bench (Business Simulation Test)

xAI also tested Grok 4 in a practical simulation called Vending-Bench. This test checks if a model can run a small business, handling inventory, changing prices, talking to suppliers, and more. It's a newer test but gives real-world insights.

Here are the average results over five tries:

Image Illustrating Grok 4's Vending Bench

Rank	Model	Net Worth	Units Sold
1	Grok 4	$4,694	4,569
2	Claude Opus 4	$2,077	1,412
3	Human baseline	$844	344
4	Gemini 2.5 Pro	$789	356
5	GPT-4 (o3)	$1,843	1,363

Grok 4 did more than twice as well as its closest rival, both in sales and profits. It also stayed consistent through 300 rounds of simulation, something most models struggle with when planning over long periods.

Grok 4 performs strongly across every benchmark xAI has tested. But scores alone don’t tell the full story. For tasks involving vision, code generation, or real-time handling of complex situations, additional testing is necessary to determine if the model is a suitable fit.

How Does Grok 4 Compare to GPT Models?

Image Illustrating A Comparison of Grok 4 to GPT Models

Benchmark / Task	Grok 4 (Heavy)	GPT-4 / GPT-4o
USAMO (Olympiad math)	61.9% (1st place)	Unreported
Humanity's Last Exam (with tools)	50.7%	No public score
AIME (math)	100%	~37%
GPQA (physics)	88.9%	53.6%
HumanEval (coding)	97.0%	Unreported
MMLU (general knowledge)	86.6%	88.7%

Pricing and Usage

Model	Prompt (per 1K tokens)	Completion (per 1K tokens)	Context Window	Source
xAI Grok 4 (API)	$0.003	$0.015	256K	xAI Docs
GPT‑4 Turbo (128K)	$0.01	$0.03	128K	OpenAI Help
GPT‑4 (8K)	$0.03	$0.06	8K	OpenAI Help
GPT‑3.5 Turbo (4K)	~$0.002 (estimated)	~$0.002 (estimated)	4K	OpenAI (approx.)

GPT‑4 API usage is approximately 10x more expensive per input token than Grok 4. While ChatGPT Plus ($20/month) provides access to GPT‑4 with usage limits, Grok’s $30 (basic) and $300 (SuperGrok) plans offer near-unlimited usage for individual users.

Where Can You Use Grok 4?

Grok 4 is accessible through multiple platforms:

Inside X (Twitter): Available to Premium+ subscribers for direct chatting.
Grok iOS and Android apps: These offer full chat, voice, file uploads, and camera input.
Web interface (chat.x.ai): Available to all paid users.
Grok 4 Heavy: Available for $300/month, ideal for research and deep analysis.
API Access: Developers can use Grok through xAI’s cloud-based API, with up to 256K context and multi-agent tools.
Microsoft Azure AI Foundry: Enterprise deployment option
Coming Soon - Tesla voice assistant: Musk has confirmed that Grok will soon be inside Tesla cars to support drivers through voice.

How is the Public and Industry Reacting?

Grok 4 didn’t just bring new features. It started real conversations. As more people tried it out, their reactions showed what made it different, what they liked, and what raised questions. From everyday users to AI experts, many had something to say about how Grok 4 works in the real world.

User Reactions

Many users were impressed with Grok 4’s intelligence, especially in solving logic problems, answering tough math questions, and analyzing long documents. Some also liked the voice interaction feature, which made the experience feel more natural.

What stood out most to users was Grok’s personality. It felt more human, with a witty and slightly rebellious tone. Unlike other AI models that often sound robotic or neutral, Grok came across as more casual and fun to talk to. This made it popular for both serious tasks and more creative or playful chats.

At the same time, a few users said the model could be hard to use without proper guidance. Grok 4 requires good prompt-writing skills to get the best results. Some also found its answers a little unpredictable because of its strong personality.

Expert Opinions

AI experts were impressed by Grok 4’s benchmark scores. Its performance on complex tests like ARC-AGI and Humanity’s Last Exam showed that it has strong reasoning abilities. However, researchers also pointed out some concerns.

One major issue is the lack of transparency. Grok 4 does not have a public system card or detailed technical report. This makes it harder to understand how it was trained or how safe and fair it really is. Without this information, researchers cannot fully evaluate the model’s strengths and weaknesses.

Another concern is the model’s bias. In many responses, Grok seems to reflect Elon Musk’s personal opinions. When asked political or sensitive questions, its answers sometimes feel one-sided. Experts warn that this could be risky, especially if people rely on Grok for important decisions. They believe AI should be neutral and not show the views of any one person or company.

The Political Angle: What Could Happen Next?

There is also growing interest in how Grok 4 will respond to changes in Elon Musk’s public relationships. Recently, Musk has had a falling out with Donald Trump, which has created some tension online.

Some people are wondering if Grok’s political responses will start to shift, too. In the past, Grok has reflected some of Musk’s ideas. If Musk’s views change or if his public fights continue, Grok might start to show those changes in its answers.

This could become a big issue, especially with important elections coming up. If Grok is seen as taking sides, it could lose trust among users. Experts say that AI models must be fair and balanced, especially when used in fields like healthcare, education, or finance.

As Grok 4 gains more users, these concerns about bias and fairness will likely become even more important. People want powerful tools, but they also want them to be reliable and neutral.

What’s Next for Grok and xAI?

xAI has revealed an aggressive roadmap:

August 2025: Release of Grok Coding Assistant
September 2025: Launch of a full multimodal agent with browser and advanced image tools
October 2025: Release of Grok Video, a model that can generate or understand video
End of 2025: Grok 5 may launch with improved common sense, better safety, and more compute

xAI also plans to expand to major cloud platforms and partner with industries for real-world deployments. Tesla, X, and Optimus robot integration are all on the table.

Wrapping It Up

Grok 4 is not just another language model. It is a serious upgrade in intelligence, logic, and usability. Its performance on exams, math tasks, and simulations shows that it is already leading the next wave of advanced AI. With its real-time web access, multi-agent system, and wide context window, it pushes beyond what GPT-4, Claude, or Gemini currently offer.

But it also comes with big questions about safety, transparency, and values. The world will be watching not just what Grok can do, but how xAI chooses to shape its future.

For now, Grok 4 stands as the most powerful AI model you can use if you can handle both its brilliance and its intelligence.

Just published

Databricks Partner Tiers Explained (Bronze, Silver, Gold and Platinum) blog image

Databricks Partner Tiers Explained (Bronze, Silver, Gold and Platinum)Read More

Is Databricks a Good Fit for Mid-Market Data Teams? blog image

Is Databricks a Good Fit for Mid-Market Data Teams?Read More

Should you build, buy, or partner for AI? A cost comparison that holds up to your CFO blog image

Should you build, buy, or partner for AI? A cost comparison that holds up to your CFORead More

Explore More

Trusted by Market Leaders in Education, Travel, Finance and E-commerce since 2007

We put excellence, value and quality above all - and it shows

NPS

INDUSTRIES

Real-time Maintenance Reporting

Workflow Automation Platform

Recruitment Automation Tool

Learner Engagement Platform

Customer Feedback Analytics

School Communication Suite

Digital Learning Suite

Software Development Outsourcing

Dedicated Teams

IT Staff Augmentation

New Venture Partnership

Is Grok 4 the Most Powerful AI or the Most Biased?

TL;DR

Introduction

What Is Grok 4 and Why Is It Important?