Google Gemini Stuns with 1 Million Tokens: Why OpenAI GPT Is Still Ahead

Published in

FAUN — Developer Community 🐾

6 min readApr 22, 2024

OpenAI DALL-E’s depiction of the context window of an LLM

Sundar Pichai, CEO of Google, touting the one million token context window of the latest Gemini LLM reminds me very much of the age when makers of digital cameras tried to beat the competition by advertising higher and higher numbers of megapixels. Product marketers conveniently left out the fact that their cameras’ lenses were not capable of truly resolving the detail that those high megapixel counts suggested. This typically led to significantly worse image quality compared to cameras with a much lower resolution. Much like those early digital cameras, simply boasting about a larger token count in an LLM does not translate to better performance. The effectiveness of token utilization — how the model processes and interprets these tokens — is much more indicative of its actual utility and sophistication.

2048 Pages of Context versus 256 Pages

Google’s new Gemini 1.5 Pro Large Language Model (LLM) touts a context window of 1 million tokens, eight times more than the 128,000 tokens currently supported by OpenAI’s GPT4, and 5 times the 200,000 tokens handled by Anthropic’s CLAUDE 3 Opus can currently process in response to user input. One token can stand for one or multiple letters or an entire term. To make this more tangible, we can estimate the Gemini can handle 2,048 pages of question context, while GPT4 accepts 256 pages, and CLAUDE 3 Opus can process about 400 pages. This means that when the user inputs text, each LLM has a different limit in terms of how many PDF documents, how much conversation history, how much web and email context, and how much other content it can consider to provide the contextually optimal response to this user input.

The context window provides the LLM with guidance of the user’s situational intent.

Google Gemini 1.5 Pro versus OpenAI’s GPT 4

You can try out Gemini 1.5 Pro here. After my initial testing, I can say that for my use cases GPT4 produces dramatically more relevant and concise results compared to Gemini. The Python code created by GPT4 in response to my inputs is much more likely to run without error, and GPT’s ability to extract and summarize key aspects from large quantities of text seems significantly more balanced, meaningful, and comprehensive. Please judge for yourself.

Token Memory versus Actual Model Memory

To answer the question of why Google’s new LLM does not blow away GPT4 (or Claude 3 Opus) despite its much larger context window we need to understand the difference between token memory and actually useful model memory.

The context window provides the already trained LLM with additional information on how to interpret the user query.

For example, the user may inquire about the top speed of a Jaguar. Based on the conversation history, the LLM knows that the user is interested in Jaguar, the animal, as opposed to the luxury car, as earlier parts of the conversation were all about planning a, hopefully photo, safari in Zimbabwe. Had that same conversation been about finding the right car to enjoy the German motorways in style while being able to outrun the policy, our LLM would have known to look up the high speed of the car, instead of that of the animal.

Google Trends chart showing 400% more Google searches for OpenAI GPT (red) compared to Google Gemini (blue)

When Does Context Window-Size Matter

Gemini being able to “see” 2,048 pages of context, versus GPT4’s 256 pages means that GPT4 is able to consider a lot less context than Gemini. This could matter for complex tasks, where the user expects the LLM to “do some real high level and strategic thinking.” For example, in a business meeting scenario where a user is discussing a multi-year project plan, Gemini’s larger token memory would allow it to maintain a broader context of the project’s goals, milestones, and dependencies across multiple meeting sessions.

Sundar Pichai announcing Gemini 1.5 Pro with its 1 Million tokens context window (source: Google.com)

In contrast, GPT4’s smaller token memory might cause it to lose track of some crucial context, potentially leading to less informed or accurate responses. This difference in token memory could also impact the user’s experience in scenarios where the conversation delves into intricate topics, such as legal or financial matters, where a more extensive context is essential to provide accurate and relevant information.

Actual Model Memory Is What Counts

Key factors contributing to the effective model memory of a large language model.

The fact that GPT4 often delivers significantly better outputs than Gemini shows that other factors matter more than the raw size of the context window. For example, GPT4 might use its context window more efficiently than Gemini, by culling unnecessary context and honing in on relevant content.

Multi-Query Attention (MQA) is designed to optimize the processing of information by allowing a model to handle multiple aspects of a query simultaneously within a shared set of keys and values. This streamlined approach reduces the need for large context windows because it focuses only on the most relevant parts of the data for each specific query. By concentrating computational resources on smaller, more pertinent segments of information, MQA enables the model to maintain high performance and accuracy without the overhead of managing extensive context windows. This efficiency is particularly beneficial in environments where computational resources are limited or when rapid response times are essential.

Dynamic token re-weighting is a process where a language model like GPT-4 adjusts the significance of each word or token in its context window based on their relevance to the current task. As the model processes input, it assigns weights to tokens, amplifying those that are crucial for understanding the query and diminishing the importance of less relevant ones. This selective focus allows the model to operate effectively with smaller context windows, as it concentrates computational resources on the most informative parts of the text, thereby optimizing performance and enabling more precise and contextually relevant outputs.

Language models can also differentiate by obtaining a deeper and more nuanced understanding of semantic relationships between words and phrases. This is achieved through more sophisticated training techniques that include contrastive learning, where the model learns not just to predict the next word, but also to distinguish between closely related concepts and contexts. This results in a more nuanced understanding and generation of text, allowing the LLM to produce outputs that are not only contextually appropriate but also rich in detail and subtlety.

Dynamic Context Windows Are More Efficient

These advanced capabilities demonstrate that the effectiveness of a language model in handling complex tasks and large contexts is not solely dependent on the size of its context window or the raw computational power. Instead, it hinges on how intelligently the model can utilize its resources to focus on what’s most relevant and ignore extraneous information, thereby delivering outputs that are both high-quality and contextually precise.

LLMs can create a dynamic context window based on the current situation and user requests.

Final Thoughts and Conclusion

In conclusion, while the impressive one million token context window of Google’s Gemini 1.5 Pro offers a theoretical advantage in handling extensive data, the practical effectiveness of a language model like GPT-4 often surpasses it due to more sophisticated mechanisms such as Multi-Query Attention and dynamic token re-weighting. These technologies enable GPT-4 to efficiently manage smaller context windows by focusing computational resources on the most relevant information, thus optimizing performance. This approach not only enhances the quality of the outputs but also demonstrates that the strategic use of technology can often outweigh sheer capacity in achieving superior results in real-world applications.