Google Gemini vs Gpt-4: methodology for comparison
Professionals — ML engineers, data scientists, and NLP researchers — use particular tools for LLMs' context awareness assessment. To benefit from such tools, one needs specialized skills and experience. As a result, these tools provide a deep understanding of the topic.
Since our focus is not on “scientific” objectives but on practical, everyday evaluations of “Chat GPT vs Gemini,” we employ our own approach for model comparison. While rooted in professional methodology, we have adapted it for simplicity and clarity in both usage and interpretation of results.
Professional approach to LLM context awareness evaluation
LM engineers use plenty of methods for language model context awareness evaluation. There are two groups of evaluation tools. We’re mentioning them briefly just for general understanding and stating that “context” is quite a complex term, and its’ assessment is multidimensional:
- separate tests for a particular model’s skills evaluation, such as SQuAD (Stanford Question Answering Dataset) or NarrativeQA;
- benchmarks, or suites of tests that ML engineers use to compare models with each other; for example, GLUE (General Language Understanding Evaluation) or MT-Bench (Multi-Task Benchmark).
All these tests measure much more than just a model’s ability to understand context since this ability is closely related to other models’ skills, such as comprehension, reasoning, coherence, and the ability to maintain consistency across different parts of a conversation or text.
One of the tests is known under the figurative name “Needle in a Haystack.”
You might have guessed from the title what is the core of an approach:
- “Haystack” is a pile of information that contains distractors - pieces of irrelevant information needed to make the task more challenging.
- “Needle” is a piece of information that an LLM should find.
ML engineers and data scientists use different “haystacks” and “needles” to conclude to which extent a model is good in dealing with context. For example, when this test was used for the first time for OpenAI’s ChatGPT-4 and Anthropic’s Claude 2.1 comparison, the group of data scientists put a particular phrase into pieces of essays written by Paul Graham, an English-American computer scientist and writer.
While applying this approach, researchers play predominantly with two variables:
- The length of the “haystack” measured in tokens — such parts of the words as syllables or stems, prefixes and suffixes.
- The depth to which the “needle” is put, meaning that it can be placed at the beginning, middle, or end of a “haystack."
While we are far from using complex “scientific” methods for a comparison of “Gemini vs Chat GPT,” we will use the “needle in a haystack” approach for our practical goal.
Our practical approach to LLMs context awareness evaluation
In addition to using the conventional way of assessment, we decided to add some creativity to the procedure by offering the LLMs to create the tests for each other.
We played the role of prompt engineers to assist models in creating “haystacks” and “needles.” Here is why models needed the assistance. Despite the models’ comprehensive understanding of the terms and the task in general, it appeared challenging for contestants to create tasks that were complex enough and easy for clear evaluation at the same time. While applying a few-shot approach to prompting, we evaluated the models’ success by counting how many prompts were needed to get a satisfactory result.
At the same time, we added our own task to a competition, “Chat GPT 4 vs Gemini.” We showed creativity once again and used not a specialized test but a task designed for humans. This task is from a popular textbook for English learners. It specifically tests the ability to understand context.
Additionally, the task contains “traps” — distracting factors that confuse and complicate the search for the correct answer. Such tasks are standard for testing humans, so why not use it to test LLMs?
To sum up, we’ll have three rounds of a competition:
- Round 1. The creation of a task inside the “Needle in a Haystack” approach.
- Round 2. Models are dealing with the task created by another model.
- Round 3. Models handle the English language test.
Ready, steady, go!
Round 1. Task creation
So, the first step is to create a task for participants of the battle “Chat GPT 4 vs Gemini.” Let’s look at how participants handled it.
GPT’s task for Gemini
GPT created the test that met our requirements within 5 prompts.
The text consists of approximately 864 tokens and includes two distractors — phrases or concepts repeated multiple times to test the LLM's ability to identify the most relevant information amidst potential distractions.
Gemini’s task for GPT
We must say that Gemini made a more favorable impression as the test creator in the contest “Chat GPT vs Gemini.” Unlike GPT, the model suggested the possibility of making the test more challenging by adding several needles to the haystack. Moreover, Gemini emphasized that it is essential to meet several conditions:
Gemini created the test within 4 prompts, showcasing higher efficiency in the comparison Chat GPT 4 vs Gemini.
The result of LLM’s work is the same as in the case of GPT: the text of approximately 860 tokens with two distractors.
You can see full tasks for both models in this document.
Scoring
Both models completed the task with our assistance. However, we awarded Gemini two additional points in a battle “Chat GPT vs Gemini”: the first point for clear and concise explanations of the test creation technology and the second for completing the task with fewer prompts.
Round 2. Task performance
Now, we have two tests at hand to manage a contest Chat GPT 4 vs Gemini.
Let’s elaborate briefly on the challenges the models will take on. Despite the fact that tests were created by different LLMs, both tests are comparable in difficulty due to wise prompting.
The task is not as simple as finding a short piece of information (“a needle”) in a bigger text (“a haystack”). Each text contains two distractors, the tricky formulations that make a test more rigorous.
Let’s look at the foils:
- The first distractor. The prompt for testing doesn’t repeat literally the piece of text in a “haystack.” For example, our “needle” is “Paris”. It’s mentioned in the following sentence: “Paris is the capital of France”. In a prompt, we don’t ask directly, “What is the capital of France?” Instead, we ask, “What is the chief city of France?”.
- The second distractor. We mention the “needle” in a text twice in different contexts. In the case of “Paris” as a “needle,” we mention it as the capital of France for the first time and as a location of the Eiffel Tower for the second time. As the query appeals to the capital of France, the second context isn’t relevant to the prompt.
Adding distractors helps assess models’ ability not only to understand the meaning of words but also to grasp synonyms meanings and to be aware of the context, which is the primary goal of our survey.
Gemini’s performance
Gemini’s Сhat GPT vs Gemini test inputs are as follows:
- the “haystack” is a text about the history of architecture;
- the “needle” is the fact that the Eiffel Tower can be 15 cm taller during the summer due to the expansion of iron in the heat”;
- the prompt, containing the rephrased hint “intriguing” instead of “interesting” and one false mention of the Eiffel Tower, is as follows: "What intriguing fact about the Eiffel Tower is mentioned in the text?"
Here is Gemini’s answer:
The answer is correct!
GPT’s performance
GPT’s Gemini vs GPT test inputs are as follows:
- the “haystack” is a text about renewable energy sources;
- the “needle” is the term “smart grid”;
- the prompt containing two rephrased hints (in bold) and one false mention of smart grids is as follows: "What technological innovation, mentioned in the text, is being developed to manage the fluctuating and unpredictable nature of power generation from renewable sources?"
Here is GPT’s answer:
The answer is correct!
Scoring
Round 3
We based the final round of the Google Gemini vs GPT-4o competition on the language test — strict, precise, and conventional. The test is taken from a well-known textbook for English learners, Evans, V., & Dooley, J. (2002). Upstream Proficiency Student's Book. Express Publishing.
The task is to read the text and answer seven questions by selecting the answer that best matches the text.
Here are the reasons why the test from the textbook for English learners suits our test’s requirements:
- such tests are created to assess students’ abilities to comprehend the text, which requires skills in grasping context;
- there are no ambiguities in the wording of both the text and the questions;
- the questions include distractors to make the task more challenging and to assess the solidity of comprehension skills.
To check if the answer was correct, we used the answer keys available in the Teacher’s Book, which is part of the mentioned book package.
We used identical simple prompts to ensure consistent testing conditions for both models.
The test result is unexpected. Here are the number of correct answers out of the seven given:
- GPT — 4 answers;
- Gemini — 5 answers.
Final scoring:
Gemini is a winner!
Conclusion
While working on the Gemini vs GPT experiment, we hadn’t an intention to make it serious and “scientific” for three reasons:
- you can find plenty of evaluation test results and benchmarks to make a decision about which LLM is best for your AI solution, so you hardly need another one;
- state-of-the-art LLMs are proficient in grasping contextual information, it’s a stated fact that doesn’t need to be approved in general;
- if we estimate well-known LLMs’ abilities with a 100-point scale, the difference between models lies somewhere between the 90th and 100th points, not between the 20th and the 100th; hence, the decision-making is based rather on the LLMs’ particular aspects evaluation than the test scoring.
Hence, our goals were rather extremely close to practice with a portion of amusement than oriented to the strict “scientific” outcomes.
In addition, we were glad to show you that you can test LLMs on the go with your own hands. The issue is that, in the CoSupport AI team's opinion, it’s one of the most efficient ways to obtain a closer acquaintance with AI.
We expected that both LLMs would show similar outcomes with slight fluctuations in the battle Chat GPT vs Gemini. Thus, the third round surprised us to some extent. It turns out our experiment illustrated the idea that is vital for decision-making.
It’s similar to the third round of our playful competition. A few points stand between the student and passing the exam to obtain the certificate.
In practice, a business owner or CEO needs sufficient information to choose a basis for their AI-driven solutions. Moreover, specialized expertise is required to evaluate the test results. Thus, the CoSupport AI team recommends engaging ML engineers to discuss particular models’ advantages and flaws.
Certainly, we’ll be glad to help you choose the technical basis for your AI solutions and explain the details concisely. With three years of experience in ML and NLP, the AI architecture patented in the U.S., and a team of skilled ML engineers, we have all the chances to be the advisor of your choice.