Microsoft recently released a research paper titled: Sparks of Artificial General Intelligence: Early experiments with GPT-4. As described by Microsoft:

This paper reports on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit more general intelligence than previous AI models.

In this paper, there is conclusive evidence demonstrating that GPT-4 goes far beyond memorization, and that it has a deep and flexible understanding of concepts, skills, and domains. In facts it’s ability to generalize far exceeds that of any human alive today.

While we have previously discussed the benefits of AGI, we should quickly summarize the general consensus of what an AGI system is. In essence an AGI is a type of advanced AI that can generalize across multiple domains and is not narrow in scope. Examples of narrow AI include an autonomous vehicle, a chatbot, a chess bot, or any other AI which is designed for a single purpose.

An AGI in comparison would be able to flexibly alternate between any of the above or any other field of expertise. It’s an AI that would take advantage of nascent algorithms such as transfer learning, and evolutionary learning, while also exploiting legacy algorithms such as deep reinforcement learning.

The above description of AGI matches my personal experience with using GPT-4, as well as the evidence shared in research paper that was released by Microsoft.

One of the prompts outlined in the paper is for GPT-4 to write a proof of the infinitude of primes in the form of a poem.

If we analyze the requirements for creating such a poem we realize that it requires mathematical reasoning, poetic expression, and natural language generation. This is a challenge that would exceed the average capability of most humans.

The paper wanted to understand if GPT-4 was simply producing content based on general memorization versus understanding context and being able to reason. When asked to recreate a poem in the style of Shakespeare it was able to do so. This requires a multifaceted level of understanding that far exceeds the ability of the general population and includes theory of mind and mathematical genius.

How to Calculate GPT-4 Intelligence?

The question then becomes how can we measure the intelligence of an LLM? And is GPT-4 displaying behaviors of true learning or mere memorization?

The current way of testing an AI system is by evaluating the system on a set of standard benchmark datasets, and to ensure that they are independent of the training data and that they cover a range of tasks and domains. This type of testing is nearly impossible due to the nearly unlimited quantity of data that GPT-4 was trained on.

The paper then delves into the territory of generating novel and difficult tasks and questions that convincingly demonstrate that GPT-4 goes far beyond memorization, and that it has a deep and flexible understanding of concepts, skills, and domains.

When it comes intelligence, GPT-4 can generate short stories, screenplays, and it can calculate the most complicated formulas.

GPT-4 is also able to code at a very high level, both in terms of writing code from instructions and understanding existing code. GPT-4 can handle a wide range of coding tasks, from
coding challenges to real world applications, from low-level assembly to high-level frameworks, from simple data structures to complex programs such as games. GPT-4 can also reason about code execution, simulate the effects of instructions, and explain the results in natural language. GPT-4 can even execute pseudocode,

GPT-4 can offer rational responses to almost any prompt given. In fact, it’s intelligence is almost invincible in its complexity.

GPT-4 Limitations

The paper then continues to describe how they probed GPT-4’s responses and behaviors, to verify its consistency, coherence, and correctness, and to uncover its limitations and biases.

The most obvious limitation is its tendency to hallucinate. Here is an example of a hallucination that is given in the paper where GPT-4 was confident in its answer which was completely false.

It appears that even GPT-4 understands this current limitation with hallucinations as this was the response it gave when I asked it what is a GPT hallucination:

A GPT hallucination refers to a phenomenon where a Generative Pre-trained Transformer (GPT) model, like the one you are currently interacting with, produces a response that is not based on factual information or is not coherent with the context provided. These hallucinations occur when the model generates text that may seem plausible but is actually incorrect, misleading, or nonsensical.

In other words when you train a LLM AI on the world’s data, how do you avoid a system learning incorrect data? A large language model learning and regurgitating misinformation, and conspiracy theories could actually be one of the largest pitfalls and threats that humanity faces with large scale adoption of LLMs. This could actually be one of the bigger threats from AGI, one that is surprisingly missed when discussing the dangers of AGI.

GPT-4 Proofs of Intelligence

The paper illustrates that it did not matter what type of complex prompts were directed towards it, GPT-4 would exceed expectations. As stated in the paper:

Its unparalleled mastery of natural language. It can not only generate fluent and coherent text, but also understand and manipulate it in various ways, such as summarizing, translating, or answering an extremely broad set of questions. Moreover, by translating we mean not only between different natural languages but also translations in tone and style, as well as across domains such as medicine, law, accounting, computer programming, music, and more.

Mock technical reviews were given to GPT-4, it easily passed meaning in this context if this was a human on the other end that they would instantly be hired as a software engineer. A similar preliminary test of GPT-4’s competency on the Multistate Bar Exam showed an accuracy above 70%. This means that in the future we could automate many of the tasks that are currently given to lawyers. In fact there are some startups that are now working to create robot lawyers using GPT-4.

Producing New Knowledge

One of the arguments in the paper is that the only thing left for GPT-4 to prove true levels of understanding is for it to produce new knowledge, such as proving new mathematical theorems, a feat that currently remains out of reach for LLMs.

Then again this is the holy grail of an AGI. While there are dangers with an AGI being controlled in the wrong hands,  the benefits of an AGI being able to quickly analyze all historical data to discover new theorems, cures and treatments is nearly infinite.

An AGI could be the missing link towards finding cures for rare genetic diseases which currently lack private industry funding, towards curing cancer once and for all, and to maximize the efficiency of renewable power to remove our dependency on unsustainable energy. In fact it could solve any consequential problem that is fed into the AGI system. This is what Sam Altman and and the team at OpenAI understand, an AGI is truly the last invention that is needed to solve most problems and to benefit humanity.

Of course that does not solve the nuclear button problem of who controls the AGI, and what their intentions are. Regardless this paper does a phenomenal job arguing that GPT-4 is a leap forward towards achieving the dream AI researchers have had since 1956, when the initial Dartmouth Summer Research Project on Artificial Intelligence summer workshop was first launched.

While it is debatable if GPT-4 is an AGI, it could easily be argued that for the first time in human history it’s an AI system that can pass the Turing Test.