Taelins bet and solution

X user @VictorTaelin posted a bet:

Mandatory clarifications and thoughts:

This isn’t a tokenizer issue. If you use 1 token per symbol, GPT-4 / Opus / etc. will still fail. Byte-based GPTs fail at this task too. Stop blaming the tokenizer for everything.
This tweet is meant to be an answer to the following argument. You: “GPTs can’t solve new problems”. Them: “The average human can’t either!”. You: this prompt. In other words, this is a simple “new statement” that an average human can solve easily, but current-gen AIs can’t.
The reason GPTs will never be able to solve this is that they can’t perform sustained logical reasoning. It is that simple. Any “new” problem outside of the training set, that requires even a little logical reasoning, will not be solved by GPTs. That’s what this aims to show.
A powerful GPT (like GPT-4 or Opus) is basically one that has “evolved a circuit designer within its weights”. But the rigidness of attention, as a model of computation, doesn’t allow such evolved circuit to be flexible enough. It is kinda like AGI is trying to grow inside it, but can’t due to imposed computation and communication constraints. Remember, human brains undergo synaptic plasticity all the time. There exists a more flexible architecture that, trained on much smaller scale, would likely result in AGI; but we don’t know it yet.
The cold truth nobody tells you is that most of the current AI hype is due to humans being bad at understanding scale. Turns out that, once you memorize the entire internet, you look really smart. Everyone on AI is aware of that, it is just not something they say out loud. Most just ride the waves and enjoy the show.
GPTs are still extremely powerful. They solve many real-world problems, they turn 10x devs into 1000x devs, and they’re accelerating the pace of human progress in such a way that I believe AGI is on the corner. But it will not be a GPT. Nor anything with gradient descent.
I may be completely wrong. I’m just a person on the internet. Who is often completely wrong. Read my take and make your own conclusion. You have a brain too!

A Prompt for an Unsolvable Problem

from Github

A::B is a system with 4 tokens: `A#`, `#A`, `B#` and `#B`.

An A::B program is a sequence of tokens. Example:

    B# A# #B #A B#

To *compute* a program, we must rewrite neighbor tokens, using the rules:

    A# #A ... becomes ... nothing
    A# #B ... becomes ... #B A#
    B# #A ... becomes ... #A B#
    B# #B ... becomes ... nothing

In other words, whenever two neighbor tokens have their '#' facing each-other,
they must be rewritten according to the corresponding rule. For example, the
first example shown here is computed as:

    B# A# #B #A B# =
    B# #B A# #A B# =
    A# #A B# =
    B#

The steps were:
1. We replaced `A# #B` by `#B A#`.
2. We replaced `B# #B` by nothing.
3. We replaced `A# #A` by nothing.
The final result was just `B#`.

Now, consider the following program:

A# B# B# #A B# #A #B

Fully compute it, step by step.

but he lost and admitted defeat:

I WAS WRONG - $10K CLAIMED!

The Claim

Two days ago, I confidently claimed that “GPTs will NEVER solve the A::B problem”. I believed that: 1. GPTs can’t truly learn new problems, outside of their training set, 2. GPTs can’t perform long-term reasoning, no matter how simple it is. I argued both of these are necessary to invent new science; after all, some math problems take years to solve. If you can’t beat a 15yo in any given intellectual task, you’re not going to prove the Riemann Hypothesis. To isolate these issues and raise my point, I designed the A::B problem, and posted it here - full definition in the quoted tweet.

Reception, Clarification and Challenge

Shortly after posting it, some users provided a solution to a specific 7-token example I listed. I quickly pointed that this wasn’t what I meant; that this example was merely illustrative, and that answering one instance isn’t the same as solving a problem (and can be easily cheated by prompt manipulation).

So, to make my statement clear, and to put my money where my mouth is, I offered a $10k prize to whoever could design a prompt that solved the A::B problem for random 12-token instances, with 90%+ success rate. That’s still an easy task, that takes an average of 6 swaps to solve; literally simpler than 3rd grade arithmetic. Yet, I firmly believed no GPT would be able to learn and solve it on-prompt, even for these small instances.

Solutions and Winner

Hours later, many solutions were submitted. Initially, all failed, barely reaching 10% success rates. I was getting fairly confident, until, later that day, @ptrschmdtnlsn and @SardonicSydney submitted a solution that humbled me. Under their prompt, Claude-3 Opus was able to generalize from a few examples to arbitrary random instances, AND stick to the rules, carrying long computations with almost zero errors. On my run, it achieved a 56% success rate.

Through the day, users @dontoverfit (Opus), @hubertyuan_ (GPT-4), @JeremyKritz (Opus) and @parth007_96 (Opus), @ptrschmdtnlsn (Opus) reached similar success rates, and @reissbaker made a pretty successful GPT-3.5 fine-tune. But it was only late that night that @futuristfrog posted a tweet claiming to have achieved near 100% success rate, by prompting alone. And he was right. On my first run, it scored 47/50, granting him the prize, and completing the challenge.

How it works!?

The secret to his prompt is… going to remain a secret! That’s because he kindly agreed to give 25% of the prize to the most efficient solution. This prompt costs $1+ per inference, so, if you think you can improve on that, you have until next Wednesday to submit your solution in the link below, and compete for the remaining $2.5k! Thanks, Bob.

How do I stand?

Corrected! My initial claim was absolutely WRONG - for which I apologize. I doubted the GPT architecture would be able to solve certain problems which it, with no margin for doubt, solved. Does that prove GPTs will cure Cancer? No. But it does prove me wrong!

Note there is still a small problem with this: it isn’t clear whether Opus is based on the original GPT architecture or not. All GPT-4 versions failed. If Opus turns out to be a new architecture… well, this whole thing would have, ironically, just proven my whole point 😅 But, for the sake of the competition, and in all fairness, Opus WAS listed as an option, so, the prize is warranted.

Who I am and what I’m trying to sell?

Wrong! I won’t turn this into an ad. But, yes, if you’re new here, I AM building some stuff, and, yes, just like today, I constantly validate my claims to make sure I can deliver on my promises. But that’s all I’m gonna say, so, if you’re curious, you’ll have to find out for yourself (:

That’s all. Thanks for all who participated, and, again - sorry for being a wrong guy on the internet today! See you.