Regarding GPT-3, there is some discussion whether growing the model would transform it into an Oracle AI. I looked into the actual benchmark results (Appendix H in the paper) to see if we can predict something useful from the actual measurements.

**Method: **The OpenAI team ran a suite of 63 different benchmarks (including sub-types), each for zero/one/few shot. In each scenario, there are 8 model sizes. I looked at how results scale with model size. With only 8 measurements, there is a large associated uncertainty for predictions. Formally, one would test the trend function using a

Bayesian model selection between a linear and (e.g.,) a polynomial. I did this for a few and then eye-balled the rest. So, please take the following as an indication only.

**Disclaimer: **The smallest model for GPT-3 has parameters, the largest . That's a span of 3 orders of magnitude. Scaling this out to many more orders of magnitude is dangerous. Thus, take these numbers only as an indication.**Results. **For the following tests, I find an **asymptotic trend**. Scaling the model will apparently not yield fantastic results for:

- HellaSwag, LAMBADA, PIQA, CoQA, OpenBookQA, Quac, RACE, CB, ReCoRD, WiC
- Translations - but unclear level description.

In the following tests, it is **unclear if the trend is asymptotic** or better than that:

- SAT: Could be linear, could be asymptotic. If linear, it will achieve 100% at parameters.
- StoryCloze, Winograd, Winogrande, SQuADv2, DROP, Copa.

**These tests show a linear scaling:**

- TriviaQA ( parameter estimate to achieve 100%)
- BoolQ ()
- MultiRC ()
- ARC ()
- SuperGLUE ()
- WSC ()
- WebQs ()
- Cycled ()

**Some tests scale neither linear nor asymptotic:**

- Symbol: Near exponential ()
- Arithmetic: Exponential; one-digit composite may achieve 100% at
- Reversed: Near exponential ()
- Anagrams: Polynomial ()
- ANLI: stepped, unclear
- RTE: stepped, unclear

**Summary: **About half of the tested skills will likely not scale much with larger models. The other half will (e.g., TriviaQA, SuperGLUE, arithmetic, anagrams). Going to e.g., parameters - would that make an Oracle AI? Probably it's not sufficient, but I'm interested in hearing your opinion!