Vaniver's Comments

GPT-3: a disappointing paper
I take a critical tone here in an effort to cut that hype off at the pass.

Maybe this is just my AI safety focus, or something, but I find myself annoyed by 'hype management' more often than not; I think the underlying root cause of the frustration is that it's easier to reach agreement on object-level details than interpretations, which are themselves easier than interpretations of interpretations.

Like, when I heard "GPT-3", I thought "like GPT-2, except one more," and from what I can tell that expectation is roughly accurate. The post agrees, and notes that since "one" doesn't correspond to anything here, the main thing this tells you is that this transformer paper came from people who feel like they own the GPT name instead of people who don't feel that. It sounds like you expected "GPT" to mean something more like "paradigm-breaker" and so you were disappointed, but this feels like a ding on your expectations more than a ding on the paper.

But under the hype management goal, the question of whether we should celebrate it as "as predicted, larger models continue to perform better, and astoundingly 175B parameters for the amount of training we did still hasn't converged" or criticize it as "oh, it is a mere confirmation of a prediction widely suspected" isn't a question of what's in the paper (as neither disagree), or even your personal take, but what you expect the social distribution of takes is, so that your statement is the right pull on the group beliefs.


Maybe putting this another way, when I view this as "nostalgebraist the NLP expert who is following and sharing his own research taste", I like the post, as expert taste is useful even if you the reader disagree; and when I view it as "nostalgebraist the person who has goals for social epistemology around NLP" I like it less.

OpenAI announces GPT-3

But does it ever hallucinate the need to carry the one when it shouldn't?

Speculations on the Future of Fiction Writing
The movie industry has been around long enough, and is diverse enough, that I'd be very surprised if there were million-dollar bills lying around waiting to be picked up like this.

Prediction markets for box office results are more than a million dollar bill, I think, and yet reduce the power of the people who decide whether or not they get used.

Also, speaking of people caring about accuracy, it reminds me of the story Neil deGrasse Tyson tells about confronting James Cameron about the lazy fake sky in Titanic, and he responded with

Last I checked, Titanic has grossed a billion dollars worldwide. Imagine how much more it would have grossed had I gotten the sky correct.

But the ending of the story is that later they hire him to make an accurate sky for their director's cut, and he made a company that provides that service now.

It wouldn't shock me if a firm of smart rational-fic writers could do this sort of 'script doctoring' cheaply enough to be worth it to filmmakers, and the main problem is that the buyers don't know what to ask for and the sellers don't know how to find the buyers.

Studies On Slack

From The Sources of Economic Growth by Richard Nelson, but I think it's a quote from James Fisk, Bell Labs President:

If the new work of an individual proves of significant interest, both scientifically and in possible communications applications, then it is likely that others in the laboratory will also initiate work in the field, and that people from the outside will be brought in. Thus a new area of laboratory research will be started. If the work does not prove of interest to the Laboratories, eventually the individual in question will be requested to return to the fold, or leave. It is hoped the pressure can be informal. There seems to be no consensus about how long to let someone wander, but it is clear that young and newly hired scientists are kept under closer rein than the more senior scientists. However even top-flight people, like Jansky, have been asked to change their line of research. But, in general, the experience has been that informal pressures together with the hiring policy are sufficient to keep AT&T and Western Electric more than satisfied with the output of research.

[Most recently brought to my attention by this post from a few days ago]

Predictions/questions about conquistadors?

My (weakly held) take is that a category of 'usual medieval weaponry' obscures a lot of detail that turns out to be relevant. Like even talking about 'swords', a 3 foot sword made of Toledo steel is a very different beast from a macuahuitl. They're about equally sharp and long, but the steel sword is lighter, allows fighting more closely together (note that, at this time, a lot of the successful European tactics require people somewhat tightly packed working in concert), and is more durable. (The obsidian blades, while they could slice clean through people and horses, weren't very effective against mail and would break on impact with another sword.)

Predictions/questions about conquistadors?
I predict that guns weren't that big a deal; they probably were useful as surprise weapons (shocking and demoralizing enemies not used to dealing with them) but that most of the fighting would be done by swords, bows, etc.

I think you should count pikes and swords differently, here, especially if the Spaniards are using the pike square.

Trust-Building: The New Rationality Project
What we want isn't a lack of factionalism, it's unity. ... You have high trust in this network, and believe the evidence you receive from it by default.

I think one of the ways communities can differ are the local communication norms. Rather than saying something like "all communities have local elders whose word is trusted by the community", and then trying to figure out who the elders are in various communities, you can try to measure something like "how do people react to the statements of elders, and how does that shape the statements elders make?". In some communities, criticism of elders is punished, and so they can make more loose or incorrect statements and the community can coordinate more easily (in part because they can coordinate around more things). In other communities, criticism of elders is rewarded, and so they can only make more narrow or precise statements, and the community can't coordinate as easily (in part because they can coordinate around fewer, perhaps higher quality, things).

It seems to me like there's a lot of value in looking at specific mechanisms there, and trying to design good ones. Communities where people do more reflexive checking of things they read, more pedantic questions, and so on do mean "less trust" in many ways and do add friction to the project of communication, but that friction does seem asymmetric in an important way.

Trust-Building: The New Rationality Project
When it feels like there's no need to explore, and all you need to do is practice your routine and enjoy what you have, the right assumption is that you are missing an opportunity. This is when exploration is most urgent.

I think good advice is often of the form "in situation X, Y is appropriate"; from a collection of such advice you can build a flowchart of observations to actions, and end up with a full policy.

Whenever there is a policy that is "regardless of the observation, do Y", I become suspicious. Such advice is sometimes right--it may be the case that Y strictly dominates all other options, or it performs well enough that it's not worth the cost of checking whether you're in the rare case where something else is superior.

Is the intended reading of this "exploitation and routine is never correct"? Is exploration always urgent?

Cortés, Pizarro, and Afonso as Precedents for Takeover

Tercios were very strong during the era Conn Nugent is pointing at; "nobody in Europe could stand up to them" is probably an exaggeration but not by much. They had a pretty good record under Ferdinand II, and then for various dynastic reasons, Spain was inherited by a Habsburg who became Holy Roman Emperor, and then immediately faced coalitions against him as the 'most powerful man in Christendom.' So we don't really get to see what would have happened had they tried to fight their way to continental prominence, since they inherited to it.

It's also not obvious that, if you have spare military capacity in 1550 (or whenever), you would want to use it conquering bits of Europe instead of conquering bits elsewhere, if the difficulty for the latter is sufficiently lower and the benefits not sufficiently higher.

Why aren’t we testing general intelligence distribution?

First, you might be interested in tests like the Wonderlic, which are not transformed to a normal variable, and instead use raw scores. [As a side note, the original IQ test was not normalized--it was a quotient!--and so the name continues to be a bit wrong to this day.]

Second, when we have variables like height, there are obvious units to use (centimenters). Looking at raw height distributions makes sense. When we discover that the raw height distribution (split by sex) is a bell curve, that tells us something about how height works.

When we look at intelligence, or results on intelligence tests, there aren't obvious units to use. You can report raw scores (i.e. number of questions correctly answered), but in order for the results to be comparable the questions have to stay the same (the Wonderlic has multiple forms, and differences between the forms do lead to differences in measured test scores). For a normalized test, you normalize each version separately, allowing you to have more variable questions and be more robust to the variation in questions (which is useful as an anti-cheating measure).

But 'raw score' just pushes the problem back a step. Why the 50 questions of the Wonderlic? Why not different questions? Replace the ten hardest questions with easier ones, and the distribution looks different. Replace the ten easiest questions with harder ones, and the distribution looks different. And for any pair of tests, we need to construct a translation table between them, so we can know what a 32 on the Wonderlic corresponds to on the ASVAB.

Using a normal distribution sidesteps a lot of this. If your test is bad in some way (like, say, 5% of the population maxing out the score on a subtest), then your resulting normal distribution will be a little wonky, but all sufficiently expressive tests can be directly compared. Because we think there's this general factor of intelligence, this also means tests are more robust to inclusion or removal of subtests than one might naively expect. (If you remove 'classics' from your curriculum, the people who would have scored well on classics tests will still be noticeable on average, because they're the people who score well on the other tests. This is an empirical claim; the world didn't have to be this way.)

"Sure," you reply, "but this is true of any translation." We could have said intelligence is uniformly distributed between 0 and 100 and used percentile rank (easier to compute and understand than a normal distribution!) instead. We could have thought the polygenic model was multiplicative instead of additive, and used a lognormal distribution instead. (For example, the impact of normally distributed intelligence scores on income seems multiplicative, but if we had lognormally distributed intelligence scores it would be linear instead.) It also matters whether you get the splitting right--doing a normal distribution on height without splitting by sex first gives you a worse fit.

So in conclusion, for basically as long as we've had intelligence testing there have been normalized and non-normalized tests, and today the normalized tests are more popular. From my read, this is mostly for reasons of convenience, and partly because we expect the underlying distribution to be normal. We don't do everything we could with normalization, and people aren't looking for mixture Gaussian models in a way that might make sense.

Load More