I.
Many people uncritically assume that matrix reasoning is a good test of intelligence, when in reality it plays third fiddle to tests of vocabulary and arithmetic. I think that matrix reasoning has been promoted online as a good IQ test for two reasons:
Linguistic Invariance: there is no linguistic requirement needed to take them, making them convenient comparisons across countries.
Subconscious association: tests of mathematics/verbal ability are not specific to psychology and are commonly found in educational institutions. On the other hand, ravens-style tests are only used to test IQ. This leads to people to subconsciously associate IQ testing with matrix reasoning more than other subtests.
There is no concrete evidence that either of these theories are true - but I do know that people highly value matrix reasoning as an IQ test despite the fact it poorly tests intelligence. While matrix reasoning tests are listed in the ‘Comprehensive Online Resources’ of r/CognitiveTesting, no individual test of mathematical and verbal ability is. Also, it’s also one of the most searched for IQ tests on wikipedia, indicating that it is highly notable.
II.
These are the 3 aspects by which an IQ test should be evaluated:
g-loading: because g is the most predictive part of the IQ test, tests that correlate more with it are better.
Reliability: if a measurement is good, it is better for it to measure itself well.
Time: if the time necessary to test an ability is relatively low, this means that more time can be used to generate a better estimate of that ability or test other abilities.
To determine how g-loaded matrix reasoning is in comparison to other good, but not comprehensive tests of IQ, I decided to compare matrix reasoning to arithmetic reasoning and vocabulary in their g-loadings in factor-analytic studies. I relied on google search, google scholar, some material from my old blog, and some data emil compiled for sources. I found that the most g-loaded subtest was vocabulary (0.75), followed by arithmetic (0.7), and then matrix reasoning (0.69). In comparison, the SAT has a g-loading of about 0.8-0.86 based on the available evidence.
Using a different method that calculates the raw correlation between the WAIS and ravens tests finds that they correlate at about 0.67, according to an informal meta-analysis done by the openpsychometrics guys.
Typically, the reliability of a composite IQ test is about 0.9-0.95 depending on the method used to extract the reliability coefficient. In a similar fashion, I also decided to gather the reliability of Matrix reasoning tests. On average, the internal reliability/consistency is about 0.9, while the test-retest reliability is about 0.8 depending on the interval.
The poor test-retest reliability of the matrix reasoning seems to be shared among the other nonverbal tests in the WAIS. This leads me to believe that daily/monthly fluctuations in cognitive abilities may be affecting tests which require intensive thinking more than those that rely on accumulated knowledge.
Lastly, it is also important for a test to be completed quickly. Different sources claim that the Ravens SPM take between 30 to 60 minutes to complete [1] [2] [3] [4]. In comparison, the WAIS IV takes about 70 minutes to test, meaning that an individual could complete about 6 WAIS subtests in the time it takes to do the whole Ravens test.
Academic literature on the time required to complete an IQ test is surprisingly slim, and all I have managed to find is one study that uses a clinical sample with an average IQ of about 85. This study found that matrix reasoning, arithmetic reasoning, and vocabulary all take about 5-6 minutes to test in the WAIS IV. Personally, I don’t put much stock in it.
There is also a myth that the ravens is a very pure measurement of general intelligence. This is not true. This has been known for quite a long time, since about the mid 20th century, but the myth continues to perpetuate due to unknown reasons. The test itself seems to load on visualspatial ability in addition to general intelligence, according to analysis by Gignac.
III.
In summary, I would advise against the use of Ravens-style tests to test intelligence, unless there are linguistic constraints that make their administration necessary. They are not highly g-loaded, they are relatively unreliable, and take too long to test.
A bigger problem perhaps is how easy it is to practice on, or teach to others. Since subjects differ in exposure to such tests, this decreases the validity, especially in the modern internet age of mass exposure. Online samples probably have even more problems.
"will look into this"