Viewing a single comment thread. View all comments

sjiveru t1_j3o8z6y wrote

Not OP, but the general opinion in linguistics is I think fairly well reflected by this Language Log post from a good decade ago, in response to a paper about high altitudes correlating with ejectives:

> Still, the (presumably) spurious correlations of the two word-order variables with altitude remind us of the possibility for false findings here. (...)

> Whether or not the altitude/ejective correlation reveals a causal connection, we can expect the near future to bring us a large number of spurious correlational analyses, along with a few meaningful ones. There are three reasons for this:

> (1) The existence of digital datasets makes it increasingly easy to perform quantitative checks on hypotheses about possible relationships between linguistic and non-linguistic variables;

> (2) The astronomically large number of such possible relationships guarantees that many of them should exhibit a strong pair-wise connection by chance, even if all of the distributions were statistically independent;

> (3) The distributions are not statistically independent, due to factors such as cultural and geographical diffusion.

> Note that the "file drawer effect" strongly undermines the often-made argument "But I/we made the hypothesis before we checked, we didn't just dredge for correlations and then try to explain them". The data-dredging (and the associated multiple comparisons) can (and do) occur across many unconnected investigations, with only the "significant" ones getting published.

In short, such correspondences aren't impossible, but it's a lot of effort to show that they're not just random coincidences. Languages are incredibly complex systems, and aren't independent of each other - which makes them extremely difficult to do statistics on. Personally, I think for a lot of purposes (including these) the set of all human languages isn't a statistically significant sample size - the systems are too complex and too interrelated for only seven thousand data points to be anywhere near enough to show clear trends above the background of noise.

2