The False Hope of Usable Data Analysis
I changed the regular schedule of the posts because I wanted to write down these ideas.
A few days ago, in a panel at EuBIAS, I argued again that scientists should learn how to programme. I also argued that usability of bioimage analysis was a false hope.
Now, to be sure: usability is great, but usability does not mean usable without programming skills. Good usable programming environments can be the most usable way to achieve something[1]. I find the Python environment one of the most usable for data analysis currently, although there is still a lot of work which could improve it.
§
We can build communication systems without words, but only if the vocabulary is very limited. Otherwise, people need to learn how to read [2]. I think this a good analogy for non-programming environments.
§
The problem is that image analysis (or data analysis) is not a closed goal. Whatever we are doing today, will probably be packaged into simple-to-use tools, but the problems will grow in size and complexity.
For a fixed target, like sending email or writing a blog, we can build nice tools that don't require programming. Any modern email client basically does email well enough. There is probably only a small set of behaviours we want our blogs to do (like scheduling a post) and I think we can get a small set of features that covers 95%+ of uses. There might be a need for a few hundred plugins, but not constant innovation. There is no constant pressure to do 10 times more.
But data analysis is not in the same category as sending email. It's an open-ended problem, which will grow continuously, which has been growing continuously. Only a full-blown artificial intelligence system will be able to deal with the sort of analyses that we will want to do in 10 years. There are even analyses that we already want to do, but do not yet have the right code and tools.
§
If anything, as time has passed, I have felt more and more of a need to think in low-level terms [3].
A few years ago, push-button analysis was sufficient for most problems. Load your data into Excel, select the rows, and plot. Fit a line, compute some stats. STATA gave you a bit more power if Excel did not suffice. Now, the problems grew and push-button solutions do not scale. Not only do we have more data, we have more complex, more unstructured data.
Afew years ago, pointing out that Excel can only handle 1 million rows would have made you seem like a technically-obsessed weirdo, now it is a serious limitation.
A few years ago, people were writing things like feel free to use interpreted languages, it doesn't matter that you're losing performance compared to C; computers are super-fast, waste them. Now, there is much more interest in building implementations that are as fast as C (normally using Just-in-time compilation).
This will not get better and just saying that tools should be easier for non-programmers is missing the point.
§
Programming is like writing: a general purpose technological skill which transforms all activities. And this means that, eventually, it becomes useful (or even necessary) for many activities which are outside the core of programming (who'd have thought a salesperson would have to know how to read and write? A firefighter?).
Almost any job that does not require programming is one which can be done by a robot. Except entertainment and those jobs that Tyler Cowen, for lack of a better word, calls marketing. Tyler calls them marketing, but prostitution might be just as accurate, as it is about providing not a specific service or product, which could be provided by a machine, but the general positive feeling that comes from human contact [4].
Related
Bayes and Big Data by Cosma Shalizi
The Average is Over by Tyler Cowen [1] If you wish, read scripting for programming. I never cared much for this division. [2] If you google for traffic signs you'll see that actually most images have at least one sign with words or images. [3] The need to managing parallelism (as our cores multiply, but not get faster) and memory access patterns as data grows faster than RAM have forced me to think about exactly what is happening in my machines. [4] Obviously, Tyler is right to use the word marketing even if it's not a good fit. Prostitution has a strong negative charge..