Link
Only one link, but it is such a great one, that it is worth making a full post about it!
This excellent essay by Maxwell Tabarrok is something that I wished I could submit as a grant application. It really describes what I think is the major limitations in peptides/small proteins research right now: we do not have enough high-quality data:
ML needs data. Google’s AlphaGo trained on 30 million moves from human games and orders of magnitude more from games it played against itself. The largest language models are trained on at least 60 terabytes of text. AlphaFold was trained on just over 100,000 3D protein structures from the Protein Data Bank.
The data available for antimicrobial peptides is nowhere near these benchmarks. Some databases contain a few thousand peptides each, but they are scattered, unstandardized, incomplete, and often duplicative. Data on a few thousand peptide sequences and a scattershot view of their biological properties are simply not sufficient to get accurate ML predictions for a system as complex as protein-chemical reactions. For example, the APD3 database is small, with just under 4,000 sequences, but it is among the most tightly curated and detailed.
There is a good “progress studies” angle to it too
Expanding the dataset of peptides and including negative observations is feasible and desirable, but no one in science has an incentive to do it. Open data sets are a public good: anyone can costlessly copy-paste a dataset, so it is difficult and often socially wasteful to put it behind a paywall.
As a minor quibble, I do disagree with the following:
Non-monetary rewards in academia, like publications and prestige, point toward splashy results in big journals, not toward foundational infrastructure like open datasets.
I think if it was generated, this dataset would result in as high-profile paper as it gets.1 The problem really finding the large amount of funding that would be required. If you are in the position to write large checks (or even smaller checks for a smaller version as one good thing about this idea is that it could be scaled up from a smaller version), let me know!
Tweet
Photos
It would easily be publishable in one of the three top journals in the life sciences, often abbreviated as CNS (Cell, Nature, Science)
I would love to see this happen. I remember messing around with GANs trying to generate anti microbial peptides back in 2020 and being sad at how small the datasets were.
How much do you think a million peptide database would actually cost? And how long would it take?
Astera
https://astera.org/first-residency-cohort/ and Renaissance Philanthropy https://renaissancephilanthropy.substack.com/ seem like they might be useful for finding funding.