Update: REBUS 2.0 is released!
Entropy Agglomeration (EA) is the most useful algorithm you can imagine. It’s not cited or used only because the established scientific paradigms cannot conceive its meaning.
In fact, the idea is very simple:
In EA, entropy is a measure of relevance//irrelevance.
— Subsets of elements that either appear together or disappear together in the blocks have low entropy: Those elements are “relevant” to each other: They literally “lift up again” each other.
— Subsets of elements that are partly appearing while partly disappearing in the blocks have large entropy: Those elements are “irrelevant” to each other: They literally “don’t lift up again” each other.
This is all visible in the results of the analysis of James Joyce’s Ulysses: https://arxiv.org/abs/1410.6830
In this setup, entropy becomes a measure of relevance//irrelevance, literally and by definition: https://en.wiktionary.org/wiki/relevant
I. B. Fidaner & A. T. Cemgil (2013) “Summary Statistics for Partitionings and Feature Allocations.” In Advances in Neural Information Processing Systems (NIPS) 26. Paper: http://papers.nips.cc/paper/5093-summary-statistics-for-partitionings-and-feature-allocations (the reviews are available on the website)
I. B. Fidaner & A. T. Cemgil (2014) “Clustering Words by Projection Entropy,” accepted to NIPS 2014 Modern ML+NLP Workshop. Paper: http://arxiv.org/abs/1410.6830 Software Webpage: https://fidaner.wordpress.com/science/rebus/
The grid of all possible entropy values is a universal constant:
EA is a hierarchical clustering algorithm that outputs dendrograms. I have a few examples to show how the outputs look like:
Clustering of (I) plants (II) fungi according to their occurrences in studies on Mycorrhizal fungi.
Clustering of dinosaurs according to the occurrences of their recorded phenotypic characteristics.
Clustering of central wavelengths according to their occurrences in a known set of exoplanets.
Clustering of the well-known Iris dataset. 145/150 of the flowers were successfully clustered. (This last example employs an additional wrapper code that categorizes the numerical features given in the dataset)
Clustering of Last.fm tags. Part 1 and Part 2.