I recently faced a problem that seemed trivial to solve at first but (as always) was not: How to get the taxonomic distribution of proteins with a given interpro domain?
I start off even without knowing exactly which protein domains to look at (I wanted to look at all that were present in certain proteins). To do this, I query Uniprot through biomart to get all the annotated interpro domains present in a group of proteins. I also get a bunch of other info (other IDs, names) along.
We now switch to the Interpro database and get the scientific name (and other stuff) of species with proteins containing these interpro domains. Again Biomart is our friend.
To assess the distribution of these proteins across clades of species, one needs more information about these species. Getting there was the bit not so obvious to me.
We define two functions: to get the ID of a taxon based on its name (this actually may fail if species names have appended stuff like “strain”) and to get the record (which contains a full lineage description) for that taxon based on its ID.