Since July, I’ve been participating in a research internship with SPARKS and the Cannabis Research Institute (University of Illinois System). The project involves creating a network graph of the endocannabinoid system to analyze structural and functional insights taken from metric, clustering, and perturbation results. But instead of taking the time here to break down all the fancy-sounding terminology I used in my last sentence, I want to talk about the biggest challenge I faced throughout the internship—and how I eventually overcame it. Two words, endless frustration: data curation.
The project required me to pull data about proteins and chemicals that interact within the endocannabinoid system, but unfortunately, available online data is both sparse and labeled without a universal standard. This means that one database might refer to a protein/chemical one way, while another database might refer to a protein/chemical in another way. For example, search up the chemical “Noladin” on PubChem’s database, and you get an accurate best match result. But try searching up “Noladin” on ChEMBL’s database, and the top result that you get is a chemical named “Solamin”…which is not the same thing. Now, imagine how this issue grows exponentially when you’re trying to query across five different databases, instead of only the two I mentioned just now.

So, in short, scraping for data this summer was an absolute pain in the neck. And I realized, if these databases aren’t going to cooperate, then I’d have to find my own tricks to make them cooperate. Here’s what I did:
- Data Normalization/Harmonization: I used an ID-system to create my own standardization for proteins and chemicals across multiple different databases. For proteins, I used UniProt ID, because UniProt ID is typically recognized by most other molecular databases. Similarly, for chemicals, I used ChEMBL chemical ID (CID) for the same reason. So then, instead of searching these databases by a protein or chemical name (which might be recognized by some databases, but runs the risk of not being recognized by other databases), I was able to perform consistent, low-risk queries using the IDs I had assigned each protein and chemical.
- Alias Mapping and Data Abstraction: I created my own naming system for within my local project. This way, a user trying to run my code to gather information about the protein or chemical of their choosing would be allowed to use the naming conventions they prefer. For example, some people like to abbreviate delta-9-tetrahydrocannabinol to 9-THC, while others (including me), more pedantically write Δ9-THC. I know this difference seems minimal (and almost silly) right now, but when you’re trying to get five different databases to agree over whether to use the delta symbol or not, it’s actually quite frustrating. Regardless, my code allows users to call THC whatever they want. All of the following would work: 9-THC, Δ9-THC, 9THC, Δ9THC, etc. You could also name it basically anything in the world if you want, and my code would run with it: chemical, hi, elephant, chair, explosion, etc. So long as you link the name of your choice to the appropriate ID, anything goes.
The end result? Over 300 megabytes of data successfully curated and organized on what is considered a relatively low-researched system. And now my data is open-sourced so that other endocannabinoid system researchers can just work…you know…without having to go through more of the craziness that I had to.





Leave a comment