SMITHSONIAN API AND DATA CLEANING
1. Fetch data using the Smithsonian API: Identify the query and breakdown the complex data structure to extract the relevant data.
2. Identifying the inconsistencies and parsing with python.
3. Documentation in Github.
Holy Ghost Orchid: Before
{
"bloom_time": "June to October" "pollination_syndrome": "Bee (male Euglossa)",
"fragrance": "Fragrant (strong). Sweet and fruity from afar, but spicy up close.",
"bloom_characteristics": "Erect inflorescence is 3-4.5 feet (89-140 cm) long 10-20 waxy, white flowers that open in succession. Flowers are 2\" (5 cm) across.",
}
Holy Ghost Orchid: Cleaned
{
"cleaned_bloom_months": [ 6, 7, 8, 9, 10 ],
"cleaned_peak_months": [ "Unknown" ],
"pollinator_types": "Euglossine Bees",
“fragrance_strength": "Strong",
"fragrance_types": "pleasant, sweet, fruity, spicy", "fragrance_notes": "fragrant, sweet, fruity, spicy", "fragrant_time": "day_and_night",
}
I parsed the bloom_time into months by defining a month_map and seasons_map. I also printed out all the fragrance descriptors to bin them into fragrance types, which determines the grouping of all the more specific fragrance note descriptions. The information with pollinator syndrome vary in complexity, so I parsed the data with key words like “euglossine bees”, “moths”, and “flies” and the rest as “others”.
Another challenge is identifying the inflorescence length and number of flowers from the “bloom characteristics” since the wording and units were inconsistent. After multiple rounds of python, I went back in to manually input the outliers that were not caught in the code. This was then followed by code to unify the length units to inches.