top of page

Designing a new data curation method for business analysts

Internal business analysts were in need of a simplified way to categorize their suppliers' products. This was historically challenging because each supplier had their own ways of describing, sorting, and labeling – a headache for an analyst who wants to find all of a single kind of product. As a part of the data science team, I leveraged multiple machine learning methods to help analysts quickly classify and label products regardless of the different labels across suppliers. The resulting research for this application resulted in a more efficient method for labeling and organizing product data. See more on my Data Science page.

Goals

​Design research goals
  • Understand suppliers' and internal analysts reasoning for product sorting​

  • Create a tool for analysts to quickly label, organize, and find the products they seek

​

Quantitative research goals
  • Implement a combination of unsupervised and active machine learning to improve sorting efficiency ("unsupervised machine learning" is a way of clustering data with minimal intervention from a person, and "active machine learning" is a way to label data with ongoing intervention from a person)

Methods

  • Analytics

  • Machine learning

  • Customer feedback

Crucial insights

  • At the time, analysts did not have a standard way to handle the idiosyncratic product sorting of individual suppliers -- they hacked together different methods to overcome this challenge

  • Many available active machine learning methods that could help analysts curate and sort data were not efficient or flexible enough for them to use without a heavy investment in extra training

  • Combining unsupervised machine learning with active learning was a way to more quickly label large data sets, and allowed the flexibility that was needed by analysts

Research impact

Stakeholder impact
  • Self and business stakeholders: I won an inventor's award for this work, and other members of the data science team were able to integrate pieces of this project into their own.

​

Product impact
  • The code designed here was intended as a general sorting tool for any text-based product data -- potentially affecting many products and services

  • Components of the method were easily adopted in developing other products (a module I wrote for better context-specific spell correction was implemented in other product recommendation algorithms)

The original procedure for labeling data used standard active machine learning, and was relatively slow

DataCuration2.png

The new procedure combined unsupervised and active machine learning, and allowed for more flexible and faster labeling

What I learned

  • Active learning was something I had only heard of before -- here I got the chance to try it out, understand its limitations, and modify it to work better for potential users with no data science experience

  • There is room in every organization for cross-functional collaboration between data scientists and UX research -- even if formal UX researchers are not available

bottom of page