(Source: hoelixDE/Shutterstock.com)
In 2020, more than 4.4 billion internet users were producing a staggering amount of data through social-media posts, reviews, recommendations, and similar interactions. The insight gathered from this data is invaluable in guiding businesses and innovators through product development, marketing, and customer support. However, extracting that insight is challenging as opinion-oriented, customer-provided data is difficult for machines to understand and interpret because of complexities in human language and cultural context. Tools such as natural language processing (NLP) and machine learning (ML) enable computers to understand and derive meaning from human language. Furthermore, an advancing research area in artificial intelligence (AI) called sentiment analysis helps machines understand unstructured, customer-provided data and interpret opinions as positive, negative, or neutral.
To understand sentiment analysis in NLP, let’s look at this simple statement from a restaurant review: “The soup was good.” An analysis of the sentiment requires three actions:
In this instance, the sentiment analysis is unambiguously positive concerning a particular food served at the restaurant. However, other examples are less straightforward, as in a seemingly similar clause, “The beer is cold.” Many would consider this opinion positive because they like beer this way, but cold can have a negative polarity in other contexts. For example, “The coffee is cold” uses an identical sentence structure and adjective, but many people would consider cold coffee to be negative.
Other language complexities create additional challenges, such as sentences that contain multiple sentiments, for example: “The food was good, but the soup was cold.” Here, we have a positive, negative, and an ambiguous sentiment, depending on the customer’s preference for soup temperature. Similarly, “The soup was hot, but the beer was cold” would be positive sentiments for most people but ambiguous given potential customer contexts.
Modifiers further blur the line between polarities. For instance, consider the opinion: “The staff were almost too friendly.” Here, we must also think about irony, sarcasm, or figures of speech, making it challenging to identify sentiment correctly. Examples such as “We waited for more than an hour, really great service!” tend to be rare in training data and extremely difficult to encode manually in a systematic manner.
Assigning polarity to opinions becomes even more challenging when considering personal, cultural, or circumstantial preferences. For example, consider analyzing customer reviews for a ryokan, a traditional Japanese guest house that is typically fancy and expensive but features a common bathing area rather than private bathrooms. Categorizing the absence or presence of something as positive or negative seems straightforward—for example, “There was dirt in the shower” or “There was a pool for the kids.” However, the ryokan example demonstrates how accounting for cultural variables and personal preferences is essential in attaining useable insights for data. In Japan, guests believe shared bathing areas to be a positive attribute. By contrast, most European travelers would view it negatively, particularly at an expensive hotel. This example highlights just one feature and two cultures.
In NLP, sentiments can be analyzed at the whole-document level and at the paragraph and sentence levels, with results often then aggregated. Although whole-document analysis is useful, paragraph and sentence-level analysis can yield more granular and correspondingly accurate results (such as identifying sentiment about a particular product feature in addition to the complete product). The challenge comes in developing a lexicon—the set of rules that machines use to classify sentiments as positive, negative, or neutral. Many free tools and resources are trained on public data as a starting point. For instance, software libraries such as Natural Language Toolkit, spaCy, and TextBlob include sentiment models and retraining with user data. If you prefer not to code, cloud offerings such as Google Cloud Platform or Microsoft Azure enable you to get started with sentiment analysis immediately: Simply paste the text to be analyzed into a browser and build your application from there.
Beyond prototyping, data sets and ML models should address language and culture complexities. This means:
Customer-provided data from media posts, reviews, recommendations, and the like provide invaluable insights for businesses and innovators. Complexities in natural language and cultures make it difficult for AI-driven machines to understand customer opinions. However, sentiment analysis can help ensure that these aspects are captured and reflected in insights. You can get started by using freely available tools and resources, but addressing complexities in language and culture is challenging, requiring significant planning, data prep, and modeling. Raising awareness about language and culture complexities is an excellent start in gaining useful insights and a highly valuable way to better understand your customers and their needs.
Michael Matuschek is a Senior Data Scientist form Düsseldorf, Germany. He holds a Master’s Degree in Computer Science and a PhD in Computational Linguistics. He has worked on diverse Natural Language Processing projects across different industries as well as academia. Covered topics include Sentiment Analysis for reviews, client email classification, and ontology enrichment.