We Were Data Scientists Before Data Science Was Cool: New Challenges for the Profession

Suddenly, being a data scientist is cool. And in high demand.

Why? Because these days, data makes the world go round. Nearly every industry in our economic ecosystem is clamoring for it.

If a company, no matter the industry, is not using Big Data to chart and forecast customers' journeys, better connect with them, ferret out their wants and needs before they even know what they are (thank you, Netflix, for creating that perception in people's minds), and otherwise using the numbers to enhance the customer experience, it will be left in the dust by competitors that do.

The increased demand for data in all sectors of the economy has created a boom in the data science field. According to Forbes magazine, the fastest-growing jobs in the country today are data scientist, machine learning engineer and big data engineer. In the blink of an eye, every company needs people who can make sense of data. LinkedIn conducted a survey and found there are 6.5 times as many data scientists working today than there were just five years ago. For machine learning engineers, that number jumps to 9.8.

"The field has exploded within the past four or five years," says Nuray Yurt, head of enterprise data science at Novartis. But, she points out, while the need for data pros continues to ramp up, which is a good thing on many levels for the profession, it also brings with it some challenges for the data scientists themselves.

Challenges for Data Scientists Today

The situation can be loosely compared to the disruption the corporate training field went through back when the internet was first starting to change the way every company on earth worked. People got into the training profession because they liked teaching in front of a classroom, which is where the bulk of training happened pre-internet. But very soon after the screech of dial-up technology began connecting every desk in every office to the World Wide Web, someone got the idea that training should happen online, so trainees could sit at those very desks and get the knowledge they needed on their own schedule. Suddenly, trainers had to learn an entirely new skillset — creating online learning modules. It was not what they signed up for, but it quickly became an essential part of the job.

Data scientists are finding themselves in a similar predicament today. The nuts and bolts of analyzing data are always evolving, but the skills to do the job, like analyzing statistics, computer knowledge and business knowledge, remain the same. What's new for data scientists are the so-called soft skills that are becoming necessary parts of the job.

"Data scientists need to be curious, open minded, quick learners and have the right personality fit now," Yurt explains.

Communication skills are a vital part of that. Why? Because industries that are newly reliant on data, like sales, customer service and hospitality, are hiring data scientists to help them make sense of it all. And, gently put, the people who run those companies are not data scientists nor have they ever had one on staff. As Yurt notes, everyone now knows what to do with data, but few know what it takes to glean that data, analyze it and translate it into actionable goals and strategies for companies to implement. So, data scientists are suddenly put into the position of emerging from their offices where they've been happily crunching numbers on their own and explaining to higher-ups what the data science actually means, in language they can understand.

The temptation may be to "dumb down" the explanation, but Yurt says that's a mistake.

"The challenge for data scientists today is being able to communicate complex concepts to people who don't understand them without diluting the complexity," she says. That last part is the key. People in industries new to data need to understand the complexity of the process, or it diminishes the data science field as a whole. It also puts funding and potentially jobs at risk if people don't entirely get the fact that analyzing and interpreting data is a science that Hal from accounting wasn't trained for.

"We need to communicate why and how what we do makes a difference," she says.

Another challenge for data scientists is the need to be more open minded. "We need to be OK with change," Yurt says. "Our jobs won't be the same as they always were, and we need to be OK with that."

Application of NLP to Detect Adverse Events in Patients

At the last PMSA conference, Ketan Walia, senior associate of decision science at Axtria, and his colleague Rushil Goyal, also a senior associate of decision science at Axtria, presented "Application of NLP to Detect Adverse Events in Patients," which generated a lot of interest. They looked into the automated detection of adverse drug reactions using social media text data leveraging natural language processing and machine learning, and gave conference attendees a rundown of what they found. In case you missed it, here's a recap.

Why Are Adverse Drug Reactions (ADR) Significant?

Getting a handle on ADR will significantly benefit the industry, leading to huge savings in healthcare costs and better patient compliance.

"ADR detection is a very significant task which typically doesn’t get as much traction as it needs," Walia says. "Especially considering the fact that adverse reactions related to a drug could affect the entire life cycle of the drug from clinical trials to the time it is launched in the market. Around 90% of ADR are underreported and there is often a big delay by the time they get formally reported and registered. This creates a huge lag in the system called a delayed feedback syndrome. This eventually hurts drug performance in the long run, greatly impacting safety of patients and commercial gains for the manufacturer."

ADR is a top cause of morbidity. Here are the grim stats:
  • 6.7% of hospitalized patients have a serious ADR with a fatality rate of 0.32%
  • Adverse reactions to drugs cause 100,000 deaths yearly
  • ADRs are the 4th leading cause of death in the U.S.
  • 90% are underreported
There's an urgent need for action. Adverse drug reactions and their impact on drug approval are having serious impacts on the commercial outlook of drugs.

One place to start, Walia found, was Twitter.

Wait, Twitter?

Yes, Twitter. The social media giant is a potential gold mine of information about ADR. Its 645 million users generate about 9,100 tweets every second, some of them about their own health and response to medications.

Twitter has widely been used in other frontline industries like retail, e-commerce, consumer durables, service and more for opinion mining, customer intelligence and gauging customer satisfaction levels.

However, Twitter as a data source has not been widely used by the pharmaceutical and life science industry as it is not a standard practice. Here's what led Walia and his team to Twitter:

Delayed feedback syndrome: "For our topic while doing literature review we realized that around 90% of ADR cases are underreported, which results in delayed feedback syndrome and many times ADR are officially registered only after their market launch. This hurts entire USP (unique selling point) of the product/drug and continues to affect drug performance throughout its lifecycle," Walia explains. "To mitigate these shortcomings, we were looking to build a pharmacovigilance system which could provide automated feedback, possibly in real time."

While researching, Walia and his team realized that although ADR are underreported, patients do not hesitate to go online and vent about their experiences in almost real time. So one of the reasons to use Twitter as a data source emanates from the shortcoming of the present system and also the nature of the problem we are trying to solve.

Lack of data sources: There is not much data pertaining to ADR being collected and made frequently available publicly for commercial use. So, Walia found Twitter to solve this problem. All the data is publicly available, directly coming out of affected patients themselves.

How Walia Used Twitter Data for Pharmacovigilance

Step 1: Data Acquisition
The first part of the process was to collect tweets as a source of potential ADR. Arizona State University collected 10,000 tweets corresponding to a list of 81 drugs as per IMS Health Top 100 Drugs. What they found was raw, unstructured data: people's thoughts, feelings and experiences. Next, it was time to remove the "noisy" information, like retweets, advertisements, URL links, boiling the tweets down into the information they really needed — a patient's reaction to the drug they were taking.

Step 2: Tweet Pre-Processing
Pre-processing involved segmentation of the raw text, sentence splitting and tokenization. It was about converting words into numbers so the data could be analyzed.

Step 3: Feature Engineering
The third stage in the process involved coming up with a representation of a group of words having similar meanings. One representation of this is the "bag of words" that most everyone has seen. It is simply a grouping of words, some very large and some tiny, representing how people feel about a certain situation or issue. Ketan Walia explains: "The way it works is that you feed in 'Bag of word' representation of words to this algorithm and it runs a neural network on the background and converts bag of word representation into a more generalizable vector representation called 'Word Embeddings.' Once you get word embeddings for all the words in your data set you can now feed these word embeddings instead of bag of words to a machine-learning algorithm."

Step 4: Binary Classification
This step involved categorization of sentences as ADR or not-ADR, and testing and evaluation of data using various cross validation techniques. Here, it's about deep learning. The main advantage of deep learning is that it is capable enough to deal with highly complex and unstructured data like text.

Step 5: Named Entity Recognition
Walia and his team used the Hidden Markov Model to annotate words and phrases directly related to ADR. That's because the Hidden Markov Model has a 63% accuracy rate to train a machine learning model to automatically annotate ADR positive tweets.


Maximizing knowledge of a drug’s safety profile and integrating it into commercial planning will have greater influence with regulators, payers, and ultimately patients and prescribers.

The end result is that the entire modeling framework provides an Artificial Intelligence based system which could automatically stream a drug-related post online (Twitter in this case), interpret the text data and classify if the text is pertaining to an ADR or not. If yes, these ADR positive tweets are further analyzed within the framework itself to tag the words and phrases in the tweet directly pertaining to ADR, thus providing the most relevant and concise intelligence to the user. It provides the user tools to perform this pharmacovigilance and extract the most relevant information in an automated fashion.

Industrializing Machine Learning in Pharma

Last year at the PMSA conference, Daniel Kinney, senior director, data and analytics platforms for the Janssen Pharmaceutical Companies of Johnson & Johnson, and two colleagues gave a presentation that generated a lot of interest among conference attendees. "Industrializing Machine Learning in Pharma" looked at common problems and myths about industrializing AI/ML, and how best to tackle those issues.

Even one year later, the topic still resonates. In case you missed the presentation, here's a recap:

Challenges of Industrializing Machine Learning

Nobody questions that machine learning and data science has great power. Now, it's a given. Still, challenges exist. While most pharma companies work with AI/ML in different parts of the organization, few actually leverage AI/ML beyond proof of concept projects. AI/ML proponents in the pharma industry face three main challenges industrializing AI/ML to create large-scale impact.
  1. Downplaying the role of iterative automation
  2. Data readiness
  3. The adoption barrier
Let's look at each of those in more detail, and the myths behind them.

Three Myths About Machine Learning

Common myths, if not "busted," contribute to the inability to show ROI if ML underperforms with poor data and short timeframes, and inefficient one-off analysis with unnecessarily complicated ML algorithms.

Myth: ML is a black box. If you don't already know this term, it refers to the possibility that, once you run your model, you'll be given something that just doesn't make sense, but you won't get a reason why. For instance, you'll get: "When targeting physicians, the year they graduated from medical school is an important variable," giving you something that doesn't make sense, and doesn't tell you why that variable is there.
Busted! Most use cases do not require or are not fit for black box algorithms like Neural Network. If you're worried about not getting buy-in because of the black box factor, it's a nonissue now.

Myth: You'll see immediate improvement.
Busted! The improvement of processes depends on data availability and quality, and it takes iterations.

Myth: You'll be disrupting today's processes with ML.
Busted! Machine learning models do not need to be developed from scratch. They can incorporate business rules or sit on top of rule-based, knowledge-based systems.

Data Readiness: Garbage In, Garbage Out

Another problem is with the data itself. It's vital how clean, accessible and quality assured your data is. But somebody's got to do it. Sometimes data scientists and analysts feel like "data janitors," spending the majority of their time cleaning and prepping data. Also, creating training data can be daunting and time consuming.

But those pain points also come with solutions.

Accessibility is no longer the issue it once was. Cloud and big data tech are increasingly available for the entire ML ecosystem. Also, publicly available data can be labeled quickly and cheaply by online resources.

Data strategy, governance and management is a big piece of this pie, and you can also use ML reinforcement learning to enrich training data.

The Adoption Barrier: Ability to Impact Decisions with ML Insights

The inability to impact decision-making eventually will impact the ROI of machine learning, making sustainable development of machine learning in an organization difficult. That's why it's vital to help guide sales and marketing teams, who are likely not data experts. It's vital how you engage these teams. How you introduce the concept and get buy-in can mean the difference between success and failure. The data can be spot on, but if sales and marketing don't use it, what good is it? Ways you can help encourage buy-in:
  • Field adoption: Show the relationship between increased sales and adoption.
  • Non-personal marketing: Use tools to automatically track target responses to create training data and improve marketing vendor management with a clear tracking system.
  • Other strategic decisions: Identify pain points, including financial, productivity, processes and support. Avoid "technology for its own sake."

A Word About ROI

Yes, that's what ultimately makes the world go round, but you shouldn't get too hung up on ROI at the outset. Let the models run, let some time pass and the data will get better. You'll identify new sources and the model will improve. Over time, you'll refine and refine again.

Key takeaway:
Your team doesn't need to be a machine learning unicorn. But for success, your organization should combine three key aspects of machine learning:
  • Data science. Develop and adjust your ML models. Validate and continuously monitor your model's performance.
  • Business analytics. Ensure your ML model is set up to answer important business questions. Relevancy is key. Communicate results to realize the impact of machine learning on your business.
  • Data engineering. Set up an infrastructure to enable data combination. You can even scale models developed by data scientists in big data environments.
The most important thing? Machine learning is the flash and sizzle of data science, but in order for it to really impact your business, you have to start with the endgame in mind. Design your infrastructure. Engage your customer base. And plan for success, because it will come.

Unlocking Your Brand's Hidden Potential Through Dynamic Targeting

At the most recent PMSA conference, Analytical Wizards associate principal Sreya Chatterjee and advisor James Lin gave a presentation that turned traditional targeting methods on its head. In "Unlocking Your Brand's Hidden Potential Through Dynamic Targeting," they outlined how brands can achieve deeper and wider results through dynamic targeting.

In case you missed it, here's a recap of their "dynamic" presentation.

Business Challenge

It all began when a customer came to Analytical Wizards with what has proven to be a common challenge. The pharmaceutical client wanted Analytical Wizards to look into more effective and innovative ways to address two growth challenges of a buy and bill brand:
  • Low conversion. Only 17% of the accounts in the market prescribed the brand.
  • Sales concentrated within a small portion of the prescribing accounts. Just 15% drove 80% of the sales.


The client wanted to enrich their current targeting by blending quintile-based segmentation with a driver analysis for adoption and growth. Goals included:
  • Accelerate sales. The client wanted to expand the breadth and depth of prescribing.
  • Smarter targeting. Focusing promotional efforts on the right accounts (i.e., emerging adopters and growers).
  • Maximize conversions. Greater impact by identifying the right promotional levers (drivers) for the targeted segments.

Examine Targeting Methods

The first step in tackling the challenge of increasing those low conversion rates and capitalizing on the opportunity for growth shown in just 15% of accounts driving 80% of the sales was to look into the client's targeting methods. Chatterjee found they had been using traditional volumetric targeting.

Chatterjee and her team uncovered some problems with volumetric and quintile-based targeting:

Traditional volumetric targeting is not designed to drive adoption or prescribing growth. While easy to communicate, the traditional volumetric targeting approach results in sub-optimal effort allocation because it does not differentiate promotional effort between:
  • Two high-market volume writers with diverging adoption likelihood (high vs. low) for the target brand.
  • Two high prescribers of target brand with different growth trajectories (growers vs. decliners).
"It is possible for a physician to be a 'high writer' but not prescribing your brand," she notes. "High prescription volume does not mean they'll prescribe yours." Similarly, quintile-based targeting does not differentiate promotional effort between two high-market volume non-prescribing accounts with different adoption likelihoods.

"We saw that there was a high prescription volume, but our modeling ranked those accounts as medium or low," she explains. "To convert those to 'high,' a change was necessary."

Solution: Dynamic Targeting

The goal for Analytical Wizards was to:
  • Expand breadth — grow the base of prescribing accounts.
  • Increase depth — more prescriptions within writing accounts.
Rather than simply try to get different results by using traditional volumetric or quintile-based targeting, Chatterjee sought to expand the breadth and increase the depth of the targeting method itself. The result: Dynamic targeting.

Dynamic targeting doesn't simply look at volume alone. It looks at other factors as well, factors that may change, grow, diminish or otherwise move in different directions. Some of those factors included:
  • Volume
  • How many competitors the physician is writing prescriptions for
  • What types of promotions the physician has been exposed to
  • What types of promotions they have responded well to
  • Promotion variables including sampling, calls, non-personal promotions
  • The type of account itself
  • Insurance factors
  • Patient pool
  • Claims information
  • What stage of disease the patient is in
Using dynamic targeting, Chatterjee and her team were able to give recommendations for the client to achieve its goals, but, as with any change, they experienced some pushback at first.

"It's not an easy task to convince companies and sales teams to change the way they target," she explains.

Her solution? Combine the approaches. "We told them, keep your volumetric targeting. Overlay our dynamic targeting with that." The client began to see how many factors, other than volume, came into play.

"The market itself is dynamic," she says. "Everything is changing. If your targeting mechanism doesn't factor that in, you're setting yourself up for failure."

With the dynamic targeting approach, the client was able to track which of their accounts were most likely to prescribe, and model accurately. Looking at the factors dynamic targeting takes into account, the client received a more accurate view, giving them a stronger strategy to achieve their goals.