AI and Data Science: Stop the Madness!

Data scientists Anthony Scriffignano and Satyam Priyadarshy discuss the realities of AI and data science. Learn about data quality, biases, communication gaps, and building a data-driven culture. Practical advice for business and tech leaders on what works, what fails, and how to avoid AI project pitfalls.

01:00:13

Oct 18, 2024
14,940 Views

CXOTalk episode 856 explores the critical intersection of AI and data science with Dr. Satyam Priyadarshy, CEO of ReIgnite Future and former Chief Data Scientist at Halliburton, and Dr. Anthony Scriffignano, Distinguished Fellow at the Stimpson Center and former Chief Data Scientist at Dun & Bradstreet. They discuss the practical realities of implementing AI, emphasizing what truly works, what doesn't, and the reasons behind AI project failures. They examine common misconceptions, such as the idea that AI is a magic bullet, highlighting the importance of data quality, rigorous analysis, and effective communication between data scientists and business leaders.

Dr. Priyadarshy and Dr. Scriffignano also address the ethical implications of using synthetic data and offer practical advice on building a data-driven culture while managing computational costs. They explore the challenges of navigating organizational politics and biases that can hinder the success of data science initiatives. This conversation offers valuable guidance for business and technology leaders seeking to unlock the potential of AI and avoid common pitfalls
 

Episode Highlights

Recognize the Pitfalls of Confirmation Bias in Data Science

  • Actively challenge existing beliefs and encourage diverse perspectives within data science teams to prevent biased data analysis.
  • Establish transparent processes that prioritize objective evaluation, minimizing the influence of preconceived notions on decision-making.

Understand the Crucial Role of Data Quality in AI

  • Invest in robust data governance frameworks to ensure data accuracy, completeness, and relevance throughout the AI lifecycle.
  • Implement data quality checks at every stage, from collection and preprocessing to model training and deployment, to maintain AI effectiveness.

Navigate the Complex Relationship Between Data Scientists and Business Leaders

  • Foster open communication and mutual understanding of roles and expertise to enhance collaboration between data scientists and business leaders.
  • Encourage clear and transparent presentation of findings while empowering leaders to ask questions without imposing predetermined outcomes.

Address the Ethical Implications of Synthetic Data in AI

  • Be aware of potential biases that synthetic data can introduce, and implement measures to prevent amplification of existing or artificial biases.
  • Ensure transparency in the use of synthetic data by establishing ethical guidelines for its generation and application.

Develop a Data-Driven Culture While Managing Costs

  • Prioritize data initiatives that align with business goals and demonstrate clear value to maximize resource efficiency.
  • Explore cost-effective data storage and processing strategies, such as cloud-based solutions and open-source tools, to manage costs effectively.
AI Implementation Strategies

Key Takeaways

Emphasize Data Quality and Context to Maximize AI Effectiveness

Data quality and context are crucial for successful AI implementation. Business leaders should invest in robust data governance frameworks to ensure data is accurate, complete, and relevant. Organizations can make informed decisions and avoid misinterpretations that undermine AI initiatives by understanding the context in which data is collected and used.

Foster Collaboration Between Data Scientists and Business Leaders

Effective communication between data scientists and business leaders bridges understanding gaps and aligns objectives. Leaders should encourage open dialogue, allowing data scientists to present findings transparently while being receptive to insights that may challenge existing assumptions. This collaboration enables organizations to use AI effectively and drive meaningful changes.

Address Bias in Data and AI Models to Ensure Fair Outcomes

Bias in data and AI models can lead to unfair or inaccurate results. Leaders must implement strategies to find and mitigate biases in datasets and algorithms. Regularly reviewing assumptions and validating AI models maintains integrity and trust in data-driven decisions, leading to more equitable and effective outcomes.

Episode Participants

Anthony Scriffignano, Ph.D. is an internationally recognized data scientist with experience spanning over 40 years in multiple industries and enterprise domains. Scriffignano has extensive background in advanced anomaly detection, computational linguistics and advanced inferential methods, leveraging that background as primary inventor on multiple patents worldwide. He also has extensive experience with various boards and advisory groups. He is a Distinguished Fellow with The Stimson Center, a nonprofit, nonpartisan Washington, D.C. think tank and a member of the OECD Network of Experts on AI working group on implementing Trustworthy AI.

Dr. Satyam Priyadarshy is the CEO of Reignite Future. He was previously Chief Data Scientist and a Technology Fellow at Halliburton. He was also the Managing Director of Halliburton’s India Center. He often is recognized as the first Chief Data Scientist of the oil and gas industry. Recently Forbes India named him as one of the top 10 outstanding business leaders. His work or profile has appeared in many places including Chemical and Engineering News, The Scientist, Silicon India, Oil ReviewMiddle East, Petroleum Review, Rigzone, Forbes among others.

Michael Krigsman is a globally recognized analyst, strategic advisor, and industry commentator, known for his deep expertise in the fields of digital transformation, innovation, and leadership. He has presented at industry events around the world and written extensively on the reasons for IT failures. His work has been referenced in the media over 1,000 times and in more than 50 books and journal articles; his commentary on technology trends and business strategy reaches a global audience.

Transcript

Michael Krigsman: Welcome to CXO Talk 856. I'm Michael Krigsman, and we are exploring AI and data science: what works, what doesn't, and what fails.

Our guests are two prominent chief data scientists. Dr. Satyam Priyadarshy is the former Chief Data Scientist and Technology Fellow at Halliburton. He is currently CEO of ReIgnite Future. Dr. Anthony Scriffignano is the former Chief Data Scientist at Dun & Bradstreet. He is now a Distinguished Fellow at the Stimpson Center.

Anthony, business leaders somehow think that AI is magic and can deliver instant results. We end-users think that too. What are your thoughts on this as a data scientist?

Anthony Scriffignano: There's nothing fundamentally new about most current AI developments except for significantly increased compute power and data volume. Something that mimics human behavior well enough gains popularity. It becomes easier to interact with and understand.

This increased popularity is good, but also dangerous, requiring caution. Think about prescription drugs: necessary, but potentially harmful if misused. AI is similar.

Michael Krigsman: Let's discuss the relationship between AI and data. Everyone knows AI requires data, but how does this relationship actually function? Do hidden problems exist within organizations that don't surface? Not because data scientists intentionally hide them, but because they could if they wanted to.

Satyam Priyadarshy: No, data scientists don't typically hide anything. They present the narrative revealed by the data, regardless of its size or perceived quality. Models present patterns, and those patterns tell a story.

Sometimes, this story unsettles business leaders. They may have been operating with inefficiencies for years, and the data exposes this. Their reaction might be, "No, you can't say that! You're a data scientist, not a domain expert."

However, data is objective. It captures modifications and changes, reflecting them in the resulting patterns and narratives. This leads to excuses like, "We don't have enough data," or, "Our data is bad." Yet, many complex industries have decades of data. Why isn't it utilized? This is the real challenge.

Anthony Scriffignano: We face numerous challenges in data science and analytics. One major hurdle is confirmation bias: "I believe X is true; find data to prove it." With enough data, anything can seemingly be "proven."

Asking data scientists questions without understanding their work or assumptions creates problems. Data provides answers regardless of pre-conceived notions. However, leaders often reject answers that contradict their beliefs. They pressure data scientists to "work harder" and essentially force the data to fit a narrative.

Careful consideration of a priori assumptions is crucial – assumptions made before analysis. Data must be representative and assumptions about truth must be validated. Provenance is essential: knowing the data's origin and usage rights. Significant preparation is needed before employing AI.

Michael Krigsman: The idea of data telling a story is intriguing. Satyam, you mentioned transparency in data science. We often hear about problematic data or insufficient data despite companies having vast quantities. Can we explore this "data story" concept further?

Satyam Priyadarshy: Predictive maintenance is a common application of AI algorithms. It's essential across complex industries like manufacturing, energy, and oil and gas. Even with incomplete data (30-40% complete), a model and algorithm can be built. This generates a story, patterns, and a narrative.

The objective is creating "smart data" from existing data. This means addressing data gaps. If you doubt the data's story due to your own expertise, examine the patterns and identify why they emerged. This might indicate a need for more data. If the data needs enhancement, you must be able to explain all aspects of the predictive maintenance. This process of refining incomplete data into something smart enables full use of self-learning models.

Anthony Scriffignano: Consider the airline industry during COVID. Flights were grounded, disrupting models predicting aircraft maintenance needs. Predictions for ordering parts – greases, oils, screws, bolts – became inaccurate. Distribution management systems failed.

Why? The underlying behaviors changed drastically. The historical data used for prediction became irrelevant. Smart individuals recognized the need for intervention. Model assumptions were invalid, requiring a rethinking of distribution, ordering, maintenance scheduling, and staffing.

Mid-disruption, the Suez Canal incident further complicated matters, impacting shipping and air freight, requiring more flights. This highlights the need for constant adaptation. You cannot rely solely on static models.

Anthony Scriffignano: Context is crucial. You must actively manage and adapt to changing conditions.

Satyam Priyadarshy: Context is indeed paramount, often overlooked. Data scientists understand the context of the data they analyze.

Michael Krigsman: How do these data-related problems arise? How can business and technology leaders break the cycle of misusing data and obtaining substandard or meaningless results?

Anthony Scriffignano: In aerospace, for instance, multiple systems contribute to launch decisions. These systems effectively "vote," and the smarter ones use different criteria. This offers protection against unforeseen issues. Having multiple systems, observing the same environment from various perspectives or using isolated equipment, safeguards against surprises.

Think of old commercials claiming, "Nine out of ten dentists recommend..." You can cherry-pick data to support anything. Systems should be representative, analyze extensive data, and ideally function independently. This ensures that a single point of failure doesn't bring down the whole system. It shouldn't require every system to be wrong for the overall conclusion to be wrong.

Michael Krigsman: Please subscribe to our newsletter. Visit cxotalk.com. Subscribe to our YouTube channel.

Gus Bekdash on Twitter says, "Data science/AI, like statistics, involves questions that can be illuminating or misleading. Asking the right questions is crucial. What should data scientists do to ask the right questions and navigate organizational politics?"

I love this question!

Michael Krigsman: Yeah.

Anthony Scriffignano: It's so important! There are two key aspects. First, ensuring you're asking the right questions. Propositional calculus offers a framework: defining what you accept as true (axioms), what you're testing, your methods, and the justification for those methods.

Consider assumptions embedded in methods. We often focus on regressive or supervised methods, using past data to predict the future. However, applying these during times of disruption is problematic, as the future deviates from the past. The "elasticity" introduced by disruption must be considered.

The mathematical aspect is essential, but so is the political dimension. You can be right, but still fail due to organizational dynamics.

You must assess various factors. Can you be correct but still be overruled? What if powerful people insist on the wrong conclusion? This raises ethical and moral dilemmas beyond data science. These are real workplace challenges. Satyam, I'm sure you've faced such pressures in your career, especially given your work history.

You need strategies. With experience, you learn to phrase things carefully, for instance, "From my perspective...", "In my experience...", or "It appears to me..." This protects you when others are dissatisfied with your findings. They can't argue with your perspective.

Communicating these messages effectively is crucial. Advocacy and ethics become paramount for true data science. Otherwise, you may be right, but effectively silenced.

Satyam Priyadarshy: In my 20 years of experience, I've always focused on the data's narrative. I define data science as "science on the data." The methods—statistics, data mining, machine learning—serve the goal of finding value for the business.

Not all problems require neural networks, nor can all be solved with basic statistics. The focus is on understanding the specific business problem and building tailored solutions, assessing value propositions, like cost savings or increased sales, using relevant financial metrics.

I have never accepted direction from leaders dictating desired results. I advise my data scientists to, "Tell the data's story objectively. Develop resilience. Domain experts should explain unexpected results, or revisit their assumptions."

We don't impose meaning on data; we extract its inherent meaning. There is a political aspect, yes, but it often lies within leadership disagreements. Data scientists merely present the results, which leaders may not always accept.

The discussion then shifts to explaining the validity and context of those results.

Michael Krigsman: So, technology and data help reach conclusions, forming the data's story. Then we use this story to make decisions. Decisions draw upon the data, and are sometimes influenced by external factors like politics. But these factors are distinct from the data itself. Satyam, is this accurate?

Satyam Priyadarshy: Correct. Data-driven decisions are justified. However, ignoring data analysis based on experience can introduce unseen problems. While external factors are involved, they shouldn't override data insights.

Here's an example: Two leaders from different organizations collaborate on a model using a specific dataset. The model performs excellently, with high accuracy. However, one leader insists on applying the model to a different dataset (dataset C). The results are unsatisfactory. They haven't considered data drift, and dismiss the model as useless. The issue isn't the model, but the change in context and data distribution. This data drift can lead to significant issues, causing project abandonment, adding to the statistics of failed AI projects.

Michael Krigsman: So, they blame the data scientists for providing "wrong" data and results?

Satyam Priyadarshy: Absolutely. "Wrong model" is the typical accusation.

Anthony Scriffignano: This reminds me of post-election analysis, where half the population rejects unfavorable predictions.

Michael Krigsman: A question from TH Go on Twitter: "With growing AI popularity across industries, what's the anticipated impact on decision-making, especially regarding predictive analytics?" Anthony?

Anthony Scriffignano: Will we prioritize analytics in the future? Will decisions be based on AI recommendations, or on intuition? These technologies become ingrained in our processes, leading us to accept them without question.

I recall my early physics classes. Calculators existed, but we used slide rules. This seemingly archaic method forced us to think critically about magnitude and precision. Today, we risk blindly accepting results without understanding the underlying calculations. "60% of 47 million can't exceed 47 million"—a simple concept often lost in the age of instant computation.

Here's an anecdote illustrating Satyam's point. During a war (Korea or WWII), analysts studied why pilots were shot down. Survival rates were alarming. They examined various factors: training, mission type, ordnance. Eventually, someone analyzed bullet holes in returning aircraft, suggesting reinforcement where planes were hit.

This led to heavier, less efficient planes. The critical insight came later: surviving planes revealed where they could be hit without being downed. The focus shifted to reinforcing areas without bullet holes. This wasn't an AI or analytics triumph; it was a better question. We need to cultivate analytical thinking, not just blindly follow data density.

Michael Krigsman: So you advocate for critical thinking and understanding, even if it means more work?

Anthony Scriffignano: Exactly. This isn't easy, but it's crucial.

Michael Krigsman: Except for those burdened with the extra work.

Anthony Scriffignano: True. It's a leadership challenge. Positioning is key. Demanding extra work without explaining its value breeds resentment. Effective leadership explains the why, fostering understanding and motivation. The framing matters: "had to" versus "got to."

Michael Krigsman: Arsalan Khan asks, "Management sometimes uses consultants to justify their agenda (e.g., reducing FTEs, re-engineering, politics). Is AI different?"

Satyam Priyadarshy: AI, at its core, involves mimicking human intelligence with machines. In practice, "AI" often refers to the technologies employed. While AI can be a tool for such agendas, it's different. It provides data-driven insights.

AI analyzes data to uncover truths. Unless manipulated, data reveals accurate narratives. The power of AI lies in augmenting human capabilities, uncovering inefficiencies in processes and workflows. This empowers informed business transformation. Leaders benefit most when AI is used effectively and efficiently.

Its correct usage is in their best interest.

Michael Krigsman: I've heard leaders express respect for data scientists and their expertise, only to reject their findings as "wrong" and irrelevant to their needs. What's your perspective on this?

Satyam Priyadarshy: Such leaders are hindering progress and need to adapt. Embracing data-driven transformation is essential for navigating the evolving business landscape.

Michael Krigsman: Why is that?

Satyam Priyadarshy: Consider your podcast. It leverages various technologies to enhance quality. Would you revert to simpler, less effective tools? Similarly, businesses must embrace proven technologies and solutions to remain competitive. Ignoring these advancements puts them at a disadvantage in a dynamic market where consumer expectations and global dynamics constantly shift.

Adaptation is key for sustainability.

Michael Krigsman: This question resonates with those raised by Arsalan Khan, a regular listener. It centers on the disconnect between data science findings and business leader acceptance. The data scientist presents results, the businessperson rejects them.

Anthony Scriffignano: It's important to remember that we're all on the same team. Data scientists and business leaders are working toward the same organizational goals. We're collectively pushing the same rock uphill. Ideally, we should be working together, not against each other.

I understand the framing of the question, but let's reframe it to emphasize collaboration.

What I usually do in situations like this—and I have to admit, I admire Satyam’s more direct approach—is to appeal to understanding.

My martial arts training informs my approach. If facing a stronger, armed opponent, direct confrontation isn’t wise. A different tactic is needed. Similarly, when faced with resistance to data findings, direct confrontation might not be effective.

If a leader insists I'm wrong, I'll often respond with, "Perhaps I am. Help me understand your perspective." This fosters dialogue, exploring potential misunderstandings. Perhaps I misinterpret the question or need to clarify how the data relates to their concerns.

"Maybe I am wrong," or, "Maybe I misunderstood your question." Clarifying how I interpreted their question and presenting the data's perspective invites a collaborative discussion. This approach can defuse defensiveness, opening up the possibility for productive conversation and mutual understanding. It also acknowledges the possibility of my own error.

Starting with humility, even if I believe I’m right, can de-escalate tension. It encourages listening and allows us to find common ground. Perhaps I truly misunderstood, or perhaps there are other factors to consider. This open approach is much more constructive than an adversarial one.

Michael Krigsman: Gus Bekdash adds, "Anthony mentioned a significant point: data science uses history to predict the future. But the past doesn't contain the new. Does data science hinder innovation?" He acknowledges the complexity of this question. What are your thoughts?

Anthony Scriffignano: Some future projections can be informed by historical data. However, disruption, unforeseen events, and unprecedented circumstances demand different approaches. Past learnings are rarely useless; they become useful in different ways.

We often discuss supervised and unsupervised learning. Supervised learning relies on training data, using the past to inform the future. Unsupervised learning explores patterns in current data without explicit historical context. It's not a simple binary; there are other methods that blend both.

Bayesian inference offers a valuable approach for disruption, iteratively adjusting assumptions. Think of driving. Relying solely on the rearview mirror is dangerous. We look ahead, making real-time decisions based on current observations.

Similarly, our analytical methods should adapt. In dynamic situations, like predicting the stock market or performing predictive maintenance on long-standing equipment, different approaches are required. Supervised methods suit stable environments; other methods suit dynamic ones.

Michael Krigsman: Arsalan Khan asks, "Do data scientists, focused on the data, consider its potential inaccuracy? Should they explain this to executives before presenting AI recommendations?"

Satyam Priyadarshy: Competent data scientists analyze data within its context. This reveals underlying phenomena and potential inefficiencies. Tainted, modified, or incomplete data compromises analysis. Leaders need to understand that storing unreliable data is wasteful.

Data manipulation, unfortunately, occurs in various industries to force desired outcomes. Proper modeling on historical data reveals such manipulations. Historical data itself is crucial for building future models. Anyone claiming otherwise is mistaken. Creating models from synthetic data without context is futile.

I tested a generative AI system by asking it to design a flyer for "generative AI" based on the phrase "learn some AI." It garbled the phrase, lacking context. Data scientists must be cautious and communicate clearly with executives, explaining results in the correct context and offering explanations for observed phenomena.

Anthony Scriffignano: Michael, on the topic of data and truth, I'd like to offer a different perspective. Truth has multiple dimensions. I've worked extensively in veracity adjudication, evaluating data's truthfulness and usefulness.

The legal oath—"the truth, the whole truth, and nothing but the truth"—highlights three distinct aspects. Presenting selective truths, including falsehoods with truth, or presenting truthful information that misleads—are forms of deception. All can manipulate the perception of what is true.

Data representation matters. Does the data accurately reflect the problem? Surveys are susceptible to non-response bias. How do you know the opinions of those who didn't respond? You need mathematical methods to ensure data is representative. Data integrity is also important. Systems need stability and data needs protection from manipulation. Adversarial attacks, for example, can poison datasets subtly.

Data authenticity and timeliness are key. Metadata analysis can sometimes verify authenticity. Latency, or data age, is also crucial. Data gathered now might not reflect current reality, especially in rapidly changing scenarios, such as immediate earthquake casualty counts.

All these factors and more contribute to evaluating data quality and its usability in AI. These dimensions of data integrity – representativeness, freedom from manipulation, authenticity, and timeliness – are crucial. Ignoring them is risky.

Michael Krigsman: Both of you work with large organizations with substantial budgets, sometimes spending billions on AI training. How pervasive are these data issues within large companies, assuming they aim for good data but might err?

Satyam Priyadarshy: I frequently present a slide based on 15 years of conversations, a trend that continues. Many organizations lack a data catalog, unaware of their data holdings. They claim "petabytes of data," but can't pinpoint its location. In oil and gas, for example, multiple data copies exist, yet data completeness hovers around 40-50%, with quality varying.

These statistics are common in complex industries. However, recent years have seen progress. Organizations are learning high-frequency data collection and faster analysis, thanks to IoT and cloud technologies. Historical data, even if not perfectly dimensioned, remains valuable. Emerging technologies are further enhancing data practices.

Anthony Scriffignano: Building on Satyam's point, remember the initial excitement around "big data"? We discussed its various Vs: velocity, variety, value, volume. These concepts remain relevant, though the terminology has evolved.

Today’s difference lies in edge devices (self-driving cars, satellites) generating massive datasets. Hyperspectral sensors alone create petabytes. Real-time processing is essential, forcing decisions about data retention and inference.

Consider self-driving cars. They collect immense amounts of data, but not all of it gets sent to central servers for AI analysis. Legal, privacy, and data localization constraints often dictate data handling.

Large organizations face this data deluge. They possess vast but scattered information, often siloed. Regulations complicate matters. The challenge isn't a lack of data, but accessing and integrating the right data at the right time. This decentralized information landscape necessitates advanced approaches to data management and analysis.

Michael Krigsman: Now's a great time to subscribe to the CXO TALK newsletter at cxotalk.com. You'll encounter a pop-up—yes, one of those pop-ups—inviting you to subscribe. This ensures you're notified about live shows and events. We're also active on Twitter (#cxotalk) and LinkedIn, where you can join the conversation and ask questions. Don't miss this opportunity to engage with these prominent data scientists.

A formal question from Lisbeth Shaw: "Why augment existing data with synthetic data? What are the implications? When do you have too much synthetic data?"

Anthony Scriffignano: Synthetic data is algorithmically generated to mimic real data with specific characteristics. You might expand your dataset by creating synthetic data that resembles it.

Generative AI, essentially, produces synthetic data. The "generative" aspect implies creation. We now generate more synthetic data than ever before.

Beyond generatively created synthetic data, consider situations where you lack sufficient testing data for an algorithm. You can create synthetic data within known statistical bounds to support testing. The crucial consideration is bias amplification. Synthetic data can magnify biases present in the original data, either obscuring anomalies with noise or creating artificial outliers.

I'm not against synthetic data; I caution against its misuse as a crutch. Understand your existing data before generating synthetic data.

Satyam Priyadarshy: Synthetic data has its place. If you understand the underlying physics, science, or behavioral aspects, but lack sufficient data, you can create scenarios. This aligns with the traditional concept of scenario planning. Synthetic data allows exploring, “What if this happened?” It’s a valuable tool for exploring possibilities.

This isn't about declaring "truth," but about gaining insights. However, limitations exist. LLMs, used to generate synthetic data, may produce nonsensical outputs without context. Numerical data, constrained by physics or statistical principles, is more suitable for generating useful scenarios.

This can save time. For example, one company I advise needed to differentiate objects moving at high speed, requiring a significant amount of factory-gathered data. Generating synthetic data allowed model creation and refinement once real-world data became available, speeding up the process considerably.

This reduces model development time, allowing for more rapid refinement with actual data.

Michael Krigsman: Arsalan Khan asks, "When is data too much or too little? Who decides? Does this power of deciding influence others? What federal guardrails are needed without stifling innovation?"

Let's start with the technical aspect: big versus little data, enough versus too much. Anthony or Satyam?

Anthony Scriffignano: Think of data in three categories: data you have, data you could reasonably acquire, and data you know exists but can't access. "Enough data" becomes nuanced. It's "enough" when you reach the dispositive threshold: when you have sufficient data to answer the question using your chosen method.

This isn’t necessarily enough for a good decision, just a decision. Continued data collection reaches a point of diminishing returns. When predictions stabilize within statistical limits, that's generally a good time to finalize your analysis and present your findings, along with its limitations.

Elasticity—the acceptable margin of error—is context-dependent. Launching rockets demands higher precision than, say, a marketing campaign. The "cost" of being wrong varies dramatically.

There's the concept of supersaturation—when additional data provides no new insights. Then there's the dispositive point: you can answer, but aren’t fully prepared. The ideal lies between these extremes. The real challenge is when your decision-making environment changes faster than data acquisition.

There’s no emergency brake in some situations. Real-world urgency may force a decision before data is fully collected. This external pressure determines the practical dispositive point. The environment forces your hand.

These are some significant considerations regarding data quantity and decision-making methodologies. I’ll pause here to allow Satyam to contribute to the second part, regarding regulation and safeguards.

Michael Krigsman: Satyam, your thoughts on data quantity, considering our limited time.

Satyam Priyadarshy: I aim for minimal data with maximum output, achieving a positive return on investment. Why analyze terabytes if gigabytes suffice? A leaner model with fewer features can still significantly improve workflows, products, or services.

From a business standpoint, value optimization is key. A scientist might pursue data until reaching the model’s asymptotic limit. But in the business world, pragmatism dictates finding the sweet spot of cost-effectiveness.

Michael Krigsman: Anthony, I noticed your reaction. Any quick comments?

Anthony Scriffignano: That’s spot-on economically. Minimal data for maximum impact is the ideal, a practical counterpoint to my earlier theoretical perspective. We must consider data, people, and infrastructure costs.

Michael Krigsman: A comment from Gus Bekdash, "Feeding cow remnants to cows resulted in mad cow disease. What's the AI equivalent when its output is fed back as input?"

Satyam Priyadarshy: It’s not just about AI output as input. Any input data requires context. If the context is wrong, the output narrative is also wrong. The source of the input—AI-generated or otherwise—doesn't change this fundamental requirement. Using a model’s output as input for another model creates a new narrative shaped by the original data. Context remains paramount.

Anthony Scriffignano: Generative AI output as input creates a feedback loop, potentially amplifying biases and ignoring context. The consequences aren't fully known yet, but the situation is developing rapidly. Misinformation (repeating heard but unverified information) and disinformation (deliberately spreading falsehoods) exemplify this. Generative AI amplifies both at hypergeometric speeds.

Gen AI, consuming and generating content that mimics human creation (articles, posts), spreads these untruths. Algorithms reproduce and disseminate these embedded inaccuracies. This phenomenon is real and concerning; its full impact is yet to unfold. It’s a serious issue demanding attention.

This situation demands immediate attention and further research.

Michael Krigsman: Are data scientists more susceptible to spreading disinformation, misinformation, or simply confusing information?

Satyam Priyadarshy: Data scientists' role is to reveal the data's story, accurately and objectively.

Anthony Scriffignano: But if the data itself contains inaccuracies?

Satyam Priyadarshy: Yes.

Anthony Scriffignano: If the data contains lies and lacks the rigorous validation I discussed, data scientists inadvertently contribute to the problem. Veracity checks are essential.

Michael Krigsman: My apologies to data scientists for implying a tendency to distort data. That was unintentional.

Anthony Scriffignano: Data is indifferent to your feelings, Michael.

Michael Krigsman: Now I feel bad, insulted even.

Okay, a question from Hue Hoang: "What practical steps can leadership take to create a data-driven culture while managing computational costs?"

Anthony Scriffignano: A data-driven culture will look different in a chewing gum company versus an analytics firm. It depends on the company’s core business and its data-centricity. Another factor is the leadership's perspective: how receptive are they to change and data-driven approaches? The datasphere is growing exponentially. Simply accumulating data isn't enough; we’re already drowning in it.

Sense-making becomes paramount. A data-driven culture involves extracting actionable insights from data and aligning them with organizational needs. This makes your data valuable. The focus then shifts away from infrastructure costs as leaders recognize the value. Satyam phrased this better: demonstrate the data’s value.

Satyam Priyadarshy: When discussing data with leadership, context is key. Frame the discussion within the business context and link it to key performance indicators (KPIs). Demonstrate how data-driven solutions can positively impact those KPIs. This increases buy-in, enabling you to build a data-centric organization.

Michael Krigsman: Arsalan Khan asks: "Given the overlap in data collected by different organizations, why not have a centralized repository accessible to all?"

Satyam Priyadarshy: Data sharing depends on the specific industry and business context. A universal data repository isn't always feasible or desirable. Healthcare data, for instance, involves privacy concerns. While various entities (pharmacies, doctors, labs) possess patient data, sharing it centrally raises significant privacy and security issues.

Similar concerns exist in other sectors, like the energy industry, with its proprietary information. However, some areas lend themselves better to shared data. Face recognition technology, fueled by vast public datasets from social media and other sources, benefits from this broader availability. This was unimaginable 20 years ago, with limited image data. Data sharing possibilities vary significantly depending on the field and context.

Some areas are amenable to central repositories, while others present serious challenges. Caution is warranted. The academic world has explored centralized repositories (like DSpace for research papers), with varying success.

Michael Krigsman: The Mayo Clinic, with Dr. John Halamka (a past CXOTalk guest), is developing a federated data system for medical research. Participating centers share data while maintaining control, allowing researchers to query across datasets on specific diseases.

This addresses some of Arsalan's concerns, incorporating data protection measures and adhering to HIPAA regulations. For those interested in this model, I encourage further investigation. One final question, from Roland Coffee on LinkedIn: "Anthony, how can bias introduced by data scientists be removed?" Time is short, so please be brief.

Anthony Scriffignano: Bias removal often introduces new bias. Correcting for insufficient or manipulated data is necessary, but inherent data biases should be understood and factored into the analytical process. Altering data beyond standard normalization and statistical adjustments risks introducing unintended biases and consequences.

Understanding and acknowledging bias is key. Address bias stemming from identifiable causes that you can fix. Then, incorporate this understanding into your analyses and consider the acceptable margin of error in your decisions.

Michael Krigsman: Satyam, the final word on eliminating data scientist bias.

Satyam Priyadarshy: Data scientists don't create bias. They discover it. Bias exists within the data itself. The task isn't about removing our bias, but understanding the source of the data's bias. Was it introduced through manipulation, or does it reflect a genuine aspect of the underlying process?

Is it a true outlier event, or a misinterpretation? The key is to identify the cause of the bias, not to "remove" something that may reflect a genuine phenomenon. As data scientists, we should aim to expose, not erase, these anomalies.

Michael Krigsman: Can a less experienced or less meticulous data scientist inadvertently introduce bias? Or are these sources of bias entirely separate, as you suggest?

Anthony Scriffignano: Bias is a complex issue, encompassing various types. A less experienced data scientist certainly could introduce bias. We may be thinking about different aspects here.

The question itself carries a bias, assuming younger or newer data scientists are more prone to introducing bias through their methodologies.

Human beings inherently possess biases. It's how our brains function. We constantly filter information, choosing what to focus on and what to ignore. This selective attention is a survival mechanism, essential for navigating our complex world.

This inherent human bias inevitably influences our approach to problems, even in scientific endeavors. We strive for objectivity, but remain human. If our task simply involves calculations, then bias might be less impactful. But in complex, dynamic systems dealing with disruption, our humanity, with its inherent biases, comes into play. We must acknowledge and address this.

Michael Krigsman: Satyam, your perspective on this apparent disagreement?

Satyam Priyadarshy: We, as data scientists, don't introduce bias. We identify it. No data scientist inherently creates bias. The bias resides in the data. Different models might interpret the data differently, creating varied narratives, but the core bias remains the data’s, not the scientist's.

So, two out of two data scientists agree...

Michael Krigsman: Disagree, I think.

Anthony Scriffignano: Respectfully, we're perhaps emphasizing different facets of a multifaceted issue. I believe we are in violent agreement, focusing on different aspects of the same problem.

Michael Krigsman: I detect more of a "violent disagreement."

Anthony Scriffignano: Think of it like examining different parts of an elephant. Each perspective is valid, but incomplete without the others.

Michael Krigsman: On that note, a reminder to subscribe to our newsletter at cxotalk.com for updates and show notifications. And subscribe to our YouTube channel for more in-depth discussions.

A sincere thank you to both Dr. Anthony Scriffignano and Dr. Satyam Priyadarshy for their valuable insights. I truly appreciate you both sharing your expertise with us today.

Anthony Scriffignano: Thank you. It’s been a pleasure.

Michael Krigsman:Thank you to everyone for the excellent questions. Subscribe to the newsletter for updates on future shows! Visit cxotalk.com; we have extraordinary guests lined up.

Take care, and see you next time! Goodbye.

Published Date: Oct 18, 2024

Author: Michael Krigsman

Episode ID: 856