Category Archives: Data Science

Politics, Machine Learning, and Zoom Conferences in a Pandemic: A Conversation with an Undergraduate Researcher – Caltech

In every election, after the polls close and the votes are counted, there comes a time for reflection. Pundits appear on cable news to offer theories, columnists pen op-eds with warnings and advice for the winners and losers, and parties conduct postmortems.

The 2020 U.S. presidential election in which Donald Trump lost to Joe Biden was no exception.

For Caltech undergrad Sreemanti Dey, the election offered a chance to do her own sort of reflection. Dey, an undergrad majoring in computer science, has a particular interest in using computers to better understand politics. Working with Michael Alvarez, professor of political and computational social science, Dey used machine learning and data collected during the 2020 election to find out what actually motivated people to vote for one presidential candidate over another.

In December, Dey presented her work on the topic at the fourth-annual International Conference on Applied Machine Learning and Data Analytics, which was held remotely and was recognized by the organizers as having the best paper at the conference.

We recently chatted with Dey and Alvarez, who is co-chair of the Caltech-MIT Voting Project, about their research, what machine learning can offer to political scientists, and what it is like for undergrads doing research at Caltech.

Sreemanti Dey: I think that how elections are run has become a really salient issue in the past couple of years. Politics is in the forefront of people's minds because things have gotten so, I guess, strange and chaotic recently. That, along with a lot of factors in 2020, made people care a lot more about voting. That makes me think it's really important to study how elections work and how people choose candidates in general.

Sreemanti: I've learned from Mike that a lot of social science studies are deductive in nature. So, you pick a hypothesis and then you pick the data that would best help you understand the hypothesis that you've chosen. We wanted to take a more open-ended approach and see what the data itself told us. And, of course, that's precisely what machine learning is good for.

In this particular case, it was a matter of working with a large amount of data that you can't filter through yourself without introducing a lot of bias. And that could be just you choosing to focus on the wrong issues. Machine learning and the model that we used are a good way to reduce the amount of information you're looking at without bias.

Basically it's a way of reducing high-dimensional data sets to the most important factors in the data set. So it goes through a couple steps. It first groups all the features of the data into these modules so that the features within a module are very correlated with each other, but there is not much correlation between modules. Then, since each module represents the same type of features, it reduces how many features are in each module. And then at the very end, it combines all the modules together and then takes one last pass to see if it can be reduced by anything else.

Mike: This technique was developed by Christina Ramirez (MS' 96, PhD '99), a PhD graduate of our program now at UCLA. Christina is someone who I've collaborated with quite a bit. Sreemanti and I were meeting pretty regularly with Christina and getting some advice from her along the way about this project and some others that we're thinking about.

Sreemanti: I think we got pretty much what we expected, except for what the most partisan-coded issues are. Those I found a little bit surprising. The most partisan questions turned out to be about filling the Supreme Court seats. I thought that it was interesting.

Sreemanti: It's really incredible. I find it astonishing that a person like Professor Alvarez has the time to focus so much on the undergraduates in lab. I did research in high school, and it was an extremely competitive environment trying to get attention from professors or even your mentor.

It's a really nice feature of Caltech that professors are very involved with what their undergraduates are doing. I would say it's a really incredible opportunity.

Mike: I and most of my colleagues work really hard to involve the Caltech undergraduates in a lot of the research that we do. A lot of that happens in the SURF [Summer Undergraduate Research Fellowship] program in the summers. But it also happens throughout the course of the academic year.

What's unusual a little bit here is that undergraduate students typically take on smaller projects. They typically work on things for a quarter or a summer. And while they do a good job on them, they don't usually reach the point where they produce something that's potentially publication quality.

Sreemanti started this at the beginning of her freshman year and we worked on it through her entire freshman year. That gave her the opportunity to really learn the tools, read the political science literature, read the machine learning literature, and take this to a point where at the end of the year, she had produced something that was of publication quality.

Sreemanti: It was a little bit strange, first of all, because of the time zone issue. This conference was in a completely different time zone, so I ended up waking up at 4 a.m. for it. And then I had an audio glitch halfway through that I had to fix, so I had some very typical Zoom-era problems and all that.

Mike: This is a pandemic-era story with how we were all working to cope and trying to maintain the educational experience that we want our undergraduates to have. We were all trying to make sure that they had the experience that they deserved as a Caltech undergraduate and trying to make sure they made it through the freshman year.

We have the most amazing students imaginable, and to be able to help them understand what the research experience is like is just an amazing opportunity. Working with students like Sreemanti is the sort of thing that makes being a Caltech faculty member very special. And it's a large part of the reason why people like myself like to be professors at Caltech.

Sreemanti: I think I would want to continue studying how people make their choices about candidates but maybe in a slightly different way with different data sets. Right now, from my other projects, I think I'm learning how to not rely on surveys and rely on more organic data, for example, from social media. I would be interested in trying to find a way to study their candidatepeople's candidate choice from their more organic interactions with other people.

Sreemanti's paper, titled, "Fuzzy Forests for Feature Selection in High-Dimensional Survey Data: An Application to the 2020 U.S. Presidential Election," was presented in December at the fourth-annual International Conference on Applied Machine Learning and Data Analytics," where it won the best paper award.

Here is the original post:

Politics, Machine Learning, and Zoom Conferences in a Pandemic: A Conversation with an Undergraduate Researcher - Caltech

Startup Spotlight: This woman-led Triangle startup is democratizing big data to drive sales for businesses – WRAL TechWire

Editors note: Startup Spotlight is a regular feature at WRAL TechWire designed to bring attentionn about potential emerging stars in the North Carolina innovation economy.

+++

DURHAM Working as part of the Peace Corps in Cameroon in the early 2000s, Meghan Corroon witnessed first-hand the need for reliable data. Without it, she recalled, critical funding got wasted trying to bring clean water to villages.

Years later, she saw similar challenges while running large-scale projects for The Bill and Melinda Gates Foundation, as well as several national governments in Africa and Asia.

I was increasingly frustrated with the pace of innovation, she told WRAL TechWire.

So she founded Clerdata, formerly Lumen Insights. Its a Durham-based data science startup that aims to bring change this time to the corporate world by ending ineffective marketing strategies.

Rather than a cluttered dashboard that still doesnt answer the most important questions,Corroon said Clerdatas product offers sophisticated modeling that is fast, lean, and automated.

In a survey of 1,000 marketers worldwide by Rakuten Marketing, respondents estimated they waste an average of26%of their budgets on ineffective channels and strategies. And about half of respondents said they misspend at least 20% of their budgets.

Corroon, who interviewed businesses across industries for over a year before launching, said she heard similar testimonies from very small to multi-billion-dollar businesses. In todays world of advanced data science and business insights, she said she couldnt believe this was still the case: We built Clerdata to tackle this problem.

Clerdata founder Meghan Corroon

Clerdata offers a marketing tool that tells customers which digital marketing channels are driving their sales. It operates using a proprietary algorithm that includes statistical modeling to tackle the question of marketing action effectiveness on sales in brick-and-mortar stores, as well as e-commerce. This allows customers to move more quickly from insights to business decisions, Corroon said: We believe thats the future of SaaS data products.

It also avoids using consumer web tracking data.

This is a huge deal as Facebook lost 10 billion in ad revenue just last quarter due to this increasingly faulty data pipeline. This problem affects almost every consumer business in our modern economy.

Clerdata is already reaping the rewards. Since launching in 2018, its team has grown to five employees and projected revenue growth is expected to be over 700% this year.

Clerdata team

It has a growing list of clients, including Raleigh-based Videri Chocolate Factory; nationally distributed Marys Gone Crackers; Peacock Alley, a luxury linens company; and Healthy Living Market, an independent grocery store chain, among others.

To date, the startup remains bootstrapped with no help from outside investors. But Corroon said she wont rule that out in the future.

Were hyper focused on running our own race and serving our customers a product with integrity, she said. Were excited for the future of this adventure.

Originally posted here:

Startup Spotlight: This woman-led Triangle startup is democratizing big data to drive sales for businesses - WRAL TechWire

The Role Of Data Analytics In Ensuring Business Continuity – CIO Applications

The majority of teams now work remotely or on a rotating basis. In this situation, it is critical for a company to manage their workforce optimally so that only the necessary resources are available when they are required.

Fremont, CA: As the world struggles to cope with the pandemic's aftermath, businesses are looking for new technologies to help them streamline operations and ensure continuity in the coming days. Business leaders are embracing data analytics, AI, and data science to improve processes, become more scalable, and plan for the future.

Let us look at how data analytics is playing a critical role in ensuring business continuity.

Workforce Management

The majority of teams now work remotely or on a rotating basis. In this situation, it is critical for a company to manage their workforce optimally so that only the necessary resources are available when they are required. Businesses can accurately predict how many workers are needed on a daily, weekly, or monthly basis by using data analytics. They can forecast the volume of work and assign teams accordingly. They can also remotely monitor the performance of the teams to ensure maximum efficiency. In the BPO industry, data analytics for workforce management is widely used.

Data Security

Data security has become a major concern for businesses as teams work remotely and office networks are exposed to external environments. Businesses are using data analytics to monitor system logs and identify anomalies as they transition to new, digital ways of working. IT teams are using analytics to monitor all users and applications in real time and to detect and mitigate system threats in real time. They can set up automated workflows for deviations and trigger the next set of actionswithout requiring any dependencies or waiting time.

Forecasting and Risk Mitigation

Business data is a goldmine of information that can assist business leaders in a variety of ways. Business leaders can detect problems before they occur by using AI and ML-powered data analytics. Enterprises are also devoting time and effort to analyzing historical data from the COVID timeline to determine how deeply and where the business was impacted. They can predict what factors will cause interruptions and take proactive steps to control them by using analytics and data science.

Read the rest here:

The Role Of Data Analytics In Ensuring Business Continuity - CIO Applications

Iterative to Launch Open Source Tool, First to Train Machine Learning Models on Any Cloud Using HashiCorp’s Terraform Solution – Business Wire

SAN FRANCISCO--(BUSINESS WIRE)--Iterative, the MLOps company dedicated to streamlining the workflow of data scientists and machine learning (ML) engineers, today announced a new open source compute orchestration tool using Terraform, a solution by HashiCorp, Inc., the leader in multi-cloud infrastructure automation software.

Terraform Provider Iterative (TPI) is the first product on HashiCorps Terraform technology stack to simplify ML training on any cloud while helping infrastructure and ML teams to save significant time and money in maintaining and configuring their training resources.

Built on Terraform by HashiCorp, an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services, TPI allows data scientists to deploy workloads without having to figure out the infrastructure.

Data scientists oftentimes need a lot of computational resources when training ML models. This may include expensive GPU instances that need to be provisioned for training an experiment and then de-provisioned to save on costs. Terraform helps teams to specify and manage compute resources. TPI complements Terraform with additional functionality, customized for machine learning use cases:

With TPI, data scientists only need to configure the resources they need once and are able to deploy anywhere and everywhere in minutes. Once it is configured as part of an ML model experiment pipeline, users can deploy on AWS, GCP, Azure, on-prem, or with Kubernetes.

"We chose Terraform as the de facto standard for defining the infrastructure-as-code approach, said Dmitry Petrov, co-founder and CEO of Iterative. TPI extends Terraform to fit with machine learning workloads and use cases. It can handle spot instance recovery and lets ML jobs continue running on another instance when one is terminated."

To learn more about TPI visit the blog.

About IterativeIterative.ai, the company behind Iterative Studio and popular open-source tools DVC and CML, enables data science teams to build models faster and collaborate better with data-centric machine learning tools. Iteratives developer-first approach to MLOps delivers model reproducibility, governance, and automation across the ML lifecycle, all integrated tightly with software development workflows. Iterative is a remote-first company, backed by True Ventures, Afore Capital, and 468 Capital. For more information, visit Iterative.ai.

Continue reading here:

Iterative to Launch Open Source Tool, First to Train Machine Learning Models on Any Cloud Using HashiCorp's Terraform Solution - Business Wire

Why its important for Known to educate DTC clients (and their CFOs) on understanding the value of brand advertising – Digiday

Independent agency Known is not your run-of-the-mill shop, having built itself largely on the back of data science as a way to exceed creative expectations through the life of a campaign. Two-year-old Known, created by Kern Schireson and Ross Martin, two ex-media conglomerate executives (along with co-founders Brad Roth and Mark Feldstein), enjoyed a strong 2021, growing its employee count while adding a number of new clients including Grubhub,Beautycounter, Talkspace,Dapper Labs, and Invitation Homes. The next push is to land clients in the Web3.0 space.

But the agency is also refining its data science skills while helping clients (particularly in the DTC space) understand the value of brand advertising and helping their CFOs understand that value as well. Nathan Hugenberger, Knowns executive vp and chief technical officer, shared his thoughts with Digiday about those efforts.

The following interview has been edited for space and clarity.

How is Known leveraging clean rooms?

Clean rooms basically provide us with the ability to use first-party data when the client has to better target the planning, and better target the buy. And its definitely made it possible to unlock performance when youve got a great data set. Its one of the tools that we think about when were thinking about how to drive better performance for a client. Like, can we build something on top of their dataset? Or if its a newer brand and their first-party data is still growing, how can we design things so that we are growing the data and adding to it over time whether its through experimentation or better telemetry?

Skeptic (Knowns proprietary test-and-learn software to assess and optimize client campaigns) has been your secret weapon. How does it work?

We have all of the clients first-party data, all their research, all this new research we bring to the process to develop the campaign concept and develop the initial batch of creative. But we like to essentially design the campaign to be a learning campaign, a campaign thats going to help us learn and optimize and get better over time. And in that process, we can literally design this so that, in addition to being a marketing campaign, its also a research campaign on what creative does well. And use that to literally say, We know that when you do this, it works better, and when you do that, it doesnt. The idea that the use of data and insights stops at the brief is very old school. The day we know the least is the day when we launch, right? We only can know more after that.

Its not just why we think about our data science and technology investments the way we do. Its vital that the creative team is calling our data scientists and saying, Which creative did well? Which one won? When you gamify the system and are using the clients KPIs as your scoreboard, it all works a lot better.

How does Skeptic help clients improve both performance- and brand-driven advertising.

Theres a whole class of marketers and advertisers whove grown up digital- and social-first, and their organizations, their boards, their leadership are really used to those kinds of metrics that you get from those kinds of systems. But once they reach a certain scale, they need to start investing in brands, they need to start building awareness. How do you teach an organization how to spend millions on brand advertising, and feel really good about it when up until now theyve been used to having, like, eight dials and speedometers every day? And if its not showing immediate results are like, Why are we spending this money?

Aspects of this come up all the time, whether its thinking about how to advise a client in terms of moving them and their organization into investing more in the long term. Or how to use science and data in sort of experimentation to make the brand stuff more measurable and, hence, better set up to be accountable to the board around the budget youre getting.

Part of that is helping the CMO speak the language of the CFO, though, right?

Its absolutely critical these days for CMOs to really be in tight collaboration with the CFO. We actually spent a lot of time thinking about how do we make sure that we understand the internal dynamics that are going on around budgeting and KPIs. Some of this ends up getting into a little bit of management consulting. For an agency to do their best work as a partner to a company, they really need to understand whats on the CFOs wish list. How do you make sure that youre really designing the approach in a campaign to support people on that journey? Can we run tests that answer questions that are critical for the C-suite? Having that information may mean that they continue down on the right path or can make key decisions.

https://digiday.com/?p=446325

Visit link:

Why its important for Known to educate DTC clients (and their CFOs) on understanding the value of brand advertising - Digiday

New book on Kaggle released with inputs from over 30 experts – Analytics India Magazine

The new Kaggle book The Kaggle Book: Data analysis and machine learning for competitive data science, written by Konrad Banachewicz and Luca Massaron, has been released. This book is suitable for anyone new to Kaggle and veteran users too. Data analysts/scientists who want to perform better in Kaggle competitions and secure jobs with tech giants will find this book useful as well.

First of its kind, The Kaggle Book assembles the techniques and skills a participant would require for success in competitions, data science projects, and beyond. In the book, two Kaggle Grandmasters who have accumulated knowledge along the way will walk you through modelling strategies and these details are not easily found elsewhere.

The book also has Kaggle-specific tips, where the participants will learn more techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. This book will help design validation schemes and work comfortably with different evaluation metrics. This book will help you climb the ranks of Kaggle, build some data science skills, or improve the accuracy of your existing models.

Read the original here:

New book on Kaggle released with inputs from over 30 experts - Analytics India Magazine

Estimating the informativeness of data | MIT News | Massachusetts Institute of Technology – MIT News

Not all data are created equal. But how much information is any piece of data likely to contain? This question is central to medical testing, designing scientific experiments, and even to everyday human learning and thinking. MIT researchers have developed a new way to solve this problem, opening up new applications in medicine, scientific discovery, cognitive science, and artificial intelligence.

In theory, the 1948 paper, A Mathematical Theory of Communication, by the late MIT Professor Emeritus Claude Shannon answered this question definitively. One of Shannons breakthrough results is the idea of entropy, which lets us quantify the amount of information inherent in any random object, including random variables that model observed data. Shannons results created the foundations of information theory and modern telecommunications. The concept of entropy has also proven central to computer science and machine learning.

The challenge of estimating entropy

Unfortunately, the use of Shannons formula can quickly become computationally intractable. It requires precisely calculating the probability of the data, which in turn requires calculating every possible way the data could have arisen under a probabilistic model. If the data-generating process is very simple for example, a single toss of a coin or roll of a loaded die then calculating entropies is straightforward. But consider the problem of medical testing, where a positive test result is the result of hundreds of interacting variables, all unknown. With just 10 unknowns, there are already 1,000 possible explanations for the data. With a few hundred, there are more possible explanations than atoms in the known universe, which makes calculating the entropy exactly an unmanageable problem.

MIT researchers have developed a new method to estimate good approximations to many information quantities such as Shannon entropy by using probabilistic inference. The work appears in a paper presented at AISTATS 2022 by authors Feras Saad 16, MEng 16, a PhD candidate in electrical engineering and computer science; Marco-Cusumano Towner PhD 21; and Vikash Mansinghka 05, MEng 09, PhD 09, a principal research scientist in the Department of Brain and Cognitive Sciences. The key insight is, rather than enumerate all explanations, to instead use probabilistic inference algorithms to first infer which explanations are probable and then use these probable explanations to construct high-quality entropy estimates. The paper shows that this inference-based approach can be much faster and more accurate than previous approaches.

Estimating entropy and information in a probabilistic model is fundamentally hard because it often requires solving a high-dimensional integration problem. Many previous works have developed estimators of these quantities for certain special cases, but the new estimators of entropy via inference (EEVI) offer the first approach that can deliver sharp upper and lower bounds on a broad set of information-theoretic quantities. An upper and lower bound means that although we don't know the true entropy, we can get a number that is smaller than it and a number that is higher than it.

The upper and lower bounds on entropy delivered by our method are particularly useful for three reasons, says Saad. First, the difference between the upper and lower bounds gives a quantitative sense of how confident we should be about the estimates. Second, by using more computational effort we can drive the difference between the two bounds to zero, which squeezes the true value with a high degree of accuracy. Third, we can compose these bounds to form estimates of many other quantities that tell us how informative different variables in a model are of one another.

Solving fundamental problems with data-driven expert systems

Saad says he is most excited about the possibility that this method gives for querying probabilistic models in areas like machine-assisted medical diagnoses. He says one goal of the EEVI method is to be able to solve new queries using rich generative models for things like liver disease and diabetes that have already been developed by experts in the medical domain. For example, suppose we have a patient with a set of observed attributes (height, weight, age, etc.) and observed symptoms (nausea, blood pressure, etc.). Given these attributes and symptoms, EEVI can be used to help determine which medical tests for symptoms the physician should conduct to maximize information about the absence or presence of a given liver disease (like cirrhosis or primary biliary cholangitis).

For insulin diagnosis, the authors showed how to use the method for computing optimal times to take blood glucose measurements that maximize information about a patients insulin sensitivity, given an expert-built probabilistic model of insulin metabolism and the patients personalized meal and medication schedule. As routine medical tracking like glucose monitoring moves away from doctors offices and toward wearable devices, there are even more opportunities to improve data acquisition, if the value of the data can be estimated accurately in advance.

Vikash Mansinghka, senior author on the paper, adds, We've shown that probabilistic inference algorithms can be used to estimate rigorous bounds on information measures that AI engineers often think of as intractable to calculate. This opens up many new applications. It also shows that inference may be more computationally fundamental than we thought. It also helps to explain how human minds might be able to estimate the value of information so pervasively, as a central building block of everyday cognition, and help us engineer AI expert systems that have these capabilities.

The paper, Estimators of Entropy and Information via Inference in Probabilistic Models, was presented at AISTATS 2022.

Read the original here:

Estimating the informativeness of data | MIT News | Massachusetts Institute of Technology - MIT News

Riskthinking.AI Names Carolyn A. Wilkins to Advisory Board – PR Newswire

Experienced financial leader brings 30+ years of experience to riskthinking.AI team

TORONTO, April 27, 2022 /PRNewswire/ --Riskthinking.AI, a visionary data analytics provider for assessing climate-based financial risk, announced today that long-time Bank of Canada executive Carolyn A. Wilkins has joined the company's advisory board. Wilkins brings over three decades of experience in public policy and the financial services industry to riskthinking.AI, which works with global financial institutions, assessment firms and custodial banks.

Wilkins spent 20 years with the Bank of Canada, including serving as senior deputy governor from 2014 to 2020. During her tenure, she set monetary and financial system policies with the governing council, oversaw strategic planning and economic research, and was a member of the international Financial Stability Board. She is currently a senior research scholar at Princeton University's Griswold Center for Economic Policy, and a member of Intact Financial Corporation's board of directors. Additionally, she sits on the Bank of England's Financial Policy Committee.

Wilkins' extensive financial sector experience will be a valuable asset to riskthinking.AI as the company continues to evolve and strengthen its ClimateWisdom platform, which combines climate science, economics, data science and software to provide the most comprehensive climate risk data and analytics available. Unlike traditional climate risk solutions, riskthinking.AI enables companies to comprehensively measure and manage climate financial risks in a way that's consistent, science-based, methodologically rigorous and audit-ready.

Wikins joins an impressive roster of advisors and board members dedicated to fostering the company's success.

Dr. Ron Dembo, CEO, riskthinking.AI, said:"Ms. Wilkins adds a perspective that riskthinking.AI needs to tailor our initiatives for the financial sector. She has made important contributions to international financial policies throughout her career and will continue to have meaningful impact through our advisory board."

Carolyn Wilkins, advisory board member, riskthinking.AI, said: "Financial institutions and other businesses need better data and methods to identify, assess and manage climate risks to make informed decisions. I look forward to working with Dr. Dembo and his team, as well as the other strategic advisors, to help develop their initiatives and enable more entities to take advantage of riskthinking.AI's innovative technology."

About riskthinking.AIriskthinking.AI is a visionary data analytics provider for assessing climate-based financial risk. Designed for global financial institutions, assessment firms, and custodial banks, riskthinking.AI provides the most comprehensive climate risk analysis currently available, based on hard science and mathematics. Anchored by founder Dr. Ron Dembo's 40-year experience in algorithmic modeling, riskthinking.AI applies sophisticated stochastic prediction models to comprehensive climate data to help financial institutions apply climate risk ratings to investment assets, vehicles and portfolios for clients to make informed investment decisions about the future. Visit at http://www.riskthinking.ai

SOURCE Riskthinking.AI

Link:

Riskthinking.AI Names Carolyn A. Wilkins to Advisory Board - PR Newswire

Dr. Tina Hernandez-Boussard: Data Science as a Path to Inclusivity and Diversity in Medicine – Ms. Magazine

Growing up in a rural community, Tina Hernandez-Boussard never thought she would go on to earn a Ph.D., much less be at the forefront of a new field intent on solving the inequities of our healthcare system through data science. However, with the support of a mentor who recognized her potential and encouraged her pursuits, Dr. Hernandez-Boussardnow a professor of medicine and biomedical data science at Stanford Universityleads efforts utilizing data in medicine to better serve people from all demographics, not only those who have traditionally been the focus of biomedical research.

For Hernandez-Boussard, solving the inequities within our healthcare system is only possible when we ensure that the people who collect, analyze and interpret data to make decisions, are as diverse as those who will be affected by those decisions. Not only does this make healthcare more equitable, it also creates more empathetic medicine. Through merging health and data science, Hernandez-Boussard is uniquely situated to understand both the challenges and the opportunities in biomedicine that she and other advocates for equity in health care confront. In the wake of a pandemic that drew attention to the numerous inequities in our healthcare system for minority and low-income populations, solving these problems is not only an academic venture, but a matter of life and death.

As Hernandez-Boussard observed at last months Women in Data Science Conference at Stanford University, one of the greatest challenges for data science in healthcare is also its greatest opportunity: creating datasets that include populations and perspectives traditionally excluded from medicine and medical research. Although data science can offer important insights into the problems we face, Hernandez-Boussard reminds us data analysis techniques, like natural language processing (an interdisciplinary approach to computer science that scrapes human language for data) and machine learning, only provide answers learned from the data we feed it. When that data is unbalanced, models perform poorly for different populations.

For example, the Boussard Lab has been working to identify depressive symptoms in cancer patients undergoing chemotherapy. While it is relatively straightforward to capture symptoms of severely depressed patients, intermediate symptoms are less easy to discern, especially among diverse populations who might express these symptoms or feelings differently and traditionally havent been researched. Diverse data scientists have the background to understand how people might communicate these symptoms across culture, gender, race, language and socioeconomic groups. To ask the right questions, data science needs to have diverse problem-solving teams who can better understand patients voices.

According to Hernandez-Boussard, one of the best ways to improve data-driven medicine is to ensure diverse teams of scientists and clinicians are thinking about the right questions to ask. For example, Hernandez-Boussard recalls the time a hospital asked for an algorithm to predict no-show appointments. Rather than simply creating such an algorithm, Hernandez-Boussards team challenged the hospital to think about why they wanted to predict no-shows instead of using data to find ways to reduce barriers that prevent patients from keeping their appointments. In this case, what worked best for the hospital perpetuated circumstances which restrict certain populations from accessing healthcare.

To ask the right questions, data science needs to have diverse problem-solving teams who can better understand patients voices.

Working with diverse populations allows scientists to challenge preconceived notions of symptoms, diseases and treatments, while also enabling practitioners and patients to work together to overcome histories of harm and misinformation. For data science to effectively rise to the challenge of unraveling bias in healthcare, the task requires an additional type of diversity. Not only must data scientists ensure diverse patient voices are better incorporated into healthcare systems, but data science as a field must also seek strategies for creating diverse team science approaches to problem solving.

In addition to ensuring diversity in gender, race, ethnicity and ability in biomedical data science, Hernandez-Boussard emphasizes the importance of diversity within backgrounds, professions and fields of study among teams of those studying problems in medicine. Collaboration across fields is critical, because the complexities of contemporary science and the problems confronting healthcare require multidisciplinary relationships; with computer scientists partnering with clinicians, engineers working with statisticians and social scientists bringing insights from qualitative research.

Data scientists can only rise to the challenge of healthcare inequality and become more collaborative and creative problem solvers by listening to diverse patient voices and engaging in conversations with those who push them outside their comfort zones. As lives continue to be lost as a result of incomplete data sets and single-minded solutions, Hernandez-Boussards efforts to diversify data in healthcare have the potential to save the lives of many people who have traditionally been left behind by medicine.

If you found this articlehelpful,please consider supporting our independent reporting and truth-telling for as little as $5 per month.

Up next:

Read the original:

Dr. Tina Hernandez-Boussard: Data Science as a Path to Inclusivity and Diversity in Medicine - Ms. Magazine

Framing Data Science Problems the Right Way From the Start – MIT Sloan

The failure rate of data science initiatives often estimated at over 80% is way too high. We have spent years researching the reasons contributing to companies low success rates and have identified one underappreciated issue: Too often, teams skip right to analyzing the data before agreeing on the problem to be solved. This lack of initial understanding guarantees that many projects are doomed to fail from the very beginning.

Of course, this issue is not a new one. Albert Einstein is often quoted as having said, If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.

Get Updates on Leading With AI and Data

Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.

Please enter a valid email address

Thank you for signing up

Privacy Policy

Consider how often data scientists need to clean up the data on data science projects, often as quickly and cheaply as possible. This may seem reasonable, but it ignores the critical why question: Why is there bad data in the first place? Where did it come from? Does it represent blunders, or are there legitimate data points that are just surprising? Will they occur in the future? How does the bad data impact this particular project and the business? In many cases, we find that a better problem statement is to find and eliminate the root causes of bad data.

Too often, we see examples where people either assume that they understand the problem and rush to define it, or they dont build the consensus needed to actually solve it. We argue that a key to successful data science projects is to recognize the importance of clearly defining the problem and adhere to proven principles in so doing. This problem is not relegated to technology teams; we find that many business, political, management, and media projects, at all levels, also suffer from poor problem definition.

Data science uses the scientific method to solve often complex (or multifaceted) and unstructured problems using data and analytics. In analytics, the term fishing expedition refers to a project that was never framed correctly to begin with and involves trolling the data for unexpected correlations. This type of data fishing does not meet the spirit of effective data science but is prevalent nonetheless. Consequently, defining the problem correctly needs to be step one. We previously proposed an organizational bridge between data science teams and business units, to be led by an innovation marshal someone who speaks the language of both the data and management teams and can report directly to the CEO. This marshal would be an ideal candidate to assume overall responsibility to ensure that the following proposed principles are utilized.

Get the right people involved. To ensure that your problem framing has the correct inputs, you have to involve all the key people whose contributions are needed to complete the project successfully from the beginning. After all, data science is an interdisciplinary, transdisciplinary team sport. This team should include those who own the problem, those who will provide data, those responsible for the analyses, and those responsible for all aspects of implementation. Think of the RACI matrix those responsible, accountable, to be consulted, and to be informed for each aspect of the project.

Recognize that rigorously defining the problem is hard work. We often find that the problem statement changes as people work to nail it down. Leaders of data science projects should encourage debate, allow plenty of time, and document the problem statement in detail as they go. This ensures broad agreement on the statement before moving forward.

Dont confuse the problem and its proposed solution. Consider a bank that is losing market share in consumer loans and whose leadership team believes that competitors are using more advanced models. It would be easy to jump to a problem statement that looks something like Build more sophisticated loan risk models. But that presupposes that a more sophisticated model is the solution to market share loss, without considering other possible options, such as increasing the number of loan officers, providing better training, or combating new entrants with more effective marketing. Confusing the problem and proposed solution all but ensures that the problem is not well understood, limits creativity, and keeps potential problem solvers in the dark. A better statement in this case would be Research root causes of market share loss in consumer loans, and propose viable solutions. This might lead to more sophisticated models, or it might not.

Understand the distinction between a proximate problem and a deeper root cause. In our first example, the unclean data is a proximate problem, whereas the root cause is whatever leads to the creation of bad data in the first place. Importantly, We dont know enough to fully articulate the root cause of the bad data problem is a legitimate state of affairs, demanding a small-scale subproject.

Do not move past problem definition until it meets the following criteria:

Taking the time needed to properly define the problem can feel uncomfortable. After all, we live and work in cultures that demand results and are eager to get on with it. But shortchanging this step is akin to putting the cart before the horse it simply doesnt work. There is no substitute for probing more deeply, getting the right people involved, and taking the time to understand the real problem. All of us data scientists, business leaders, and politicians alike need to get better at defining the right problem the right way.

Roger W. Hoerl (@rogerhoerl) teaches statistics at Union College in Schenectady, New York. Previously, he led the applied statistics lab at GE Global Research. Diego Kuonen (@diegokuonen) is head of Bern, Switzerland-based Statoo Consulting and a professor of data science at the Geneva School of Economics and Management at the University of Geneva. Thomas C. Redman (@thedatadoc1) is president of New Jersey-based consultancy Data Quality Solutions and coauthor of The Real Work of Data Science: Turning Data Into Information, Better Decisions, and Stronger Organizations (Wiley, 2019).

See the rest here:

Framing Data Science Problems the Right Way From the Start - MIT Sloan