Category Archives: Data Science
The alliance aims to train and improve the skills of underrepresented communities seeking opportunity.
Stay informed and join our daily newsletter now!
February18, 20213 min read
SoftBank Group Corp, as part of its Academy of Artificial Intelligence (AI), announced on February 18 its support for Data Science for All / Empowerment (DS4A / Empowerment, for its acronym in English). This alliance aims to train and improve the skills of underrepresented communities seeking opportunities in the field of data science.
Developed by Correlation One, DS4A / Empowerment aims to train 10,000 people giving priority to Afro-descendants, Latinos, women, LGBTQ + and United States military veterans, over the next three years, providing new paths to economic opportunities in one of the fastest growing industries in the world.
The SoftBank AI Academy supports programs that complement the theoretical training of traditional technical education courses with practical lessons, including artificial intelligence and data management skills that can be immediately applied to business needs.
DS4A / Empowerment will provide training to employees of SoftBank Group International portfolio companies, including the Opportunity Fund and Latam Fund, as well as external candidates from the United States and Latin America, including Mexico.
The program is specifically designed to address gender equity and talent gaps in a field that has historically been inaccessible to many people, leading to a significant under -representation of women and Afro-descendants. Participants will work on real case studies that are expected to have a measurable impact on the operating performance of participating companies.
IDB Lab, the innovation laboratory of the Inter-American Development Bank Group, will join SoftBank and provide more than 10 full scholarships to underrepresented candidates in Latin America, while Beacon Council will offer 4 full scholarships for underrepresented candidates based in Miami.
Program participants will receive 13 weeks of data and analytics training (including optional Python training) while working on case studies and projects, including projects presented by SoftBank's portfolio of companies. The initiative will also link participants with mentors who will provide career development and guidance. Upon completion of the program, external participants will be connected to employment opportunities at SoftBank and leading companies in the business, financial services, technology, healthcare, consulting and consumer sectors.
DS4A Empowerment is an online program taught in English over a period of 13 weeks. Classes will be held on Saturdays from 10:00 am to 8:00 pm (Eastern Time, ET), beginning April 17, 2021.
The program registration period ends on March 7, 2021. Applicants who might consider applying include employees from the portfolio of companies affiliated with SoftBank in the region, as well as software engineers, technical product managers, technical marketers and anyone with a background in STEM who is interested in learning data analysis. To apply and learn more about the program, interested candidates can visit the official website of DS4A Empowerment .
Originally posted here:
Published on February 22, 2021
(Miami, FL - February 22, 2021)-The City of Miami has been named as a participant in Data Science for All / Empowerment (DS4A / Empowerment), a new effort designed to upskill and prepare job-seekers from underserved communities for data science careers. The initiative is being backed by SoftBank Group International (SoftBank) as part of its AI Academy, and was developed by Correlation One. It aims to train at least 10,000 people from underrepresented communities including Greater Miami over the next three years, providing new pathways to economic opportunity in the worlds fastest-growing industries.
We need talent with a deep understanding of data science to build the companies of the future, said Marcelo Claure, CEO of SoftBank Group International. Were proud to support this effort, continue to upskill our portfolio companies and train more than 10,000 people from underrepresented communities with critical technical skills.
SoftBanks AI Academy supports programs that supplement theoretical training of traditional technical education courses with practical lessons, including AI and data skills that can be immediately applied to common business needs.
DS4A / Empowerment will provide training for SoftBank Group International portfolio company employees, including portfolio companies of the Opportunity Fund and Latin America Fund, as well as external candidates from the U.S. and Latin America. The program is specifically designed to address talent and equity gaps in a field that has historically been inaccessible for many workers, leading to significant underrepresentation of women and non-white individuals. Participants work on real-world case studies that are expected to have measurable impact on the operational performance of participating companies.
IDB Lab will join SoftBank by providing over ten full-ride Fellowships to underrepresented candidates in Latin America while the Miami-Dade Beacon Council will provide four full-ride fellowships for underrepresented candidates based in Miami.
In addition, The City of Miami will join as an impact partner by providing twenty fellowships to Miami talent and five fellowships to public sector workers.
"As Miami grows as a tech hub, it is important that we empower local entrepreneurs and the public sector to leverage the power of AI. We are proud to support the building of a diverse data-fluent community in Miami through our partnership with Correlation One and SoftBank, said Francis Suarez, the Mayor of Miami.
Participants in the program will receive 13 weeks of data and analytics training (plus optional Python training) while working on case studies and projects, including projects submitted by SoftBank portfolio companies. The initiative will also connect participants with mentors who will provide professional development and career coaching. At the end of the program, external participants will be connected with employment opportunities at SoftBank and leading enterprises across business, financial services, technology, healthcare, consulting, and consumer sectors.
Miamis success hinges on dramatically expanding opportunity across our community and building a workforce with the skills for the jobs of tomorrow, said Matt Haggman, Executive Vice President of The Beacon Council. This program is an important step towards creating the innovative and equitable future we can - and must - achieve.
Training our residents to take on the jobs of the future is critical to ensuring that economic growth is shared across all communities, and to building our local talent so that more leading companies in fields like tech and data science can put down roots in Miami-Dade, said Daniella Levine-Cava, Miami-Dade County's Mayor. Im thrilled that this program is unlocking opportunities in a field that has historically been inaccessible for so many, and creating new, inclusive pathways to prosperity in one of the worlds fastest-growing industries.
The COVID-19 pandemic has both accelerated demand for data science talent and exacerbated the access gaps that kept so many aspiring workers locked out of opportunity, said Rasheed Sabar and Sham Mustafa, Co-CEOs and Co-Founders of Correlation One. We are grateful to work with innovative employers like SoftBank that are stepping up to play a more direct role in helping the workforce prepare themselves for jobs of the future.
Program and Registration DetailsDS4A Empowerment is an online program delivered in English over a 13-week period. Classes will convene on Saturdays from 10:00am to 8:00pm ET, beginning on April 17, 2021.Registration for the program ends on March 7, 2021. Candidates who should consider applying include employees of SoftBank affiliated portfolio companies in the region as well as software engineers, technical product managers, technical marketers, and anyone with a STEM background who is interested in learning data analysis. To apply and find out more information about the program, interested candidates can visit the official DS4A Empowerment website:https://c1-web.correlation-one.com/ds4a-empowermentFor program related inquiries, please email@example.com.
About SoftBankThe SoftBank Group invests in breakthrough technology to improve the quality of life for people around the world. The SoftBank Group is comprised of SoftBank Group Corp. (TOKYO: 9984), an investment holding company that includes telecommunications, internet services, AI, smart robotics, IoT and clean energy technology providers; the SoftBank Vision Funds, which are investing up to $100 billion to help extraordinary entrepreneurs transform industries and shape new ones; and the SoftBank Latin America Fund, the largest venture fund in the region. To learn more, please visithttps://global.softbankAbout Correlation OneCorrelation One is on a mission to build the most equitable vocational school of the future. We believe that data literacy is the most important skill for the future of work. We make data fluency a competitive edge for firms through global data science competitions, rigorous data skills assessments, and enterprise-focused data science training.
Correlation One's solutions are used by some of the most elite employers all around the world in finance, technology, healthcare, insurance, consulting and governmental agencies. Since launching in 2015, Correlation One has built an expert community of 250,000+ data scientists and 600+ partnerships with leading universities and data science organizations in the US, UK, Canada, China, and Latin America.
For SoftBank:Laura Gaviria HalabyLaura.firstname.lastname@example.org
City for Miami Media:Stephanie Severinosseverino@miamigov.com
For Beacon Council:Maria Budetmbudet@beaconcouncil.com
Go here to read the rest:
Artificial intelligence (AI) and data science have the potential to revolutionize global health. But what exactly is AI and what hurdles stand in the way of more widespread integration of big data in global health? Dukes Global Health Institute (DGHI) hosted a Think Global webinar Wednesday, February 17th to dive into these questions and more.
The webinars panelists were Andy Tatem (Ph.D), Joao Vissoci (Ph.D.), and Eric Laber (Ph.D.), moderated by DGHIs Director of Research Design and Analysis Core, Liz Turner (Ph.D.). Tatem is a professor of spatial demography and epidemiology at the University of South Hampton and director of WorldPop. Vissoci is an assistant professor of surgery and global health at Duke University. Laber is a professor of statistical science and bioinformatics at Duke.
Tatem, Vissoci, and Laber all use data science to address issues in the global health realm. Tatems work largely utilizes geospatial data sets to help inform global health decisions like vaccine distribution within a certain geographic area. Vissoci, who works with the GEMINI Lab at Duke (Global Emergency Medicine Innovation and Implementation Research), tries to leverage secondary data from health systems in order to understand issues of access to and distribution of care, as well as care delivery. Laber is interested in improving decision-making processes in healthcare spaces, attempting to help health professionals synthesize very complex data via AI.
All of their work is vital to modern biomedicine and healthcare, but, Turner said, AI means a lot of different things to a lot of different people. Laber defined AI in healthcare simply as using data to make healthcare better. From a data science perspective, Vissoci said, [it is] synthesizing data an automated way to give us back information. This returned info is digestible trends and understandings derived from very big, very complex data sets. Tatem stated that AI has already revolutionized what we can do and said it is powerful if it is directed in the right way.
We often get sucked into a science-fiction version of AI, Laber said, but in actuality it is not some dystopian future but a set of tools that maximizes what can be derived from data.
However, as Tatem stated, [AI] is not a magic, press a button scenario where you get automatic results. A huge part of work for researchers like Tatem, Vissoci, and Laber is the harmonization of working with data producers, understanding data quality, integrating data sets, cleaning data, and other back-end processes.
This comes with many caveats.
Bias is a huge problem, said Laber. Vissoci reinforced this, stating that the models built from AI and data science are going to represent what data sources they are able to access bias included. We need better work in getting better data, Vissoci said.
Further, there must be more up-front listening to and communication with end-users from the very start of projects, Tatem outlined. By taking a step back and listening, tools created through AI and data science may be better met with actual uptake and less skepticism or distrust. Vissoci said that direct engagement with the people on the ground transforms data into meaningful information.
Better structures for meandering privacy issues must also be developed. A major overhaul is still needed, said Laber. This includes things like better consent processes for patients to understand how their data is being used, although Tatem said this becomes very complex when integrating data.
Nonetheless the future looks promising and each panelist feels confident that the benefits will outweigh the difficulties that are yet to come in introducing big data to global health. One cool example Vissoci gave of an ongoing project deals with the influence of environmental change through deforestation in the Brazilian Amazon on the impacts of Indigenous populations. Through work with heavy multidimensional data, Vissoci and his team also have been able to optimize scarcely distributed Covid vaccine resource to use in areas where they can have the most impact.
Laber envisions a world with reduced or even no clinical trials if randomization and experimentation are integrated directly into healthcare systems. Tatem noted how he has seen extreme growth in the field in just the last 10 to 15 years, which seems only to be accelerating.
A lot of this work has to do with making better decisions about allocating resources, as Turner stated in the beginning of the panel. In an age of reassessment about equity and access, AI and data science could serve to bring both to the field of global health.
Post by Cydney Livingston
See original here:
February 22, 2021
A new multidisciplinary collaboration between the University of Rochesters departments of biology, biomedical engineering, and optics and the Goergen Institute for Data Science will establish an innovative microscopy resource on campus, allowing for cutting-edge scientific research in biological imaging.
Michael Welte, professor and chair of the Department of Biology, is the lead principal investigator of the project, which was awarded a $1.2 million grant from the Arnold and Mabel Beckman Foundation.
The grant supports an endeavor at the intersection of optics, data science, and biomedical research, and the University of Rochester is very strong in these areas, Welte says. The University has a highly collaborative culture, and the close proximity of our college and medical center makes Rochester ideally suited to lead advances in biological imaging.
The project will include developing and building a novel light-sheet microscope that employs freeform optical designs devised at Rochester. The microscope, which will be housed in a shared imaging facility in Goergen Hall and is expected to be online in 2022, enables three-dimensional imaging of complex cellular structures in living samples. Researchers and engineers will continually improve the microscope, and it will eventually become a resource for the entire campus research community.
The optical engineers working on this project will take light-sheet technology into new domains, says Scott Carney, professor of optics and director of Rochesters Institute of Optics, who is a co-principal investigator on the project. They will transform a precise, high-end microscope into a workhorse for biologists working at the cutting edge of their disciplines to make discoveries about the very fabric of life at the cellular and subcellular level.
The microscope will produce large amounts of data that will require new methods to better collect, analyze, and store the images.
These efforts will focus on developing algorithms for computational optical imaging and automated biological image analysis, as well as on big data management, says Mujdat Cetin, a professor of electrical and computer engineering and the Robin and Tim Wentworth Director of the Goergen Institute for Data Science. Cetin is also a co-principal investigator on the project.
While many other research microscopes illuminate objects pixel by pixel, light-sheet technology illuminates an entire plane at once. The result is faster imaging with less damage to samples, enabling researchers to study biological processes in ways previously out of reach.
In addition to funding the construction of the microscope and development of the data science component, the grant from the Arnold and Mabel Beckman Foundation supports three biological research projects:
Not only am I excited about each of the individual projectsfrom intimate looks at bacteria to finding new ways to analyze imagesI am absolutely thrilled about the prospect of building something even bigger and better via the close collaboration of disciplines Rochester excels at individually: optics, data science, and biomedical research, Welte says. I believe this joint endeavor is only the first in a long line that will establish Rochester as a leader in biological imaging.
Tags: Anne S. Meyer, Arts and Sciences, Dan Bergstralh, Department of Biology, Goergen Institute for Data Science, grant, Hajim School of Engineering and Applied Sciences, James McGrath, Michael Welte, Mujdat Cetin, Richard Waugh, Scott Carney
Category: Science & Technology
UNIVERSITY PARK, Pa. Data science can be a useful tool and powerful ally in enhancing diversity. A group of data scientists are holding Harnessing the Data Revolution to Enhance Diversity, a symposium aimed at discussing the issues, identifying opportunities and initiating the next steps toward improving equity and diversity in academia at the undergraduate and faculty levels.
The online event, organized by the Institute for Computational and Data Sciences and co-sponsored by the Office of the Vice Provost for Educational Equity and the Center for Social Data Analytics, will be held from 1 to 3:30 p.m. on March 16 and 17, and is scheduled to include 10 30-minute talks and a roundtable discussion. Organizers added that the event is designed to help form new collaborations and identify cutting-edge approaches that can enhance diversity at Penn State and in higher education across the country.
This symposium will bring together researchers from across the computational and social sciences to explore how we can build more diverse communities of researchers that are sensitive to how computational and data science can shape how diverse populations are impacted by change, said Jenni Evans, professor ofmeteorology and atmospheric scienceand ICDS director.
Speakers from across the U.S. will discuss issues ranging from quantifying and contextualizing diversity-related issues to examining approaches that have and havent worked in academia.
Ed O'Brien, associate professor of chemistry and ICDS co-hire, said data science offers several tools to promote diversity, equity and inclusion.
This symposium is bringing together diverse academic communities to explore how data science can be utilized to enhance diversity, equity and inclusion, said OBrien. Leveraging advances in big data and artificial intelligence holds the promise of complementing and accelerating a range of initiatives in this area.
Some of the topics the speakers and participants will address include how to identify a diversity-related goal, how to quantify and contextualize the challenge of increasing diversity, and analyzing approaches that have and have not worked.
Find out more and register for the symposium at https://icds.psu.edu/diversity.
Last Updated February 17, 2021
Read the rest here:
Learn About Innovations in Data Science and Analytic Automation on an Upcoming Episode of the Advancements Series – Yahoo Finance
Explore the importance of analytics in digital transformation efforts.
JUPITER, Fla., Feb. 18, 2021 /PRNewswire-PRWeb/ -- The award-winning series, Advancements with Ted Danson, will focus on recent developments in data science technology, in an upcoming episode, scheduled to broadcast 2Q/2021.
In this segment, Advancements will explore how Alteryx uses data science to enable its customers to solve analytics use cases. Viewers will also learn how Alteryx accelerates digital transformation outcomes through analytics and data science automation. Spectators will see how, regardless of user skillset, the code-free and code-friendly platform empowers a self-service approach to upskill workforces, while speeding analytic and high-impact outcomes at scale.
"As digital transformation accelerates across the globe, the ability to unlock critical business insights through analytics is of the utmost importance in achieving meaningful outcomes," said Alan Jacobson, chief data and analytics officer of Alteryx. "Alteryx allows data workers at almost any experience level to solve complex problems with analytics and automate processes for business insights and quick wins. We look forward to sharing our story with the Advancements audience and to exploring how analytics and data science will shape the technology landscape of the future."
The segment will also uncover how the platform accelerates upskilling across the modern workforce, while furthering digital transformation initiatives and leveraging data science analytics to drive social outcomes.
"As a proven leader in analytics and data science automation, we look forward to highlighting Alteryx and to educating viewers about the importance of analytics," said Richard Lubin, senior producer for Advancements.
About Alteryx: As a leader in analytic process automation (APA), Alteryx unifies analytics, data science, and business process automation in one, end-to-end platform to accelerate digital transformation. Organizations of all sizes, all over the world, rely on the Alteryx Analytic Process Automation Platform to deliver high-impact business outcomes and the rapid upskilling of the modern workforce. Alteryx is a registered trademark of Alteryx, Inc. All other product and brand names may be trademarks or registered trademarks of their respective owners.
For more information visit http://www.alteryx.com.
About Advancements and DMG Productions: The Advancements series is an information-based educational show targeting recent advances across a number of industries and economies. Featuring state-of-the-art solutions and important issues facing today's consumers and business professionals, Advancements focuses on cutting-edge developments, and brings this information to the public with the vision to enlighten about how technology and innovation continue to transform our world.
Backed by experts in various fields, DMG Productions is dedicated to education and advancement, and to consistently producing commercial-free, educational programming on which both viewers and networks depend.
For more information, please visit http://www.AdvancementsTV.com or call Richard Lubin at 866-496-4065.
Sarah McBrayer, DMG Productions, 866-496-4065, email@example.com
SOURCE Advancements with Ted Danson
How Intel Employees Volunteered Their Data Science Expertise To Help Costa Rica Save Lives During the Pandemic – CSRwire.com
Submitted by Intel Corporation
We Are Intel
What do you do when a terrifying pandemic that has shaken the globe threatens to overwhelm your country? For Intel employees in Costa Rica, the answer was to offer their problem-solving expertiseand over 1,000 hours of highly technical workhelping the government develop an effective response plan.
In the early days of the pandemic, the uncertainty of how quickly the virus would spread and how severely it would impact communities made healthcare availability and resources a top concern. Feeling they could use their technical expertise to help, a group of Intel employees reached out to the Caja Costarricence de Seguro Social (CCSS), Costa Ricas main agency responsible for its public health sector.
Luis D. Rojas, one of the volunteer co-leads for the Intel team (seen here wearing a plaid shirt), laid the groundwork with CCSS to understand where help was needed, and with that, the team quickly began iterating on a statistical model to project anticipated demand for hospital beds and ICU capacity. With their combined expertise in data science, statistical process control, and machine learning system deployment, the team was able to pool their areas of knowledge to present their model and recommendations to the CCSS agency, and even the President. Ultimately, their project became one of the key modeling systems used by the government to inform their pandemic response.
Luis, and the rest of the Costa Rica team provided whats called skills-based volunteering, in which volunteers leverage the skills they possess the most expertise in to help address community concerns challenges.
What motivated me to help was my parents, said Jonathan Sequeira Androvetto, data scientist and volunteer co-lead. Both of my parents are in the high-risk population, and knowing that what I was doing was helping my country and my family was incredible. Volunteering in this way gave me a lot of positive energy it was recharging.
Jonathan and the other volunteers used their expertise in data science and statistics to help the government understand how their containment policies would affect virus reproduction rates and potential hospital and ICU utilization. The team also developed a dashboard intended to be shared with local governments to summarize the state of their cities in terms of how the pandemic is behaving. The dashboard includes metrics such as the growth rates of certain reproduction rates (R) / active cases in respective cities, as well as a 21-day projection of new cases / active cases, if the R trend is sustained in the near future.
You get back more than you give, shared volunteer co-lead and machine learning engineer Jenny Peraza, on why she went above and beyond to apply her skillset to supporting the governments pandemic response. When I began this project, it was from a place of intellectual curiosity. It evolved into so much more, and its been incredibly gratifying to be able to put my process excellence and statistical knowledge towards successfully containing and managing this problem.
The Costa Rica team proved that a coordinated volunteer effort, especially one that maximizes specific skills, can make an increased impact on local and national communities. From a small idea to a national effort, the team was able to come together to help a community much larger than themselves.
When the pandemic is over, both Jenny and Jonathan have plans to continue their skills-based volunteering. Jonathan shared, Giving back to my country in this way has been phenomenal, and the positive energy and relationships Ive made through the course of this project both within and outside of Intel have meant so much. Added Jenny, There are a wealth of resources and opportunities out there, sometimes its just a matter of finding the right fit for you to apply your knowledge and skills to help make your community stronger.
At Intel, corporate responsibility means doing what is right. Respecting people and the world around us. Its how we do business.
More from Intel Corporation
Originally posted here:
Industry VoicesBuilding ethical algorithms to confront biases: Lessons from Aotearoa New Zealand – FierceHealthcare
New Zealand, an island country of fivemillionpeople in the Pacific, presents aglobally-relevantcase study inthe application of robust, ethical data science for healthcare decision-making.
With a strong data-enabled health system, the population has successfully navigated several challenging aspects of both the pandemic response of 2020 and wider health data science advancements.
New Zealands diverse population comprises a majority ofEuropean descent, but major cohorts of the indigenous Mori population, other Pacific Islanders and Asian immigrants all makeup significant numbers. Further, these groups tend to be over-represented in negative health statistics, with an equity gap that has generally increased with advances in healthtechnology.
Disruption, Acceleration & Innovation: Pharmacists on the Frontline
This year, pharmacists will play a critical role in the United States COVID-19 immunization efforts. Although this is welcomed news, this new duty and other coronavirus responsibilities are exacerbating pharmacist burnout. In this panel, experts will explore how pharmacists can leverage technology to automate administrative tasks and satisfy patient needs.
Adopting models from international studies presentsa challenge for a societywith such an emphasis on reducing the equity gap. International research has historically included many more people of European origin, meaning that advances in medical practice are more likely to benefit those groups. As more data science technologies are developed, including machine learning and artificial intelligence, the potential to exacerbate rather than reduce inequities is significant.
RELATED:HCA teams with AHRQ, Johns Hopkins to share key data on COVID-19
New Zealand hasinvestedin health data science collaborations,particularlythrougha public-private partnership called Precision Driven Health (PDH). PDH puts clinicians, data scientists and software developers together to develop new models and toolsto translate data into better decisions.Some of the technology and governance models developedthrough these collaborations havebeencritical in supporting the national response to the COVID-19 pandemic.
When the New Zealand government, led by Prime Minister Jacinda Ardern, called upon the research community to monitor and model the spread of COVID-19, a new collaboration emerged.PDH data scientists from Orion Health supported academics fromTePnahaMatatini, auniversity-ledcenterofresearchexcellence, in developing, automating andcommunicatingthefindings of modeling initiatives.
This led to a world-firstnational platform, called the New Zealand Algorithm Hub. The hub hosts models that have been reviewed for appropriate use in the response to COVID-19 andmakes them freely available fordecision-makers to use.Models range from pandemic spread models to risk of hospitalization and mortality, as well as predictive and scheduling models utilized to help reduce backlogs created during the initial lockdown.
One of the key challenges in delivering a platform of this nature is the governance of the decisions aroundwhichalgorithms to deploy. Having had very few COVID-19 cases in New Zealandmeant that it was not straightforward to assess whether analgorithmmight be suitable for this unique population.
RELATED:How COVID-19 shifted healthcare executives' technology priorities and what to expect in 2021
A governance group was formed with stakeholders from consumer, legal, Mori, clinical, ethical and data science expertise, amongothers. This group developed a robust process to assesssuitability, inviting the community to describe howalgorithmswere intended to be used, how they potentially could be misused or whether there might be other unintended consequences to manage.
The governance group placed a strong emphasis on the potential for bias to creep in. If historical records favor some people, howdo we avoid automating these? A careful review was necessary of the data thatcontributedto model development; any knownissues relating to access or data quality differences between different groups; and what assumptions were to be made when the model would indeed be deployed for a group that had never been part of any control trial.
On one level, New Zealands COVID-19 response reflects a set of national values where the vulnerable have been protected;all of society has had to sacrificefora benefit which is disproportionatelybeneficialto older and otherwise vulnerable citizens. The sense of national achievement in being able tolive freely within tightly restricted borders has meant that it is important to protect those gains and avoidcomplacency.
The algorithm hub, with validated models and secure governance, is an example ofpositive recognition of bias motivating the New Zealand data science community to act to eliminate not just a virus, butultimately a long-term equity gap in health outcomes for people.
Kevin Ross, Ph.D., is director of research at Orion Health and CEO of Precision Driven Health.
Here is the original post:
Scikit-learn is a powerful machine learning library that provides a wide variety of modules for data access, data preparation and statistical model building. It has a good selection of clean toy data sets that are great for people just getting started with data analysis and machine learning. Even better, easy access to these data sets removes the hassle of searching for and downloading files from an external data source. The library also enables data processing tasks such as imputation, data standardization and data normalization. These tasks can often lead to significant improvements in model performance.
Scikit-learn also provides a variety of packages for building linear models, tree-based models, clustering models and much more. It featuresan easy-to-use interface for each model object type, which facilitates fast prototyping and experimentation with models. Beginners in machine learning will also find the library useful since each model object is equipped with default parameters that provide baseline performance. Overall, Scikit-learn provides many easy-to-use modules and methods for accessing and processing data and building machine learning models in Python. This tutorial will serve as an introduction to some of its functions.
Scikit-learn is a powerful machine learning library that provides a wide variety of modules for data access, data preparation and statistical model building.Scikit-learn also provides a variety of packages for building linear models, tree-based models, clustering models and much more. It featuresan easy-to-use interface for each model object type, which facilitates fast prototyping and experimentation with models.
Scikit-learn provides a wide variety of toy data sets, whichare simple, clean, sometimes fictitiousdata sets that can be used for exploratory data analysis and building simple prediction models. The ones available in Scikit-learn can be applied to supervised learning tasks such as regression and classification.
For example, it has a set called iris data, which contains information corresponding to different types of iris plants. Users can employ this data for building, training and testing classification models that can classifytypes of iris plants based on their characteristics.
Scikit-learn also has a Boston housing data set, which contains information on housing prices in Boston. This data is useful for regression tasks like predicting the dollar value of a house. Finally, the handwritten digits data set is an image data set that is great for building image classification models. All of these data sets are easy to load using a few simple lines of Python code.
To start, lets walk through loading the iris data. We first need to import the pandas and numpy packages:
Next, we relax the display limits on the columns and rows:
We then load the iris data from Scikit-learn and store it in a pandas data frame:
Finally, we print the first five rows of data using the head() method:
We can repeat this process for the Boston housing data set. To do so, lets wrap our existing code in a function that takes a Scikit-learn data set as input:
We can call this function with the iris data and get the same output as before:
Now that we see that our function works, lets import the Boston housing data and call our function with the data:
Finally, lets load the handwritten digits data set, which contains images of handwritten digits from zero through nine. Since this is an image data set, its neithernecessary nor useful to store it in a data frame. Instead, we can display the first five digits in the data using the visualization library matplotlib:
And if we call our function with load_digits(), we get the following displayed images:
I cant overstate the ease with which a beginner in the field can access these toy data sets. These sets allow beginners to quickly get their feet wet with different types of data and use cases such as regression, classification and image recognition.
Scikit-learn also provides a variety of methods for data processing tasks. First, lets take a look at data imputation, which is the process of replacing missing data and is important because oftentimes real data contains either inaccurate or missing elements. This can result in misleading results and poor model performance.
Being able to accurately impute missing values is a skill that both data scientists and industry domain experts should have in their toolbox. To demonstrate how to perform data imputation using Scikit-learn, well work with the University of California, Irvines data set on housing electric power consumption, which is available here. Since the data set is quite large, well take a random sample of 40,000 records for simplicity and store the down-sampled data in a separate csv file called hpc.csv:
As we can see, the third row (second index) contains missing values specified by ? and NaN. The first thing we can do is replace the ? values with NaN values. Lets demonstrate this with Global_active_power:
We can repeat this process for the rest of the columns:
Now, to impute the missing values, we import the SimpleImputer method from Scikit-learn. We will define an imputer object that simply imputes the mean for missing values:
And we can fit our imputer to our columns with missing values:
Store the result in a data frame:
Add back the additional date and time columns:
And print the first five rows of our new data frame:
As we can see, the missing values have been replaced.
Although Scikit-learns SimpleImputer isnt the most sophisticated imputation method, it removes much of the hassle around building a custom imputer. This simplicity is useful for beginners who are dealing with missing data for the first time. Further, it serves as a good demonstration of how imputation works. By introducing the process, it can motivate more sophisticated extensions of this type of imputation such as using a statistical model to replace missing values.
Data standardization and normalization are also easy with Scikit-learn. Both of these are useful in machine learning methods that involve calculating a distance metric like K-nearest neighbors and support vector machines. Theyre also useful in cases where we can assume the data are normally distributed and for interpreting coefficients in linear models to be of variable importance.
Standardization is the process of subtracting values in numerical columns by the mean and scaling to unit variance (through dividing by the standard deviation). Standardization is necessary in cases where a wide range of numerical values may artificially dominate prediction outcomes.
Lets consider standardizing the Global_intensity in the power consumption data set. This column has values ranging from 0.2 to 36. First, lets import the StandardScalar() method from Scikit-learn:
Data normalization scales a numerical column such that its values are between 0 and 1. Normalizing data using Scikit-learn follows similar logic to standardization. Lets apply the normalizer method to the Sub_metering_2 column:
Now we see that the min and max are 1.0 and 0.
In general, you should standardize data if you can safely assume its normally distributed. Conversely, if you can safely assume that your data isnt normally distributed, then normalization is a good method for scaling it. Given that these transformations can be applied to numerical data with just a few lines of code, the StandardScaler() and Normalizer() methods are great options for beginners dealing with data fields that have widely varying values or data that isnt normally distributed.
Scikit-learn also has methods for building a wide array of statistical models, including linear regression, logistic regression and random forests. Linear regression is used for regression tasks. Specifically, it works for the prediction of continuous output like housing price, for example. Logistic regression is used for classification tasks in which the model predicts binary output or multiclass like predicting iris plant type based on characteristics. Random forests can be used for both regression and classification. Well walk through how to implement each of these models using the Scikit-learn machine learning library in Python.
Linear regression is a statistical modeling approach in which a linear function represents the relationship between input variables and a scalar response variable. To demonstrate its implementation in Python, lets consider the Boston housing data set. We can build a linear regression model that uses age as an input for predicting the housing value. To start, lets define our input and output variables:
Next, lets split our data for training and testing:
Now lets import the linear regression module from Scikit-learn:
Finally, lets train, test and evaluate the performance of our model using R^2 and RMSE:
Since we use one variable to predict a response, this is a simple linear regression. But we can also use more than one variable in a multiple linear regression. Lets build a linear regression model with age (AGE), average number of rooms (RM), and pupil-to-teacher ratio (PTRATION). All we need to do is redefine X (input) as follows:
This gives the following improvement in performance:
Linear regression is a great method to use if youre confident that there is a linear relationship between input and output. Its also useful as a benchmark against more sophisticated methods like random forests and support vector machines.
Logistic regression is a simple classification model that predicts binary or even multiclass output. The logic for training and testing is similar to linear regression.
Lets consider the iris data for our Python implementation of a logistic regression model. Well use sepal length (cm), sepal width (cm), petal length (cm) and petal width (cm) to predict the type of iris plant:
We can evaluate and visualize the model performance using a confusion matrix:
We see that the model correctly captures all of the true positives across the three iris plant classes. Similar to linear regression, logistic regression depends on a linear sum of inputs used to predict each class. As such, logistic regression models are referred to as generalized linear models. Given that logistic regression models a linear relationship between input and output, theyre best employed when you know that there is a linear relationship between input and class membership.
Random forests, also called random decision trees, is a statistical model for both classification and regression tasks. Random forests are basically a set of questions and answers about the data organized in a tree-like structure.
These questions split the data into subgroups so that the data in each successive subgroup are most similar to each other. For example, say wed like to predict whether or not a borrower will default on a loan. A question that we can ask using historical lending data is whether or not the customers credit score is below 700. The data that falls into the yes bucket will have more customers who default than the data that falls into the no bucket.
Within the yes bucket, we can further ask if the borrowers income is below $30,000. Presumably, the yes bucket here will have an even greater percentage of customers who default. Decision trees continue asking statistical questions about the data until achieving maximal separation between the data corresponding to those who default and those who dont.
Random forests extend decision trees by constructing a multitude of them. In each of these trees, we ask statistical questions on random chunks and different features of the data. For example, one tree may ask about age and credit score on a fraction of the train data. Another may ask about income and gender on a separate fraction of the training data, and so forth. Random forest then performs consensus voting across these decision trees and uses the majority vote for the final prediction.
Implementing a random forests model for both regression and classification is straightforward and very similar to the steps we went through for linear regression and logistic regression. Lets consider the regression task of predicting housing prices using the Boston housing data. All we need to do is import the random forest regressor module, initiate the regressor object, fit, test and evaluate our model:
We see a slight improvement in performance compared to linear regression.
The random forest object takes several parameters that can be modified to improve performance. The three Ill point out here are n_estimators, max_depth and random_state. You can check out the documentation for a full description of all random forest parameters.
The parameter n_estimators is simply the number of decision trees that the random forest is made up of. Max_depth measures the longest path from the first question to a question at the base of the tree. Random_state is how the algorithm randomly chooses chunks of the data for question-asking.
Since we didnt specify any values for these parameters, the random forest module automatically selects a default value for each parameter. The default value for n_estimators is 10, which corresponds to 10 decision trees. The default value for max_depth is None, which means there is no cut-off for the length of the path from the first question to the last question at the base of the decision tree. This can be roughly understood as the limit on the number of questions we ask about the data. The default value for random_state is None. This means, upon each model run, different chunks of data will be randomly selected and used to construct the decision trees in the random forests. This will result in slight variations in output and performance.
Despite using default values, we achieve pretty good performance. This accuracy demonstrates the power of random forests and the ease with which the data science beginner can implement an accurate random forest model.
Lets see how to specify n_estimators, max_depth and random_state. Well choose 100 estimators, a max depth of 10 and a random state of 42:
We see that we get a slight improvement in both MSE and R^2. Further, specifying random_state makes our results reproducible since it ensures the same random chunks of data are used to construct the decision trees.
Applying random forest models to classification tasks is very straightforward. Lets do this for the iris classification task:
And the corresponding confusion matrix is just as accurate:
Random forests are a great choice for building a statistical model since they can be applied to a wide range of prediction use cases. This includes classification, regression and even unsupervised clustering tasks. Its a fantastic tool that every data scientist should have in their back pocket. In the context of Scikit-learn, theyre extremely easy to implement and modify for improvements in performance. This enables fast prototyping and experimentation of models, which leads to accurate results faster.
Finally, all the code in this post is available on GitHub.
Overall, Scikit-learn provides many easy-to-use tools for accessing benchmark data, performing data processing, and training, testing and evaluating machine learning models. All of these tasks require relatively few lines of code, making the barrier to entry for beginners in data science and machine learning research quite low. Users can quickly access toy data sets and familiarize themselves with different machine learning use cases (classification, regression, clustering) without the hassle of finding a data source, downloading and then cleaning the data. Upon becoming familiar with different use cases, the user can then easily port over what theyve learned to more real-life applications.
Further, new data scientists unfamiliar with data imputation can quickly pick up how to use the SimpleImputer package in Scikit-learn and implement some standard methods for replacing missing or bad values in data. This can serve as the foundation for learning more advanced methods of data imputation, such as using a statistical model for predicting missing values. Additionally, the standard scaler and normalizer methods make data preparation for advanced models like neural networks and support vector machines very straightforward. This is often necessary in order to achieve satisfactory performance with more complicated models like support vector machines and neural networks.
Finally, Scikit-learn makesbuilding a wide variety of machine learning models very easy. Although Ive only covered three in this post, the logic for building other widely used models such as support vector machines and K-nearest neighbors is very similar. It is also very suitable for beginners who have limited knowledge of how these algorithms work under the hood, given that each model object comes with default parameters that give baseline performance. Whether the task is model benching marking with toy data, preparing/cleaning data, or evaluating model performance Scikit-learn is a fantastic tool for building machine learning models for a wide variety of use cases.
Jump Into Machine LearningThe Top 10 Machine Learning Algorithms Every Beginner Should Know
Willis Towers Watson enhances its human capital data science capabilities globally with the addition of the Jobable team – GlobeNewswire
LONDON, Feb. 16, 2021 (GLOBE NEWSWIRE) -- Willis Towers Watson (NASDAQ: WLTW), a leading global advisory, broking and solutions company, today announced a group hire of the entire team from Jobable, a Hong Kong-based human capital analytics and software company.
The team brings to Willis Towers Watson (WTW) its expertise in human capital data science and software development. Combining the capabilities of Jobable and WTW will enhance the companys leadership in helping organisations drive digital transformation and uncover the insights within their human capital data.
Former Jobable Chief Executive Officer, Richard Hanson, joins WTW as Global Head of Data Science for Talent & Rewards, along with his Jobable co-founder, Luke Byrne. In his new role, Hanson will continue to be based in Hong Kong, working to identify and capture global revenue opportunities, whilst actively contributing to WTW's thought leadership initiatives. Byrne, who is formerly Jobables Chief Operating Officer, will help drive the transition process.
Mark Reid, Global Leader, Work and Rewards at WTW said, Throughout our partnership with Jobable, we experienced first-hand, their capabilities across data science, software design and development. The Jobable team often provided a valuable point of differentiation to our clients work. Whilst we have already shared numerous commercial successes together, the prospect of building on this proven track record, discovering new synergies and fully leveraging on Richard and his teams expertise is truly a compelling one.
Welcoming the new colleagues, Shai Ganu, Global Leader, Executive Compensation, at WTW commented, With client demands evolving at speed and often with increasing complexity, the addition of Richard and his teams capabilities will sharpen our competitive edge. We are excited to be able to apply data science in all our Data-Software-Advisory offerings, and ultimately help our clients find solutions to critical and emerging people challenges.
For Byrne and Hanson, the team move marks the beginning of a new journey from founding their start-up to growing the business at an enterprise level now. Byrne remarked, We are tremendously proud of Jobables achievements over the past six years. Joining WTW is the perfect way for us to ensure that we can amplify the impact of our work going forward. We are truly excited to see how our combination of skill sets and experience can benefit WTWs clients and their people for years to come.
Bringing the Jobable team to WTW is the culmination of a successful multi-year global partnership between the two companies, marked by notable achievements such as the design and development of innovative skill-based compensation modelling software, SkillsVue, which was launched in 2019. In addition to that, WTW introduced WorkVue, the award-winning AI-driven job reinvention software in 2020 which was also developed by the Jobable team. Jobable has also consistently delivered their unique data analysis and insights to support WTWs advisory work with corporate clients and government agencies worldwide.
The Jobable team will add a wealth of expertise and capabilities to WTWs technology team, including Full Stack Software Development, Data Engineering, DevOps, Natural Language Processing, ETL, Topic Modeling, Word Embedding, Deep Learning, Predictive Analytics, Web Scraping, UX / UI Design and Rapid Prototyping.
About Willis Towers Watson
Willis Towers Watson (NASDAQ: WLTW) is a leading global advisory, broking and solutions company that helps clients around the world turn risk into a path for growth. With roots dating to 1828, Willis Towers Watson has 45,000 employees serving more than 140 countries and markets. We design and deliver solutions that manage risk, optimise benefits, cultivate talent, and expand the power of capital to protect and strengthen institutions and individuals. Our unique perspective allows us to see the critical intersections between talent, assets and ideas the dynamic formula that drives business performance. Together, we unlock potential. Learn more at willistowerswatson.com.
Clara Goh: +65 6958 2542 | firstname.lastname@example.org
Read this article: