Category Archives: Data Science

How to Plan for Your Next Career Move in Data Science and Machine Learning | by TDS Editors | Jul, 2024 – Towards Data Science

Written by admin on July 19, 2024 — Leave a Comment

Feeling inspired to write your first TDS post? Were always open to contributions from new authors.

Data science and machine learning professionals are facing uncertainty from multiple directions: the global economy, AI-powered tools and their effects on job security, and an ever-shifting tech stack, to name a few. Is it even possible to talk about recession-proofing or AI-proofing ones career these days?

The most honest answer we can give is we dont really know, because as weve seen with the rise of LLMs in the past couple of years, things can and do change very quickly in this field (and in tech more broadly). That, however, doesnt mean we should just resign ourselves to inaction, let alone despair.

Even in challenging times, there are ways to assess the situation, think creatively about our current position and what changes wed like to see, and come up with a plan to adjust our skills, self-presentation, and mindset accordingly. The articles weve selected this week each tackle one (or more) of these elements, from excelling as an early-career data scientist to becoming an effective communicator. They offer pragmatic insights and a healthy dose of inspiration for practitioners across a wide range of roles and career stages. Lets dive in!

BlazeFace: How to Run Real-time Object Detection in the Browser – Towards Data Science

Written by admin on July 19, 2024 — Leave a Comment

A step-by-step guide to training a BlazeFace model, from the Python training pipeline to the JavaScript demo through model conversion. 11 min read

Thanks to libraries such as YOLO by Ultralytics, it is fairly easy today to make robust object detection models with as little as a few lines of code. Unfortunately, those solutions are not yet fast enough to work in a web browser on a real-time video stream at 30 frames per second (which is usually considered the real-time limit for video applications) on any device. More often than not, it will run at less than 10 fps on an average mobile device.

The most famous real-time object detection solution on web browser is Googles MediaPipe. This is a really convenient and versatile solution, as it can work on many devices and platforms easily. But what if you want to make your own solution?

In this post, we propose to build our own lightweight, fast and robust object detection model, that runs at more than 30 fps on almost any devices, based on the BlazeFace model. All the code used for this is available on my GitHub, in the blazeface folder.

The BlazeFace model, proposed by Google and originally used in MediaPipe for face detection, is really small and fast, while being robust enough for easy object detection tasks such as face detection. Unfortunately, to my knowledge, no training pipeline of this model is available online on GitHub; all I could find is this inference-only model architecture. Through this post, we will train our own BlazeFace model with a fully working pipeline and use it on browser with a working JavaScript code.

More specifically, we will go through the following steps:

Lets get started with the model training.

As usual when training a model, there are a few typical steps in a training pipeline:

Lets go through those steps together.

We are going to use a subset of the Open Images Dataset V7, proposed by Google. This dataset is made of about 9 million images with many annotations (including bounding boxes, segmentation masks, and many others). The dataset itself is quite large and contains many types of images.

For our specific use case, I decided to select images in the validation set fulfilling two specific conditions:

The script to download and build the dataset under those strict conditions is provided in the GitHub, so that anyone can reproduce it. The downloaded dataset with this script contains labels in the YOLO format (meaning box center, width and height). In the end, the downloaded dataset is made of about 3k images and 8k faces, that I have separated into train and validation set with a 80%-20% split ratio.

From this dataset, typical preprocessing it required before being able to train a model. The data preprocessing code I used is the following:

As we can see, the preprocessing is made of the following steps:

Optionally, this code allows for data augmentation using Albumentations. For the training, I used the following data augmentations:

Those augmentations will allow us to have a more robust, regularized model. After all those transformations and augmentations, the input data may look like the following sample:

As we can see, the preprocessed images have grey borders because of augmentation (with rotation or translation) or padding (because the original image did not have a square aspect ratio). They all contain faces, although the context might be really different depending on the image.

Important Note:

Face detection is a highly sensitive task with significant ethical and safety considerations. Bias in the dataset, such as underrepresentation or overrepresentation of certain facial characteristics, can lead to false negatives or false positives, potentially causing harm or offense. See below a dedicated section about ethical considerations.

Now that our data can be loaded and preprocessed, lets go to the next step: building the model.

In this section, we will build the model architecture of the original BlazeFace model, based on the original article and adapted from the BlazeFace repository containing inference code only.

The whole BlazeFace architecture is rather simple and is mostly made of what the papers author call a BlazeBlock, with various parameters.

The BlazeBlock can be defined with PyTorch as follows:

As we can see from this code, a BlazeBlock is simply made of the following layers:

N.B.: You can read the PyTorch documentation for more about these layers: Conv2D layer and BatchNorm2D layer.

This block is repeated many times with different input parameters, to go from a 128-pixel image up to a typical object detection prediction using tensor reshaping in the final stages. Feel free to have a look at the full code in the GitHub repository for more about the implementation of this architecture.

Before moving to the next section about training the model, note that there are actually two architectures:

As you can imagine, the 256-pixel architecture is slightly larger, but still lightweight and sometimes more robust. This architecture is also implemented in the provided code, so that you can use it if you want.

N.B.: The original BlazeFace model not only predicts a bounding box, but also six approximate face landmarks. Since I did not have such labels, I simplified the model architecture to predict only the bounding boxes.

Now that we can build a model, lets move on to the next step: training the model.

For anyone familiar with PyTorch, training models such as this one is usually quite simple and straightforward, as shown in this code:

As we can see, the idea is to loop over your data for a given number of epochs, one batch at a time, and do the following:

I am not getting into all the details for clarity in this post, but feel free to navigate through the code to get a better sense of the training part if needed.

After training on 100 epochs, I had the following results on the validation set:

As we can see on those results, even if the object detection is not perfect, it works pretty well for most cases (probably the IoU threshold was not optimal, leading sometimes to overlapping boxes). Keep in mind its a very light model; it cant exhibit the same performances as a YOLOv8, for example.

Before going to the next step about converting the model, lets have a short discussion about ethical and safety considerations.

Lets go over a few points about ethics and safety, since face detection can be a very sensitive topic:

To address these concerns, anyone willing to build a product on such topic should focus on:

N.B.: A useful approach to address these concerns is to examine what Google did for their own face detection and face landmarks models.

Again, the used dataset is intended solely for educational purposes. Anyone willing to use it should exercise caution and be mindful of its limitations when interpreting results. Lets now move to the next step with the model conversion.

Remember that our goal is to make our object detection model work in a web browser. Unfortunately, once we have a trained PyTorch model, we can not directly use it in a web browser. We first need to convert it.

Currently, to my knowledge, the most reliable way to run a deep learning model in a web browser is by using a TFLite model with TensorFlow.js. In other words, we need to convert our PyTorch model into a TFLite model.

N.B.: Some alternative ways are emerging, such as ExecuTorch, but they do not seem to be mature enough yet for web use.

As far as I know, there is no robust, reliable way to do so directly. But there are side ways, by going through ONNX. ONNX (which stands for Open Neural Network Exchange) is a standard for storing and running (using ONNX Runtime) machine learning models. Conveniently, there are available libraries for conversion from torch to ONNX, as well as from ONNX to TensorFlow models.

To summarize, the conversion workflow is made of the three following steps:

This is exactly what the following code does:

This code can be slightly more cryptic than the previous ones, as there are some specific optimizations and parameters used to make it work properly. One can also try to go one step further and quantize the TFLite model to make it even smaller. If you are interested in doing so, you can have a look at the official documentation.

N.B.: The conversion code is highly sensitive of the versions of the libraries. To ensure a smooth conversion, I would strongly recommend using the specified versions in the requirements.txt file on GitHub.

On my side, after TFLite conversion, I finally have a TFLite model of only about 400kB, which is lightweight and quite acceptable for web usage. Next step is to actually test it out in a web browser, and to make sure it works as expected.

On a side note, be aware that another solution is currently being developed by Google for PyTorch model conversion to TFLite format: AI Edge Torch. Unfortunately, this is quite new and I couldnt make it work for my use case. However, any feedback about this library is very welcome.

Now that we finally have a TFLite model, we are able to run it in a web browser using TensorFlow.js. If you are not familiar with JavaScript (since this is not usually a language used by data scientists and machine learning engineers) do not worry; all the code is provided and is rather easy to understand.

I wont comment all the code here, just the most relevant parts. If you look at the code on GitHub, you will see the following in the javascript folder:

If we take a step back, all we need to do in the JavaScript code is to loop over the frames of the camera feed (either a webcam on a computer or the front-facing camera on a mobile phone) and do the following:

We wont comment the image preprocessing since this would be redundant with the Python preprocessing, but feel free to have a look at the code. When it comes to making an inference with a TFLite model in JavaScript, its fairly easy:

The tricky part is actually the postprocessing. As you may know, the output of a SSD object detection model is not directly usable: this is not the bounding boxes locations. Here is the postprocessing code that I used:

In the code above, the model output is postprocessed with the following steps:

This is exactly what has been done in Python too to display the resulting bounding boxes, if it may help you get a better understanding of that part.

Finally, below is a screenshot of the resulting web browser demo:

As you can see, it properly detects the face in the image. I decided to use a static image from Unsplash, but the code on GitHub allows you to run it on your webcam, so feel free to test it yourself.

Before concluding, note that if you run this code on your own computer or smartphone, depending on your device you may not reach 30 fps (on my personal laptop having a rather old 2017 Intel Core i58250U, it runs at 36fps). If thats the case, a few tricks may help you get there. The easiest one is to run the model inference only once every N frames (N to be fine tuned depending on your application, of course). Indeed, in most cases, from one frame to the next, there are not many changes, and the boxes can remain almost unchanged.

I hope you enjoyed reading this post and thanks if you got this far. Even though doing object detection is fairly easy nowadays, doing it with limited resources can be quite challenging. Learning about BlazeFace and converting models for web browser gives some insights into how MediaPipe was built, and opens the way to other interesting applications such as blurring backgrounds in video call (like Google Meets or Microsoft Teams) in real time in the browser.

Google and Howard University to train underrepresented high schoolers in data analyticsand its free to students – Fortune

Written by admin on July 19, 2024 — Leave a Comment

If youre thinking about going to college to, say, pursue a degree in data science, heres a statistic to listen to: college graduates make $1.2 million more, on average, during their lifetime than those who dont.

Thats according to data compiled by the Association of Public and Land-Grant Universities. For those early in their career, obtaining a degree is about more than simply getting a diplomait means a substantially greater change at having a higher salary, and thus, having greater social and economic mobility.

UNC Kenan-Flaglers top-ranked online MBA is a top choice for experienced professionals with strong undergraduate performance. You can earn your degree at your own pacein 18 to 36 monthswithout sacrificing academic quality and rigor. Access lifelong career benefits and join a global community of over 44,000 alumni with an AACSB-accredited online MBA from UNC-Chapel Hill. GMAT waivers available. Learn more today.

But for many students, especially low-income and underrepresented high schoolers, the idea of going to college may be a daunting one to say the leastwith many not believing they could be worthy enough to pursue a career, especially in-demand, high-paying jobs in the STEM field. A new partnership, shared exclusively with Fortune, between Google, Howard University, and the National Education Equity Lab is hoping to reverse that hesitation.

Starting this fall, students at under-resourced, Title 1 high schools will be able to learn the fundamentals of data analytics via Googles industry-recognized certification program as well as gain six college credits from Howardall at no cost to students.

Its a critical pipeline opportunity for students in our high schools that have not had an opportunity to be exposed to potential careers in data analytics (who) will (otherwise) be behind the curve because of this absence of exposure, explains Leslie Cornfield, founder and CEO of the National Ed Equity Lab.

Since 2019, the organization has reached more than 25,000 disadvantaged students in many of the countrys largest school districtsincluding in New York, Los Angeles, Chicago, Miami, Nashville, and Denverwith opportunities to realize they can actually do well in rigorous college courses. The lab has been so successful in such a short amount of time that even The New York Times wrote a feature on it.

During the year-long program, students will take Googles Data Analytics Professional Certificationthe most popular of any certification on Coursera and learn the essentials of data analytics, including exposure to Tableau and SQL via hands-on projects and case studies. While a classroom teacher will help students with the asynchronous certification coursework, a teaching fellow from Howard will also connect with students virtually to help with mastery of materials as well as provide mentoring and advice.

Ben Vinson, president of Howard University, says his school chose to participate in the program to not only expand their own walls to classrooms across the countrybut also as part of Howards commitment to truth, service, and uplifting underserved communities.

This is an increasingly technical world, and data is the new coin of the realm, if you will, in a fast-paced, technologically, AI-fueled environment, explains Vinson. And so, data science then becomes a key core skill for future success, and we want underserved populations, minority populations to be very successful in these highly technical fields.

Data science and data analytics are among the fastest growing fields not only in the tech space, but across all occupations. Over the next 10 years, CompTIA predicts that demand for data scientists and data analysts will grow by 304%.

Lisa Gevelber, founder and leader of Grow with Google, the companys tech training arm, says Google believes data analytics is critical for individuals to learn due to its rapid growththat pays well, too.

Its really good to have durable skills, and also to be able to prove to people that you have those skills. And the best way to demonstrate that you have these skills are through real experience or through a credential, she tells Fortune.

As part of Grow with Google, students receive access to an employer consortium, made up of more than 150 companies such as Marriott, Siemens, and Google itself, who are eager to connect with talentand maybe one day hire them. Students are also provided access to a job board of entry-level data analytics jobs in which their certification would make them qualified for, and Grow with Google even provides tips on writing resumes and other job success help.

Its really designed not just to teach them about their field, but to help them get a great job that will hopefully lead to a long term career, Gevelber says.

Students part of this new partnership also get access to a new AI essentials course from Google. All together, the program aims to make students confident to further master college-level courses and prepare for the job market.

We want to break down barriers to success. We want to elevate the performance of students in these areas. And we want to inspire their confidence in ways that prepare them not just to, for the skills in these fields, but also to be more confident about going to college and succeeding, Vinson adds.

Vinson says college is essential for unlocking the totality of ones future, and thus students should never stop dreaming.

Always realize that the next step to fulfilling your dreams is entering the doors of college because that allows you to gain the capacity, the confidence, and the know-how to implement those dreams in the future.

Clarification: A previous version of this article noted that this new program was completely free. The program is no cost to scholars, but districts and schools pay the enrollment fees. Fortune apologizes for this confusion.

Check out all ofFortunesrankings of degree programs, and learn more about specificcareer paths.

See the original post:

Google and Howard University to train underrepresented high schoolers in data analyticsand its free to students - Fortune

Analytics and Data Science News for the Week of July 19; Updates from Microsoft, Qlik, Teradata & More – Solutions Review

Written by admin on July 19, 2024 — Leave a Comment

Solutions Review Executive Editor Tim King curated this list of notable analytics and data science news for the week of July 19, 2024.

Keeping tabs on all the most relevant analytics and data science news can be a time-consuming task. As a result, our editorial team aims to provide a summary of the top headlines from the last week, in this space. Solutions Review editors will curate vendor product news, mergers and acquisitions, venture capital funding, talent acquisition, and other noteworthy analytics and data science news items.

By broadening its scope beyond NFT indexing, bitsCrunch aims to address various data analytics, security, and validation needs across different blockchain applications, positioning itself as a comprehensive solution provider for the entire Web3 industry.

Read on for more

Databricks customers already enjoy fast, simple and reliable serverless compute forDatabricks SQLandDatabricks Model Serving. The same capability is now available for all ETL workloads on the Data Intelligence Platform, including Apache Spark and Delta Live Tables.

Read on for more

The integrated ArcGIS Enterprise and IBM Maximo Application Suite product is designed to give clients access to a more holistic view of asset locations, conditions, and performance, as well as augmented maintenance scheduling and resource allocation capabilities.

Read on for more

It confirms their ability to craft tailored Microsoft analytics solutions, leveraging powerful tools like Microsoft Fabric, Azure Synapse Analytics, Azure Data Lake, Azure Data Factory, and Azure Databricks. These capabilities enable them to meet the unique needs of their customers, helping them harness the power of their data and drive meaningful business outcomes.

Read on for more

Microsoft Purview is a comprehensive set of solutions that can help organizations secure, govern, and manage their data, wherever it lives. The unification of data security and governance capabilities in Microsoft Purview reflects our belief that our customers need a simpler approach to data.

Read on for more

This version of the gateway will ensure that the reports that you publish to the Power BI Service and refresh via the gateway will go through the same query execution logic/run-time as in the July version of Power BI Desktop. View the data connectivity section ofPower BIs July 2024 feature summaryfor updates on relevant connectors.

Read on for more

Qlik Talend Cloud delivers AI-augmented data integration capabilities from no-code to pro-code, enabling businesses to maintain data integrity and accelerate AI adoption at all levels. With new AI-augmented data pipelines, Qlik Talend Cloud sets a new standard in data quality and reliability, crucial for generating value from diverse data sources in a flexible and scalable environment.

Read on for more

Watch this space each week as our editors will share upcoming events, new thought leadership, and the best resources from Insight Jam, Solutions Reviews enterprise tech community for business software pros. The goal? To help you gain a forward-thinking analysis and remain on-trend through expert advice, best practices, predictions, and vendor-neutral software evaluation tools.

With the next Solutions Spotlight event, the team at Solutions Review has partnered with Informatica, a leader in data management and integration, to provide viewers with this unique webinar. Hear from industry experts like Siddharth Mishra and Abhilash Reddy to learn how to effortlessly prepare quality data and deliver it so its timely and relevant for GenAI.

This month on The Jam Session, host Bob Eve and guests John Santaferraro, Roy Hasson, and Andrew Madson are talking data architecture. The panel discusses the engineering marvels (and disasters) theyve seen in their careers, and offer valuable insights on data quality, metadata management, real-time analytics, and the future of AI agent architectures.

Watch on YouTube

For consideration in future data management news roundups, send your announcements to the editor: tking@solutionsreview.com.

Excerpt from:

Analytics and Data Science News for the Week of July 19; Updates from Microsoft, Qlik, Teradata & More - Solutions Review

Center for Community-Engaged Artificial Intelligence and the Connolly Alexander Institute for Data Science launch summer research program – Tulane…

Written by admin on July 19, 2024 — Leave a Comment

July 16, 2024 9:30 AM

The Center for Community-Engaged Artificial Intelligence (CEAI) at Tulane University, in partnership with the Connolly Alexander Institute for Data Science (CAIDS), established the Community-Engaged Artificial Intelligence and Data Science Summer Research Program. This innovative program aims to foster research into human-centered artificial intelligence (AI) with a focus on social impact, emphasizing the importance of building meaningful relationships with diverse communities throughout the AI lifecycle.

The Community-Engaged AI and Data Science Summer Research Program supports research into how AI technologies are developed and deployed in ways that are socially beneficial, inclusive, effective, fair, transparent and accountable. By involving communities in all stages of the AI process from design through deployment the program seeks to create AI solutions that address real-world challenges and promote equity.

This summer, the program awarded funds to support three groundbreaking research projects, each receiving $10,000 to advance their work:

1. LandmarkAI: Recognizing Real Estate Development Threats to Unregistered National Historic Landmarks

2. Choc Forest Watch: Supporting Local Conservation in a Biodiversity Hotspot

3. Transforming a Traditional Evidence-Based Intervention: AI-Enhanced Support for Young Adults With Substance Use Disorders

"We are excited to support these innovative projects that exemplify the potential of AI and data science to address societal challenges," said Dr. Aron Culotta, director of CEAI. "By engaging communities directly in the research process, the goal is to develop AI solutions that are not only technologically advanced but also socially responsible and impactful."

The Community-Engaged AI and Data Science Summer Research Program represents a significant step toward creating AI technologies that are deeply attuned to the needs and values of diverse communities. CEAI and CAIDS are committed to continuing this important work and fostering a collaborative environment where technology and society co-evolve for the betterment of all.

For more information about the program and the funded projects, visit CEAI and CAIDS.

About CEAI

The Center for Community-Engaged Artificial Intelligence (CEAI), funded by Tulanes Office of Research, is dedicated to promoting the development and deployment of artificial intelligence technologies that are inclusive, equitable, and socially beneficial. CEAI focuses on integrating community engagement into AI research to ensure that technological advancements serve the public good.

About CAIDS

The Connolly Alexander Institute for Data Science (CAIDS) at Tulane University is dedicated to cultivating a comprehensive understanding of data science concepts and methodologies among the Tulane community. By fostering data literacy, CAIDS aims to empower individuals to tackle real-world challenges and drive innovation in the data-driven landscape.

Original post:

Center for Community-Engaged Artificial Intelligence and the Connolly Alexander Institute for Data Science launch summer research program - Tulane...

Heather Barker travels to New Zealand to present at International Association of Statistics Education Roundtable – Today at Elon

Written by admin on July 19, 2024 — Leave a Comment

Heather Barker, assistant professor of mathematics and statistics, joined a small group of 40 statistics educators at the sixth IASE roundtable at the University of Auckland in New Zealand from July 2-5, 2024.

Share this page on Facebook Share this page on LinkedIn Email this page to a friend Print this page

Heather Barker, assistant professor of mathematics and statistics at Elon University, recently presented at the International Association for Statistical Education (IASE) Roundtable Conference held at the University of Auckland in New Zealand. The conference, held from July 2 to July 5, 2024, brought together only 40 to 50 participants from around the world to discuss the theme of connecting data and people for inclusive statistics and data science education.

This unique four-day event focused on the human-driven aspects of data creation and usage, aiming to promote inclusivity in the teaching and learning of statistics and data science. Attendees shared their passion, expertise and experiences through a mixture of papers and workshops, along with ample opportunities for in-person discussions.

Barkers paper, titled Integrating Social Justice into Data Investigation: A Framework for Statistical Analysis of Police-Involved Deaths, explores the intersection of teaching mathematics for social justice principles and a data investigation process framework. The study utilized a dataset documenting police-involved deaths, demonstrating how statistical techniques can be applied to real-world data. She introduced the Social Justice Data Investigation Process framework, which highlights the interplay between statistical habits of mind, social justice teaching and the data investigation process. Preliminary insights from high school and college students who participated in this data investigation were also shared.

The IASE Roundtable Conference is distinctive in its approach, requiring presenters to share their papers with participants two months before the meeting for review. Presenters then revise their papers based on this feedback and use it to shape their presentations. After the conference, participants provide further feedback on all presentations, leading to one final round of revisions before the papers are published in the conference proceedings by the end of the year.

This participation not only highlights Barkers dedication and expertise in the field of statistics and data science education but also underscores the universitys commitment to fostering inclusive and impactful research.

Funding for the research was made possible through a mini-grant from the Center for Advancement in Teaching and Learning.

More here:

Heather Barker travels to New Zealand to present at International Association of Statistics Education Roundtable - Today at Elon

Diversifying the Data Science Workforce: The DataMosaic Program – CDC

Written by admin on July 19, 2024 — Leave a Comment

Description

The Division of Global Migration and Quarantine (DGMQ) Office of Innovation, Development, Evaluation, and Analytics (IDEA) intentionally established a diverse team of fellows to develop DataMosaic, a CARES Act-funded proof-of-concept program that seeks to connect and harness population movement, epidemiological, and genomic data to increase public health response efficiency. In addition to IDEA's core analytics team, DataMosaic consists of 11 fellows from different programs, including Oak Ridge Institute for Science and Education (ORISE), Public Health Analytics and Modeling (PHAM) Program, and Public Health Informatics Fellowship Program (PHIFP), as well as health communications fellows, who receive intermediate and advanced data science upskilling to help fill critical needs within the agency. IDEA aims to refine health equity science by leveraging the lived experiences of the diverse team of fellows in data science approaches and methodologies. The DataMosaic team created an analytics platform that combines data from disparate sources to enhance analyses.

In the spring of 2021, IDEA launched the DataMosaic Program with three goals in mind:

Through the DataMosaic Program, cutting-edge data science skills were provided to members of the public health workforce who historically have been underrepresented in the data science field. Fellows were presented with several professional development modules, including how to create a CV, build a federal resume, develop a BLUF ("Bottom Line Up Front") statement, select a peer-reviewed journal, and code their own professional website. IDEA also organized a data science career panel consisting of several CDC employees, fellows of other programs, contractors, consultants, and industry workers so DataMosaic fellows could learn about different ways they could apply their skills after their fellowship.

All fellows received one-on-one mentoring routinely, as well as project mentoring. Fellows were able to propose their own projects, recruit collaborators within and beyond the IDEA team, assist other groups within and beyond DGMQ, participate in emergency response deployments, and join ongoing program evaluation projects. The fellows created and led their own study group, and surveyed topics including machine learning, infectious disease modeling, and health equity. First-year fellows held a showcase for DGMQ in June 2022 to present their work from the past year.

DataMosaic fellows collaborated within and beyond DGMQ to provide innovative and advanced analytic solutions to public health problems. Currently, fellows are involved in over 20 active projects, both within and beyond DGMQ. Several manuscripts are in development for publication, including for a project that uses supervised machine learning to identify areas with low COVID-19 vaccine uptake.

DataMosaic is helping to build a robust and diverse public health workforce. Fellows have taken the skills they've learned through the DataMosaic program and have gone on to excel in public and private sector roles as data scientists, analysts, and engineers, including contracting positions (both at CDC and beyond), industry positions, and full-time positions at CDC. Additionally, some fellows have transitioned into graduate school or other CDC fellowships, continuing to advance their professional skills and training.

Diversifying the Data Science Workforce: The DataMosaic Program - CDC

Bipartisan Senate bill proposes $12B for DOEs AI work to spur energy breakthroughs, other advances – Utility Dive

Written by admin on July 19, 2024 — Leave a Comment

Dive Brief:

FASST will transform the vast repositories of scientific data produced at DOE user facilities to be AI-ready and build the next-generation of highly energy efficient AI supercomputers, the agency said Tuesday.

DOE in April published a pair of reports concluding AI can help manage the U.S. electric grid, including reducing emissions and lowering costs but also warning that AI could expose the country to a host of risks, including cyber or physical grid attacks, and supply chain compromises, if deployed navely.

AI is an innovative technology that can help unleash breakthroughs in energy technologies and enhance our national security, Secretary of Energy Jennifer Granholm said in a July 16 statement.FASST builds on DOE's role as the nation's steward of advanced supercomputing and research infrastructure.

According to a fact sheet distributed by DOE, the initiative will focus on four related areas:

AI-accelerated scientific discoveries can lead to affordable batteries for electric vehicles, breakthroughs in fusion energy, new cancer-fighting drugs, and help assure our national security, the agency said.

Sen. Joe Manchin,I-W. Va., and Sen.Lisa Murkowski, R-Alaska, introduced legislation to advance the FASST initiative on July 10.It requires the secretary of energy to report to Congress annually on the "progress, findings, and expenditures of the initiative's programs, and sets an annual budget of $2.4 billion over five years.

As AI technology takes the world by storm, the United States needs to meet the moment quickly and effectively before our adversaries do, Manchin said in a statement. Deploying our existing lab infrastructure and scientific expertise for AI instead of starting from scratch will also safeguard taxpayer dollars and allow for us to move quickly.

The legislation also establishes a network of AI research clusters built on DOEs existing infrastructure, calls for an AI risk evaluation and mitigation program to evaluate security risks,and directs the Federal Energy Regulatory Commission to initiate a rulemaking around the use of advanced computing to expedite the interconnection queue process. The legislation also directs DOE to study the growing energy demand of data centers and AI.

The Electric Power Research Institute in May published a report concluding data centers could consume 9% of the United States electricity generation by 2030, about double the amount consumed today.AI queries require about ten times the electricity of traditional internet searches, the report said.

AI can be used to reduce energy use in data centers and advanced manufacturing streamlining operations identifying new ways of processing data and information, West Monroes DeCotis said. The Departments efforts should set a standard by which AI capabilities and use can advance innovations and discoveries in scientific, energy and natural security communities.

Excerpt from:

Bipartisan Senate bill proposes $12B for DOEs AI work to spur energy breakthroughs, other advances - Utility Dive

Research Centers and Initiatives – University of Illinois Urbana-Champaign

Written by admin on July 19, 2024 — Leave a Comment

ACE Center for Evolvable Computing The ACE Center aims to devise novel technologies for scalable distributed computing that will improve the performance and energy efficiency of diverse applications by 100x over the expected computer systems of 2030.

Artificial Intelligence for Future Agricultural Resilience, Management, and Sustainability Institute (AIFARMS) AIFARMS brings together researchers in artificial intelligence and agriculture, combining their expertise to promote advances in agriculture through AI research areas such as computer vision, machine learning, data science, soft object manipulation, and intuitive human-robot interaction.

Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE) Co-sponsored by the Siebel School of Computing and Data Science and CSL, AICE is a collaboration to develop intelligent conversational systems that demonstrate contextual understanding and emotional intelligence and allow for personalization and the ability to interpret non-verbal communication while being ethical and fair.

Center for Digital Agriculture CDA was formed to help agricultural producers, researchers, and industries keep pace with how technology transforms how we feed and support a growing global population.

Center for Exascale-enabled Scramjet DesignCEESD will develop physics-faithful predictive simulations enabled by advanced computer science methods to leverage massive-scale computational resources to advance scramjet designs that leverage advanced high-temperature composite materials.

Coordinated Science Laboratory Led by professorKlara Nahrstedt, CSL is a premier, multidisciplinary research laboratory that focuses on information technology at the crossroads of circuits, computing, control, and communications.

c3.ai Digital Transformation Institute (DTI) C3.ai engages scientists to research and train practitioners in the Science of Digital Transformation, which operates at the intersection of artificial intelligence, machine learning, cloud computing, the Internet of Things, big data analytics, organizational behavior, public policy, and ethics.

Illinois-Insper Partnership Insper and the University of IllinoisUrbana-Champaign are working to support research and educational collaborations between the two institutions.

IMMERSE: Center for Immersive Computing IMMERSE brings expertise in immersive technologies, applications, and human factors to perform research, educate a new workforce, and build infrastructure, enabling a new immersive computing era.

Inclusive and Intelligent Technologies for Education (INVITE) The INVITE Institute seeks to reframe how AI-based educational technologies interact with learners fundamentally. INVITE develops AI techniques to track and promote skills that underlie successful learning and contribute to academic success: persistence, academic resilience, and collaboration.

Internet of Battlefield Things (IoBT) IoBT will connect soldiers with smart technology in armor, radios, weapons, and other objects to shorten the latency of decision cycles, improve the resiliency of tactical battlefield analytics, and push tailored machine intelligence to the point of need.

Mind in Vitro: an NSF Expedition in Computing This expedition imagines computers and robots that are human-designed, living, and programmable but whose behaviors are not specified and instead emerge. These systems will grow, heal, learn, and explore.

Molecular Maker Lab Institute (MMLI) The Molecule Maker Lab Institute focuses on developing new AI-enabled tools to accelerate automated chemical synthesis to advance the discovery and manufacture of novel materials and bioactive compounds.

National Center for Supercomputing Applications Led by professor Bill Gropp, NCSAis a hub of transdisciplinary research and digital scholarship where researchers unite to address research grand challenges for the benefit of science and society. Current research focus areas include digital agriculture, bio, health sciences, earth and environment, astronomy, and many more.

Read the rest here:

Research Centers and Initiatives - University of Illinois Urbana-Champaign

Understanding Conditional Probability and Bayes Theorem | by Sachin Date | Jul, 2024 – Towards Data Science

Written by admin on July 19, 2024 — Leave a Comment

Photo by Stephen Cobb on Unsplash A primer on two concepts that form the substrate of regression analysis

Few incidents in history exemplify how thoroughly conditional probability is woven into human thought, as remarkably as the events of September 26, 1983.

Just after midnight on September 26, a Soviet Early Warning Satellite flagged a possible ballistic missile launch from the United States directed toward to the Soviet Union. Lt. Col. Stanislav Petrov, the duty officer on shift at a secret EWS control center outside Moscow, received the warning on his screens. He had only minutes to decide whether to flag the signal as legitimate and alert his superior.

The EWS was specifically designed to detect ballistic missile launches. If Petrov had told his boss that the signal was real, the Soviet leadership would have been well within the bounds of reason to interpret it as the start of a nuclear strike on the USSR. To complicate matters, the Cold War had reached a terrifying crescendo in 1983, boosting the probability in the minds of the Soviets that the signal from the EWS was, in fact, the real thing.

Accounts differ on exactly when Petrov informed his superior about the alarm and what information was exchanged between the two men, but two things are certain: Petrov chose to disbelieve what the EWS alarm was implying namely, that the United States had launched a nuclear missile against the Soviet Union and Petrovs superiors deferred to his judgement.

With thousands of nuclear-tipped missiles aimed at each other, neither superpower would have risked retaliatory annihilation by launching only one or a few missiles at the other. The bizarre calculus of nuclear war meant that if either side had to start it, they must do so by launching a tidy portion of their arsenal all at once. No major nuclear power would be so stupid as to start a nuclear war with just a few missiles. Petrov was aware of this doctrine.

Given that the EWS detected a solitary launch, in Petrovs mind the probability of its being real was vanishingly small despite the acutely heightened tensions of his era.

So Petrov waited. Crucial minutes passed. Soviet ground radars failed to detect any incoming missiles making it almost certain that it was a false alarm. Soon, the EWS fired four more alarms. Adopting the same line of logic, Petrov chose to flag all of them as not genuine. In reality, all alarms turned out to be false.

If Stanislav Petrov had believed the alarms were real, you might not be reading this article today, as I would not have written it.

The 1983 nuclear close call is an unquestionably extreme example of how human beings compute probabilities in the face of uncertainty, even without realizing it. Faced with additional evidence, we update our interpretation our beliefs about what weve observed, and at some point, we act or choose not to act based on those beliefs. This system of conditioning our beliefs on evidence plays out in our brain and in our gut every single day, in every walk of life from a surgeons decision to risk operating on a terminally ill cancer patient, to your decision to risk stepping out without an umbrella on a rainy day.

The complex probabilistic machinery that our biological tissue so expertly runs is based upon a surprisingly compact lattice of mathematical concepts. A key piece of this lattice is conditional probability.

In the rest of this article, Ill cover conditional probability in detail. Specifically, Ill cover the following:

Lets begin with the definition of conditional probability.

Conditional probability is the probability of event A occurring given that events B, C, D, etc., have already occurred. It is denoted as P(A | {B, C, D}) or simply P(A | B, C, D).

The notation P(A|B, C, D) is often pronounced as probability of A given B, C, D. Some authors also represent P(A|B, C, D) as P(A|B; C; D).

We assume that events B, C, D jointly influence the probability of A. In other words, event A does not occur independently of events B, C, and D. If event A is independent of events B, C, D, then P(A|B, C, D) equals the unconditional probability of A, namely P(A).

A subtle point to stress here is that when A is conditioned upon multiple other events, the probability of A is influenced by the joint probability of those events. For example, if event A is conditioned upon events B, C, and D, its the probability of the event (B C D) that A is conditions on.

Thus, P(A | B,C,D) is the same as saying P(A |B C D).

Well delve into the exact relation of joint probability with conditional probability in the section on Bayes theorem. Meanwhile, the thing to remember is that the joint probability P(A B C D) is different from the conditional probability P(A | B C D).

Lets see how to calculate conditional probability.

Every summer, millions of New Yorkers flock to the sun-splashed waters of the citys 40 or so beaches. With the visitors come the germs, which happily mix and multiply in the warm seawater of the summer months. There are at least a dozen different species and subspecies of bacteria that can contaminate seawater and, if ingested, can cause a considerable amount of, to put it delicately, involuntary expectoration on the part of the beachgoer.

Given this risk to public health, from April through October of each year, the New York City Department of Health and Mental Hygiene (DOHMH) closely monitors the concentration of enterococci bacteria a key indicator of seawater contamination in water samples taken from NYCs many beaches. DOHMH publishes the data it gathers on the NYC OpenData portal.

The following are the contents of the data file pulled down from the portal on 1st July 2024.

The data set contains 27425 samples collected from 40 beaches in the NYC area over a period of nearly 20 years from 2004 to 2024.

Each row in the data set contains the following pieces of information:

Before we learn how to calculate conditional probabilities, lets see how to calculate the unconditional (prior) probability of an event. Well do that by asking the following simple question:

What is the probability that the enterococci concentration in a randomly selected sample from this dataset exceeds 100 MPN per 100 ml of seawater?

Lets define the problem using the language of statistics.

Well define a random variable X to represent the enterococci concentration in a randomly selected sample from the dataset.

Next, well define an event A such that A occurs if, in a randomly selected sample, X exceeds 100 MPN/100 ml.

We wish to find the unconditional probability of event A, namely P(A).

Well calculate P(A) as follows:

From the data set of 27425 samples, if you count the samples in which the enterococci concentration exceeds 100, youll find this count to be 2414. Thus, P(A) is simply this count divided by the total number of samples, namely 27425.

Now suppose a crucial piece of information flows in to you: The sample was collected on a Monday.

In light of this additional information, can you revise your estimate of the probability that the enterococci concentration in the sample exceeds 100?

In other words, what is the probability of the enterococci concentration in a randomly selected sample exceeding 100, given that the sample was collected on a Monday?

To answer this question, well define a random variable Y to represent the day of the week on which the random sample was examined. The range of Y is [Monday, Tuesday,,Sunday].

Let B be the event that Y is a Monday.

Recall that A represents the event that X > 100 MPN/100 ml.

Now, we seek the conditional probability P(A | B).

In the dataset, 10670 samples happen to fall on a Monday. Out of these 10670 samples, 700 have an enterococci count exceeding 100. To calculate P(A | B), we divide 700 by 10670. Here, the numerator represents the event A B (A and B), while the denominator represents the event B.

We see that while the unconditional probability of the enterococci concentration in a sample is 0.08802 (8.8%), this probability drops to 6.56% when new evidence is gathered, namely that the samples were all collected on Mondays.

Conditional probability has the nice interpretative quality that the probability of an event can be revised as new pieces of evidence are gathered. This aligns well with our experience of dealing with uncertainty.

Heres a way to visualize unconditional and conditional probability. Each blue dot in the chart below represents a unique water sample. The chart shows the distribution of enterococci concentrations by the day of the week on which the samples were collected.

The green box contains the entire data set. To calculate P(A), we take the ratio of the number of samples in the green box in which the concentration exceeds 100 MPN/100 ml to the total number of samples in the green box.

The orange box contains only those samples that were collected on a Monday.

To calculate P(A | B), we take the ratio of the number of samples in the orange box in which the concentration exceeds 100 MPN/100 ml to the total number of samples in the orange box.

Now, lets make things a bit more interesting. Well introduce a third random variable Z. Let Z represent the month in which a random sample is collected. A distribution of enterococci concentrations by month, looks like this:

Suppose you wish to calculate the probability that the enterococci concentration in a randomly selected sample exceeds 100 MPN/100 ml, conditioned upon two events:

As before, let A be the event that the enterococci concentration in the sample exceeds 100.

Let B be the event that the sample was collected on a Monday.

Let C be the event that the sample was collected in July (Month 7).

You are now seeking the conditional probability: P (A | (B C)), or simply P (A | B, C).

Lets use the following 3-D plot to aid our understanding of this situation.

The above plot shows the distribution of enterococci concentration plotted against the day of the week and month of the year. As before, each blue dot represents a unique water sample.

The light-yellow plane slices through the subset of samples collected on Mondays i.e. on day of week=0. There happen to be 10670 samples lying along this plane.

The light-red plane slices through the subset of the samples collected in the month of July i.e., month = 7. There are 6122 samples lying along this plane.

The red dotted line marks the intersection of the two planes. There are 2503 samples (marked by the yellow oval) lying along this line of intersection. These 2503 samples were collected on July Mondays.

Among this subset of 2503 samples, are 125 samples in which the enterococci concentration exceeds 100 MPN/100 ml. The ratio of 125 to 2503 is the conditional probability P(A | B, C). The numerator represents the event A B C, while the denominator represents the event B C.

We can easily extend the concept of conditional probability to additional events D, E, F, and so on, although visualizing the additional dimensions lies some distance beyond what is humanly possible.

Now heres a salient point: As new events occur, the conditional probability doesnt always systematically decrease (or systematically increase). Instead, as additional evidence is factored in, conditional probability can (and often does) jump up and down in no apparent pattern, also depending on the order in the events are factored into the calculation.

Lets approach the job of calculating P(A | B) from a slightly different angle, specifically, from a set-theoretic angle.

Lets denote the entire dataset of 27425 samples as the set S.

Recall that A is the event that the enterococci concentration in a randomly selected sample from S exceeds 100.

From S, if you pull out all samples in which the enterococci concentration exceeds 100, youll get a set of size 2414. Lets denote this set as S_A. As an aside, note that event A occurs for every single sample in S_A.

Recall that B is the event that the sample falls on a Monday. From S, if you pull out all samples collected on a Monday, youll get a set of size 10670. Lets denote this set as S_B.

The intersection of sets S_A and S_B, denoted as S_A S_B, is a set of 700 samples in which the enterococci concentration exceeds 100 and the sample was collected on a Monday. The following Venn diagram illustrates this situation.

The ratio of the size of S_A S_B to the size of S is the probability that a randomly selected sample has an enterococci concentration exceeding 100 and was collected on a Monday. This ratio is also known as the joint probability of A and B, denoted P(A B). Do not mistake the joint probability of A and B for the conditional probability of A given B.

Using set notation, we can calculate the joint probability P(A B) as follows:

Now consider a different probability: the probability that a sample selected at random from S falls on a Monday. This is the probability of event B. From the overall dataset of 27425 samples, there are 10670 samples that fall on a Monday. We can express P(B) in set notation, as follows:

What I am leading up to with these probabilities is a technique to express P(A | B) using P(B) and the joint probability P(A B). This technique was first demonstrated by an 18th century English Presbyterian minister named Thomas Bayes (17011761) and soon thereafter, in a very different sort of way, by the brilliant French mathematician Pierre-Simon Laplace (17491827).

In their endeavors on probability, Bayes (and Laplace) addressed a problem that had vexed mathematicians for several centuries the problem of inverse probability. Simply put, they sought the solution to the following problem:

Knowing P(A | B), can you calculate P(B | A) as a function of P(A | B)?

While developing a technique for calculating inverse probability, Bayes indirectly proved a theorem that became known as Bayes Theorem or Bayes rule.

Bayes theorem not only allows us to calculate inverse probability, it also enables us to link three fundamental probabilities into a single expression:

When expressed in modern notation it looks like this:

Plugging in the values of P(A B) and P(B), we can calculate P(A | B) as follows:

The value 0.06560 for P(A | B) is of course the same as what we arrived at by another method earlier in the article.

Bayes Theorem itself is stated as follows:

In the above equation, the conditional probability P(B | A) is expressed in terms of:

Its in this form that Bayes theorem achieves phenomenal levels of applicability.

In many situations, P(B | A) cannot be easily estimated but its inverse, P(A | B), can be. The unconditional priors P(A) and P(B) can also be estimated via one of two commonly used techniques:

The point is, Bayes theorem gives you the conditional probability you seek but cannot easily estimate directly, in terms of its inverse probability which you can easily estimate and a couple of priors.

This seemingly simple procedure for calculating conditional probability has turned Bayes theorem into a priceless piece of computational machinery.

Bayes theorem is used in everything from estimating student performance on standardized test scores to hunting for exoplanets, from diagnosing disease to detecting cyberattacks, from assessing risk of bank failures to predicting outcomes of sporting events. In Law Enforcement, Medicine, Finance, Engineering, Computing, Psychology, Environment, Astronomy, Sports, Entertainment, Education there is scarcely any field left in which Bayes method for calculating conditional probabilities hasnt been used.

Lets return to the joint probability of A and B.

We saw how to calculate P(A B) using sets as follows:

Its important to note that whether or not A and B are independent of each, P(A B) is always the ratio of |S_A S_B| to |S|.

The numerator in the above ratio is calculated in one of the following two ways depending on whether A is independent of B:

Thus, when A and B are independent events, |S_A S_B| is calculated as follows:

The principle of conditional probability can be extended to any number of events. In the general case, the probability of an event E_s conditioned upon the occurrence of m other events E_1 through E_m can be written as follows:

We make the following observations about equation (1):

Now heres something interesting: In equation (1), if you rename the event E_s as y, and rename events E_1 through E_m as x_1 through X_m respectively, equation (1) suddenly acquires a whole new interpretation.

And thats the topic of the next section.

There is a triad of concepts upon which every single regression model rests:

Even within this illustrious trinity, conditional probability commands a preeminent position for two reasons:

Ill illustrate this using two very commonly used, albeit very dissimilar, regression models: the Poisson model, and the linear model.

Consider the task of estimating the daily counts of bicyclists on New York Citys Brooklyn Bridge. This data actually exists: for 7 months during 2017, the NYC Department of Transportation counted the number of bicyclists riding on all East River bridges. The data for the Brooklyn bridge looked like this:

Data such as these, which contain strictly whole-numbered values, can often be effectively modeled using a Poisson process and the Poisson probability distribution. Thus, to estimate the daily count of bicyclists, you would:

Consequently, the probability of observing a particular value of y, say y_i, will be given by the following Probability Mass Function of the Poisson probability distribution:

See more here:

Understanding Conditional Probability and Bayes Theorem | by Sachin Date | Jul, 2024 - Towards Data Science

Cloud Hosting

Category Archives: Data Science

How to Plan for Your Next Career Move in Data Science and Machine Learning | by TDS Editors | Jul, 2024 – Towards Data Science

BlazeFace: How to Run Real-time Object Detection in the Browser – Towards Data Science

Google and Howard University to train underrepresented high schoolers in data analyticsand its free to students – Fortune

Analytics and Data Science News for the Week of July 19; Updates from Microsoft, Qlik, Teradata & More – Solutions Review

Center for Community-Engaged Artificial Intelligence and the Connolly Alexander Institute for Data Science launch summer research program – Tulane…

Heather Barker travels to New Zealand to present at International Association of Statistics Education Roundtable – Today at Elon

Diversifying the Data Science Workforce: The DataMosaic Program – CDC

Bipartisan Senate bill proposes $12B for DOEs AI work to spur energy breakthroughs, other advances – Utility Dive

Research Centers and Initiatives – University of Illinois Urbana-Champaign

Understanding Conditional Probability and Bayes Theorem | by Sachin Date | Jul, 2024 – Towards Data Science

Recent Posts

Categories

Archives

Media Sites

Pages

Site admin