Category Archives: Data Science

David Mongeau to step down, interim director for country’s only HSI data science school announced – The University of Texas at San Antonio

A nationally recognized leader in the data science and artificial intelligence community, Mongeau brought to UTSA a distinguished record in leading research institutes and training programs, as well as in developing partnerships across government, industry, academia and the philanthropic community.

Under his leadership, the School of Data Science has recorded numerous achievements, including receiving $1.2 million in gift funding for data science, AI and machine learning student training and research programs. In addition to the undergraduate and graduate degree and certificate programs comprising the School of Data Science, school leaders now are developing a new certificate program in data engineering.

In partnership with the Association of Computing Machinery at UTSA and the National Security Agency, the School of Data Science in 2022 launched the annual Rowdy Datathon competition.

In April 2023, the school hosted the inaugural UTSA Draper Data Science Business Plan Competition, which highlights data science applications and student entrepreneurship; the second annual competition will be held at San Pedro I later this spring.

Also in 2023, the school hosted its inaugural Los Datos Conference. The school also now serves as administrative host to the universitys annual RowdyHacks competition; more than 500 students from across Texas participated in the 9th annual RowdyHacks at San Pedro I last weekend.

Mongeau has been driven to increase the reach and reputation of the School of Data Science from and back to San Antonio. In October 2023, the School of Data Science hosted the annual meeting of the Academic Data Science Alliance, bringing together more than 200 data science practitioners, researchers and educators from across the country to UTSA. The school also invested nearly $400,000 to create opportunities for UTSA students and faculty to pursue projects and participate in national data science and AI experiences at, for example, University of Chicago, University of Michigan, University of Washington, and the U.S. Census Bureau.

Through a collaboration with San Antonio-based start-up Skew the Script, the school has reached 20,000 high school teachers and 400,000 high school students with open-source training in statistics and math, which are core to success in data science and AI.

I consider myself so fortunate to have been part of the creation of the School of Data Science at UTSA, said Mongeau. I thank the schools dedicated staff and core faculty for their commitment to the school which is having an enduring impact on our students the next generation of diverse data scientists who have embraced the schools vision to make our world more equitable, informed and secure. These Roadrunners are destined to become industry leaders and continue to advance the frontiers of data science and AI.

Immediately prior to joining UTSA, Mongeau served as executive director of the Berkeley Institute for Data Science at the University of California, Berkeley. As executive director, he set the strategic direction for the institute, expanded industry and foundation engagement, and applied data science and AI in health care, climate change, and criminal justice.

Notably, he also initiated three data science fellowship programs and forged partnerships to enhance opportunities for legal immigrants and refugees in data science careers.

Visit link:

David Mongeau to step down, interim director for country's only HSI data science school announced - The University of Texas at San Antonio

How Google Used Your Data to Improve their Music AI – Towards Data Science

MusicLM fine-tuned on user preferences 7 min read

MusicLM, Googles flagship text-to-music AI, was originally published in early 2023. Even in its basic version, it represented a major breakthrough and caught the music industry by surprise. However, a few weeks ago, MusicLM received a significant update. Heres a side-by-side comparison for two selected prompts:

Prompt: Dance music with a melodic synth line and arpeggiation:

Prompt: a nostalgic tune played by accordion band

This increase in quality can be attributed to a new paper by Google Research titled: MusicRL: Aligning Music Generation to Human Preferences. Apparently, this upgrade was considered so significant that they decided to rename the model. However, under the hood, MusicRL is identical to MusicLM in its key architecture. The only difference: Finetuning.

When building an AI model from scratch, it starts with zero knowledge and essentially does random guessing. The model then extracts useful patterns through training on data and starts displaying increasingly intelligent behavior as training progresses. One downside to this approach is that training from scratch requires a lot of data. Finetuning is the idea that an existing model is used and adapted to a new task, or adapted to approach the same task differently. Because the model already has learned the most important patterns, much less data is required.

For example, a powerful open-source LLM like Mistral7B can be trained from scratch by anyone, in principle. However, the amount of data required to produce even remotely useful outputs is gigantic. Instead, companies use the existing Mistral7B model and feed it a small amount of proprietary data to make it solve new tasks, whether that is writing SQL queries or classifying emails.

The key takeaway is that finetuning does not change the fundamental structure of the model. It only adapts its internal logic slightly to perform better on a specific task. Now, lets use this knowledge to understand how Google finetuned MusicLM on user data.

A few months after the MusicLM paper, a public demo was released as part of Googles AI Test Kitchen. There, users could experiment with the text-to-music model for free. However, you might know the saying: If the product is free, YOU are the product. Unsurprisingly, Google is no exception to this rule. When using MusicLMs public demo, you were occasionally confronted with two generated outputs and asked to state which one you prefer. Through this method, Google was able to gather 300,000 user preferences within a couple of months.

As you can see from the screenshot, users were not explicitly informed that their preferences would be used for machine learning. While that may feel unfair, it is important to note that many of our actions in the internet are being used for ML training, whether it is our Google search history, our Instagram likes, or our private Spotify playlists. In comparison to these rather personal and sensitive cases, music preferences on the MusicLM playground seem negligible.

It is good to be aware that user data collection for machine learning is happening all the time and usually without explicit consent. If you are on Linkedin, you might have been invited to contribute to so-called collaborative articles. Essentially, users are invited to provide tips on questions in their domain of expertise. Here is an example of a collaborative article on how to write a successful folk song (something I didnt know I needed).

Users are incentivized to contribute, earning them a Top Voice badge on the platform. However, my impression is that noone actually reads these articles. This leads me to believe that these thousands of question-answer pairs are being used by Microsoft (owner of Linkedin) to train an expert AI system on these data. If my suspicion is accurate, I would find this example much more problematic than Google asking users for their favorite track.

But back to MusicLM!

The next question is how Google was able to use this massive collection of user preferences to finetune MusicLM. The secret lies in a technique called Reinforcement Learning from Human Feedback (RLHF) which was one of the key breakthroughs of ChatGPT back in 2022. In RLHF, human preferences are used to train an AI model that learns to imitate human preference decisions, resulting in an artificial human rater. Once this so-called reward model is trained, it can take in any two tracks and predict which one would most likely be preferred by human raters.

With the reward model set up, MusicLM could be finetuned to maximize the predicted user preference of its outputs. This means that the text-to-music model generated thousands of tracks, each track receiving a rating from the reward model. Through the iterative adaptation of the model weights, MusicLM learned to generate music that the artificial human rater likes.

In addition to the finetuning on user preferences, MusicLM was also finetuned concerning two other criteria: 1. Prompt Adherence MuLan, Googles proprietary text-to-audio embedding model was used to calculate the similarity between the user prompt and the generated audio. During finetuning, this adherence score was maximized. 2. Audio Quality Google trained another reward model on user data to evaluate the subjective audio quality of its generated outputs. These user data seem to have been collected in separate surveys, not in MusicLMs public demo.

The new, finetuned model seems to reliably outperform the old MusicLM, listen to the samples provided on the demo page. Of course, a selected public demo can be deceiving, as the authors are incentivized to showcase examples that make their new model look as good as possible. Hopefully, we will get to test out MusicRL in a public playground, soon.

However, the paper also provides a quantitative assessment of subjective quality. For this, Google conducted a study and asked users to compare two tracks generated for the same prompt, giving each track a score from 1 to 5. Using this metric with the fancy-sounding name Mean Opinion Score (MOS), we can compare not only the number of direct comparison wins for each model, but also calculate the average rater score (MOS).

Here, MusicLM represents the original MusicLM model. MusicRL-R was only finetuned for audio quality and prompt adherence. MusicRL-U was finetuned solely on human feedback (the reward model). Finally, MusicRL-RU was finetuned on all three objectives. Unsurprisingly, MusicRL-RU beats all other models in direct comparison as well as on the average ratings.

The paper also reports that MusicRL-RU, the fully finetuned model, beat MusicLM in 87% of direct comparisons. The importance of RLHF can be shown by analyzing the direct comparisons between MusicRL-R and MusicRL-RU. Here, the latter had a 66% win rate, reliably outperforming its competitor.

Although the difference in output quality is noticeable, qualitatively as well as quantitatively, the new MusicLM is still quite far from human-level outputs in most cases. Even on the public demo page, many generated outputs sound odd, rhythmically, fail to capture key elements of the prompt or suffer from unnatural-sounding instruments.

In my opinion, this paper is still significant, as it is the first attempt at using RLHF for music generation. RLHF has been used extensively in text generation for more than one year. But why has this taken so long? I suspect that collecting user feedback and finetuning the model is quite costly. Google likely released the public MusicLM demo with the primary intention of collecting user feedback. This was a smart move and gave them an edge over Meta, which has equally capable models, but no open platform to collect user data on.

All in all, Google has pushed itself ahead of the competition by leveraging proven finetuning methods borrowed from ChatGPT. While even with RLHF, the new MusicLM has still not reached human-level quality, Google can now maintain and update its reward model, improving future generations of text-to-music models with the same finetuning procedure.

It will be interesting to see if and when other competitors like Meta or Stability AI will be catching up. For us as users, all of this is just great news! We get free public demos and more capable models.

For musicians, the pace of the current developments may feel a little threatening and for good reason. I expect to see human-level text-to-music generation in the next 13 years. By that, I mean text-to-music AI that is at least as capable at producing music as ChatGPT was at writing texts when it was released. Musicians must learn about AI and how it can already support them in their everyday work. As the music industry is being disrupted once again, curiosity and flexibility will be the primary key to success.

Read the rest here:

How Google Used Your Data to Improve their Music AI - Towards Data Science

4 Emerging Strategies to Advance Big Data Analytics in Healthcare –

February 28, 2024 -While the potential for big data analytics in healthcare has been a hot topic in recent years, the possible risks of using these tools have received just as much attention.

Big data analytics technologies have demonstrated their promise in enhancing multiple areas of care, from medical imaging and chronic disease management to population health and precision medicine. These algorithms could increase the efficiency of care delivery, reduce administrative burdens, and accelerate disease diagnosis.

But despite all the good these tools could achieve, the harm these algorithms could cause is nearly as significant.

Concerns about data access and collection, implicit and explicit bias, and issues with patient and provider trust in analytics technologies have hindered the use of these tools in everyday healthcare delivery.

Healthcare researchers and provider organizations are working to solve these issues, facilitating the use of big data analytics in clinical care for better quality and outcomes.

READ MORE: Data Analytics in Healthcare: Defining the Most Common Terms

In this primer, HealthITAnalytics will explore how improving data quality, addressing bias, prioritizing data privacy, and building providers trust in analytics tools can advance the four types of big data analytics in healthcare.

In healthcare, its widely understood that the success of big data analytics tools depends on the value of the information used to train them. Algorithms trained on inaccurate, poor-quality datacan yield erroneous results, leading to inadequate care delivery.

However, obtaining quality training data is complex and time-intensive, leaving many organizations without the resources to build effective models.

Researchers across the industry are working to overcome this challenge.

In 2019, a team from MITs Computer Science and Artificial Intelligence Library (CSAIL)developedan automated system to gather more data from images to train machine learning models, synthesizing a massive dataset of distinct training examples.

READ MORE: Breaking Down the 4 Types of Healthcare Big Data Analytics

This approach is beneficial for use cases in which high-quality images are available, but there are too few to develop a robust dataset. The synthesized dataset can be used to improve the training of machine learning models, enabling them to detect anatomical structures in new scans.

This image segmentation approach helps address one of the major data quality issues: insufficient data points.

But what about cases with a wealth of relevant data but varying qualities or data synthetization challenges?

In these cases, its useful to begin by defining and exploring some common healthcare analytics concepts.

Data quality, as the name suggests, is a way to measure the reliability and accuracy of the data. Addressing quality is critical to healthcare data generation, collection, and processing.

READ MORE: Top Data Analytics Tools for Population Health Management

If the data collection process yielded a sufficient number of data points but there is a question of quality, stakeholders can look at the datas structure and identify whether converting the structure of the datasets into a common format is appropriate. This is known as data standardization, and it can help ensure that the data are consistent, which is necessary for effective analysis.

Data cleaning flagging and addressing data abnormalities and data normalization, the process of organizing data, can take standardization even further.

Tools like the United States Core Data for Interoperability (USCDI) and USCDI+ can help in cases where a healthcare organization doesnt have enough high-quality data.

In scenarios with a large amount of data, synthesizing the data for analysis creates another potential hurdle.

As seen throughout the COVID-19 pandemic, when data related to the virus became available globally, healthcare leaders faced the challenge of creating high-quality datasets to help researchers answer vital questions about the virus.

In 2020, the White House Office of Science and Technology Policyissueda call to action for experts to synthesize an artificial intelligence (AI) algorithm-friendly COVID-19 dataset to bolster these efforts.

The dataset represents an extensive machine-readable coronavirus literature collection including over 29,000 articles at the time of creation designed to help researchers sift through and analyze the data more quickly.

By promoting collaboration among researchers, healthcare institutions, and other stakeholders, initiatives like this can support the efficient synthesis of large-scale, high-quality datasets.

As healthcare organizations become increasingly reliant on analytics algorithms to help them make care decisions, bias is a major hurdle to the safe and effective deployment of these tools.

Tackling algorithmic bias requires stakeholders to be aware of how biases are introduced and reproduced at every stage of algorithm development and deployment. In many algorithms, bias can be baked in almost immediately if the developers rely on biased data.

The US Department of Health and Human Services (HHS) Office of Minority Health (OMH) indicates that lack of diversity in an algorithms training data is a significant source of bias. Further, bias can be coded into algorithms based on developers beliefs or assumptions, including implicit and explicit biases.

If, for example, a developer incorrectly assumes that symptoms of a particular condition are more common or severe in one population than another, the resulting algorithm could be biased and perpetuate health disparities.

Some have suggested that bringing awareness to potential biases can remedy the issue of algorithmic bias, but research suggests that a more robust approach is required. One study published in the Future Healthcare Journal in 2021 demonstrated that while bias training can help individuals recognize biases in themselves and others, it is not an effective debiasing strategy.

The OMH recommends best practices beyond bias training, encouraging developers to work with diverse stakeholders to ensure that algorithms are adequately developed, validated, and reviewed to maximize utility and minimize harm.

In scenarios where diverse training data for algorithms is unavailable, techniques like synthetic data can help minimize potential biases.

In terms of algorithm deployment and monitoring, the OMH suggests that the tools should be implemented gradually and that users should have a way to provide feedback to the developers for future algorithm improvement.

To this end, developers can work with experts and end-users to understand what clinical measures are important to providers, according to researchers from the University of Massachusetts Amherst.

In recent years, healthcare stakeholders have increasingly developed frameworks and best practices to minimize bias in clinical algorithms.

A panel of experts convened by the Agency for Healthcare Research and Quality (AHRQ) and the National Institute on Minority Health and Health Disparities (NIMHD) published a special communications article in the December 2023 issue of JAMA Network Open outlining five principles to address the impact of algorithm bias on racial and ethnic disparities in healthcare.

The framework guides healthcare stakeholders to mitigate and prevent bias at each stage of an algorithms life cycle by promoting health equity, ensuring algorithm transparency, earning trust by engaging patients and communities, explicitly identifying fairness issues, and establishing accountability for equity and fairness in outcomes from algorithms.

When trained using high-quality data and deployed in settings that will be monitored and adjusted to minimize biases, algorithms can help address disparities in maternal health, preterm births, and social determinants of health (SDOH).

In algorithm development, data privacy and security are high on the list of concerns. Legal, privacy, and cultural obstacles can keep researchers from accessing the large, diverse data sets needed to train analytics technologies.

Over the years, experts have worked to craft approaches that can balance the need for data access against the need to protect patient privacy.

In 2020, a team from the University of Iowa (UI) set out to develop a solution to this problem. With a $1 million grant from the National Science Foundation (NSF), UI researcherscreateda machine learning platform to train algorithms with data from around the world.

The tool is a decentralized, asynchronous solution called ImagiQ, and it relies on an ecosystem of machine learning models so that institutions can select models that work best for their populations. Using the platform, organizations can upload and share the models, but not patient data, with each other.

The researchers indicated that traditional machine learning methods require a centralized database where patient data can be directly accessed for use in model training, but these approaches are often limited by practical issues like information security, patient privacy, data ownership, and the burden on health systems tasked with creating and maintaining those centralized databases.

ImagiQ helps overcome some of these challenges, but it is not the only framework to do so.

Researchers from the University of Pittsburgh Swanson School of Engineering were awarded $1.7 million from the National Institutes of Health (NIH) in 2022 to advance their efforts to develop a federated learning (FL)-based approach to achieve fairness in AI-assisted medical screening tools.

FL is a privacy-protection method that enables researchers to train AI models across multiple decentralized devices or servers holding local data samples without exchanging them.

The approach is useful for improving model performance without compromising data privacy, as AI trained on one institutions data typically does not generalize well on data from another.

However, FL is not a perfect solution, as experts from the University of Southern California (USC) Viterbi School of Engineering pointed out at the 2023 International Workshop on Health Intelligence. They posited that FL brings forth multiple concerns, such as its ability to make predictions based on what its learned from its training data and the hurdles presented by missing data and the data harmonization process.

The research team presented a framework for addressing these challenges, but there are other tools healthcare stakeholders can use to prioritize data privacy, such as confidential computing or blockchain. These tools center on making the data largely inaccessible and resistant to tampering by unauthorized parties.

Alternatives that do not require significant investments in cloud computing or blockchain are also available to stakeholders through privacy-enhancing technologies (PETs), three of which are particularly suited to healthcare use cases.

Algorithmic PETs like encryption, differential privacy, and zero-knowledge proofs protect data privacy by altering how the information is represented while ensuring it is usable. Often, this involves modifying the changeability or traceability of healthcare data.

In contrast, architectural PETs focus on the structure of data or computation environments, rather than how those data are represented, to enable users to exchange information without exchanging any underlying data. Federated learning, secure multi-party computation, and blockchain fall into this PET category.

Augmentation PETs, as the name suggests, augment existing data sources or create fully synthetic ones. This approach can help enhance the availability and utility of data used in healthcare analytics projects. Digital twins and generative adversarial networks are commonly used for this purpose.

But even the most robust data privacy infrastructure cannot compensate for a lack of trust in big data analytics tools.

Just as patients need to trust that analytics algorithms can keep their data safe, providers must trust that these tools can deliver information in a functional, reliable way.

The issue of trustworthy analytics tools has recently taken center stage in conversations around how Americans interact with AI knowingly and unknowingly in their daily lives. Healthcare is one of the industries where advanced technologies present the most significant potential for harm, leading the federal government to begin taking steps to guide the deployment and use of algorithms.

In October 2023, President Joe Biden signed theExecutive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, which outlines safety, security, privacy, equity, and other standards for how industry and government should approach AI innovation.

The orders directives are broad, as they are designed to apply to all US industries, but it does lay out some industry-specific directives for those looking at how it will impact healthcare. Primarily, the executive order provides a framework for creating standards, laws, and regulations around AI and establishes a roadmap of subsequent actions that government agencies, like HHS, must take to build such a framework.

However, this process will take months, and more robust regulation of healthcare algorithms could take even longer, leading industry stakeholders to develop their own best practices for using analytics technologies in healthcare.

One stakeholder is the National Academy of Medicine (NAM) Artificial Intelligence Code of Conduct (AICC), which represents a collaborative effort among healthcare, research, and patient advocacy groups to create a national architecture for responsible AI use in healthcare.

In a 2024 interview with HealthITAnalytics, NAM leadership emphasized that this governance infrastructure is necessary to gain trust and improve healthcare as advanced technologies become more ubiquitous in care settings.

However, governance structure must be paired with education and clinician support to obtain buy-in from providers.

Some of this can start early, as evidenced by recent work from the University of Texas (UT) health system to incorporate AI training into medical school curriculum. Having staff members dedicated to spearheading analytics initiatives, such as a chief analytics officer, is another approach that healthcare organizations can use to make providers feel more comfortable with these tools.

These staff can also work to bolster trust at the enterprise level by focusing on creating a healthcare data culture, gaining provider buy-in from the top down, and having strategies to address concerns about clinician overreliance on analytics technologies.

With healthcare organizations increasingly leveraging big data analytics tools for enhanced insights and streamlined care processes, overcoming data quality, bias, privacy, and security issues and fostering user trust will be critical for successfully using these models in clinical care.

As research evolves around AI, machine learning, and other analytics algorithms, the industry will keep refining these tools for improved patient care.

Follow this link:

4 Emerging Strategies to Advance Big Data Analytics in Healthcare -

Data Science Market: Unleashing Insights with AI and Machine Learning, Embracing a 31.0% CAGR and to Grow USD … – GlobeNewswire

Covina, Feb. 28, 2024 (GLOBE NEWSWIRE) -- According to the recent research study, the Data Science Market size was valued at about USD 80.5 Billion in 2024 and expected to grow at CAGR of 31.0% to extend a value of USD 941.8 Billion by 2034.

What is Data Science?

Market Overview:

Data science is a multidisciplinary field that involves extracting insights and knowledge from data using various scientific methods, algorithms, processes, and systems. It combines aspects of statistics, mathematics, computer science, and domain expertise to analyze complex data sets and solve intricate problems.

The primary goal of data science is to extract valuable insights, patterns, trends, and knowledge from structured and unstructured data. This process typically involves:

Get Access to Free Sample Research Report with Latest Industry Insights:

*Note: PMI Sample Report includes,

Top Leading Players in Data Science Market:

Market Dynamics:

Driving Factors:

Restrain Factors:

Emerging Trends and Opportunities in Data Science Market:

Download PDF Brochure:

Challenges of Data Science Market:

Detailed Segmentation:

Data Science Market, By Type:

Data Science Market, By End-User:

Data Science Market, By Region:

Regional Analysis:

Regional insights highlight the diverse market dynamics, regulatory landscapes, and growth drivers shaping the Data Science Market across different geographic areas. Understanding regional nuances and market trends is essential for stakeholders to capitalize on emerging opportunities and drive market expansion in the Data Science sector.

North America market is estimated to witness the fastest share over the forecast period the adoption of cloud computing services, including infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS), has accelerated in North America. Cloud providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform offer scalable, cost-effective solutions for data storage, processing, and analytics, driving adoption among enterprises.

Report scope:

By End-User Banking and Financial Institutions (BFSI), Telecommunication, Transportation and Logistics, Healthcare, and Manufacturing

Europe - UK, Germany, Spain, France, Italy, Russia, Rest of Europe

Asia Pacific - Japan, India, China, South Korea, Australia, Rest of Asia-Pacific

Latin America - Brazil, Mexico, Argentina, Rest of Latin America

Middle East & Africa - South Africa, Saudi Arabia, UAE, Rest of Middle East & Africa

Key highlights of the Data Science Market:

Any query or customization before buying:

Explore More Insights:


Follow us on:

LinkedIn | Twitter | Facebook |YouTube

Go here to read the rest:

Data Science Market: Unleashing Insights with AI and Machine Learning, Embracing a 31.0% CAGR and to Grow USD ... - GlobeNewswire

Computer scientist traces her trajectory from stunt flying to a startup – GeekWire

Computer scientist Cecilia Aragon tells her life story at the Womens Leadership Conference, presented by the Bellevue Chamber. (GeekWire Photo / Alan Boyle)

BELLEVUE, Wash. Three decades ago, Cecilia Aragon made aviation history as the first Latina to earn a place on the U.S. Unlimited Aerobatic Team. She went on to write a book about it, titled Flying Free.

Today, shes still flying free, as a professor and data scientist in the University of Washington and as the co-founder of a Seattle startup that aims to commercialize her research.

Aragon recounted her personal journey today during a talk at the Womens Leadership Conference, presented by the Bellevue Chamber. The conference brought nearly 400 attendees to Bellevues Meydenbauer Center to hear about topics ranging from financial literacy to sports management.

Aragons aerobatic days began in 1985, when she accepted an invitation from a co-worker to take a ride in his flying clubs Piper Cherokee airplane.

The first thing I thought was, Im the person whos scared of climbing a stepladder. Im scared of going in an elevator,' she recalled.

But then she thought of her Chilean-born father. I heard my fathers voice, saying, What is stopping you from doing whatever you want? she said. She swallowed her fears, climbed into the plane, and was instantly hooked.

Its so gorgeous to fly out into the water and see the sun glinting up on the water, like a million gold coins, she said. And when we got down to the ground, I said, I want to take flying lessons. I want to be the pilot of my own life.'

Aragon said she went through three flight instructors, but gradually overcame her fears. I learned to turn fear into excitement, she said. The excitement reached its peak in 1991 when she was named to the U.S. aerobatic team and went on to win bronze medals at the U.S. national and world aerobatic championships.

That wasnt the only dream that Aragon has turned into reality. After leaving the aerobatic team, she worked as a computer scientist at NASAs Ames Research Center in Silicon Valley, earned her Ph.D. at Berkeley and became a staff scientist at Lawrence Berkeley National Laboratory. Aragon joined UWs faculty in 2010 and is now the director of the universitys Human-Centered Data Science Lab.

I love it, she said. My students amaze me and excite me every single day.

Aragons research focuses on how people make sense of vast data sets, using computer algorithms and visualizations. She holds several patents relating to visual representations of travel data and with the help of UWs CoMotion Labs and Mobility Innovation Center, Aragon and her teammates have turned that data science into a startup called Traffigram.

For the past year, Traffigrams small team has been working in semi-stealth mode to develop software that can analyze multiple travel routes, determine the quickest way to get from Point A to Point B, and present the information in an easy-to-digest format. Aragon is the ventures chief scientist and her son, Ken Aragon, is co-founder and CEO.

Its a family business, she told GeekWire. Weve gotten a great response from potential customers so far, and weve raised some money.

So how does creating a startup compare with aerobatic stunt flying?

I think there are a lot of similarities, because its very risky, Aragon said. As they have told me many times, most startup businesses fail. You know, thats just like what they told me with aerobatics that very few people make the U.S. aerobatic team, and its probably not going to happen. I said, Yeah, but Im going to enjoy the path I believe in. So I believe in the mission we have, to make transportation more accessible to everyone.

Originally posted here:

Computer scientist traces her trajectory from stunt flying to a startup - GeekWire

Why LLMs are not Good for Coding. Challenges of Using LLMs for Coding | by Andrea Valenzuela | Feb, 2024 – Towards Data Science

Self-made image

Over the past year, Large Language Models (LLMs) have demonstrated astonishing capabilities thanks to their natural language understanding. These advanced models have not only redefined the standards in Natural Language Processing but also populated applications and services.

There has been a rapidly growing interest in using LLMs for coding, with some companies striving to turn natural language processing into code understanding and generation. This task has already highlighted several challenges yet to be addressed in using LLMs for coding. Despite these obstacles, this trend has led to the development of AI code generator products.

Have you ever used ChatGPT for coding?

While it can be helpful in some instances, it often struggles to generate efficient and high-quality code. In this article, we will explore three reasons why LLMs are not inherently proficient at coding out of the box: the tokenizer, the complexity of context windows when applied to code and the nature of the training itself .

Identify the key areas that need improvement is crutial to transform LLMs into more effective coding assistants!

The LLM tokenizer is the responsible of converting the user input text, in natural language, to a numerical format that the LLMs can understand.

The tokenizer processes raw text by breaking it down into tokens. Tokens can be whole words, parts of words (subwords), or individual characters, depending on the tokenizers design and the requirements of the task.

Since LLMs operate on numerical data, each token is given an ID which depends on the LLM vocabulary. Then, each ID is further associated with a vector in the LLMs latent high-dimensional space. To do this last mapping, LLMs use learned embeddings, which are fine-tuned during training and capture complex relationships and nuances in the data.

If you are interested in playing around with different LLM tokenizers and see how they

Follow this link:

Why LLMs are not Good for Coding. Challenges of Using LLMs for Coding | by Andrea Valenzuela | Feb, 2024 - Towards Data Science

Virtual Data Science and Analytics Day | Student Affairs – The University of Virginia

Register now for Spring Virtual Data Science and Analytics Day. This evening event provides a unique opportunity for students to engage with organizations looking to hire for data science and analytics roles. Throughout this virtual event, you will have an opportunity to participate in one-on-one conversations and group sessions with employers and alumni. Spaces are first-come-first-served, so sign-up in advance.

See original here:

Virtual Data Science and Analytics Day | Student Affairs - The University of Virginia

iPhone Creator Suggests Opinions Drive Innovation, not Data – Towards Data Science

Source: DALL-E

We need to be data driven, says everyone. And yes. I agree 90% of the time, but it shouldnt be taken as a blanket statement. Like everything else in life, recognizing where it does and doesnt apply is important.

In a world obsessed with data, its the bold, opinionated decisions that break through to revolutionary innovation.

The Economist wrote about the rumoured, critical blunders of McKinsey in the 1980s during the early days of the smartphone era. AT&T asked McKinsey to project the size of the smartphone market.

McKinsey, presumably after rigorous projections, expert calls, and data crunching, shared that the estimated total market would be about 900,000 smartphones. They based it on data, specifically data in that time. It was bulky, large, and only a necessary evil for mobile people. Data lags.

AT&T pulled out initially, in part, due to those recommendations, before diving back in the market to compete. Some werent as lucky. Every strategy consultant in South Korea will know about the rumours of McKinsey sharing a similar advice to one of the largest conglomerates that used go go head-to-head with Samsung: LG. They pulled out of the market, and lost even taking a shot at becoming a global leader in this estimated 500 billion dollar market.

Today, the World Economic Forum shared in a recent analysis that there are more smartphones than people on earth, with roughly 8.6 BILLION subscribed phones.

The designer and builder of the iPhone and Nest Tony Faddell, shares in his book Build that decisions are driven by some proportion of opinions and data. And for the very first version of a product that are revolutionary, as opposed to evolutionary, are by definition opinion driven. And theyre useful for different types of innovation:

See original here:

iPhone Creator Suggests Opinions Drive Innovation, not Data - Towards Data Science

Advanced Selection from Tensors in Pytorch | by Oliver S | Feb, 2024 – Towards Data Science

In some situations, youll need to do some advanced indexing / selection with Pytorch, e.g. answer the question: how can I select elements from Tensor A following the indices specified in Tensor B?

In this post well present the three most common methods for such tasks, namely torch.index_select, torch.gather and torch.take. Well explain all of them in detail and contrast them with one another.

Admittedly, one motivation for this post was me forgetting how and when to use which function, ending up googling, browsing Stack Overflow and the, in my opinion, relatively brief and not too helpful official documentation. Thus, as mentioned, we here do a deep dive into these functions: we motivate when to use which, give examples in 2- and 3D, and show the resulting selection graphically.

I hope this post will bring clarity about said functions and remove the need for further exploration thanks for reading!

And now, without further ado, lets dive into the functions one by one. For all, we first start with a 2D example and visualize the resulting selection, and then move to somewhat more complex example in 3D. Further, we re-implement the executed operation in simple Python s.t. you can look at pseudocode as another source of information what these functions do. In the end, we summarize the functions and their differences in a table.

torch.index_select selects elements along one dimension, while keeping the other ones unchanged. That is: keep all elements from all other dimensions, but pick elements in the target dimensions following the index tensor. Lets demonstrate this with a 2D example, in which we select along dimension 1:

The resulting tensor has shape [len_dim_0, num_picks]: for every element along dimension 0, we have picked the same element from dimension 1. Lets visualize this:

Read more here:

Advanced Selection from Tensors in Pytorch | by Oliver S | Feb, 2024 - Towards Data Science

Mosaic Data Science’s Neural Search Solution Named the Top Insight Engine of 2024 by CIO Review – Newswire

Press Release Feb 28, 2024 10:00 EST

Mosaic Data Science has been recognized as the Top Insight Engines Solutions Provider of 2024 by CIO Review magazine for its Neural Search Engine framework.

LEESBURG, Va., February 28, 2024 ( - In a significant acknowledgment of its pioneering efforts in the realm of insight engines, Mosaic Data Science has been recognized as the Top Insight Engines Solutions Provider of 2024 by CIO Review magazine for its Neural Search Engine framework. The accolade is a testament to Mosaics ability to address and solve complex customer challenges using Large Language Models (LLMs) and Reader/Retrieval Architectures (RAG), positioning the company at the forefront of innovation in the Generative AI landscape.

The Neural Search Engine has revolutionized how businesses comb through vast amounts of data, automating text, image, video, and audio information retrieval from all corporate documents and significantly enhancing efficiency and productivity. With its advanced modeling and architecture frameworks, Neural Search provides firms with a robust set of templates for the secure tuning of AI models, tailoring them to an organizations specific data and requirements.

Mosaics Neural Search Engine is designed for versatility. Whether organizations have already deployed a production-grade AI search system and seek assistance with nuanced queries or contextualized results, or are exploring the right LLM for their needs, Mosaic offers a custom-built, cutting-edge solution. The engines ability to understand the nuances of human language and deliver actionable insights empowers businesses to make informed, data-driven decisions, effectively transforming how companies access and leverage information.

The Insight Engines award from CIO Review highlights Mosaics commitment to a vendor-agnostic approach, ensuring seamless integration with existing data sources, infrastructure, AI, and governance tools. By adopting Mosaics Neural Search Engine, businesses can embrace the future of search technology without discarding their current investments, taking what works and integrating it.

The recognition includes a feature in the print edition of CIO Reviews Insight Engines special. This accolade is not just a win for Mosaic but a win for the future of efficient, intelligent search solutions that cater to the evolving needs of businesses.

Source: Mosaic Data Science

Read the original post:

Mosaic Data Science's Neural Search Solution Named the Top Insight Engine of 2024 by CIO Review - Newswire