Category Archives: Data Science

Multilingual RAG, Algorithmic Thinking, Outlier Detection, and Other Problem-Solving Highlights – Towards Data Science

4 min read

Feeling inspired to write your first TDS post? Were always open to contributions from new authors.

When we think about problem-solving, our focus tends to be on the solving part: the powerful hack, a new magical tool, a few lines of code that make everything click into place. In reality, a lot has to happen for these final touches to workfrom developing a solid understanding of what the problem actually is, to sketching out a workable process that ensures we find consistent success rather than just a temporary band-aid.

Our weekly highlights this week stand out for their holistic approach to finding effective solutions to occasionally thorny challenges. They offer a glimpse into practitioners mindset as they explore their available resources (data, tools, and time, to name a few) and weigh the pros and cons of different workflows. We think they might just inspire you to view whatever project youre working on at the moment from a new perspective. Enjoy your reading!

Read more from the original source:

Multilingual RAG, Algorithmic Thinking, Outlier Detection, and Other Problem-Solving Highlights - Towards Data Science

Computers and chemistry | UDaily – University of Delaware

Soham Jariwala has a unique perspective: He took the Hackathon course in 2022 and this spring is a project mentor alongside Vasu Venkateshwaran from W. L. Gore. & Associates. Jariwala, a doctoral alumnus from the chemical and biomolecular engineering department and now a modeling and simulation scientist at Gore, was not officially part of the NRT program but took the course to gain hands-on experience with using machine learning tools for industry projects, an experience that he said was formative in helping him succeed in his current role.

In a traditional classroom, you have a limited perspective on how projects are conducted in industry, Jariwala said. In the Hackathon class, you have a problem that even the industry experts don't know the answer to. As a team, you bring your own expertise, brainstorm ideas, find the best approach, and learn about other areas in order to reach a decision.

The summer after completing the Hackathon course, students in the NRT-MIDAS program have the option of either completing a summer internship or a two-week teaching workshop.

Alison Shapiro, a chemical engineering doctoral candidate working in the lab of Allan & Myra Ferguson Distinguished ProfessorThomas H. Epps, III, worked at Dow last summer as part of the companys cable and wire department. During her time at Dow, she looked for ways to recycle and more sustainably fabricate the insulating, protective polymers that coat electrical wires and conducted life cycle assessments for candidate materials.

Shapiro said that understanding the vernacular used to talk about chemicals during the Hackathon course was extremely helpful in completing her internship projects.

At Dow, they had the same way of talking about [formulations] as we did in the Hackathon class, which is different from how most academic research was done, she said. That was one of the things that translated over the most, and I initially had no idea that it was going to be so helpful.

Sean Farrington, who is also part of the first NRT cohort, is a doctoral candidate working under Unidel Robert L. Pigford ChairNorman Wagner and Arthur B. Metzner ProfessorAntony Beris. Last summer he completed a teaching workshop developed by NRT core faculty member and associate professorJoshua Enszer, which involved presentations about class preparation and teaching strategies followed by each student delivering a mock lecture and receiving constructive feedback.

Farrington, who is currently a TA in the chemical engineering department, regularly uses a list of action verbs provided by Enszer when preparing to teach. He said that the workshop was invaluable, not only for his current career plans of working in academia but because no matter what job you have, you are always going to have to teach people something, and to do so you need to figure out exactly what your learning outcomes are, he said.

Outside of the programs coursework and professional development activities, NRT-MIDAS also fosters a strong sense of community that reaches across multiple departments.

This includes a biweekly NRT community hour organized by Johnston and Jayaraman, where all members of the NRT-MIDAS community get together during the lunch hour. Along with socializing over pizza, students get to hear invited speakers discuss their research in academic and national laboratories, learn about various STEM careers in industry, publishing, and teaching, and attend professional development workshops on topics such as data ethics, responsible conduct of research, and science communication.

As NRT program coordinator, Johnston plays a key role in helping foster this sense of community, from helping students become comfortable with public speaking and communication during their outreach activities to hosting monthly individual advising meetings with all of the trainees, which Johnston said is the highlight of her week.

Not only do I enjoy getting to know our students, but they also provide valuable information on what they need from us as a program, Johnston said. These meetings have helped influence our professional skill community hour topics and have given us the ability to really cater to the needs of the students in our program.

During the summer, students work in teams to complete an outreach activity that showcases STEM research and data science concepts for a variety of non-scientific audiences.

In the summer of 2022, the first cohort createdvideos to help explain their research and pique other students interest in science.

Read more from the original source:

Computers and chemistry | UDaily - University of Delaware

International Team of Researchers Advance Groundwater Resilience Through AI and Data Science – EnterpriseAI

If you are a visitor of this website:

Please try again in a few minutes.

There is an issue between Cloudflare's cache and your origin web server. Cloudflare monitors for these errors and automatically investigates the cause. To help support the investigation, you can pull the corresponding error log from your web server and submit it our support team. Please include the Ray ID (which is at the bottom of this error page). Additional troubleshooting resources.

View original post here:

International Team of Researchers Advance Groundwater Resilience Through AI and Data Science - EnterpriseAI

Qlik meets user needs with realistic approach to generative AI – TechTarget

While some vendors fed the hype, Qlik took a pragmatic approach to generative AI.

The business analytics vendor considered its existing capabilities when generative AI surged in popularity following OpenAI's release of ChatGPT in late 2022. It considered its customers' needs. And it considered what it needed to add to meet those customers' needs.

From there, Qlik came up with a realistic strategy with providing trustworthy data at its core, and one that its users believe in.

"I always check on what other vendors are doing," said Angel Monjars, Qlik platform manager at C40 Cities, a network of nearly 100 cities working together to combat climate change. "I have to stay in touch with everything that's out there. I'm confident that Qlik is on the right track."

At this point, generative AI is nothing new. But after the launch of ChatGPT, it suddenly embodied the technology that could finally enable true natural language processing. It was the technology that, when combined with enterprise data, could reduce and eliminate coding and enable anyone within an organization to work with data rather than a small percentage of specialists.

Within months, data management and analytics vendors such as Microsoft, ThoughtSpot, Tableau, Alteryx and Informatica were among the many to unveil plans to augment their platforms with generative AI, introducing tools that would make their platforms smarter and simpler. But the tools were under development rather than nearing general availability.

Some of those tools eventually made their way through the preview process. For example, Microsoft first unveiled its Copilot for Power BI in May 2023, but didn't make it generally available until one year later. Other generative AI systems, however, after more than a year, are still not generally available.

Qlik, conversely, didn't quickly grab attention when generative AI became the rage. It didn't publicize every time it came up with an idea. It didn't introduce tools in development that promised to eliminate the difficulties that have existed for decades that make data management and analytics specialized skills.

It didn't buy into the hype surrounding generative AI.

"Over the last year, there were a whole heck of a lot of people making a lot of noise about [generative AI]. We did a bit of the opposite," said Nick Magnuson, Qlik's head of AI. "We took a step back and asked some key questions about how we wanted to plan an ecosystem."

Qlik might have lost out on some publicity in the process. But according to Susan Dean, director of business technology at heavy equipment manufacturer Takeuchi U.S., the ecosystem Qlik is developing serves customers' needs. And what it is now revealing related to generative AI is accelerating quickly.

"Definitely," she said when asked whether Qlik is proving the necessary tools for generative AI development, in an interview earlier this week during this year's Qlik user conference. "I'm very excited to see what's next. They just keep getting better. The leaps and bounds from last year's Qlik [conference] to this year's is night and day."

In a sense, Qlik has always been pragmatic. It's part of why the vendor is still relevant 31 years after it was founded, while onetime competitors such as Business Objects, Cognos and Information Builders have been swallowed up by other vendors and essentially disappeared.

Based in King of Prussia, Pa., Qlik is a longtime analytics vendor that has evolved as business intelligence has evolved.

When data was kept on premises and analytics was a specialized skill for experts only, Qlik provided a platform to meet the needs of data analysts. When Tableau rose to prominence touting self-service analytics, Qlik adapted and developed self-service tools to meet the needs of business users.

When cloud computing emerged and enterprises migrated their data operations to the various clouds, Qlik complemented its on-premises capabilities with a cloud-based version of its platform. When that was no longer enough, Qlik identified data integration as an opportunity for growth and over the past six years has methodically built up a data integration suite to complement its analytics capabilities.

Now, the vendor is taking that same strategic approach to AI as it creates an environment for customers to develop AI models and applications and apply generative AI to existing data products.

"Despite what everyone else is doing, what matters most is our customer needs," Magnuson said. "That's where we're focused."

To meet customer need for a trusted foundation for generative AI, Qlik started by combining its existing AI and machine learning capabilities in a single environment it calls Staige.

Unveiled in September, Staige includes AutoML, which is a tool that enables users to perform predictive analysis, and Insight Advisor, a natural language interface that lets customers query and analyze structured data and provides natural language responses with accompanying visuals.

In addition, Staige provides automated code generation capabilities, integrations with generative AI capabilities from third-party providers such as OpenAI and Hugging Face, and an advisory council to provide guidance for customers getting started with AI.

While Qlik's existing capabilities combined with third-party integrations was a start, Qlik needed more capabilities to effectively provide a foundation for developing trusted AI models and applications.

One thing missing was support for unstructured data, such as text and audio files, which is estimated to now make up more than 80% of all data.

To add support for unstructured data, Qlik acquired Kyndi in January and on June 4 unveiled Qlik Answers. The tool, scheduled for general availability this summer, uses retrieval-augmented generation to enable customers to query and analyze unstructured data with natural language in the same way Insight Advisor enables natural language interactions with structured data.

Furthermore, Qlik Answers provides data lineage information so that users can trace the data used to inform the tool's responses and ensure that those responses can be trusted.

Also missing was the data management component -- the integration layer that would enable customers to build applications using quality data from the start rather than look back later to see if the data they already used could be trusted.

Therefore, to complement Answers, the vendor on June 4 unveiled Qlik Talend Cloud, which is likewise definitively scheduled for general availability this summer. The suite, which comes a little more than a year after Qlik completed its acquisition of Talend, is a data integration environment that forms the foundation for ensuring the quality of data used to train generative AI models and applications. Included are governance capabilities and tools such as a trust score.

Combined, Qlik Answers and Qlik Talend Cloud succeed at providing quality data for AI models and applications, according to Mike Leone, an analyst at TechTarget's Enterprise Strategy Group.

"Qlik Answers and Qlik Talend Cloud can work together to deliver a trusted data foundation for AI and fuel innovation from AI," he said.

In addition, the acquisition of Kyndi was critical, Leone continued.

"Kyndi is really that enabling factor for Qlik to extend the delivery of predictive AI and generative AI more broadly and at scale," he said. "I like Qlik's focus on unstructured data as it's often overlooked and underutilized."

Given the foundation that's now been formed by addressing practical needs, customers can begin using Qlik as a foundation for developing generative AI capabilities.

"After we saw what Qlik presented, the possibilities [for using generative AI] are open now," Monjars said.

C40 has been using Insight Advisor and other AI and machine learning tools, but had previously been hesitant to add any generative AI capabilities given its strict data security and data compliance requirements, he continued.

"A very real component we saw is the ability to analyze unstructured data, and there's a lot of knowledge there," Monjars said.

By grounding its generative AI plans in reality rather than making promises it might not be able to keep, Qlik is serving the needs of its customers.

But that pragmatic approach might have come with a cost, according to Donald Farmer, founder and principal of TreeHive Strategy.

Data management rivals Databricks and Snowflake have broadcast seemingly every move while creating environments for AI development. Tech giants AWS, Google and Microsoft have similarly maintained a steady presence in the collective mindset. And many of the more specialized vendors have introduced large swaths of capabilities even when they're only just starting to build them.

Qlik's comparatively quiet approach might have resulted in slow customer growth.

Farmer spent nearly 20 years in product development, including a stint at Qlik as vice president of innovation and design. Now, he heads a consulting firm that works with companies to develop analytics and AI strategies. While the evidence is anecdotal, he noted that Qlik's resonance with potential new customers seems to be slowing.

"Qlik still remains a significant vendor, but with one caveat," Farmer said. "There is very little sign of them gaining traction with greenfield customers. The trickle of new logos is slow. Mostly, they are adding more value to existing clients. But to be fair, they are adding significant value."

Qlik Answers could be a means of adding new users, according to Magnuson.

When Qlik added automated machine learning capabilities with its 2021 acquisition of Big Squid and turned that technology into AutoML, it drew in new customers, he said. Once generally available, Qlik Answers, though tightly integrated with the rest of the Qlik ecosystem, will be available as a standalone tool and could likewise be a way to draw new customers.

"We've made a conscious decision as part of a strategy to offer these solutions to a new buying agenda," Magnuson said. "We know a lot of people are generating new budgets to acquire technology. ... Answers potentially gives us a new opportunity to have a conversation with someone where we can open up a net new opportunity."

Regardless of whether Qlik's practical approach to generative AI brings in a significant number of new customers, what Qlik is doing in terms of technological innovation and support for that technology works for the vendor's existing users, according to Dean.

When Dean joined Takeuchi U.S. in 2018, the company had one analyst keeping its data in Excel spreadsheets. Dean subsequently led the company's transition to Qlik, beginning with a single application. Now, Takeuchi U.S. uses Qlik not only in its administrative office, but also with each of its hundreds of dealers.

But while Takeuchi U.S. -- a subsidiary of Japan-based Takeuchi Manufacturing -- is a sizable organization, it does not boast a big roster of data scientists. Dean is part of a team of three BI analysts.

To do more advanced analytics than just developing dashboards and reports, Takeuchi U.S. needs assistance. One of the main reasons the company has remained with Qlik is the relationship Dean and her team have with the vendor and the support they receive.

"My partnership with Qlik is what keeps me," Dean said. "They work with us."

Takeuchi U.S. now uses AutoML. And once a major undertaking to implement an ERP system is finished next spring, the company wants to build new analytics applications to discover insights related to the performance of its excavators, wheel loaders and other products.

"I'll definitely set up demos with [Qlik] to figure out what will suit us when the time is right," Dean said.

While Qlik Answers and Qlik Talend Cloud in some ways complete the foundation for trusted data that Qlik targeted as its role in enabling generative AI development, the vendor nevertheless plans to develop additional capabilities.

Most notably, it aims to enable customers to query and analyze structured and unstructured data together, according to Magnuson. The acquisition of Kyndi led to Qlik Answers, which enables customers to operationalize unstructured data. But that's just a beginning.

"[Qlik Answers] is starting us on this bigger journey to develop strength and muscle around unstructured content that puts us in a position to provide value to customers by integrating both structured and unstructured data in a single analytics experience," Magnuson said.

Monjars likewise noted that Qlik's enablement of access to unstructured data is significant. From a technological standpoint, Qlik is meeting C40's needs. But where he said he'd like to see more investment from Qlik is in another practical area: increasing awareness.

Qlik provides its own data literacy program. But its customer base is not as big as some other platforms such as Power BI, so it is therefore sometimes difficult to find new employees who don't need to be trained to use Qlik, Monjars noted.

"Qlik is doing what we need, but it's a little hard to find people who are Qlik-trained," he said. "A given professional maybe learns Power BI before they learn Qlik, so that affects the availability of people out there. It would be helpful if Qlik were more of a household name and people made it a priority to learn Qlik coming out of school."

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

More:

Qlik meets user needs with realistic approach to generative AI - TechTarget

Exploring RAG Applications Across Languages: Conversing with the Mishnah – Towards Data Science

15 min read

Im excited to share my journey of building a unique Retrieval-Augmented Generation (RAG) application for interacting with rabbinic texts in this post. MishnahBot aims to provide scholars and everyday users with an intuitive way to query and explore the Mishnah interactively. It can help solve problems such as quickly locating relevant source texts or summarizing a complex debate about religious law, extracting the bottom line.

I had the idea for such a project a few years back, but I felt like the technology wasnt ripe yet. Now, with advancements of large language models, and RAG capabilities, it is pretty straightforward.

This is what our final product will look like, which you could try out here:

RAG applications are gaining significant attention, for improving accuracy and harnessing the reasoning power available in large language models (LLMs). Imagine being able to chat with your library, a collection of car manuals from the same manufacturer, or your tax documents. You can ask questions, and receive answers informed by the wealth of specialized knowledge.

There are two emerging trends in improving language model interactions: Retrieval-Augmented Generation (RAG) and increasing context length, potentially by allowing very long documents as attachments.

One key advantage of RAG systems is cost-efficiency. With RAG, you can handle large contexts without drastically increasing the query cost, which can become expensive. Additionally, RAG is more modular, allowing you to plug and play with different knowledge bases and LLM providers. On the other hand, increasing the context length directly in language models is an exciting development that can enable handling much longer texts in a single interaction.

For this project, I used AWS SageMaker for my development environment, AWS Bedrock to access various LLMs, and the LangChain framework to manage the pipeline. Both AWS services are user-friendly and charge only for the resources used, soIreallyencourageyoutotryitoutyourselves. For Bedrock, youll need to request access to Llama 3 70b Instruct and Claude Sonnet.

Lets open a new Jupyter notebook, and install the packages we will be using:

The dataset for this project is the Mishnah, an ancient Rabbinic text central to Jewish tradition. I chose this text because it is close to my heart and also presents a challenge for language models since it is a niche topic. The dataset was obtained from the Sefaria-Export repository, a treasure trove of rabbinic texts with English translations aligned with the original Hebrew. This alignment facilitates switching between languages in different steps of our RAG application.

Note: The same process applied here can be applied to any other collection of texts of your choosing. This example also demonstrates how RAG technology can be utilized across different languages, as shown with Hebrew in this case.

First we will need to download the relevant data. We will use git sparse-checkout since the full repository is quite large. Open the terminal window and run the following.

And voila! we now have the data files that we need:

Now lets load the documents in our Jupyter notebook environment:

And take a look at the Data:

Looks good, we can move on to the vector database stage.

Next, we vectorize the text and store it in a local ChromaDB. In one sentence, the idea is to represent text as dense vectors arrays of numbers such that texts that are similar semantically will be close to each other in vector space. This is the technology that will enable us to retrieve the relevant passages given a query.

We opted for a lightweight vectorization model, the all-MiniLM-L6-v2, which can run efficiently on a CPU. This model provides a good balance between performance and resource efficiency, making it suitable for our application. While state-of-the-art models like OpenAIs text-embedding-3-large may offer superior performance, they require substantial computational resources, typically running on GPUs.

For more information about embedding models and their performance, you can refer to the MTEB leaderboard which compares various text embedding models on multiple tasks.

Heres the code we will use for vectorizing (should only take a few minutes to run on this dataset on a CPU machine):

With our dataset ready, we can now create our Retrieval-Augmented Generation (RAG) application in English. For this, well use LangChain, a powerful framework that provides a unified interface for various language model operations and integrations, making it easy to build sophisticated applications.

LangChain simplifies the process of integrating different components like language models (LLMs), retrievers, and vector stores. By using LangChain, we can focus on the high-level logic of our application without worrying about the underlying complexities of each component.

Heres the code to set up our RAG system:

Alright! Lets try it out! We will use a query related to the very first paragraphs in the Mishnah.

That seems pretty accurate.

Lets try a more sophisticated question:

Very nice.

I tried that out, heres what I got:

The response is long and not to the point, and the answer that is given is incorrect (reaping is the third type of work in the list, while selecting is the seventh). This is what we call a hallucination.

While Claude is a powerful language model, relying solely on an LLM for generating responses from memorized training data or even using internet searches lacks the precision and control offered by a custom database in a Retrieval-Augmented Generation (RAG) application. Heres why:

This structured retrieval process ensures users receive the most accurate and relevant answers, leveraging both the language generation capabilities of LLMs and the precision of custom data retrieval.

Finally, we will address the challenge of interacting in Hebrew with the original Hebrew text. The same approach can be applied to any other language, as long as you are able to translate the texts to Englishfortheretrievalstage.

Supporting Hebrew interactions adds an extra layer of complexity since embedding models and large language models (LLMs) tend to be stronger in English. While some embedding models and LLMs do support Hebrew, they are often less robust than their English counterparts, especially the smaller embedding models that likely focused more on English during training.

To tackle this, we could train our own Hebrew embedding model. However, another practical approach is to leverage a one-time translation of the text to English and use English embeddings for the retrieval process. This way, we benefit from the strong performance of English models while still supporting Hebrew interactions.

In our case, we already have professional human translations of the Mishnah text into English. We will use this to ensure accurate retrievals while maintaining the integrity of the Hebrew responses. Heres how we can set up this cross-lingual RAG system:

For generation, we use Claude Sonnet since it performs significantly better on Hebrew text compared to Llama 3.

Here is the code implementation:

Lets try it! We will use the same question as before, but in Hebrew this time:

We got an accurate, one word answer to our question. Pretty neat, right?

The translation with Llama 3 Instruct posed several challenges. Initially, the model produced nonsensical results no matter what I tried. (Apparently, Llama 3 instruct is very sensitive to prompts starting with a new line character!)

After resolving that issue, the model tended to output the correct response, but then continue with additional irrelevant text, so stopping the output at a newline character proved effective.

Controlling the output format can be tricky. Some strategies include requesting a JSON format or providing examples with few-shot prompts.

In this project, we also remove vowels from the Hebrew texts since most Hebrew text online does not include vowels, and we want the context for our LLM to be similar to text seen during pretraining.

Building this RAG application has been a fascinating journey, blending the nuances of ancient texts with modern AI technologies. My passion for making the library of ancient rabbinic texts more accessible to everyone (myself included) has driven this project. This technology enables chatting with your library, searching for sources based on ideas, and much more. The approach used here can be applied to other treasured collections of texts, opening up new possibilities for accessing and exploring historical and cultural knowledge.

Its amazing to see how all this can be accomplished in just a few hours, thanks to the powerful tools and frameworks available today. Feel free to check out the full code on GitHub, and play with the MishnahBot website.

Please share your comments and questions, especially if youre trying out something similar. If you want to see more content like this in the future, do let me know!

Follow this link:

Exploring RAG Applications Across Languages: Conversing with the Mishnah - Towards Data Science

Principal Foundation and EVERFI from Blackbaud Reach 26,000 U.S. Students with Growing National Data Science … – PR Newswire

DataSetGo, a first-of-its-kind digital curriculum, is opening doors to data science careers at more than 400 schools, with $50,000 in recent awards to those who show promise in the field

DES MOINES, Iowa, June 4, 2024 /PRNewswire/ -- Principal Foundation, a global nonprofit organization committed to helping people and communities build financially secure futures, and EVERFI from Blackbaud, the leader in powering social impact through education, announce the second and biggest year of DataSetGo, a first-of-its-kind interactive digital curriculum that teaches high school students the fundamentals of data science and its value in daily life, the workforce, and the world.

Since its inception in 2022, DataSetGo has reached over 26,000 high school students in over 400 schools throughout the U.S. In the 2023-2024 academic year, the program reached over 17,000 new students and 200 additional schools across ten states, including New York, Texas, and California.

Last fall, DataSetGo expanded to include DataSetGo Distinguished Scholars, a new national program that equips students to explore postsecondary education and workforce opportunities, including those in the rapidly growing field of data science.

Data science roles can be found in nearly every industry and according to the World Economic Forum, there could be up to 1.4 million new jobs created in data science and data analytics between 2023 and 2027.

"Having learned more about the possible job opportunities has opened several possibilities I had no idea existed. It would be a dream to learn more about data analysis and science to eventually make a profession out of it one day," wrote Jarod Story, a Distinguished Scholar who attends Irving High School near Dallas, Texas.

All the schools that use the DataSetGo curriculum are in low- to moderate-income communities, where educators have given the program high marks. The research-backed curriculum was designed to align with national educational standards and is provided at no cost to educators through a strategic partnership between Principal Foundation andEVERFI.

"This program [DataSetGo] is totally awesome. I'm so overwhelmingly proud of my students," said LaTara Meyers, a teacher at H.D. Woodson High School in Washington, D.C., whose student Amaya Bostic is among the Distinguished Scholars.

The full list of ten Distinguished Scholars was announced in May. Each student received a $5,000 award, for a total of $50,000.

"These impressive students seized the opportunity to learn about data science and the doors it could open for them," said Jo Christine Miles, Director, Principal Foundation and Community Relations, Principal. "We're thrilled to provide awards that will help them continue to pursue their dreams."

"In nearly every industry, data science skills are in high demand. DataSetGo ensures that students are aware of the opportunities and equipped to pursue them because then, their career options are endless," said Ray Martinez, co-founder and President of EVERFIfrom Blackbaud.

Six of the ten Distinguished Scholars ("national award winners") were selected from a national pool of essay submissions that detailed how students plan to apply what they learned through DataSetGo in their careers and lives.

The other four Scholars ("local award winners") were selected from schools in Brooklyn, N.Y.; Minneapolis, Minn.; Washington, D.C.; and Dallas, Texas who participated in DataSetGo virtual or in-person learning sessions. Three of these local award winners were selected throughout the school year.

The final local 2023-2024 Distinguished Scholar was announced at an in-person event in Brooklyn, New York on Tuesday, May 13. Hosted by EVERFI and Principal Foundation, the event celebrated the DataSetGo program and featured sessions with guest speakers who rely on data science in their careers in professional sports, artificial intelligence, and entertainment.

Below are the 2023-2024 DataSetGo Distinguished Scholars. The entry window for the 2024-2025 competition will open September 15.

National award winners:

Local award winners:

For more information about DataSetGo or the DataSetGo Distinguished Scholars award program, visithttps://principal.everfi.com.

About Principal Foundation Principal Financial Group Foundation, Inc. ("Principal Foundation") is a duly recognized 501(c)(3) entity focused on providing philanthropic support to programs that build financial security in the communities where Principal Financial Group, Inc. ("Principal") operates. While Principal Foundation receives funding from Principal, Principal Foundation is a distinct, independent, charitable entity. Principal Foundation does not practice any form of investment advisory services and is not authorized to do so. Established in 1987, Principal Foundation works with organizations that are helping to shape and support the journey to financial security by ensuring access to essential needs, fostering social and cultural connections, and promoting financial inclusion. 3609043-052024

About EVERFI from Blackbaud EVERFI from Blackbaud (NASDAQ: BLKB) is an international technology company driving social impact through education to address the most challenging issues affecting society ranging from financial wellness to mental health to workplace conduct and other critical topics. Founded in 2008, EVERFI's Impact-as-a-Servicesolution and digital educational content have reached more than 45 million learners globally. In 2020, the company was recognized as one of the World's Most Innovative Companies by Fast Company and was featured on Fortune Magazine's Impact 20 List. The company was also named to the 2021 GSV EdTech 150, a list of the most transformative growth companies in digital learning.Blackbaud acquired EVERFI in December 2021. To learn more about EVERFI, please visiteverfi.com or follow us onFacebook,Instagram,LinkedIn, orTwitter @EVERFI.

Blackbaud Forward-looking Statements Except for historical information, all the statements, expectations, and assumptions contained in this news release are forward-looking statements that involve a number of risks and uncertainties, including statements regarding expected benefits of products and product features. Although Blackbaud attempts to be accurate in making these forward-looking statements, it is possible that future circumstances might differ from the assumptions on which such statements are based. In addition, other important factors that could cause results to differ materially include the following: general economic risks; uncertainty regarding increased business and renewals from existing customers; continued success in sales growth; management of integration of acquired companies and other risks associated with acquisitions; risks associated with successful implementation of multiple integrated software products; the ability to attract and retain key personnel; risks associated with management of growth; lengthy sales and implementation cycles, particularly in larger organization; technological changes that make our products and services less competitive; and the other risk factors set forth from time to time in the SEC filings for Blackbaud, copies of which are available free of charge at the SEC's website at http://www.sec.govor upon request from Blackbaud's investor relations department. All Blackbaud product names appearing herein are trademarks or registered trademarks of Blackbaud, Inc.

Media Contact: Zevenia Dennis, [emailprotected]

SOURCE Principal Foundation

Read this article:

Principal Foundation and EVERFI from Blackbaud Reach 26,000 U.S. Students with Growing National Data Science ... - PR Newswire

Jet Sweep: Route Optimization to Visit Every NFL Team at Home – Towards Data Science

10 min read

Most people in the sports industry or avid fans have entertained the thought, Wouldnt it be cool to visit every NFL stadium, NBA arena, or MLB ballpark in my life? While this feels incredibly out of reach from where Im sitting, I follow basketball

Read more from the original source:

Jet Sweep: Route Optimization to Visit Every NFL Team at Home - Towards Data Science

Thinking, Fast and Slow, with LLMs and PDDL | by Nikolaus Correll | Jun, 2024 – Towards Data Science

ChatGPT can make mistakes. Check important info. is now written right underneath the prompt, and we all got used to the fact that ChatGPT stoically makes up anything from dates to entire references. But what about basic reasoning? Looking at a simple tower rearranging task from the early days of Artificial Intelligence (AI) research, we

Go here to read the rest:

Thinking, Fast and Slow, with LLMs and PDDL | by Nikolaus Correll | Jun, 2024 - Towards Data Science

STEM job market trends: High-demand skills and top-paying roles – Research & Development World

If you are a visitor of this website:

Please try again in a few minutes.

There is an issue between Cloudflare's cache and your origin web server. Cloudflare monitors for these errors and automatically investigates the cause. To help support the investigation, you can pull the corresponding error log from your web server and submit it our support team. Please include the Ray ID (which is at the bottom of this error page). Additional troubleshooting resources.

More here:

STEM job market trends: High-demand skills and top-paying roles - Research & Development World

The One Billion Row Challenge in Julia | by Vikas Negi | Jun, 2024 – Towards Data Science

A recent release of Julia such as 1.10 is recommended. For those wanting to use a notebook, the repository shared above also contains a Pluto file, for which Pluto.jl needs to be installed. The input data file for the challenge is unique for everyone and needs to be generated using this Python script. Keep in mind that the file is about 15 GB in size.

Additionally, we will be running benchmarks using the BenchmarkTools.jl package. Note that this does not impact the challenge, its only meant to collect proper statistics to measure and quantify the performance of the Julia code.

The structure of the input data file measurements.txt is as follows (only the first five lines are shown):

The file contains a billion lines (also known as rows or records). Each line has a station name followed by the ; separator and then the recorded temperature. The number of unique stations can be up to 10,000. This implies that the same station appears on multiple lines. We therefore need to collect all the temperatures for all distinct stations in the file, and then calculate the required statistics. Easy, right?

My first attempt was to simply parse the file one line at a time, and then collect the results in a dictionary where every station name is a key and the temperatures are added to a vector of Float64 to be used as the value mapped to the key. I expected this to be slow, but our aim here is to get a number for the baseline performance.

Once the dictionary is ready, we can calculate the necessary statistics:

The output of all the data processing needs to be displayed in a certain format. This is achieved by the following function:

Since this implementation is expected to take long, we can run a simple test by timing @time the following only once:

Our poor mans implementation takes about 526 seconds, so ~ 9 minutes. Its definitely slow, but not that bad at all!

Instead of reading the input file one line at a time, we can try to split it into chunks, and then process all the chunks in parallel. Julia makes it quite easy to implement a parallel for loop. However, we need to take some precautions while doing so.

Before we get to the loop, we first need to figure out how to split the file into chunks. This can be achieved using memory mapping to read the file. Then we need to determine the start and end positions of each chunk. Its important to note that each line in the input data file ends with a new-line character, which has 0x0a as the byte representation. So each chunk should end at that character to ensure that we dont make any errors while parsing the file.

The following function takes the number of chunksnum_chunksas an input argument, then returns an array with each element as the memory mapped chunk.

Since we are parsing station and temperature data from different chunks, we also need to combine them in the end. Each chunk will first be processed into a dictionary as shown before. Then, we combine all chunks as follows:

Now we know how to split the file into chunks, and how we can combine the parsed dictionaries from the chunks at the end. However, the desired speedup can only be obtained if we are also able to process the chunks in parallel. This can be done in a for loop. Note that Julia should be started with multiple threads julia -t 12 for this solution to have any impact.

Additionally, we now want to run a proper statistical benchmark. This means that the challenge should be executed a certain number of times, and we should then be able to visualize the distribution of the results. Thankfully, all of this can be easily done with BenchmarkTools.jl. We cap the maximum number of samples to 10, maximum time for the total run to be 20 minutes and enable garbage collection (will free up memory) to execute between samples. All of this can be brought together in a single script. Note that the input arguments are now the name of the file fname and the number of chunks num_chunks.

Benchmark results along with the inputs used are shown below. Note that we have used 12 threads here.

Multi-threading provides a big performance boost, we are now down to roughly over 2 minutes. Lets see what else we can improve.

Until now, our approach has been to store all the temperatures, and then determine the required statistics (min, mean and max) at the very end. However, the same can already be achieved while we parse every line from the input file. We replace existing values each time a new value which is either larger (for maximum) or smaller (for minimum) is found. For mean, we sum all the values and keep a separate counter as to how many times a temperature for a given station has been found.

Overall, out new logic looks like the following:

The function to combine all the results (from different chunks) also needs to be updated accordingly.

Lets run a new benchmark and see if this change improves the timing.

The median time seems to have improved, but only slightly. Its a win, nonetheless!

Our previous logic to calculate and save the mix, max for temperature can be further simplified. Moreover, following the suggestion from this Julia Discourse post, we can make use of views (using @view ) when parsing the station names and temperature data. This has also been discussed in the Julia performance manual. Since we are using a slice expression for parsing every line, @view helps us avoid the cost of allocation and copying.

Rest of the logic remains the same. Running the benchmark now gives the following:

Whoa! We managed to reach down to almost a minute. It seems switching to a view does make a big difference. Perhaps, there are further tweaks that could be made to improve performance even further. In case you have any suggestions, do let me know in the comments.

Restricting ourselves only to base Julia was fun. However, in the real world, we will almost always be using packages and thus making use of existing efficient implementations for performing the relevant tasks. In our case, CSV.jl (parsing the file in parallel) and DataFrames.jl (performing groupby and combine) will come in handy.

The function below performs the following tasks:

We can now run the benchmark in the same manner as before.

The performance using CSV.jl and DataFrames.jl is quite good, albeit slower than our base Julia implementation. When working on real world projects, these packages are an essential part of a data scientists toolkit. It would thus be interesting to explore if further optimizations are possible using this approach.

See more here:

The One Billion Row Challenge in Julia | by Vikas Negi | Jun, 2024 - Towards Data Science