Page 30«..1020..29303132..4050..»

Jet Sweep: Route Optimization to Visit Every NFL Team at Home – Towards Data Science

10 min read

Most people in the sports industry or avid fans have entertained the thought, Wouldnt it be cool to visit every NFL stadium, NBA arena, or MLB ballpark in my life? While this feels incredibly out of reach from where Im sitting, I follow basketball

Read more from the original source:

Jet Sweep: Route Optimization to Visit Every NFL Team at Home - Towards Data Science

Read More..

Thinking, Fast and Slow, with LLMs and PDDL | by Nikolaus Correll | Jun, 2024 – Towards Data Science

ChatGPT can make mistakes. Check important info. is now written right underneath the prompt, and we all got used to the fact that ChatGPT stoically makes up anything from dates to entire references. But what about basic reasoning? Looking at a simple tower rearranging task from the early days of Artificial Intelligence (AI) research, we

Go here to read the rest:

Thinking, Fast and Slow, with LLMs and PDDL | by Nikolaus Correll | Jun, 2024 - Towards Data Science

Read More..

STEM job market trends: High-demand skills and top-paying roles – Research & Development World

If you are a visitor of this website:

Please try again in a few minutes.

There is an issue between Cloudflare's cache and your origin web server. Cloudflare monitors for these errors and automatically investigates the cause. To help support the investigation, you can pull the corresponding error log from your web server and submit it our support team. Please include the Ray ID (which is at the bottom of this error page). Additional troubleshooting resources.

More here:

STEM job market trends: High-demand skills and top-paying roles - Research & Development World

Read More..

The One Billion Row Challenge in Julia | by Vikas Negi | Jun, 2024 – Towards Data Science

A recent release of Julia such as 1.10 is recommended. For those wanting to use a notebook, the repository shared above also contains a Pluto file, for which Pluto.jl needs to be installed. The input data file for the challenge is unique for everyone and needs to be generated using this Python script. Keep in mind that the file is about 15 GB in size.

Additionally, we will be running benchmarks using the BenchmarkTools.jl package. Note that this does not impact the challenge, its only meant to collect proper statistics to measure and quantify the performance of the Julia code.

The structure of the input data file measurements.txt is as follows (only the first five lines are shown):

The file contains a billion lines (also known as rows or records). Each line has a station name followed by the ; separator and then the recorded temperature. The number of unique stations can be up to 10,000. This implies that the same station appears on multiple lines. We therefore need to collect all the temperatures for all distinct stations in the file, and then calculate the required statistics. Easy, right?

My first attempt was to simply parse the file one line at a time, and then collect the results in a dictionary where every station name is a key and the temperatures are added to a vector of Float64 to be used as the value mapped to the key. I expected this to be slow, but our aim here is to get a number for the baseline performance.

Once the dictionary is ready, we can calculate the necessary statistics:

The output of all the data processing needs to be displayed in a certain format. This is achieved by the following function:

Since this implementation is expected to take long, we can run a simple test by timing @time the following only once:

Our poor mans implementation takes about 526 seconds, so ~ 9 minutes. Its definitely slow, but not that bad at all!

Instead of reading the input file one line at a time, we can try to split it into chunks, and then process all the chunks in parallel. Julia makes it quite easy to implement a parallel for loop. However, we need to take some precautions while doing so.

Before we get to the loop, we first need to figure out how to split the file into chunks. This can be achieved using memory mapping to read the file. Then we need to determine the start and end positions of each chunk. Its important to note that each line in the input data file ends with a new-line character, which has 0x0a as the byte representation. So each chunk should end at that character to ensure that we dont make any errors while parsing the file.

The following function takes the number of chunksnum_chunksas an input argument, then returns an array with each element as the memory mapped chunk.

Since we are parsing station and temperature data from different chunks, we also need to combine them in the end. Each chunk will first be processed into a dictionary as shown before. Then, we combine all chunks as follows:

Now we know how to split the file into chunks, and how we can combine the parsed dictionaries from the chunks at the end. However, the desired speedup can only be obtained if we are also able to process the chunks in parallel. This can be done in a for loop. Note that Julia should be started with multiple threads julia -t 12 for this solution to have any impact.

Additionally, we now want to run a proper statistical benchmark. This means that the challenge should be executed a certain number of times, and we should then be able to visualize the distribution of the results. Thankfully, all of this can be easily done with BenchmarkTools.jl. We cap the maximum number of samples to 10, maximum time for the total run to be 20 minutes and enable garbage collection (will free up memory) to execute between samples. All of this can be brought together in a single script. Note that the input arguments are now the name of the file fname and the number of chunks num_chunks.

Benchmark results along with the inputs used are shown below. Note that we have used 12 threads here.

Multi-threading provides a big performance boost, we are now down to roughly over 2 minutes. Lets see what else we can improve.

Until now, our approach has been to store all the temperatures, and then determine the required statistics (min, mean and max) at the very end. However, the same can already be achieved while we parse every line from the input file. We replace existing values each time a new value which is either larger (for maximum) or smaller (for minimum) is found. For mean, we sum all the values and keep a separate counter as to how many times a temperature for a given station has been found.

Overall, out new logic looks like the following:

The function to combine all the results (from different chunks) also needs to be updated accordingly.

Lets run a new benchmark and see if this change improves the timing.

The median time seems to have improved, but only slightly. Its a win, nonetheless!

Our previous logic to calculate and save the mix, max for temperature can be further simplified. Moreover, following the suggestion from this Julia Discourse post, we can make use of views (using @view ) when parsing the station names and temperature data. This has also been discussed in the Julia performance manual. Since we are using a slice expression for parsing every line, @view helps us avoid the cost of allocation and copying.

Rest of the logic remains the same. Running the benchmark now gives the following:

Whoa! We managed to reach down to almost a minute. It seems switching to a view does make a big difference. Perhaps, there are further tweaks that could be made to improve performance even further. In case you have any suggestions, do let me know in the comments.

Restricting ourselves only to base Julia was fun. However, in the real world, we will almost always be using packages and thus making use of existing efficient implementations for performing the relevant tasks. In our case, CSV.jl (parsing the file in parallel) and DataFrames.jl (performing groupby and combine) will come in handy.

The function below performs the following tasks:

We can now run the benchmark in the same manner as before.

The performance using CSV.jl and DataFrames.jl is quite good, albeit slower than our base Julia implementation. When working on real world projects, these packages are an essential part of a data scientists toolkit. It would thus be interesting to explore if further optimizations are possible using this approach.

See more here:

The One Billion Row Challenge in Julia | by Vikas Negi | Jun, 2024 - Towards Data Science

Read More..

Optimizing Memory Consumption for Data Analytics Using Python From 400 to 0.1 – Towards Data Science

Member-only story

Christopher Tao


Towards Data Science




There are way many articles that tell us how to improve the performance of our code. Of course, the performance is critical, especially when we use Python for Data Analytics activities.

View original post here:

Optimizing Memory Consumption for Data Analytics Using Python From 400 to 0.1 - Towards Data Science

Read More..

Hridesh Rajan named new dean of Tulane University School of Science and Engineering – Tulane University

June 03, 2024 12:00 PM


Hridesh Rajan, Kingland professor and chair of the Department of Computer Science at Iowa State University, has been named the new dean of Tulane University's School of Science and Engineering.

A distinguished scholar and innovative leader, Hridesh brings an impressive breadth of knowledge and experience to this vital role. Bringing Hridesh to Tulane will elevate the School of Science and Engineering to new levels of excellence, President Michael A. Fitts and Provost Robin Forman wrote in a message to the Tulane community.

The message also noted that Rajans selection followed an extensive national search that attracted an exceptionally strong pool of candidates.

Joining Tulane SSE represents a unique opportunity for me to contribute to an institution that aligns with my values and to lead a school poised to make significant contributions to solving the pressing challenges of our time."

Hridesh Rajan, Dean of the School of Science and Engineering

At Iowa State University, Rajan led the development of cutting-edge new degree programs in artificial intelligence and computer science while implementing a cross-campus transdisciplinary research initiative of faculty and students interested in the foundations and applications of data science. He launched numerous other efforts that facilitated interdisciplinary faculty collaboration, guided the successful reaccreditation of ISU's computer science bachelor's program and greatly increased seed grants for graduate research.

Rajan developed new instructional methods that boosted the success rates of students and helped usher in a period of remarkable growth in enrollment, including a 45 percent increase in female students, as well as increases in faculty, staff and research funding.

Rajan, who will join Tulane July 1, cited the School of Science and Engineerings interdisciplinary strengths in areas vital to the future of humanity health, energy, climate science and AI as major draws to the new position.

Joining Tulane SSE represents a unique opportunity for me to contribute to an institution that aligns with my values and to lead a school poised to make significant contributions to solving the pressing challenges of our time through transdisciplinary research, education and community outreach, he said.

Rajan earned both a PhD and an MS in computer science from the University of Virginia, and a Bachelor of Technology degree in computer science and engineering from the Indian Institute of Technology. He arrived at ISU in 2005 and served three years as the founding professor-in-charge of data science programs.

A Fulbright scholar, ACM Distinguished Scientist and fellow of the American Association for the Advancement of Science, Rajan said he recognizes Tulanes unique positioning at the center of health, energy, climate research, data science, artificial intelligence and other fields.

Working closely with Tulane administration, the SSE Board of Advisors, the SSE executive committee, and our dedicated faculty, staff and students,our collective efforts will focus on enhancing interdisciplinary research, fostering innovation, and growing a strong, inclusive community that supports academic excellence and groundbreaking discoveries, he said.

Throughout his career Rajan has displayed a deep commitment to increased access for students from all backgrounds. At ISU he helped increase annual philanthropic commitments by an astounding 643 percent and worked continually to promote more inclusivity, greater representation and higher success rates for all students. His strategic vision led to the creation of an inclusive departmental plan extending through 2032.

An accomplished and award-winning researcher with more than 125 publications, Rajans research interests are focused on data science, software engineering and programming languages where he is most known for his design of the Boa programming language and infrastructure that democratizes access to large-scale data-driven science and engineering.

Rajan will join Tulane as Kimberly Foster, who led the School of Science and Engineering through six successful and transformative years, steps down.

Read more:

Hridesh Rajan named new dean of Tulane University School of Science and Engineering - Tulane University

Read More..

Understanding You Only Cache Once | by Matthew Gunton | Jun, 2024 – Towards Data Science

To understand the changes made here, we first need to discuss the Key-Value Cache. Inside of the transformer we have 3 vectors that are critical for attention to work key, value, and query. From a high level, attention is how we pass along critical information about the previous tokens to the current token so that it can predict the next token. In the example of self-attention with one head, we multiply the query vector on the current token with the key vectors from the previous tokens and then normalize the resulting matrix (the resulting matrix we call the attention pattern). We now multiply the value vectors with the attention pattern to get the updates to each token. This data is then added to the current tokens embedding so that it now has the context to determine what comes next.

We create the attention pattern for every single new token we create, so while the queries tend to change, the keys and the values are constant. Consequently, the current architectures try to reduce compute time by caching the key and value vectors as they are generated by each successive round of attention. This cache is called the Key-Value Cache.

While architectures like encoder-only and encoder-decoder transformer models have had success, the authors posit that the autoregression shown above, and the speed it allows its models, is the reason why decoder-only models are the most commonly used today.

To understand the YOCO architecture, we have to start out by understanding how it sets out its layers.

For one half of the model, we use one type of attention to generate the vectors needed to fill the KV Cache. Once it crosses into the second half, it will use the KV Cache exclusively for the key and value vectors respectively, now generating the output token embeddings.

This new architecture requires two types of attention efficient self-attention and cross-attention. Well go into each below.

Efficient Self-Attention (ESA) is designed to achieve a constant inference memory. Put differently we want the cache complexity to rely not on the input length but on the number of layers in our block. In the below equation, the authors abstracted ESA, but the remainder of the self-decoder is consistent as shown below.

Lets go through the equation step by step. X^l is our token embedding and Y^l is an intermediary variable used to generate the next token embedding X^l+1. In the equation, ESA is Efficient Self-Attention, LN is the layer normalization function which here was always Root Mean Square Norm (RMSNorm ), and finally SwiGLU. SwiGLU is defined by the below:

Here swish = x*sigmoid (Wg * x), where Wg is a trainable parameter. We then find the element-wise product (Hadamard Product) between that result and X*W1 before then multiplying that whole product by W2. The goal with SwiGLU is to get an activation function that will conditionally pass through different amounts of information through the layer to the next token.

Now that we see how the self-decoder works, lets go into the two ways the authors considered implementing ESA.

First, they considered what is called Gated Retention. Retention and self-attention are admittedly very similar, with the authors of the Retentive Network: A Successor to Transformer for Large Language Models paper saying that the key difference lies in the activation function retention removes softmax allowing for a recurrent formulation. They use this recurrent formulation along with the parallelizability to drive memory efficiencies.

To dive into the mathematical details:

We have our typical matrices of Q, K, and V each of which are multiplied by the learnable weights associated with each matrix. We then find the Hadamard product between the weighted matrices and the scalar . The goal in using is to create exponential decay, while we then use the D matrix to help with casual masking (stopping future tokens from interacting with current tokens) and activation.

Gated Retention is distinct from retention via the value. Here the matrix W is used to allow our ESA to be data-driven.

Sliding Window ESA introduces the idea of limiting how many tokens the attention window should pay attention to. While in regular self-attention all previous tokens are attended to in some way (even if their value is 0), in sliding window ESA, we choose some constant value C that limits the size of these matrices. This means that during inference time the KV cache can be reduced to a constant complexity.

To again dive into the math:

We have our matrices being scaled by their corresponding weights. Next, we compute the head similar to how multi-head attention is computed, where B acts both as a causal map and also to make sure only the tokens C back are attended to.

Read more here:

Understanding You Only Cache Once | by Matthew Gunton | Jun, 2024 - Towards Data Science

Read More..

Neo4j Announces Collaboration with Snowflake for Advanced AI Insights & Predictive Analytics USA – English – PR Newswire

Neo4j knowledge graphs, graph algorithms, and ML tools are fully integrated within Snowflake - with zero ETL & requiring no specialist graph expertise

SAN FRANCISCO, June 4, 2024 /PRNewswire/ -- Graph database and analytics leader Neo4j today announced at Snowflake's annual user conference, Snowflake Data Cloud Summit 2024, a partnership with Snowflake to bring its fully integrated native graph data science solution within Snowflake AI Data Cloud. The integration enables users to instantly execute more than 65 graph algorithms, eliminates the need to move data out of their Snowflake environment, and empowers them to leverage advanced graph capabilities using the SQL programming languages, environment, and tooling that they already know.

The offering removes complexity, management hurdles, and learning curves for customers seeking graph-enabled insights crucial for AI/ML, predictive analytics, and GenAI applications. The solution features the industry's most extensive library of graph algorithms to identify anomalies and detect fraud, optimize supply chain routes, unify data records, improve customer service, power recommendation engines, and hundreds of other use cases. Anyone who uses Snowflake SQL can get more projects into production faster, accelerate time-to-value, and generate more accurate business insights for better decision-making.

Neo4j graph data science is an analytics and machine learning (ML) solution that identifies and analyzes hidden relationships across billions of data points to improve predictions and discover new insights. Neo4j's library of graph algorithms and ML modeling enables customers to answer questions like what's important, what's unusual, and what's next. Customers can also build knowledge graphs, which capture relationships between entities, ground LLMs in facts, and enable LLMs to reason, infer, and retrieve relevant information more accurately and effectively. Neo4j graph data science customers include Boston Scientific, Novo Nordisk,OrbitMI,and Zenapse, among many others.

"By 2025, graph technologies will be used in 80% of data and analytics innovations up from 10% in 2021 facilitating rapid decision-making across the enterprise," predicts Gartner in its Emerging Tech Impact Radar: Data and Analytics November 20, 2023 report. Gartner also notes, "Data and analytics leaders must leverage the power of large language models (LLMs) with the robustness of knowledge graphs for fault-tolerant AI applications," in the November 2023 report AI Design Patterns for Knowledge Graphs and Generative AI.

Neo4j with Snowflake: new offering capabilities and benefits

Enterprises can harness and scale their secure, governed data natively in Snowflake and augment it with Neo4j's graph analytics and reasoning capabilities for more efficient and timely decision-making, saving customers time and resources.

Supporting quotes

Greg Steck, VP Consumer Analytics, Texas Capital Bank

"At Texas Capital Bank, we're built to help businesses and their leaders succeed. We use Snowflake and Neo4j for critical customer 360 and fraud use cases where relationships matter. We are excited about the potential of this new partnership. The ability to use Neo4j graph data science capabilities within Snowflake will accelerate our data applications and further enhance our ability to bring our customers long-term success."

Jeff Hollan, Head of Applications and Developer Platform, Snowflake

"Integrating Neo4j's proven graph data science capabilities with the Snowflake AI Data Cloud marks a monumental opportunity for our joint customers to optimize their operations. Together, we're equipping organizations with the tools to extract deeper insights, drive innovation at an unprecedented pace, and set a new standard for intelligent decision-making."

Sudhir Hasbe, Chief Product Officer, Neo4j

"Neo4j's leading graph analytics combined with Snowflake's unmatched scalability and performance redefines how customers extract insights from connected data while meeting users in the SQL interfaces where they are today. Our native Snowflake integration empowers users to effortlessly harness the full potential of AI/ML, predictive analytics, and Generative AI for unparalleled insights and decision-making agility."

The new capabilities are available for preview and early access, with general availability later this year on Snowflake Marketplace. For more information, read our blog post or contact us for a preview of Neo4j on Snowflake AI Data Cloud.

To learn more about how organizations are building next gen-apps on Snowflake, click here.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

About Neo4j

Neo4j, the Graph Database & Analytics leader, helps organizations find hidden patterns and relationships across billions of data connections deeply, easily, and quickly. Customers leverage the structure of their connected data to reveal new ways of solving their most pressing business problems, from fraud detection, customer 360, knowledge graphs, supply chain, personalization, IoT, network management, and more even as their data grows. Neo4j's full graph stack delivers powerful native graph storage with native vector search capability, data science, advanced analytics, and visualization, with enterprise-grade security controls, scalable architecture, and ACID compliance. Neo4j's dynamic open-source community brings together over 250,000 developers, data scientists, and architects across hundreds of Fortune 500 companies, government agencies, and NGOs. Visit

Contact: [emailprotected]

2024 Neo4j, Inc., Neo Technology, Neo4j, Cypher, Neo4j Bloom, Neo4j Graph Data Science Library, Neo4j Aura, and Neo4j AuraDB are registered trademarks or a trademark of Neo4j, Inc. All other marks are owned by their respective companies.


Here is the original post:

Neo4j Announces Collaboration with Snowflake for Advanced AI Insights & Predictive Analytics USA - English - PR Newswire

Read More..

Effective Strategies for Managing ML Initiatives | by Anna Via | Jun, 2024 – Towards Data Science

Embracing uncertainty, right people, and learning from the data Picture by Cottonbro, on Pexels

This blog post is an updated version of part of a conference talk I gave at GOTO Amsterdam last year. The talk is also available to watch online.

Providing value and positive impact through machine learning product initiatives is not an easy job. One of the main reasons for this complexity is the fact that, in ML initiatives developed for digital products, two sources of uncertainty intersect. On one hand, there is the uncertainty related to the ML solution itself (will we be able to predict what we need to predict with good enough quality?). On the other hand, there is the uncertainty related to the impact the whole system will be able to provide (will users like this new functionality? will it really help solve the problem we are trying to solve?).

All this uncertainty means failure in ML product initiatives is something relatively frequent. Still, there are strategies to manage and improve the probabilities of success (or at least to survive through them with dignity!). Starting ML initiatives on the right foot is key. I discussed my top learnings in that area in a previous post: start with the problem (and define how predictions will be used from the beginning), start small (and maintain small if you can), and prioritize the right data (quality, volume, history).

However, starting a project is just the beginning. The challenge to successfully manage an ML initiative and provide a positive impact continues throughout the whole project lifecycle. In this post, Ill share my top three learnings on how to survive and thrive during ML initiatives:

It is really hard (impossible even!) to plan ML initiatives beforehand and to develop them according to that initial plan.

The most popular project plan for ML initiatives is the ML Lifecycle, which splits the phases of an ML project into business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Although these phases are drawn as consecutive steps, in many representations of this lifecycle youll find arrows pointing backward: at any point in the project, you might learn something that forces you to go back to a previous phase.

This translates into projects where it is really hard to know when they will finish. For example, during the evaluation step, you might realize thanks to model explainability techniques that a specific feature wasnt well encoded, and this forces you to go back to the data preparation phase. It could also happen that the model isnt able to predict with the quality you need, and might force you to go back to the beginning in the business understanding phase to redefine the project and business logic.

Whatever your role in an ML initiative or project is, it is key to acknowledge things wont go according to plan, to embrace all this uncertainty from the beginning, and to use it to your advantage. This is important both to managing stakeholders (expectations, trust) and for yourself and the rest of the team (motivation, frustration). How?

Any project starts with people. The right combination of people, skills, perspectives, and a network that empowers you.

The days when Machine Learning (ML) models were confined to the Data Scientists laptop are over. Today, the true potential of ML is realised when models are deployed and integrated into the companys processes. This means more people and skills need to collaborate to make that possible (Data Scientists, Machine Learning Engineers, Backend Developers, Data Engineers).

The first step is identifying the skills and roles that are required to successfully build the end-to-end ML solution. However, more than a group of roles covering a list of skills is required. Having a diverse team that can bring different perspectives and empathize with different user segments has proven to help teams improve their ways of working and build better solutions (why having a diverse team will make your products better).

People dont talk about this enough, but the key people to deliver a project go beyond the team itself. I refer to these other people as the network. The network is people you know are really good at specific things, you trust to ask for help and advice when needed, and can unblock, accelerate, or empower you and the team. The network can be your business stakeholders, manager, staff engineers, user researchers, data scientists from other teams, customer support team Ensure you build your own network and identify who is that ally who you can go to depending on each specific situation or need.

A project is a continuous learning opportunity, and many times learnings and insights come from checking the right data and monitors.

In ML initiatives there are 3 big groups of metrics and measures that can bring a lot of value in terms of learnings and insights: model performance monitoring, service performance, and final impact monitoring. In a previous post I deep dive into this topic.

Checking at the right data and monitors while developing or deploying ML solutions is key to:

Effectively managing ML initiatives from beginning until end is a complex task with multiple dimensions. In this blogpost I shared, based on my experience first as Data Scientist and lately as ML Product Manager, the factors I consider key when dealing with an ML project: embracing uncertainty, surrounding yourself with the right people, and learning from the data.

I hope these insights help you successfully manage your ML initiatives and drive positive impact through them. Stay tuned for more posts about the intersection of Machine Learning and Product Management 🙂

Read the original here:

Effective Strategies for Managing ML Initiatives | by Anna Via | Jun, 2024 - Towards Data Science

Read More..

How to Deploy ML Solutions with FastAPI, Docker, and GCP – Towards Data Science

This is the 5th article in a larger series on Full Stack Data Science. In this article, I walk through the deployment of an ML-based search API. While we could do this in countless ways, here I discuss a simple 3-step approach that can be applied to almost any machine learning solution. The example code is freely available on the GitHub repository.

More here:

How to Deploy ML Solutions with FastAPI, Docker, and GCP - Towards Data Science

Read More..