Category Archives: Data Science

Introducing Seaborn Objects: One Ring to Rule Them All! – Towards Data Science

Quick Success Data Science One plotting ring to rule them all! One ring to Plot them all (by Dall-E2)

Have you started using the new Seaborn Objects System for plotting with Python? You definitely should; its a wonderful thing.

Introduced in late 2022, the new system is based on the Grammar of Graphics paradigm that powers Tableau and Rs ggplot2. This makes it more flexible, modular, and intuitive. Plotting with Python has never been better.

In this Quick Success Data Science project, youll get a quick start tutorial on the basics of the new system. Youll also get several useful cheat sheets compiled from the Seaborn Objects official docs.

Well use the following open-source libraries for this project: pandas, Matplotlib, and seaborn. You can find installation instructions in each of the previous hyperlinks. I recommend installing these in a virtual environment or, if youre an Anaconda user, in a conda environment dedicated to this project.

The goal of Seaborn has always been to make Matplotlib Pythons primary plotting library both easier to use and nicer to look at. As part of this, Seaborn has relied on declarative plotting, where much of the plotting code is abstracted away.

The new system is designed to be even more intuitive and to rely less on difficult Matplotlib syntax. Plots are built incrementally, using interchangeable marker types. This reduces the number of things you need to remember while allowing for a logical, repeatable workflow.

The use of a modular approach means you dont need to remember a dozen or more method names like barplot() or scatterplot() to build plots. Every plot is now initiated with a single Plot() class.

The Plot() class sets up the blank canvas for your graphic. Enter the following code to see an example (shown using JupyterLab):

Original post:

Introducing Seaborn Objects: One Ring to Rule Them All! - Towards Data Science

The Road to Biology 2.0 Will Pass Through Black-Box Data – Towards Data Science

AI-first Biotech This year marks perhaps the zenith of expectations for AI-based breakthroughs in biology, transforming it into an engineering discipline that is programmable, predictable, and replicable. Drawing insights from AI breakthroughs in perception, natural language, and protein structure prediction, we endeavour to pinpoint the characteristics of biological problems that are most conducive to being solved by AI techniques. Subsequently, we delineate three conceptual generations of bio AI approaches in the biotech industry and contend that the most significant future breakthrough will arise from the transition away from traditional white-box data, understandable by humans, to novel high-throughput, low-cost AI-specific black-box data modalities developed in tandem with appropriate computational methods. 46 min read

This post was co-authored with Luca Naef.

The release of ChatGPT by OpenAI in November 2022 has thrust Artificial Intelligence into the global public spotlight [1]. It likely marked the first instance where even people far from the field realised that AI is imminently and rapidly altering the very foundations of how humans will work in the near future [2]. A year down the road, once the limitations of ChatGPT and similar systems have become better understood [3], the initial doom predictions ranging from the more habitual panic about future massive job replacement by AI to declaring OpenAI as the bane of Google, have given place to impatience why is it so slow?, in the words of Sam Altman, the CEO of OpenAI [4]. Familiarity breeds contempt, as the saying goes.

We are now seeing the same frenetic optimism around AI in the biological sciences, with hopes that are probably best summarised by DeepMind

See the original post here:

The Road to Biology 2.0 Will Pass Through Black-Box Data - Towards Data Science

Largest-ever map of universes active superma – EurekAlert

image:

An infographic explaining the creation of a new map of around 1.3 million quasars from across the visible universe.

Credit: ESA/Gaia/DPAC; Lucy Reading-Ikkanda/Simons Foundation; K. Storey-Fisher et al. 2024

Astronomers have charted the largest-ever volume of the universe with a new map of active supermassive black holes living at the centers of galaxies. Called quasars, the gas-gobbling black holes are, ironically, some of the universes brightest objects.

The new map logs the location of about 1.3 million quasars in space and time, the furthest of which shone bright when the universe was only 1.5 billion years old. (For comparison, the universe is now 13.7 billion years old.)

This quasar catalog is different from all previous catalogs in that it gives us a three-dimensional map of the largest-ever volume of the universe, says map co-creator David Hogg, a senior research scientist at the Flatiron Institutes Center for Computational Astrophysics in New York City and a professor of physics and data science at New York University. It isnt the catalog withthe most quasars, and it isnt the catalog with the best-quality measurements of quasars, but it is the catalog with the largest total volume of the universe mapped.

Hogg and his colleagues present the map in a paper published March 18 in The Astrophysical Journal. The papers lead author, Kate Storey-Fisher, is a postdoctoral researcher at the Donostia International Physics Center in Spain.

The scientists built the new map using data from the European Space Agencys Gaia space telescope. While Gaias main objective is to map the stars in our galaxy, it also inadvertently spots objects outside the Milky Way, such as quasars and other galaxies, as it scans the sky.

We were able to make measurements of how matter clusters together in the early universe that are as precise as some of those from major international survey projects which is quite remarkable given that we got our data as a bonus from the Milky Wayfocused Gaia project, Storey-Fisher says.

Quasars are powered by supermassive black holes at the centers of galaxies and can be hundreds of times as bright as an entire galaxy. As the black holes gravitational pull spins up nearby gas, the process generates an extremely bright disk and sometimes jets of light that telescopes can observe.

The galaxies that quasars inhabit are surrounded by massive halos of invisible material called dark matter. By studying quasars, astronomers can learn more about dark matter, such as how much it clumps together.

Astronomers can also use the locations of distant quasars and their host galaxies to better understand how the cosmos expanded over time. For example, scientists have already compared the new quasar map with the oldest light in our cosmos, the cosmic microwave background. As this light travels to us, it is bent by the intervening web of dark matter the same web mapped out by the quasars. By comparing the two, scientists can measure how strongly matter clumps together.

It has been very exciting to see this catalog spurring so much new science, Storey-Fisher says. Researchers around the world are using the quasar map to measure everything from the initial density fluctuations that seeded the cosmic web to the distribution of cosmic voids to the motion of our solar system through the universe.

The team used data from Gaias third data release, which contained 6.6 million quasar candidates, and data from NASAs Wide-Field Infrared Survey Explorer and the Sloan Digital Sky Survey. By combining the datasets, the team removed contaminants such as stars and galaxies from Gaias original dataset and more precisely pinpointed the distances to the quasars. The team also created a map showing where dust, stars and other nuisances are expected to block our view of certain quasars, which is critical for interpreting the quasar map.

This quasar catalog is a great example of how productive astronomical projects are, says Hogg. Gaia was designed to measure stars in our own galaxy, but it also found millions of quasars at the same time, which give us a map of the entire universe.

ABOUT THE FLATIRON INSTITUTE

The Flatiron Institute is the research division of the Simons Foundation. The institute's mission is to advance scientific research through computational methods, including data analysis, theory, modeling and simulation. The institute's Center for Computational Astrophysics creates new computational frameworks that allow scientists to analyze big astronomical datasets and to understand complex, multi-scale physics in a cosmological context.

The Astrophysical Journal

Observational study

Not applicable

18-Mar-2024

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

Continue reading here:

Largest-ever map of universes active superma - EurekAlert

Understanding Impact of Advanced Retrievers on RAG Behavior through Visualization – Towards Data Science

13 min read

LLMs have become adept at text generation and question-answering, including some smaller models such as Gemma 2B and TinyLlama 1.1B. Even with such performant pre-trained models, they may not perform well when queried about some documents not seen during training. In such a scenario, supplementing your question with relevant context from the documents is an effective approach. This approach termed Retrieval-Augmented Generation (RAG) has gained significant popularity, due to its simplicity and effectiveness.

Retriever is a key component of a RAG system, which involve obtaining relevant document chunks from a back end vector store. In a recent survey paper on the evolution of RAG systems, the authors have classified such systems into three categories, namely Naive, Advanced and Modular [1]. Within the advanced category, post-retrieval optimization techniques such summarizing as well as re-ranking retrieved documents have been identified as some key improvement techniques over the naive approach.

In this article, we will look at how a naive retriever as well as two advanced retrievers influence RAG behavior. To better represent and characterize their influence, we will be visualizing the document vector space along with the related documents in 2-D using visualization library, renumics-spotlight. This library boasts powerful features to visualize the intricacies of document embeddings, and yet it is easy to use. And for our LLM of choice, we will be using TinyLlama 1.1B Chat, a compact model, but without a proportional drop in accuracy [2]. It makes this LLM ideal for rapid experimentation.

Disclaimer: I dont have any affiliation with Renumics or its creators. This article provides an unbiased view of the library usage based on my personal experience with the intention to make its knowledge available to the masses.

Table of Contents1.0 Environment and Key Components 2.0 Design and Implementation 2.1 Module LoadVectorize 2.2 The main Module 3.0 Knobs on Spotlight UI 4.0 Comparison of Retrievers 5.0 Closing Remarks

Continue reading here:

Understanding Impact of Advanced Retrievers on RAG Behavior through Visualization - Towards Data Science

Probably the Best Data Visualisation for Showing Many-to-Many Proportion In Python – Towards Data Science

How to draw a fancy chord chart with links using PyCirclize

In my previous article, I have introduced the Python library called PyCirclize. It can help us to generate very nice Circos Charts (or Chord Charts if you like) with very little effort. If you want to know how it can make the Data Visualisation well- Rounded, please don't miss out.

However, dont worry if you are only interested in the Chord Charts with Links. This article will make sure you understand how to draw this type of chart.

In this article, Ill introduce another type of Chord Chart that PyCirclize can do. That is a Chord Chart with links that will visualize proportional relationships between many-to-many entities very well, and so far is the best one among all the known typical diagram types.

Before we start, just make sure to use pip for installing the library as follows. Then, we are all good to go. Lets explore this fancy chart together!

As usual, lets start with something abstract but easy to follow. The purpose is to show you what the chart looks like and whats the basic way of plotting it. Let me put the full code and the diagram at the beginning.

Read the original:

Probably the Best Data Visualisation for Showing Many-to-Many Proportion In Python - Towards Data Science

Optimizing Pandas Code: The Impact of Operation Sequence – Towards Data Science

PYTHON PROGRAMMING Learn how to rearrange your code to achieve significant speed improvements. 9 min read

Pandas offer a fantastic framework to operate on dataframes. In data science, we work with small, big and sometimes very big dataframes. While analyzing small ones can be blazingly fast, even a single operation on a big dataframe can take noticeable time.

In this article I will show that often you can make this time shorter by something that costs practically nothing: the order of operations on a dataframe.

Imagine the following dataframe:

With a million rows and 25 columns, its big. Many operation on such a dataframe will be noticeable on current personal computers.

Imagine we want to filter the rows, in order to take those which follow the following condition: a < 50_000 and b > 3000 and select five columns: take_cols=['a', 'b', 'g', 'n', 'x']. We can do this in the following way:

In this code, we take the required columns first, and then we perform the filtering of rows. We can achieve the same in a different order of the operations, first performing the filtering and then selecting the columns:

We can achieve the very same result via chaining Pandas operations. The corresponding pipes of commands are as follows:

Since df is big, the four versions will probably differ in performance. Which will be the fastest and which will be the slowest?

Lets benchmark this operations. We will use the timeit module:

Visit link:

Optimizing Pandas Code: The Impact of Operation Sequence - Towards Data Science

What Does it Take to Get into Data Engineering in 2024? – Towards Data Science

Career advice for aspiring data practitioners 14 min read

If you are reading this you were probably considering a career change lately. I am assuming that you want to learn somewhat close to software engineering and database design. It doesnt matter what your background is marketing, analytics or finance, you can do this! This story is to help you find the fastest way to enter the data space. Many years ago I did the same and never regretted since then. Technology space and especially data is full of wonders and perks. Not to mention remote working and massive benefit packages from the leading IT companies, it makes you capable of doing magic with files and numbers. In this story, Ill try to summarise a set of skills and possible projects which could be accomplished within two to three months timeframe. Imagine, just a few months of active learning and you are ready for your first job interview.

Any sufficiently advanced technology is indistinguishable from magic.

Indeed, why not Data Analytics or Data Science? I think the answer resides in the nature of this role as it combines the most difficult parts of these worlds. To become a data engineer you would need to learn Software engineering and database design, Machine Learning (ML) models, and understand data modelling and Business Intelligence (BI) development.

Data engineering is the fastest growing job according to DICE. They conducted research to demonstrate that there is a gap so be quick.

While Data Scientists have been considered to be the sexiest job in the market for a long time now it seems there is a certain lack of Data Engineers. I can see a massive demand in this area. This includes not only experienced and highly qualified engineers but also entry-level roles. Data engineering has been one of the fastest-growing careers in the UK over the last five years, ranking 13 on LinkedIns list of the most in-demand jobs in 2023 [1]. On

Continued here:

What Does it Take to Get into Data Engineering in 2024? - Towards Data Science

FGV EMAp holds graduation ceremony for Brazil’s first data science and AIcourse – Portal FGV

On March 1, 2024, Fundao Getulio Vargas School of Applied Mathematics (FGV EMAp) held the graduation ceremony for the first group of students to have completed its Data Science and Artificial Intelligence Course. The ceremony, held in FGVs main building in Rio de Janeiro, was attended by 38 undergraduate students, of whom 21 are studying applied mathematics, 13 are studying data science and four are doing a dual degree.

According to the schools dean, Csar Camacho, EMAps position in the job marketis extremely satisfactory and underscores the institutions quality. In the past, a degree in engineering, law or medicine was enough to ensure a promising career. However, societyhas become more complex and diverse in its demands,so as well as a degree, people need to invest in constant professional development in line with the sophisticated technological advances that are taking place. Our figures show that 100% of EMAp graduates are swiftly hired, including those who choose to do a masters or doctorate, he said.

Yuri Saporito, the coordinator of the Data Science and Artificial Intelligence Course and the class sponsor, revealed his gratitude for taking part in this moment alongside students who made a difference over the four years they were together. I couldnt have wished for a better first class.You were attentive, interested and very participative. This course was envisioned in 2018 during an internal FGV meeting, when I realized that a degree in data science and artificial intelligence would open our doors to students who hadnt previously considered FGV as an option. It was a huge effort, but we managed to implement the course in mid-2019 and our first class, you guys, started in March 2020, he said.

Tiago da Silva Henrique, a student on the data science course and a CDMC scholarship holder, was mentioned as an academic highlight.He said he was grateful for the institutional recognition for his performance during his degree. According to him, the graduation ceremony marked the beginning of a long professional journey. He said that the school was very intellectually demanding but offered a safe environment with well-defined objectives for students personal development. More difficult decisions and greater challenges are yet to be faced by me and my colleagues. My feeling, then, is one of proactivity, knowing that my next steps will possibly determine the subsequent years of my career, he said. He also emphasized the courses pioneering status, at the forefront of a fast-growing industry that will potentially benefit from modern technical training. The recent commercial rise of applications based on large language models, such as ChatGPT, reinforces this point, he concluded.

Twenty-five of the students came from the Center for the Development of Mathematics and Science (FGV CDMC) project. Created by FGV in 2017, this project aims to identify talentedyoungsters in the countrys government schools and offer them the possibility of taking FGVs undergraduate and graduate courses in Rio de Janeiro.

In recent years, this talent selection project has offered university scholarships to outstanding government school students and medal winners in national mathematics Olympiads. These students are invited to apply to take any of the undergraduate courses offered by FGV in Rio de Janeiro: Applied Mathematics, Data Science and Artificial Intelligence, Economics, Administration, Social Sciences, Law and Digital Communication.

Follow this link:

FGV EMAp holds graduation ceremony for Brazil's first data science and AIcourse - Portal FGV

Analytics and Data Science News for the Week of March 15: Updates from Quantexa, Alation, Matillion, and More – Solutions Review

Solutions Review Executive Editor Tim King curated this list of notable analytics and data science news for the week of March 15, 2024.

Keeping tabs on the most relevant analytics and data science news can be time-consuming. As a result, our editorial team aims to provide a summary of the top headlines from the last week in this space. Solutions Review editors will curate vendor product news, mergers and acquisitions, venture capital funding, talent acquisition, and other noteworthy analytics and data science news items.

Alation, Inc., the data intelligence company, announced its business lineage to provide business users with broad and rich visibility into datas journey as it flows and transforms through systems. This visibility increases trust in the data and accelerates time to insights. Business lineage is an extension of Alations end-to-end data lineage, providing an abstraction layer to filter views of data flows and visualize relationships across technical, system, and business metadata. This unified approach offers a complete view so that all users can confidently and effectively harness information to unlock the full potential of an organization.

Read on for more.

Quantexaa decision intelligence solution provider for the public and private sectorsused the backdrop of QuanCon24, its annual customer and partner conference, to reveal its Decision Intelligence Platform roadmap and provided an update on Q Assist, a generative artificial intelligence (AI) assistant that previewed in July last year. Quantexa also announced a partnership with Microsoft. Dan Higgins, Quantexas Chief Product Officer, was joined by Kate Rosenshine, Global Technology Director of Strategic Partnerships at Microsoft.

Read on for more.

Matillions Data Productivity Cloud is now available for Databricks, enabling users to access the power of Delta Lake within their data engineering. The Data Productivity Cloud with Databricks brings no-code data ingestion and transformation capabilities that are purpose-built for Databricks, enabling users to quickly build data pipelines at scale, which can be used in AI and analytics projects.

Read on for more.

In partnership with universities and colleges globally, data and AI leader SAS is helping students and educators prepare for an AI-driven economy. To recognize significant contributions to data analytics education, the SAS Educator Awards honor university educators who excel at integrating SAS analytic tools within their academic institutions. Winners are nominated and chosen based on their use of SAS and commitment to preparing early career talent.

Read on for more.

On March 27, Solutions Review will host a Solutions Spotlight webinar with Amplitude, a digital analytics and event-tracking platform. During the hour-long presentation, attendees will gain a deeper understanding of Amplitudes platform and demonstrate how companies can apply data insights into product-led growth workflows and use them in their ongoing marketing efforts. The webinar will also feature a Q&A section with Laura Schaffer, Vice President of Growth at Amplitude.

Read on for more.

On March 22, Solutions Review will host a Solutions Spotlight webinar with Alteryx, an analytics solutions provider. Join Director of Product Management Sarah Welch and Manager of Product Management David Cooperberg to learn how the Alteryx AI Platform for Enterprise Analytics offers integrated generative and conversational AI, data preparation, advanced analytics, and automated reporting capabilities. Register now to reserve your seat for the webinar, scheduled for 12:00 pm Eastern Time.

Read on for more.

SoundCommerce, a retail data platform provider, announced a new partnership with Cordial, the marketing platform that powers billions of high-conversion email, SMS, and mobile app messages based on data. PacSun is the first consumer brand to take advantage of the new partnership, launching on Cordial with SoundCommerce data in less than 90 days. By leveraging SoundCommerces actionable data insights alongside Cordials personalized messaging solutions, retail brands can provide real-time, data-driven interactions that foster customer loyalty and maximize revenue opportunities.

Read on for more.

For consideration in future news roundups, send your announcements to the editor: tking@solutionsreview.com.

Read the original:

Analytics and Data Science News for the Week of March 15: Updates from Quantexa, Alation, Matillion, and More - Solutions Review

Alteryx’s Steve Harris Explains How AI Is Changing Data Analytics – GovCon Wire

At the end of 2023,Steve Harris became thepresident and general manager of the newly established public sector business unit at Alteryx, a data science and analytics tools provider in the GovCon market. Executive Mosaic recently sat down with Harris, a six-timeWash100 Award winner, to learn more about how artificial intelligence is shaping the data analytics industry and how Alteryx is embracing the technology.

Harris previously served as chief revenue officer for Ellucian, and prior to that role, he spent more than two decades at Dell Technologies. Read below for Harris full Executive Spotlight interview.

Tell me about the current state of the artificial intelligence market. Where are you seeing new opportunities in AI, and where do you think the market is heading?

Theres a tremendous amount of opportunity, and with that can come confusion. Theres a lot of curiosity and potential as well as risk. I like to compare it to the very early days of cloud, where the bad actors were scaling faster than cyber protection capabilities. Its a huge emerging market. Theres a massive fragmentation of players.

What are some of the biggest opportunity areas you see with generative AI? How is Alteryx approaching that?

AI is when technology simulates the actions of a person, and its based on machine learning. Generative AI can actually produce content and emulate the way that a human would create content. That can create opportunity in a number of ways, and particularly, I see worthwhile uses in analytics.

Given that generative AI is only as good or is just as bad as the data that its applied to, Alteryx is in a terrific position with our entire AI Platform for Enterprise Analytics, where we help people understand incredibly complex data sets in a number of ways.

At Alteryx, we believe that analytics can empower all employees to make faster, more insightful and more confident decisions regardless of technical skill level. Organizations can drive smarter, faster decisions and automate analytics to improve revenue performance, manage costs and mitigate risks across their organizations.

Alteryx AiDIN is the AI engine that powers the Alteryx AI platform, bringing enterprise-grade machine learning and generative AI for faster time to value, streamlined innovation, improved operations and enhanced governance.

Were a leader in AI for analytics. Our platform is a leader in no-code, easy to use technology that allows users across the organization to turn data into insights. The generative AI that we build into our platform is incredibly useful because its very hard for people to understand their data without studying it for a long time. Thats where generative AI comes in and does that data study almost instantly. Its applied to a quality data set, peoples own data and the third-party data that they choose to bring in as part of the reference. Generative AI has huge potential and has an equal amount of risk today.

What are some of the key challenges agencies face as they try to use their data for decision advantage or to better understand their organizations?

There are many silos of data. Alteryx is here to say that you dont have to embark on any significant or large-scale data management projects in order to get data. We bring the analytics to the data any kind of data, any place. The data never leaves your environment; we only take the data set from each source of data that is part of that query, and we make it extremely easy and understandable for a layman to get to an accurate, clean data set.

Then, because we are a no-code platform, we really attack that other major issue, which is making accessible technology available so that the people who are closest to the data have the power to transform that data, create business intelligence and bring data to decisions. These are the keys to the kingdom.

And on another note, this whole conversation is about a journey toward data literacy. Everybody likes to talk about the most exciting or interesting part of the journey big model AI, generative AI, disparate data sets, merging data but its all part of the journey towards data literacy for not only the staff and administrators of our agencies but also our citizenry.

U.S. citizens need to know when something they read online was produced by generative AI, because that could impact how much they trust what theyre seeing. If generative AI is producing the content on a government website, I no longer trust it because I have the data literacy to know that that generative AI could be taking information from sources that I dont consider authoritative. Data literacy is a really important overarching topic.

Which other emerging technologies do you anticipate will have the greatest impact on the federal landscape in the next few years?

The truth is that the problems havent changed dramatically over the last three to five years. Were still talking about JADC2, multi-domain command and control, cyber, cloud. Organizations are still concerned about where to put their data and their workloads. And were still talking about the lack of analytics and bringing data to decisions. I think the most disruptive technology is going to be the IT modernization of the legacy technology that exists across the spectrum of the federal government. We have really mature technologies that are highly addressable today that just are not being brought to bear.

The disruptor will be enabling the transition from outdated legacy systems to robust and contemporary technology solutions. Positioned as the premier AI platform for enterprise analytics, we boast a proud 27-year legacy. Our platform is trusted by 49 percent of the worlds top 2,000 companies, along with numerous government agencies globally. This underscores the vast potential for growth and innovation ahead. Our platform exemplifies the shift from technical debt and antiquated technologies to embracing and expanding modern technological capabilities that can have a compelling, positive impact on people and organizations.

I think the most disruptive technology will be the least proprietary technology. When you think about some of the market leaders being more rapidly adopted in the federal space, those technologies tend to be more black box in nature less of a software company and more of proprietary technology with many services, not only to implement but also to maintain. Thats the definition of legacy technology. If youre stepping into a legacy IT model as a way to modernize, I think theres a lot of danger there.

I do think that some of the big AI models and machine learning that is able to happen, assisted or unassisted, is going to have a tremendous impact on some of the biggest problems. Big model AI is going to help make a huge difference, taking those huge solutions and applying them to hundreds of thousands of small problems and decisions made every day.

The rest is here:

Alteryx's Steve Harris Explains How AI Is Changing Data Analytics - GovCon Wire