Category Archives: Data Science

Beyond Data Scientist: 10 Data Science Roles are Worth Applying For – Analytics Insight

In the ever-evolving landscape of data science, the demand for skilled professionals continues to soar. While the role of a data scientist is well-known, there exists a plethora of other specialized positions within the field, each offering unique opportunities for career growth and impact. In this article, we explore 10 data science roles beyond the traditional data scientist position that are worth considering and applying for in todays job market.

Data engineers play a crucial role in designing, building, and maintaining the infrastructure needed to support data-driven applications and analytics. They are responsible for managing large volumes of data, optimizing data pipelines, and ensuring data quality and reliability.

Machine learning engineers focus on developing and deploying machine learning models into production environments. They work closely with data scientists to turn research prototypes into scalable and robust machine learning systems that can make predictions and decisions autonomously.

Data analysts are responsible for interpreting and analyzing data to extract insights and inform decision-making. They work with stakeholders to understand business requirements, perform data analysis, and communicate findings through reports, dashboards, and visualizations.

BI developers design and build business intelligence solutions that enable organizations to gather, store, and analyze data to support strategic decision-making. They develop data models, design dashboards, and create interactive reports to provide actionable insights to business users.

Data architects design and implement the overall structure of an organizations data ecosystem. They define data architecture standards, design data models, and develop strategies for data integration, storage, and governance to ensure data consistency and accessibility across the organization.

Data product managers oversee the development and delivery of data-driven products and services. They work closely with cross-functional teams to define product requirements, prioritize features, and ensure alignment with business objectives and user needs.

In addition to the traditional data scientist role, there are specialized positions such as NLP (Natural Language Processing) Scientist, Computer Vision Scientist, and AI Research Scientist. These roles focus on applying advanced techniques and algorithms to solve specific problems in areas such as language understanding, image recognition, and artificial intelligence.

With increasing concerns about data privacy and regulatory compliance, organizations are hiring data privacy officers to ensure that data handling practices adhere to legal and ethical standards. Data privacy officers develop and implement privacy policies, conduct privacy impact assessments, and oversee compliance efforts.

Data governance managers are responsible for establishing and enforcing policies, procedures, and standards for managing data assets effectively. They work with stakeholders to define data governance frameworks, establish data quality metrics, and monitor compliance with data governance policies.

Data science consultants provide strategic advice and technical expertise to help organizations leverage data science and analytics to solve business challenges and achieve strategic goals. They work on a project basis, collaborating with clients to develop customized solutions and drive innovation through data-driven insights.

In conclusion, the field of data science offers a diverse array of career opportunities beyond the traditional data scientist role. Whether youre passionate about engineering data pipelines, building machine learning models, or driving strategic decision-making, there are plenty of exciting roles to explore within the data science domain. By considering these 10 data science roles and their respective responsibilities, you can identify the best fit for your skills, interests, and career aspirations in the dynamic world of data science.

Join our WhatsApp and Telegram Community to Get Regular Top Tech Updates

Read the original:

Beyond Data Scientist: 10 Data Science Roles are Worth Applying For - Analytics Insight

Avoiding abuse and misuse of T-test and ANOVA: Regression for categorical responses – Towards Data Science

We do the model comparison using the the loo package (9, 10) for leave-one-out cross validation. For an alternative approach using the WAIC criteria (11) I suggest you read this post also published by TDS Editors.

Under this scheme, the models have very similar performance. In fact, the first model is slightly better for out-of-sample predictions. Accounting for variance did not help much in this particular case, where (perhaps) relying on informative priors can unlock the next step of scientific inference.

I would appreciate your comments or feedback letting me know if this journey was useful to you. If you want more quality content on data science and other topics, you might consider becoming a medium member.

In the future, you may find an updated version of this post on my GitHub site.

1.M. Bieber, J. Gronewold, A.-C. Scharf, M. K. Schuhmann, F. Langhauser, S. Hopp, S. Mencl, E. Geuss, J. Leinweber, J. Guthmann, T. R. Doeppner, C. Kleinschnitz, G. Stoll, P. Kraft, D. M. Hermann, Validity and Reliability of Neurological Scores in Mice Exposed to Middle Cerebral Artery Occlusion. Stroke. 50, 28752882 (2019).

2. P.-C. Brkner, M. Vuorre, Ordinal Regression Models in Psychology: A Tutorial. Advances in Methods and Practices in Psychological Science. 2, 77101 (2019).

3. G. Gigerenzer, Mindless statistics. The Journal of Socio-Economics. 33, 587606 (2004).

4. P.-C. Brkner, Brms: An r package for bayesian multilevel models using stan. 80 (2017), doi:10.18637/jss.v080.i01.

5. H. Wickham, M. Averick, J. Bryan, W. Chang, L. D. McGowan, R. Franois, G. Grolemund, A. Hayes, L. Henry, J. Hester, M. Kuhn, T. L. Pedersen, E. Miller, S. M. Bache, K. Mller, J. Ooms, D. Robinson, D. P. Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, K. Woo, H. Yutani, Welcome to the tidyverse. 4, 1686 (2019).

6. D. Makowski, M. S. Ben-Shachar, D. Ldecke, bayestestR: Describing effects and their uncertainty, existence and significance within the bayesian framework. 4, 1541 (2019).

7. R. V. Lenth, Emmeans: Estimated marginal means, aka least-squares means (2023) (available at https://CRAN.R-project.org/package=emmeans).

8. R. McElreath, Statistical rethinking (Chapman; Hall/CRC, 2020; http://dx.doi.org/10.1201/9780429029608).

9. A. Vehtari, J. Gabry, M. Magnusson, Y. Yao, P.-C. Brkner, T. Paananen, A. Gelman, Loo: Efficient leave-one-out cross-validation and WAIC for bayesian models (2022) (available at https://mc-stan.org/loo/).

10. A. Vehtari, A. Gelman, J. Gabry, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27, 14131432 (2016).

11. A. Gelman, J. Hwang, A. Vehtari, Understanding predictive information criteria for Bayesian models. Statistics and Computing. 24, 9971016 (2013).

Read the original post:

Avoiding abuse and misuse of T-test and ANOVA: Regression for categorical responses - Towards Data Science

Using AI and Data Science to ReNOURISH Food Deserts – University of California San Diego

Twenty-four million Americans live in food deserts where ultraprocessed foods are abundant and fresh food is scarce, giving rise to large health disparities in diabetes and related cardiometabolic diseases. To address this problem, an interdisciplinary team of researchers from UC San Francisco and UC San Diego conceptualized the NOURISH platform, winning support last year from the U.S. National Science Foundation (NSF)Convergence Accelerator program to design the tool. Now, with continued NSF and U.S. Department of Agriculture funding, the team of experts has moved into the platform-building phase.

NOURISH is meant to provide small business owners in food desert communities with access to loans and grants, online maps that optimize the placement of fresh food outlets for foot traffic, help with navigating the convoluted business permitting process and AI-enabled guidance on affordable ways to locally source fresh ingredients.

Our solution complements government efforts to get fresh food into food deserts by incentivizing grocery stores and big box outlets to sell more fresh food, said Laura Schmidt of UC San Francisco, principal investigator for the project. But our approach builds upon the often-overlooked assets of these communities, including the entrepreneurial talent of small business owners, rich and diverse food heritages, and an unmet demand for fresh food.

Under the leadership of Amarnath Gupta, a team of computer scientists, software developers and students at the San Diego Supercomputer Center (SDSC) at UC San Diego are combining government, private sector and crowdsourced information to create dynamic, interactive maps of local food systems across the U.S. Gupta is a leading computer scientist in the Cyberinfrastructure and Convergence Research and Education (CICORE) Division at SDSC, directed by Ilkay Altintas.

NOURISH embodies our vision at SDSCs CICORE Division, where our deep expertise in data science and knowledge management seamlessly integrates with the diverse needs of our interdisciplinary and cross-sector partners. Together, we co-create solutions that are not just equitable but deeply impactful, tackling complex societal challenges head-on, said Altintas, who also serves as SCSCs chief data science officer. In an era where access to equitable access to fresh food is still hard, NOURISH emerges as a solution to leverage cutting-edge technology to bridge the gap between communities and an ecosystem of entrepreneurship, innovation and cultural diversity. We look forward to seeing the growing impact of this project over the years to come.

Accessible from a mobile phone in multiple languages, the NOURISH platform will include patented recommendation algorithms that customize business plans based on local consumer preferences for price, convenience and flavor.

Recent advances in scalable data systems and artificial intelligence give us an unprecedented opportunity to use NOURISH to democratize data access, creating a more level playing field between large food companies and small businesses, Gupta said.

Small businesses have relatively low start-up costs, are adaptive to local needs and can help to keep economic resources circulating within low-income communities. Community partners assisting with NOURISH also emphasize the benefits of promoting culturally appropriate food.

A major asset of so-called food deserts are immigrants who bring diverse cuisines featuring traditional dishes that are typically healthier than the standard American diet. This platform will help people from the community make wholesome food for the community, said Paul Watson, a California-based food equity advocate and director of community engagement for NOURISH.

Other scientists on the team include Keith Pezzoli and Ilya Zaslavsky (UC San Diego), Hans Taparia (New York University), Tera Fazzino (University of Kansas) and Matthew Lange (IC-FOODS). In 2024-25, the NOURISH team will test the platform in lower-income areas within San Diego and Imperial counties in California, and then scale it nationally.

Originally posted here:

Using AI and Data Science to ReNOURISH Food Deserts - University of California San Diego

Researcher uses data science to address homelessness | UNC-Chapel Hill – The University of North Carolina at Chapel Hill

In the U.S., more than650,000 people dont have homes up 12% in 2023. Thats the largest jump seen since the government began collecting this data in 2007. The Triangle is no exception.More than 6,000 people identify as homelessin Raleigh and Wake County. Durham now hastwice as many unsheltered individualsas in 2020.

These numbers driveHsun-Ta Hsu, whos spent the last decade working with some of the largest homeless populations in the country, in Los Angeles and St. Louis, using innovative tools to address this problem.

In July 2023, Hsus unique skillset led him to Carolina, where he is a professor in both theUNC School of Social Workand the newUNC School of Data Science and Society.

Dr. Hsu is a prime example of how interdisciplinary data science can create insights that transform a seemingly intractable, multilevel social issue into something solvable, SDSS DeanStan Ahaltsays.

Ramona Denby-Brinson, dean of social work, agrees about Hsus skills. His work advances our understanding of neighborhood structures, the development of effective intervention programs and services, and how we can employ social networks in more practical terms to produce better health and behavioral outcomes for the unhoused.

A human right

Hsu learned about social work in high school when his adviser recommended that he major in it based on his background and interests. In college, he earned bachelors, masters and doctoral degrees in social work.

I had relatives and people I was close to who were suffering with mental health-related issues, including suicide attempts and substance abuse, he shares. When I was younger, I didnt know how to deal with it. So I was really thinking about that and I wanted to do something about it.

In 2010, at the start of his doctoral program at the University of Southern California, Hsu got his first look at the 50-block area of Los Angeles known as Skid Row. I saw a young mother in a wheelchair breastfeeding her baby, surrounded by tents, bad smells and extreme poverty, Hsu recalls. Thats not OK. To me, housing is a human right.

Hsu analyzed data from interviews with people housed by the Los Angeles Homeless Service Authority. He documented neighborhood characteristics for 50 blocks, a time-consuming, labor-intensive process that he thought technology could improve.

After a summer fellowship at theUSC Center for Artificial Intelligence in Society, he started developing a mapping tool that uses machine learning to automate the identification of objects like garbage and broken-down cars.

Community-centered research

Since 2010, cities across the country have used another tool, thevulnerability index, to prioritize who gets housing.

Its a triage tool like we use in the emergency room, Hsu explains. We are measuring how vulnerable one is on the street and then bumping them up on the priority list to get them housing.

In 2019, Hsu teamed up with CAIS researcher Eric Rice to improve this tool by combining demographic data with feedback from community stakeholders. They said they want to be considered for housing based on assets, not deficits. This super important feedback helped Hsu and Rice revise the vulnerability index survey to include questions focused on an individuals positive traits.

Now Hsu is bringing this project model to rural communities, where nearly87,000 Americans experience homelessness. Hsu believes his research in both rural and urban homeless populations will aid future projects in North Carolina and beyond.

Homelessness is a national issue, Ahalt stresses. This research will create a replicable process that can be used in North Carolina and across the country.

Read more about Hsun-Ta Hsus work.

The rest is here:

Researcher uses data science to address homelessness | UNC-Chapel Hill - The University of North Carolina at Chapel Hill

From Economics to Electrocardiograms, Data Science Projects Get a Boost From New Seed Grants – University of Utah Health Sciences

Explaining Data Evolution Anna Fariha, PhD (School of Computing) Nina de Lacy, MD (Department of Psychiatry)

Scalable and Information-Rich Sequence Search Over SRA for Advanced Biological Analyses Prashant Pandey, PhD (School of Computing) Aaron Quinlan, PhD (Departments of Human Genetics and Biomedical Informatics)

Connecting the Metabolite-Protein Interactome: Precision Diet and Drug Synergy for Enhanced Cancer Care Mary Playdon, PhD (Departments of Nutrition & Integrative Physiology and Population Health Sciences) Kevin Hicks, PhD (Department of Biochemistry) Aik Choon Tan, PhD (Departments of Oncology and Biomedical Informatics)

Information Theoretic Approaches to Causal Inference Ellis Scharfenaker, PhD (Department of Economics) Braxton Osting, PhD (Department of Mathematics)

Automated Live Meta-Analysis of Clinical Outcomes Using Generative AI Fatemeh Shah-Mohammadi, PhD (Department of Biomedical Informatics) Joseph Finkelstein, MD, PhD (Department of Biomedical Informatics)

Modeling the Effect of Artificial Nature Exposure on Brain Health in Bed-bound Populations Using Variational Autoencoders Elliot Smith, PhD (Department of Neurosurgery) Jeanine Stefanucci, PhD (Department of Psychology)

Using Controlled Animal ECG Recordings for Machine Learning-Based Prediction of Myocardial Ischemia Outcomes Tolga Tasdizen, PhD (Department of Electrical & Computer Engineering and School of Computing) Ben Steinberg, MD (Division of Cardiovascular Medicine) Rob MacLeod, PhD (Departments of Biomedical Engineering and Internal Medicine)

See the article here:

From Economics to Electrocardiograms, Data Science Projects Get a Boost From New Seed Grants - University of Utah Health Sciences

How to Empower Pandas with GPUs. A quick introduction to cuDF, an NVIDIA | by Naser Tamimi | Apr, 2024 – Towards Data Science

DATA SCIENCE A quick introduction to cuDF, an NVIDIA framework for accelerating Pandas 6 min read

Pandas remains a crucial tool in data analytics and machine learning endeavors, offering extensive capabilities for tasks such as data reading, transformation, cleaning, and writing. However, its efficiency with large datasets is somewhat limited, hindering its application in production environments or for constructing resilient data pipelines, despite its widespread use in data science projects.

Similar to Apache Spark, Pandas loads the data into memory for computation and transformation. But unlike Spark, Pandas is not a a distributed compute platform, and therefore everything must be done on a single system CPU and memory (single-node processing). This feature limits the use of Pandas in two ways:

The first issue is addressed by frameworks such as Dask. Dask DataFrame helps you process large tabular data by parallelizing Pandas on a distributed cluster of computers. In many ways, Pandas empowered by Dask is similar to Apache Spark (however, still Spark can handle large datasets more efficiently and thats why it is a preffered tool among data engineers).

Although Dask enables parallel processing of large datasets across a cluster of machines, in reality, the data for most machine learning projects can be accommodated within a single systems memory. Consequently, employing a cluster of machines for such projects might be excessive. Thus, there is a need for a tool that efficiently executes Pandas operations in parallel on a single machine, addressing the second issue mentioned earlier.

Whenever someone talks about parallel processing, the first word that comes to most engineers' minds is GPU. For a long time, it was a wish to run Pandas on GPU for efficient parallel computing. The wish came true with the introduction of NVIDIA RAPIDS cuDF. cuDF (pronounced KOO-dee-eff) is a GPU DataFrame library for

Read the original here:

How to Empower Pandas with GPUs. A quick introduction to cuDF, an NVIDIA | by Naser Tamimi | Apr, 2024 - Towards Data Science

5 Data Analyst Projects to Land a Job in 2024 – KDnuggets

I got my first data analytics internship back in 2020.

Ever since then, Ive transitioned into a senior-level full-time role, landed multiple freelance data analytics gigs, and consulted for companies in different parts of the world.

During this time, I have reviewed resumes for data analyst positions and even shortlisted candidates for jobs.

And I noticed one thing that separated the most prominent applicants from everyone else.

Projects.

Even if you have zero experience in the data industry and no technical background, you can stand out from everyone else and get hired solely based on the projects you display on your resume.

In this article, Im going to show you how to create projects that help you stand out from the competition and land your first data analyst job.

If youre reading this article, you probably already know that it is important to display projects on your resume.

You might even have built a few projects of your own after taking an online course or boot camp.

However, many data analytics projects do more harm to your portfolio than good. These projects can actually lower your chances of getting a job and must be avoided at all costs.

For example, if youve taken the popular Google Data Analytics Certificate on Coursera, youve probably done the capstone project that comes with this certification.

However, over 2 million other people have enrolled in the same course, and have potentially completed the same capstone project.

Chances are, recruiters have seen these projects on the resume of hundreds of applicants, and will not be impressed by it.

A similar logic applies to any other project that has been created many times.

Creating a project using the Titanic, Iris, or Boston Housing dataset on Kaggle can be a valuable learning experience, but should not be displayed on your portfolio.

If you want a competitive edge over other people, you need to stand out.

Heres how.

A project that stands out must be unique.

Pick a project that:

Much of the advice on data analytics projects on the Internet is inaccurate and unhelpful.

You will be told to create generic projects like an analysis of the Titanic datasetprojects that add no real value to your resume.

Unfortunately, the people telling you to do these things arent even working in the data industry, so you must be discerning when taking this advice.

In this article, I will be showing you examples of real people who have landed jobs in data analytics because of their portfolio projects.

You will learn about the types of projects that actually get people hired in this field so that you can potentially build something similar.

The first project is a dashboard displaying job trends in the data industry.

I found this project in a video created by Luke Barousse, a former lead data analyst who also specializes in content creation.

Here is a screenshot of this dashboard:

The above dashboard is called SkillQuery, and it displays the top technologies and skills that employers are looking for in the data industry.

For instance, we can tell by looking at the dashboard that the top language that employers are looking for in data scientists is Python, followed by SQL and R.

The reason this project is so valuable is because it solves an actual problem.

Every job-seeker wants to know the top skills that employers are looking for in their field so they can prepare accordingly.

SkillQuery helps you do exactly this, in the form of an interactive dashboard that you can play around with.

The creator of this project has displayed crucial data analytics skills such as Python, web scraping, and data visualization.

You can find a link to this projects GitHub repository here.

This project was created to predict whether a person will be approved for a credit card or not.

I found it in the same video created by Luke Barousse, and the creator of this project ended up getting a full-time role as a data analyst.

The credit card approval model was deployed as a Streamlit application:

You simply need to answer the questions displayed on this dashboard, and the app will tell you whether or not you have been approved for a credit card.

Again, this is a creative project that solves a real-world problem with a user-friendly dashboard, which is why it stood out to employers.

The skills displayed in this project include Python, data visualization, and cloud storage.

This project, which I created a few years ago, involves conducting sentiment analysis on content from YouTube and Twitter.

Ive always enjoyed watching YouTube videos and was particularly fascinated by channels that created makeup tutorials on the platform.

At that time, a huge scandal surfaced on YouTube involving two of my favorite beauty influencersJames Charles and Tati Westbrook.

I decided to analyze this scandal by scraping data on YouTube and Twitter.

I built a sentiment analysis model to gauge public sentiment of the feud and even created visualizations to understand what people were saying about these influencers.

Although this project had no direct business application, it was interesting since I analyzed a topic I was passionate about.

I also wrote a blog post outlining my findings, which you can find here.

The skills demonstrated in this project include web scraping, API usage, Python, data visualization, and machine learning.

This is another project that was created by me.

In this project, I built a K-Means clustering model with Python using a dataset on Kaggle.

I used variables such as gender, age, and income to create various segments of mall customers:

Since the dataset used for this project is popular, I tried to differentiate my analysis from the rest.

After developing the segmentation model, I went a step further by creating consumer profiles for each segment and devising targeted marketing strategies.

Because of these additional steps I took, my project was tailored to the domain of marketing and customer analytics, increasing my chances of getting hired in the field.

I have also created a tutorial on this project, providing a step-by-step guide for building your own customer segmentation model in Python.

The skills demonstrated in this project include Python, unsupervised machine learning, and data analysis.

The final project on this list is a dashboard displaying insights on Udemy courses:

I found this project in a Medium article written by Zach Quinn, who currently is a senior data engineer at Forbes.

Back when he was just starting out, Zach says that this dashboard landed him a data analyst job offer from a reputable company.

And its easy to see why.

Zach went beyond simply using SQL and Python to process and analyze data.

He has incorporated data communication best practices into this dashboard, making it engaging and visually appealing.

Just by looking at the dashboard, you can gain key insights about Udemys courses, its students interests, and its competitors.

The dashboard also demonstrates metrics that are vital to businesses, such as customer engagement and market trends.

Among all the projects listed in this article, I like this one the most since it goes beyond technical skills and displays the analysts adeptness in data storytelling and presentation.

Here is a link to Zachs article where he provides the code and steps taken to create this project.

I hope that the projects described in this article have inspired you to create one of your own.

If you dont have any project ideas or face obstacles when developing your own, I recommend utilizing generative AI models for assistance.

ChatGPT, for example, can provide a wealth of project ideas and even generate fake datasets, allowing you to hone your analytical skills.

Engaging with ChatGPT for data analysis will allow you to learn new technologies faster and become more efficient, helping you stand out from the competition.

If youd like to learn more about using ChatGPT and generative AI for data analysis, you can watch my video tutorial on the topic.

Natassha Selvaraj is a self-taught data scientist with a passion for writing. Natassha writes on everything data science-related, a true master of all data topics. You can connect with her on LinkedIn or check out her YouTube channel.

View post:

5 Data Analyst Projects to Land a Job in 2024 - KDnuggets

Tips for Getting the Generation Part Right in Retrieval Augmented Generation – Towards Data Science

Image created by author using Dall-E 3 Results from experiments to evaluate and compare GPT-4, Claude 2.1, and Claude 3.0 Opus

My thanks to Evan Jolley for his contributions to this piece

New evaluations of RAG systems are published seemingly every day, and many of them focus on the retrieval stage of the framework. However, the generation aspect how a model synthesizes and articulates this retrieved information may hold equal if not greater significance in practice. Many use cases in production are not simply returning a fact from the context, but also require synthesizing the fact into a more complicated response.

We ran several experiments to evaluate and compare GPT-4, Claude 2.1 and Claude 3 Opus generation capabilities. This article details our research methodology, results, and model nuances encountered along the way as well as why this matters to people building with generative AI.

Everything needed to reproduce the results can be found in this GitHub repository.

While retrieval is responsible for identifying and retrieving the most pertinent information, it is the generation phase that takes this raw data and transforms it into a coherent, meaningful, and contextually appropriate response. The generative step is tasked with synthesizing the retrieved information, filling in gaps, and presenting it in a manner that is easily understandable and relevant to the users query.

In many real-world applications, the value of RAG systems lies not just in their ability to locate a specific fact or piece of information but also in their capacity to integrate and contextualize that information within a broader framework. The generation phase is what enables RAG systems to move beyond simple fact retrieval and deliver truly intelligent and adaptive responses.

The initial test we ran involved generating a date string from two randomly retrieved numbers: one representing the month and the other the day. The models were tasked with:

For example, random numbers 4827143 and 17 would represent April 17th.

These numbers were placed at varying depths within contexts of varying length. The models initially had quite a difficult time with this task.

While neither model performed great, Claude 2.1 significantly outperformed GPT-4 in our initial test, almost quadrupling its success rate. It was here that Claudes verbose nature providing detailed, explanatory responses seemed to give it a distinct advantage, resulting in more accurate outcomes compared to GPT-4s initially concise replies.

Prompted by these unexpected results, we introduced a new variable to the experiment. We instructed GPT-4 to explain yourself then answer the question, a prompt that encouraged a more verbose response akin to Claudes natural output. The impact of this minor adjustment was profound.

GPT-4s performance improved dramatically, achieving flawless results in subsequent tests. Claudes results also improved to a lesser extent.

This experiment not only highlights the differences in how language models approach generation tasks but also showcases the potential impact of prompt engineering on their performance. The verbosity that appeared to be Claudes advantage turned out to be a replicable strategy for GPT-4, suggesting that the way a model processes and presents its reasoning can significantly influence its accuracy in generation tasks. Overall, including the seemingly minute explain yourself line to our prompt played a role in improving the models performance across all of our experiments.

We conducted four more tests to assess prevailing models ability to synthesize and transform retrieved information into various formats:

Unsurprisingly, each model exhibited strong performance in string concatenation, reaffirming previous understanding that text manipulation is a fundamental strength of language models.

As for the money formatting test, Claude 3 and GPT-4 performed almost flawlessly. Claude 2.1s performance was generally poorer overall. Accuracy did not vary considerably across token length, but was generally lower when the needle was closer to the beginning of the context window.

Despite stellar results in the generation tests, Claude 3s accuracy declined in a retrieval-only experiment. Theoretically, simply retrieving numbers should be an easier task than manipulating them as well making this decrease in performance surprising and an area where were planning further testing to examine. If anything, this counterintuitive dip only further confirms the notion that both retrieval and generation should be tested when developing with RAG.

By testing various generation tasks, we observed that while both models excel in menial tasks like string manipulation, their strengths and weaknesses become apparent in more complex scenarios. LLMs are still not great at math! Another key result was that the introduction of the explain yourself prompt notably enhanced GPT-4s performance, underscoring the importance of how models are prompted and how they articulate their reasoning in achieving accurate results.

These findings have broader implications for the evaluation of LLMs. When comparing models like the verbose Claude and the initially less verbose GPT-4, it becomes evident that the evaluation criteria must extend beyond mere correctness. The verbosity of a models responses introduces a variable that can significantly influence their perceived performance. This nuance may suggest that future model evaluations should consider the average length of responses as a noted factor, providing a better understanding of a models capabilities and ensuring a fairer comparison.

Link:

Tips for Getting the Generation Part Right in Retrieval Augmented Generation - Towards Data Science

Exploring Real and Virtual Spaces with Data | by TDS Editors | Apr, 2024 – Towards Data Science

Theres always new and exciting terrain to discover in the field of geospatial data: from practical applications that help us better understand physical topography and social infrastructure, to theoretical approaches that allow us to navigate abstract spaces.

Its been a while since weve covered this topic in the Variable, so this week were delighted to share a selection of recent articles that offer fascinating glimpses into work across the wide range of use cases that geospatial data encompasses. From beginner-friendly tutorials to more advanced theoretical questions, were certain youll find a lot here to pique your interest regardless of your background and level of experience.

Read the rest here:

Exploring Real and Virtual Spaces with Data | by TDS Editors | Apr, 2024 - Towards Data Science

A Proof of the Central Limit Theorem | by Sachin Date | Apr, 2024 – Towards Data Science

Lets return to our parade of topics. An infinite series forms the basis for generating functions which is the topic I will cover next.

The trick to understanding Generating Function is to appreciate the usefulness of aLabel Maker.

Imagine that your job is to label all the shelves of newly constructed libraries, warehouses, storerooms, pretty much anything that requires an extensive application of labels. Anytime they build a new warehouse in Boogersville or revamp a library in Belchertown (I am not entirely making these names up), you get a call to label its shelves.

So imagine then that you just got a call to label out a shiny new warehouse. The aisles in the warehouse go from 1 through 26, and each aisle runs 50 spots deep and 5 shelves tall.

You could just print out 6500 labels like so:

A.1.1, A.1.2,,A.1.5, A.2.1,A.2.5,,A50.1,,A50.5, B1.1,B2.1,,B50.5,.. and so on until Z.50.5,

And you could present yourself along with your suitcase stuffed with 6500 florescent dye coated labels at your local airport for a flight to Boogersville. It might take you a while to get through airport security.

Or heres an idea. Why not program the sequence into your label maker? Just carry the label maker with you. At Boogersville, load the machine with a roll of tape, and off you go to the warehouse. At the warehouse, you press a button on the machine, and out flows the entire sequence for aisle A.

Your label maker is the generating function for this, and other sequences like this one:

A.1.1, A.1.2,,A.1.5, A.2.1,A.2.5,,A50.1,,A50.5

In math, a generating function is a mathematical function that you design for generating sequences of your choosing so that you dont have to remember the entire sequence.

If your proof uses a sequence of some kind, its often easier to substitute the sequence with its generating function. That instantly saves you the trouble of lugging around the entire sequence across your proof. Any operations, like differentiation, that you planned to perform on the sequence, you can instead perform them on its generating function.

But wait theres more. All of the above advantages are magnified whenever the generating sequence has a closed form like the formula for e to the power x that we saw earlier.

A really simple generating function is the one shown in the figure below for the following infinite sequence: 1,1,1,1,1,:

As you can see, a generating sequence is actually a series.

A slightly more complex generating sequence, and a famous one, is the one that generates a sequence of (n+1) binomial coefficients:

Each coefficient nCk gives you the number of different ways of choosing k out of n objects. The generating function for this sequence is the binomial expansion of (1 + x) to the power n:

In both examples, its the coefficients of the x terms that constitute the sequence. The x terms raised to different powers are there primarily to keep the coefficients apart from each other. Without the x terms, the summation will just fuse all the coefficients into a single number.

The two examples of generating functions I showed you illustrate applications of the modestly named Ordinary Generating Function. The OGF has the following general form:

Another greatly useful form is the Exponential Generating Function (EGF):

Its called exponential because the value of the factorial term in the denominator increases at an exponential rate causing the values of the successive terms to diminish at an exponential rate.

The EGF has a remarkably useful property: its k-th derivative, when evaluated at x=0 isolates out the k-th element of the sequence a_k. See below for how the 3rd derivative of the above mentioned EGF when evaluated at x=0 gives you the coefficient a_3. All other terms disappear into nothingness:

Our next topic, the Taylor series, makes use of the EGF.

The Taylor series is a way to approximate a function using an infinite series. The Taylor series for the function f(x) goes like this:

In evaluating the first two terms, we use the fact that 0! = 1! = 1.

f(a), f(a), f(a), etc. are the 0-th, 1st, 2nd, etc. derivatives of f(x) evaluated at x=a. f(a) is simple f(a). The value a can be anything as long as the function is infinitely differentiable at x = a, that is, its k-th derivative exists at x = a for all k from 1 through infinity.

In spite of its startling originality, the Taylor series doesnt always work well. It creates poor quality approximations for functions such as 1/x or 1/(1-x) which march off to infinity at certain points in their domain such as at x = 0, and x = 1 respectively. These are functions with singularities in them. The Taylor series also has a hard time keeping up with functions that fluctuate rapidly. And then there are functions whose Taylor series based expansions will converge at a pace that will make continental drifts seem recklessly fast.

But lets not be too withering of the Taylor series imperfections. What is really astonishing about it is that such an approximation works at all!

The Taylor series happens be to one of the most studied, and most used mathematical artifacts.

On some occasions, the upcoming proof of the CLT being one such occasion, youll find it useful to split the Taylor series in two parts as follows:

Here, Ive split the series around the index r. Lets call the two pieces T_r(x) and R_r(x). We can express f(x) in terms of the two pieces as follows:

T_r(x) is known as the Taylor polynomial of order r evaluated at x=a.

R_r(x) is the remainder or residual from approximating f(x) using the Taylor polynomial of order r evaluated at x=a.

By the way, did you notice a glint of similarity between the structure of the above equation, and the general form of a linear regression model consisting of the observed value y, the modeled value _capX, and the residual e?

But lets not dim our focus.

Returning to the topic at hand, Taylors theorem, which well use to prove the Central Limit Theorem, is what gives the Taylors series its legitimacy. Taylors theorem says that as x a, the remainder term R_r(x) converges to 0 faster than the polynomial (x a) raised to the power r. Shaped into an equation, the statement of Taylors theorem looks like this:

One of the great many uses of the Taylor series lies in creating a generating function for the moments of random variable. Which is what well do next.

The k-th moment of a random variable X is the expected value of X raised to the k-th power.

This is known as the k-th raw moment.

The k-th moment of X around some value c is known as the k-th central moment of X. Its simply the k-th raw moment of (X c):

The k-th standardized moment of X is the k-th central moment of X divided by k-th power of the standard deviation of X:

The first 5 moments of X have specific values or meanings attached to them as follows:

After the 4th moment, the interpretations become assuredly murky.

With so many moments flying around, wouldnt it be terrific to have a generating function for them? Thats what the Moment Generating Function (MGF) is for. The Taylor series makes it super-easy to create the MGF. Lets see how to create it.

Well define a new random variable tX where t is a real number. Heres the Taylor series expansion of e to the power tX evaluated at t = 0:

Lets apply the Expectation operator on both sides of the above equation:

By linearity (and scaling) rule of expectation: E(aX + bY) = aE(X) + bE(Y), we can move the Expectation operator inside the summation as follows:

Recall that E(X^k] are the raw moments of X for k = 0,1,23,

Lets compare Eq. (2) with the general form of an Exponential Generating Function:

What do we observe? We see that E(X^k] in Eq. (2) are the coefficients a_k in the EGF. Thus Eq. (2) is the generating function for the moments of X, and so the formula for the Moment Generating Function of X is the following:

The MGF has many interesting properties. Well use a few of them in our proof of the Central Limit Theorem.

Remember how the k-th derivative of the EGF when evaluated at x = 0 gives us the k-th coefficient of the underlying sequence? Well use this property of the EGF to pull out the moments of X from its MGF.

The zeroth derivative of the MGF of X evaluated at t = 0 is obtained by simply substituting t = 0 in Eq. (3). M_X(t=0) evaluates to 1. The first, second, third, etc. derivatives of the MGF of X evaluated at t = 0 are denoted by M_X(t=0), M_X(t=0), M_X(t=0), etc. They evaluate respectively to the first, second, third etc. raw moments of X as shown below:

This gives us our first interesting and useful property of the MGF. The k-th derivative of the MGF evaluated at t = 0 is the k-th raw moment of X.

The second property of MGFs which well find useful in our upcoming proof is the following: if two random variables X and Y have identical Moment Generating Functions, then X and Y have identical Cumulative Distribution Functions:

If X and Y have identical MGFs, it implies that their mean, variance, skewness, kurtosis, and all higher order moments (whatever humanly unfathomable aspects of reality those moments might represent) are all one-is-to-one identical. If every single property exhibited by the shapes of X and Ys CDF is correspondingly the same, youd expect their CDFs to also be identical.

The third property of MGFs well use is the following one that applies to X when X scaled by a and translated by b:

The fourth property of MGFs that well use applies to the MGF of the sum of n independent, identically distributed random variables:

A final result, before we prove the CLT, is the MGF of a standard normal random variable N(0, 1) which is the following (you may want to compute this as an exercise):

Speaking of the standard normal random variable, as shown in Eq. (4), the first, second, third, and fourth derivatives of the MGF of N(0, 1) when evaluated at t = 0 will give you the first moment (mean) as 0, the second moment (variance) as 1, the third moment (skew) as 0, and the fourth moment (kurtosis) as 1.

And with that, the machinery we need to prove the CLT is in place.

Let X_1, X_2,,X_n be n i. i. d. random variables that form a random sample of size n. Assume that weve drawn this sample from a population that has a mean and variance .

Let X_bar_n be the sample mean:

Let Z_bar_n be the standardized sample mean:

The Central Limit Theorem states that as n tends to infinity, Z_bar_n converges in distribution to N(0, 1), i.e. the CDF of Z_bar_n becomes identical to the CDF of N(0, 1) which is often represented by the Greek letter (phi):

To prove this statement, well use the property of the MGF (see Eq. 5) that if the MGFs of X and Y are identical, then so are their CDFs. Here, itll be sufficient to show that as n tends to infinity, the MGF of Z_bar_n converges to the MGF of N(0, 1) which as we know (see Eq. 8) is e to the power t/2. In short, wed want to prove the following identity:

Lets define a random variable Z_k as follows:

Well now express the standardized mean Z_bar_n in terms of Z_k as shown below:

Next, we apply the MGF operator on both sides of Eq. (9):

By construction, Z_1/n, Z_2/n, , Z_n/n are independent random variables. So we can use property (7a) of MGFs which expresses the MGF of the sum of n independent random variables:

By their definition, Z_1/n, Z_2/n, , Z_n/n are also identical random variables. So we award ourselves the liberty to assume the following:

Z_1/n = Z_2/n = = Z_n/n = Z/n.

Therefore using property (7b) we get:

Finally, well also use the property (6) to express the MGF of a random variable (in this case, Z) that is scaled by a constant (in this case, 1/n) as follows:

With that, we have converted our original goal of finding the MGF of Z_bar_n into the goal of finding the MGF of Z/n.

M_Z(t/n) is a function like any other function that takes (t/n) as a parameter. So we can create a Taylor series expansion of M_Z(t/n) at t = 0 as follows:

Next, we split this expansion into two parts. The first part is a finite series of three terms corresponding to k = 0, k = 1, and k = 2. The second part is the remainder of the infinite series:

In the above series, M, M, M, etc. are the 0-th, 1st, 2nd, and so on derivatives of the Moment Generating Function M_Z(t/n) evaluated at (t/n) = 0. Weve seen that these derivatives of the MGF happen to be the 0-th, 1st, 2nd, etc. moments of Z.

The 0-th moment, M(0), is always 1. Recall that Z is, by its construction, a standard normal random variable. Hence, its first moment (mean), M(0), is 0, and its second moment (variance), M(0), is 1. With these values in hand, we can express the above Taylor series expansion as follows:

Another way to express the above expansion of M_Z is as the sum of a Taylor polynomial of order 2 which captures the first three terms of the expansion, and a residue term that captures the summation:

Weve already evaluated the order-2 Taylor polynomial. So our task of finding the MGF of Z is now further reduced to calculating the remainder term R_2.

Before we tackle the task of computing R_2, lets step back and review what we want to prove. We wish to prove that as the sample size n tends to infinity, the standardized sample mean Z_bar_n converges in distribution to the standard normal random variable N(0, 1):

To prove this we realized that it was sufficient to prove that the MGF of Z_bar_n will converge to the MGF of N(0, 1) as n tends to infinity.

And that led us on a quest to find the MGF of Z_bar_n shown first in Eq. (10), and which I am reproducing below for reference:

But it is really the limit of this MGF as n tends to infinity that we not only wish to calculate, but also show it to be equal to e to the power t/2.

To make it to that goal, well unpack and simplify the contents of Eq. (10) by sequentially applying result (12) followed by result (11) as follows:

Here we come to an uncomfortable place in our proof. Look at the equation on the last line in the above panel. You cannot just force the limit on the R.H.S. into the large bracket and zero out the yellow term. The trouble with making such a misinformed move is that there is an n looming large in the exponent of the large bracket the very n that wants to march away to infinity. But now get this: I said you cannot force the limit into the large bracket. I never said you cannot sneak it in.

So we shall make a sly move. Well show that the remainder term R_2 colored in yellow independently converges to zero as n tends to infinity no matter what its exponent is. If we succeed in that endeavor, common-sense reasoning suggests that it will be legal to extinguish it out of the R.H.S., exponent or no exponent.

To show this, well use Taylors theorem which I introduced in Eq. (1), and which I am reproducing below for your reference:

Well bring this theorem to bear upon our pursuit by setting x to (t/n), and r to 2 as follows:

Next, we set a = 0, which instantly allows us to switch the limit:

(t/n) 0, to,

n , as follows:

Now we make an important and not entirely obvious observation. In the above limit, notice how the L.H.S. will tend to zero as long as n tends to infinity independent of what value t has as long as its finite. In other words, the L.H.S. will tend to zero for any finite value of t since the limiting behavior is driven entirely by the (n) in the denominator. With this revelation comes the luxury to drop t from the denominator without changing the limiting behavior of the L.H.S. And while were at it, lets also swing over the (n) to the numerator as follows:

Let this result hang in your mind for a few seconds, for youll need it shortly. Meanwhile, lets return to the limit of the MGF of Z_bar_n as n tends to infinity. Well make some more progress on simplifying the R.H.S of this limit, and then sculpting it into a certain shape:

It may not look like it, but with Eq. (14), we are literally two steps away from proving the Central Limit Theorem.

All thanks to Jacob Bernoullis blast-from-the-past discovery of the product-series based formula for e.

So this will be the point to fetch a few balloons, confetti, party horns or whatever.

Ready?

Here, we go:

View post:

A Proof of the Central Limit Theorem | by Sachin Date | Apr, 2024 - Towards Data Science