Category Archives: Data Science

Computational Thinking is What You Need | by Louis Chan | Mar, 2024 – Towards Data Science

6 min read

Since early last year, when we led the development of an enterprise-level GenAI-as-a-service platform, we have understandably been bombarded with questions like What are the art of possibles for or Can LLM do

In this blog post, we will dive into a critical skill that will enable you to answer all these questions better computational thinking. By the end of this blog post, you will have the answers to:

Computational thinking is a problem-solving framework that algorithmically breaks down a task into what I like to call atomic tasks. It involves designing a step-by-step algorithmic approach to solving a problem, identifying similarities and inefficiencies, and evaluating the relative importance of each step.

Imagine cooking a dish.

Recipe sourcing, grocery shopping, ingredient prep, cooking steps, and dishing are the atomic tasks

View post:

Computational Thinking is What You Need | by Louis Chan | Mar, 2024 - Towards Data Science

Pandas: From Messy To Beautiful. This is how to make your pandas code | by Anna Zawadzka | Mar, 2024 – Towards Data Science

Scripting around a pandas DataFrame can turn into an awkward pile of (not-so-)good old spaghetti code. Me and my colleagues use this package a lot and while we try to stick to good programming practices, like splitting code in modules and unit testing, sometimes we still get in the way of one another by producing confusing code.

I have gathered some tips and pitfalls to avoid in order to make pandas code clean and infallible. Hopefully youll find them useful too. We'll get some help from Robert C. Martin's classic Clean code specifically for the context of the pandas package. TL;DR at the end.

Lets begin by observing some faulty patterns inspired by real life. Later on, well try to rephrase that code in order to favor readability and control.

Pandas DataFrames are value-mutable [2, 3] objects. Whenever you alter a mutable object, it affects the exact same instance that you originally created and its physical location in memory remains unchanged. In contrast, when you modify an immutable object (eg. a string), Python goes to create a whole new object at a new memory location and swaps the reference for the new one.

This is the crucial point: in Python, objects get passed to the function by assignment [4, 5]. See the graph: the value of df has been assigned to variable in_df when it was passed to the function as an argument. Both the original df and the in_df inside the function point to the same memory location (numeric value in parentheses), even if they go by different variable names. During the modification of its attributes, the location of the mutable object remains unchanged. Now all other scopes can see the changes too they reach to the same memory location.

Actually, since we have modified the original instance, its redundant to return the DataFrame and assign it to the variable. This code has the exact same effect:

Heads-up: the function now returns None, so be careful not to overwrite the df with None if you do perform the assignment: df = modify_df(df).

In contrast, if the object is immutable, it will change the memory location throughout the modification just like in the example below. Since the red string cannot be modified (strings are immutable), the green string is created on top of the old one, but as a brand new object, claiming a new location in memory. The returned string is not the same string, whereas the returned DataFrame was the exact same DataFrame.

The point is, mutating DataFrames inside functions has a global effect. If you dont keep that in mind, you may:

Well fix that problem later, but here is another don't before we pass to do's

The design from the previous section is actually an anti-pattern called output argument [1 p.45]. Typically, inputs of a function will be used to create an output value. If the sole point of passing an argument to a function is to modify it, so that the input argument changes its state, then its challenging our intuitions. Such behavior is called side effect [1 p.44] of a function and those should be well documented and minimized because they force the programmer to remember the things that go in the background, therefore making the script error-prone.

When we read a function, we are used to the idea of information going in to the function through arguments and out through the return value. We dont usually expect information to be going out through the arguments. [1 p.41]

Things get even worse if the function has a double responsibility: to modify the input and to return an output. Consider this function:

It does return a value as you would expect, but it also permanently modifies the original DataFrame. The side effect takes you by surprise - nothing in the function signature indicated that our input data was going to be affected. In the next step, we'll see how to avoid this kind of design.

To eliminate the side effect, in the code below we have created a new temporary variable instead of modifying the original DataFrame. The notation lengths: pd.Series indicates the datatype of the variable.

This function design is better in that it encapsulates the intermediate state instead of producing a side effect.

Another heads-up: please be mindful of the differences between deep and shallow copy [6] of elements from the DataFrame. In the example above we have modified each element of the original df["name"] Series, so the old DataFrame and the new variable have no shared elements. However, if you directly assign one of the original columns to a new variable, the underlying elements still have the same references in memory. See the examples:

You can print out the DataFrame after each step to observe the effect. Remember that creating a deep copy will allocate new memory, so its good to reflect whether your script needs to be memory-efficient.

Maybe for whatever reason you want to store the result of that length computation. Its still not a good idea to append it to the DataFrame inside the function because of the side effect breach as well as the accumulation of multiple responsibilities inside a single function.

I like the One Level of Abstraction per Function rule that says:

We need to make sure that the statements within our function are all at the same level of abstraction.

Mixing levels of abstraction within a function is always confusing. Readers may not be able to tell whether a particular expression is an essential concept or a detail. [1 p.36]

Also lets employ the Single responsibility principle [1 p.138] from OOP, even though were not focusing on object-oriented code right now.

Why not prepare your data beforehand? Lets split data preparation and the actual computation in separate functions.:

The individual task of creating the name_len column has been outsourced to another function. It does not modify the original DataFrame and it performs one task at a time. Later we retrieve the max element by passing the new column to another dedicated function. Notice how the aggregating function is generic for Collections.

Lets brush the code up with the following steps:

The way we have split the code really makes it easy to go back to the script later, take the entire function and reuse it in another script. We like that!

There is one more thing we can do to increase the level of reusability: pass column names as parameters to functions. The refactoring is going a little bit over the top, but sometimes it pays for the sake of flexibility or reusability.

Did you ever figure out that your preprocessing was faulty after weeks of experiments on the preprocessed dataset? No? Lucky you. I actually had to repeat a batch of experiments because of broken annotations, which could have been avoided if I had tested just a couple of basic functions.

Important scripts should be tested [1 p.121, 7]. Even if the script is just a helper, I now try to test at least the crucial, most low-level functions. Lets revisit the steps that we made from the start:

1. I am not happy to even think of testing this, its very redundant and we have paved over the side effect. It also tests a bunch of different features: the computation of name length and the aggregation of result for the max element. Plus it fails, did you see that coming?

2. This is much better we have focused on one single task, so the test is simpler. We also dont have to fixate on column names like we did before. However, I think that the format of the data gets in the way of verifying the correctness of the computation.

3. Here we have cleaned up the desk. We test the computation function inside out, leaving the pandas overlay behind. Its easier to come up with edge cases when you focus on one thing at a time. I figured out that Id like to test for None values that may appear in the DataFrame and I eventually had to improve my function for that test to pass. A bug caught!

4. Were only missing the test for find_max_element:

One additional benefit of unit testing that I never forget to mention is that it is a way of documenting your code, as someone who doesnt know it (like you from the future) can easily figure out the inputs and expected outputs, including edge cases, just by looking at the tests. Double gain!

These are some tricks I found useful while coding and reviewing other peoples code. Im far from telling you that one or another way of coding is the only correct one you take what you want from it, you decide whether you need a quick scratch or a highly polished and tested codebase. I hope this thought piece helps you structure your scripts so that youre happier with them and more confident about their infallibility.

If you liked this article, I would love to know about it. Happy coding!

TL;DR

Theres no one and only correct way of coding, but here are some inspirations for scripting with pandas:

Donts:

- dont mutate your DataFrame too much inside functions, because you may lose control over what and where gets appended/removed from it,

- dont write methods that mutate a DataFrame and return nothing because that's confusing.

Dos:

- create new objects instead of modifying the source DataFrame and remember to make a deep copy when needed,

- perform only similar-level operations inside a single function,

- design functions for flexibility and reusability,

- test your functions because this helps you design cleaner code, secure against bugs and edge cases and document it for free.

The graphs were created by me using Miro. The cover image was also created by me using the Titanic dataset and GIMP (smudge effect).

Read more from the original source:

Pandas: From Messy To Beautiful. This is how to make your pandas code | by Anna Zawadzka | Mar, 2024 - Towards Data Science

Data science classes, bootcamps and certificates in NYC – Time Out

Data science is booming, thanks to the exponential increase in available data, advancements in technology and computing power, and the high demand for data-driven insights to inform decisions across all sectors. Data science classes and bootcamps in NYC offer the perfect opportunity to master essential data science skills like Python programming, machine learning, and data analysis. You'll learn how to extract insights from complex datasets, build predictive models, and create data visualizations. NYC is a hub of innovation and technology, where youll have unparalleled access to industry experts, networking opportunities, and real-world projects. Whether you're a seasoned professional looking to upskill or a curious beginner eager to explore the possibilities of data science, NYC offers the ideal environment to thrive in this rapidly evolving field.

Recommended: Best certificate programs in NYCRecommended: Best coding bootcamps in NYCRecommended: Best coding classes & bootcamps near youRecommended: Best data science classes and programs for high school studentsRecommended: Best digital marketing classes and certificates in NYC

Continued here:

Data science classes, bootcamps and certificates in NYC - Time Out

How the New Breed of LLMs is Replacing OpenAI and the Likes – DataScienceCentral.com – Data Science Central

Of course, OpenAI, Mistral, Claude and the likes may adapt. But will they manage to stay competitive in this evolving market? Last week Databricks launched DBRX. It clearly shows the new trend: specialization, lightweight, combining multiple LLMs, enterprise-oriented, and better results at a fraction of the cost. Monolithic solutions where you pay by the token encourage the proliferation of models with billions or trillions of tokens, weights and parameters. They are embraced by companies such as Nvidia, because they use a lot of GPU and make chip producers wealthy. One of the drawbacks is the cost incurred by the customer, with no guarantee of positive ROI. The quality may also suffer (hallucinations).

In this article, I discuss the new type of architecture under development. Hallucination-free, they achieve better results at a fraction of the cost and run much faster. Sometimes without GPU, sometimes without training. Targeting professional users rather than the layman, they rely on self-tuning and customization. Indeed, there is no universal evaluation metric: laymen and experts have very different ratings and expectations when using these tools.

Much of this discussion is based on the technology that I develop for a fortune 100 company. I show the benefits, but also potential issues. Many of my competitors are moving in the same direction.

Before diving into the architecture of new LLMs, lets first discuss the current funding model. Many startups get funding from large companies such as Microsoft, Nvidia or Amazon. It means that they have to use their cloud solutions, services and products. The result is high costs for the customer. Startups that rely on vendor-neutral VC funding face a similar challenge: you cannot raise VC money by saying that you could do better and charge 1000x less. VC firms expect to make billions of dollars, not mere millions. To maintain this ecosystem, players spend a lot of money on advertising and hype. In the end, if early investors can quickly make big money through acquisitions, it is a win. What happens when clients realize ROI is negative, is unimportant. As long as it does not happen too soon! But can investors even achieve this short-term goal?

The problem is compounded by the fact that researchers believe deep neural networks (DNN) are the panacea, with issues simply fixed by using bigger data, multiple transforms to make DNN work, or front-end patches such as prompt engineering, to address foundational back-end problems. Sadly, no one works on ground-breaking innovations outside DNNs. I am an exception.

In the end, very few self-funded entrepreneurs can compete, offering a far less expensive alternative with no plan on becoming a billionaire. I may be the only one able to survive and strive, long-term. My intellectual property is open-source, patent-free, and comes with extensive documentation, source code, and comparisons. It appeals to large, traditional corporations. The word is out; it is no longer a secret. In turn, it puts pressure on big players to offer better LLMs. They can see how I do it and implement the same algorithms on their end. Or come up with their own solutions independently. Either way, the new type of architecture is pretty much the same in all cases, not much different from mine. The new Databricks LLM (DBRX) epitomizes this trend. Mine is called XLLM.

Surprisingly, none of the startups working on new LLMs consider monetizing their products via advertising: blending organic output with sponsored results relevant to the user prompt. I am contemplating doing it, with a large client interested in signing-up when the option is available.

As concisely stated by one of my clients, the main issues to address are:

In addition to blending specialized LLMs (one per top category with its own set of embeddings and other summary tables) a new trend is emerging. It consists of blending multiple LLMs focusing on the same topic, each one with its own flavor: technical, general, or based on different parameters. Then, combining these models just like XGBoost combines multiple small decisions trees to get the best from all. In short, an ensemble method.

Note that speed and accuracy result from using many small, specialized tables (embeddings and so on) as opposed to a big table with long, fixed-size embedding vectors and expensive semantic / vector search. The user selects the categories that best match his prompt. In my case, there is no neural network involved, no GPU needed, yet no latency and no hallucinations. Liability is further reduced with a local implementation, and explainable AI.

Carefully selecting input sources (in many cases, corporate repositories augmented with external data) and smart crawling to reconstruct the hidden structure (underlying taxonomy, breadcrumbs, navigation links, headings, and so on), are critical components of this architecture.

For details about xLLM (technical implementation, comparing output with OpenAI and the likes on the same prompts, Python code, input sources, and documentation), see here. I also offer a free course on the topic, here.

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist atMLTechniques.comandGenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner one related to LLM. Vincents past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.Follow Vincent on LinkedIn.

Read the rest here:

How the New Breed of LLMs is Replacing OpenAI and the Likes - DataScienceCentral.com - Data Science Central

Do not over-think about ‘outliers’, use a student-t distribution instead – Towards Data Science

A Students t-distribution is nothing more than a Gaussian distribution with heavier tails. In other words, we can say that the Gaussian distribution is a special case of the Students t-distribution. The Gaussian distribution is defined by the mean () and the standard deviation (). The Student t distribution, on the other hand, adds an additional parameter, the degrees of freedom (df), which controls the thickness of the distribution. This parameter assigns greater probability to events further from the mean. This feature is particularly useful for small sample sizes, such as in biomedicine, where the assumption of normality is questionable. Note that as the degrees of freedom increase, the Student t-distribution approaches the Gaussian distribution. We can visualize this using density plots:

Note in Figure 1 that the hill around the mean gets smaller as the degrees of freedom decrease as a result of the probability mass going to the tails, which are thicker. This property is what gives the Students t-distribution a reduced sensitivity to outliers. For more details on this matter, you can check this blog.

We load the required libraries:

So, lets skip data simulations and get serious. Well work with real data I have acquired from mice performing the rotarod test.

First, we load the dataset into our environment and set the corresponding factor levels. The dataset contains IDs for the animals, a groping variable (Genotype), an indicator for two different days on which the test was performed (day), and different trials for the same day. For this article, we model only one of the trials (Trial3). We will save the other trials for a future article on modeling variation.

As the data handling implies, our modeling strategy will be based on Genotype and Day as categorical predictors of the distribution of Trial3.

In biomedical science, categorical predictors, or grouping factors, are more common than continuous predictors. Scientists in this field like to divide their samples into groups or conditions and apply different treatments.

Lets have an initial view of the data using Raincloud plots as shown by Guilherme A. Franchi, PhD in this great blog post.

Figure 2 looks different from the original by Guilherme A. Franchi, PhD because we are plotting two factors instead of one. However, the nature of the plot is the same. Pay attention to the red dots, these are the ones that can be considered extreme observations that tilt the measures of central tendency (especially the mean) toward one direction. We also observe that the variances are different, so modeling also sigma can give better estimates. Our task now is to model the output using the brms package.

See the rest here:

Do not over-think about 'outliers', use a student-t distribution instead - Towards Data Science

The Many Pillars of Getting the Most Value From Your Organization’s Data – Towards Data Science

Photo by Choong Deng Xiang on Unsplash

Letmeintroduce youtoSarah, a talented and passionate data scientist, who just landed her dream job at GreenEnv, a large company that makes eco-friendly cleaning products. GreenEnv has tons of data on customers, products, and other areas of the business. They hired Sarah to unlock the hidden potential within this data, uncovering market trends, competitive advantages, and more.

Her first task: analyze customer demographics and buying habits to create targeted marketing campaigns. Confident in her abilities and excited to apply data science methods, Sarah dived into the customer database. But her initial excitement quickly faded. The data was a mess inconsistent formatting, misspelled names, and duplicate entries everywhere. Data quality was terrible. There were variations of names like Jhon Smith and Micheal Brown alongside entries like Jhonn Smtih and Michealw Brown. Emails had extra spaces and even typos like gnail.com instead of gmail.com. along with many other inaccuracies. Sarah realized the hard job ahead of her data cleaning.

Inconsistent formatting, missing values, and duplicates would lead to skewed results, giving an inaccurate picture of GreenEnvs customer base. Days turned into weeks as Sarah tirelessly cleaned the data, fixing inconsistencies, filling in gaps, and eliminating duplicates. It was a tedious process, but essential to ensure her analysis was built on a solid foundation.

Who cares about data quality?

Every year, poor data quality costs organizations an average of $12.9 million. [1]

Thankfully, after weeks of cleaning and organizing this messy data, Sarah was able to get the job doneor at least for this part..

Her next challenge came when she ventured into product data, aiming to identify top-selling items and recommend future opportunities. However, she encountered a different problem a complete lack of metadata. Product descriptions were absent, and categories were ambiguous. Basically, there wasnt enough data to help Sarah to understand the products data. Sarah realized the importance of metadata management structured information about the data itself. Without it, understanding and analyzing the data was almost impossible.

Research Shows Most Data Has Inaccuracies

Research by Experian reveals that businesses believe around 29% of their data is inaccurate in some way. [2]

Frustrated but determined, Sarah reached out to different departments to piece together information about the products. She discovered that each department used its own internal jargon and classification systems. Marketing and sales refer to the same cleaning product with different names.

As Sarah delved deeper, she found that datasets were kept in separate applications by different departments, outdated storage systems struggling to handle the growing volume of data, and Sarah had to wait for a long time for her queries to be executed. Sarah noticed also there are no clear rules on who can access what data and under what terms, without centralized control and proper access controls, the risk of unauthorized access to sensitive information increases, potentially leading to data breaches and compliance violations. The lack of data governance, a set of rules and procedures for managing data, was evident.

Data Breaches Can Be Costly

According to the Ponemon Institute, the average cost of a data breach in 2023 is $4.45 million globally, an all-time high record, with costs varying by industry and location. [3]

Each of the above issues and hurdles in Sarahs story highlighted the interconnectedness of many pillars data quality, metadata management, and data governance all played a crucial role in accessing and utilizing valuable insights at GreenEnv.

Sarahs journey is a common one for data scientists and analysts. Many organizations have massive amounts of data, and everyone knows the saying: Data is the new electricity. Every organization wants to make the most of their data, as its a very valuable asset. But most people mistakenly (and practically) believe that simply hiring a data analyst or data scientist is enough to unlock this value. There are many pillars to getting the most value from data, and organizations need to account for and pay attention to these. The keyword here is data management.

Did you know..

86% of organizations say they believe investing in data management directly impacts their business growth[4]

Read the original:

The Many Pillars of Getting the Most Value From Your Organization's Data - Towards Data Science

Data-Science-Powered Research by Seattle Children’s and Microsoft Shows Promise of Predicting SIDS and Other … – PR Newswire

REDMOND, Wash., March 28, 2024 /PRNewswire/ -- More than 155 International researchers and data scientists met this week in the Pacific Northwest to share with each other the newest insights into the causes of Sudden Infant Death Syndrome (SIDS). The event was sponsored by The Center for Integrative Brain Research at Seattle Children's and Microsoft AI for Good Lab. Among the many topics attendees discussed was groundbreaking new research that suggests genetic testing at birth may hold the promise of detecting SIDS risk and potentially other causes of sudden death later in life.

SIDS is the leading cause of death of infants one month to one year old in the US and other developed countries. The new findings, resulting from a partnership between the Center for Integrative Brain Research at Seattle Children'sand data scientists at Microsoft, come from the first-ever whole genome sequencing of 145 infants who succumbed to SIDS. The Aaron Matthew SIDS Research Foundationfunds the database, which is maintained and managed at Seattle Children's Research Institute.

In a study, soon to be published in the American Journal of Medical Genetics, researchers identify novel genes associated with Sudden Unexplained Infant Deaths (SUID), which includes SIDS. Some of these genes are important for detecting and responding to hypoxia low levels of oxygen in body tissues. Children with these vulnerabilities could increase their susceptibility to death caused from sleeping face down.

For decades, medical professionals have found a correlation between the sleeping position of infants and SIDS. This research suggests why that risk exists for certain infants. The study also identified genes associated with Sudden Cardiac Death, which could also explain why some children are particularly vulnerable to succumb to SIDS. However, because not every child with this vulnerability will succumb to SIDS, those who survive may be vulnerable to Sudden Cardiac Death later in life. Sudden Cardiac Death is responsible for 360,000 fatalities annually in the United States.

"Scientific research sometimes leads to surprises," saidJan-Marino Ramirez, PhD, Director of theCenter for Integrative Brain Research at Seattle Children's. "One surprise in our research leads to an exciting question: what if a genetic test at birth could not only predict the risk of SIDS, but also terminal cardiac problems well into adulthood? Preventative treatments exist for these dangerous conditions, and early detection could save lives."

John Kahan, the former Microsoft Vice President and Chief Data Analytics Officer who co-founded The Aaron Matthew SIDS Research Foundation with his wifeHeather Kahan, organized the first SIDS Summit while he was working at Microsoft.

"Thanks to this collaboration between world-class researchers and data scientists armed with cutting edge AI, we can now use genetic data to predict children at high risk of SIDS, which claims approximately 3,200 children a year," Kahan said."We are getting far closer to enabling medical professionals to bring preventative treatments to children who exhibit these risks, and potentially to far more people those susceptible to Sudden Cardiac Death later in life."

Juan M. Lavista Ferres PhD, MS, Microsoft Chief Data Scientist and the Director of the AI For Good Lab at Microsoft,is among those who hosted the Summit this week. Dr. Lavista wasthe lead researcher who used big data to estimate that 22% of Sudden or Unexplained Infant Deaths in the United States can be directly attributed to maternal smoking during pregnancy, which led to the assertion that SIDS rates can be reduced through education programs about the risks of smoking during pregnancy.

"The learning from this collaboration with SIDS researchers is proving, once again, the power AI has to scale human expertise," Dr. Lavista said. "It's a privilege for my team to put AI in the hands of some of the leading medical researchers in the world, and to see the number of potentially life-saving outcomes that flow from their work, partly through their access to AI."

The new findings on the mechanisms of SIDS were among many issues discussed at the Seventh Annual SIDS Summit, hosted by Ramirez, Kahan, and Lavista. Other research discussed included:

About the Aaron Matthew SIDS Research Guild at Seattle Children's Hospital The Aaron Matthew SIDS Research Guild at Seattle Children's Hospitalwas named in honor of Aaron Matthew Kahan, son of Heather Kahan and John B. Kahan. Aaron died of SIDS days after his birth in 2003. The Guild board includes leaders from Seattle Children's Hospital's Integrative Brain Research Institute, Microsoft, Accenture, Marriott Hotels, Adobe, Tata Consulting Services, and VMLY&R.

SOURCE Aaron Matthew SIDS Research Guild

See the article here:

Data-Science-Powered Research by Seattle Children's and Microsoft Shows Promise of Predicting SIDS and Other ... - PR Newswire

Advancing drug discovery with AI: introducing the KEDD framework – EurekAlert

image:

A simple but effective feature fusion framework that jointly incorporates biomolecular structures, knowledge graphs, and biomedical texts for AI drug discovery.

Credit: [Yizhen Luo, Institute for AI Industry Research (AIR), Tsinghua University]

A transformative study published in Health Data Science, a Science Partner Journal, introduces a groundbreaking end-to-end deep learning framework, known as Knowledge-Empowered Drug Discovery (KEDD), aimed at revolutionizing the field of drug discovery. This innovative framework adeptly integrates structured and unstructured knowledge, enhancing the AI-driven exploration of molecular dynamics and interactions.

Traditionally, AI applications in drug discovery have been constrained by their focus on singular tasks, neglecting the rich tapestry of structured and unstructured data that could enrich their predictive accuracy. These limitations are particularly pronounced when dealing with novel compounds or proteins, where existing knowledge is scant or absent, often hampered by the prohibitive costs of manual data annotation.

Professor Zaiqing Nie, from Tsinghua University's Institute for AI Industry Research, emphasizes the enhancement potential of AI in drug discovery through KEDD. This framework synergizes data from molecular structures, knowledge graphs, and biomedical literature, offering a comprehensive approach that transcends the limitations of conventional models.

At its core, KEDD employs robust representation learning models to distill dense features from various data modalities. Following this, it integrates these features through a fusion process and leverages a predictive network to ascertain outcomes, facilitating its application across a spectrum of AI-facilitated drug discovery endeavors.

The study substantiates KEDD's effectiveness, showcasing its ability to outperform existing AI models in critical drug discovery tasks. Notably, KEDD demonstrates resilience in the face of the 'missing modality problem,' where lack of documented data on new drugs or proteins could undermine analytical processes. This resilience stems from its innovative use of sparse attention and modality masking techniques, which harness the power of existing knowledge bases to inform predictions and analyses.

Looking forward, Yizhen Luo, a key contributor to the KEDD project, outlines ambitious plans to enhance the framework's capabilities, including the exploration of multimodal pre-training strategies. The overarching objective is to cultivate a versatile, knowledge-driven AI ecosystem that accelerates biomedical research, delivering timely insights and recommendations to advance therapeutic discovery and development.

Health Data Science

Toward Unified AI Drug Discovery with Multimodal Knowledge

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

Read the original here:

Advancing drug discovery with AI: introducing the KEDD framework - EurekAlert

8 Things Most Data Science Programs Don’t Teach (But You Should Know) Part 2 – Towards Data Science

MIT calls this the missing semester of your CS education 10 min read

What data science and software engineering have in common is writing code. But while code is the main outcome of software engineering, data science projects typically end with models, results, and reports. Consequently, in data science the quality, structure, and delivery of code is often an afterthought at best.

The implicit expectation with data science projects is that the results reported at the end can be trusted.

This means that if someone asked you to re-run your or somebody elses analysis, you would be able to obtain the same results, regardless of how much time has passed since you first performed the analysis.

Similarly, if you are developing a component for a product, the implicit expectation is that component you developed represents the best possible performance given what is reasonably possible within the requirements of the product.

These statements may seem obvious, but satisfying both expectations can be quite difficult.

If you dont believe me, think about your past projects.

Continued here:

8 Things Most Data Science Programs Don't Teach (But You Should Know) Part 2 - Towards Data Science

Create Mixtures of Experts with MergeKit | by Maxime Labonne | Mar, 2024 – Towards Data Science

MoEs also come with their own set of challenges, especially in terms of fine-tuning and memory requirements. The fine-tuning process can be difficult due to the models complexity, with the need to balance expert usage during training to properly train the gating weights to select the most relevant ones. In terms of memory, even though only a fraction of the total parameters are used during inference, the entire model, including all experts, needs to be loaded into memory, which requires high VRAM capacity.

More specifically, there are two essential parameters when it comes to MoEs:

Historically, MoEs have underperformed dense models. However, the release of Mixtral-8x7B in December 2023 shook things up and showed impressive performance for its size. Additionally, GPT-4 is also rumored to be an MoE, which would make sense as it would be a lot cheaper to run and train for OpenAI compared to a dense model. In addition to these recent excellent MoEs, we now have a new way of creating MoEs with MergeKit: frankenMoEs, also called MoErges.

The main difference between true MoEs and frankenMoEs is how theyre trained. In the case of true MoEs, the experts and the router are trained jointly. In the case of frankenMoEs, we upcycle existing models and initialize the router afterward.

In other words, we copy the weights of the layer norm and self-attention layers from a base model, and then copy the weights of the FFN layers found in each expert. This means that besides the FFNs, all the other parameters are shared. This explains why Mixtral-8x7B with eight experts doesnt have 8*7 = 56B parameters, but about 45B. This is also why using two experts per token gives the inference speed (FLOPs) of a 12B dense model instead of 14B.

FrankenMoEs are about selecting the most relevant experts and initializing them properly. MergeKit currently implements three ways of initializing the routers:

As you can guess, the hidden initialization is the most efficient to correctly route the tokens to the most relevant experts. In the next section, we will create our own frankenMoE using this technique.

To create our frankenMoE, we need to select n experts. In this case, we will rely on Mistral-7B thanks to its popularity and relatively small size. However, eight experts like in Mixtral is quite a lot, as we need to fit all of them in memory. For efficiency, I'll only use four experts in this example, with two of them engaged for each token and each layer. In this case, we will end up with a model with 24.2B parameters instead of 4*7 = 28B parameters.

Here, our goal is to create a well-rounded model that can do pretty much everything: write stories, explain articles, code in Python, etc. We can decompose this requirement into four tasks and select the best expert for each of them. This is how I decomposed it:

Now that weve identified the experts we want to use, we can create the YAML configuration that MergeKit will use to create our frankenMoE. This uses the mixtral branch of MergeKit. You can find more information about how to write the configuration on this page. Here is our version:

For each expert, I provide five basic positive prompts. You can be a bit fancier and write entire sentences if you want. The best strategy consists of using real prompts that should trigger a particular expert. You can also add negative prompts to do the opposite.

Once this is ready, you can save your configuration as config.yaml. In the same folder, we will download and install the mergekit library (mixtral branch).

If your computer has enough RAM (roughly 2432 GB of RAM), you can run the following command:

If you dont have enough RAM, you can shard the models instead as follows (it will take longer):

This command automatically downloads the experts and creates the frankenMoE in the merge directory. For the hidden gate mode, you can also use the --load-in-4bit and --load-in-8bit options to compute hidden states with lower precision.

Alternatively, you can copy your configuration into LazyMergekit, a wrapper I made to simplify model merging. In this Colab notebook, you can input your model name, select the mixtral branch, specify your Hugging Face username/token, and run the cells. After creating your frankenMoE, it will also upload it to the Hugging Face Hub with a nicely formatted model card.

I called my model Beyonder-4x7B-v3 and created GGUF versions of it using AutoGGUF. If you cant run GGUF versions on your local machine, you can also perform inference using this Colab notebook.

To get a good overview of its capabilities, it has been evaluated on three different benchmarks: Nous benchmark suite, EQ-Bench, and the Open LLM Leaderboard. This model is not designed to excel in traditional benchmarks, as the code and role-playing models generally do not apply to those contexts. Nonetheless, it performs remarkably well thanks to strong general-purpose experts.

Nous: Beyonder-4x7B-v3 is one of the best models on Nous benchmark suite (evaluation performed using LLM AutoEval) and significantly outperforms the v2. See the entire leaderboard here.

EQ-Bench: Its also the best 4x7B model on the EQ-Bench leaderboard, outperforming older versions of ChatGPT and Llama-270b-chat. Beyonder is very close to Mixtral-8x7B-Instruct-v0.1 and Gemini Pro, which are (supposedly) much bigger models.

Open LLM Leaderboard: Finally, its also a strong performer on the Open LLM Leaderboard, significantly outperforming the v2 model.

On top of these quantitative evaluations, I recommend checking the models outputs in a more qualitative way using a GGUF version on LM Studio. A common way of testing these models is to gather a private set of questions and check their outputs. With this strategy, I found that Beyonder-4x7B-v3 is quite robust to changes in the user and system prompts compared to other models, including AlphaMonarch-7B. This is pretty cool as it improves the usefulness of the model in general.

FrankenMoEs are a promising but still experimental approach. The trade-offs, like higher VRAM demand and slower inference speeds, can make it challenging to see their advantage over simpler merging techniques like SLERP or DARE TIES. Especially, when you use frankenMoEs with just two experts, they might not perform as well as if you had simply merged the two models. However, frankenMoEs excel in preserving knowledge, which can result in stronger models, as demonstrated by Beyonder-4x7B-v3. With the right hardware, these drawbacks can be effectively mitigated.

In this article, we introduced the Mixture of Experts architecture. Unlike traditional MoEs that are trained from scratch, MergeKit facilitates the creation of MoEs by ensembling experts, offering an innovative approach to improving model performance and efficiency. We detailed the process of creating a frankenMoE with MergeKit, highlighting the practical steps involved in selecting and combining different experts to produce a high-quality MoE.

Thanks for reading this article. I encourage you to try to make your own FrankenMoEs using LazyMergeKit: select a few models, create your config based Beyonders, and run the notebook to create your own models! If you liked this article, please follow me on Hugging Face and X/Twitter @maximelabonne.

Read more:

Create Mixtures of Experts with MergeKit | by Maxime Labonne | Mar, 2024 - Towards Data Science