Page 60«..1020..59606162..7080..»

Pandas: From Messy To Beautiful. This is how to make your pandas code | by Anna Zawadzka | Mar, 2024 – Towards Data Science

Scripting around a pandas DataFrame can turn into an awkward pile of (not-so-)good old spaghetti code. Me and my colleagues use this package a lot and while we try to stick to good programming practices, like splitting code in modules and unit testing, sometimes we still get in the way of one another by producing confusing code.

I have gathered some tips and pitfalls to avoid in order to make pandas code clean and infallible. Hopefully youll find them useful too. We'll get some help from Robert C. Martin's classic Clean code specifically for the context of the pandas package. TL;DR at the end.

Lets begin by observing some faulty patterns inspired by real life. Later on, well try to rephrase that code in order to favor readability and control.

Pandas DataFrames are value-mutable [2, 3] objects. Whenever you alter a mutable object, it affects the exact same instance that you originally created and its physical location in memory remains unchanged. In contrast, when you modify an immutable object (eg. a string), Python goes to create a whole new object at a new memory location and swaps the reference for the new one.

This is the crucial point: in Python, objects get passed to the function by assignment [4, 5]. See the graph: the value of df has been assigned to variable in_df when it was passed to the function as an argument. Both the original df and the in_df inside the function point to the same memory location (numeric value in parentheses), even if they go by different variable names. During the modification of its attributes, the location of the mutable object remains unchanged. Now all other scopes can see the changes too they reach to the same memory location.

Actually, since we have modified the original instance, its redundant to return the DataFrame and assign it to the variable. This code has the exact same effect:

Heads-up: the function now returns None, so be careful not to overwrite the df with None if you do perform the assignment: df = modify_df(df).

In contrast, if the object is immutable, it will change the memory location throughout the modification just like in the example below. Since the red string cannot be modified (strings are immutable), the green string is created on top of the old one, but as a brand new object, claiming a new location in memory. The returned string is not the same string, whereas the returned DataFrame was the exact same DataFrame.

The point is, mutating DataFrames inside functions has a global effect. If you dont keep that in mind, you may:

Well fix that problem later, but here is another don't before we pass to do's

The design from the previous section is actually an anti-pattern called output argument [1 p.45]. Typically, inputs of a function will be used to create an output value. If the sole point of passing an argument to a function is to modify it, so that the input argument changes its state, then its challenging our intuitions. Such behavior is called side effect [1 p.44] of a function and those should be well documented and minimized because they force the programmer to remember the things that go in the background, therefore making the script error-prone.

When we read a function, we are used to the idea of information going in to the function through arguments and out through the return value. We dont usually expect information to be going out through the arguments. [1 p.41]

Things get even worse if the function has a double responsibility: to modify the input and to return an output. Consider this function:

It does return a value as you would expect, but it also permanently modifies the original DataFrame. The side effect takes you by surprise - nothing in the function signature indicated that our input data was going to be affected. In the next step, we'll see how to avoid this kind of design.

To eliminate the side effect, in the code below we have created a new temporary variable instead of modifying the original DataFrame. The notation lengths: pd.Series indicates the datatype of the variable.

This function design is better in that it encapsulates the intermediate state instead of producing a side effect.

Another heads-up: please be mindful of the differences between deep and shallow copy [6] of elements from the DataFrame. In the example above we have modified each element of the original df["name"] Series, so the old DataFrame and the new variable have no shared elements. However, if you directly assign one of the original columns to a new variable, the underlying elements still have the same references in memory. See the examples:

You can print out the DataFrame after each step to observe the effect. Remember that creating a deep copy will allocate new memory, so its good to reflect whether your script needs to be memory-efficient.

Maybe for whatever reason you want to store the result of that length computation. Its still not a good idea to append it to the DataFrame inside the function because of the side effect breach as well as the accumulation of multiple responsibilities inside a single function.

I like the One Level of Abstraction per Function rule that says:

We need to make sure that the statements within our function are all at the same level of abstraction.

Mixing levels of abstraction within a function is always confusing. Readers may not be able to tell whether a particular expression is an essential concept or a detail. [1 p.36]

Also lets employ the Single responsibility principle [1 p.138] from OOP, even though were not focusing on object-oriented code right now.

Why not prepare your data beforehand? Lets split data preparation and the actual computation in separate functions.:

The individual task of creating the name_len column has been outsourced to another function. It does not modify the original DataFrame and it performs one task at a time. Later we retrieve the max element by passing the new column to another dedicated function. Notice how the aggregating function is generic for Collections.

Lets brush the code up with the following steps:

The way we have split the code really makes it easy to go back to the script later, take the entire function and reuse it in another script. We like that!

There is one more thing we can do to increase the level of reusability: pass column names as parameters to functions. The refactoring is going a little bit over the top, but sometimes it pays for the sake of flexibility or reusability.

Did you ever figure out that your preprocessing was faulty after weeks of experiments on the preprocessed dataset? No? Lucky you. I actually had to repeat a batch of experiments because of broken annotations, which could have been avoided if I had tested just a couple of basic functions.

Important scripts should be tested [1 p.121, 7]. Even if the script is just a helper, I now try to test at least the crucial, most low-level functions. Lets revisit the steps that we made from the start:

1. I am not happy to even think of testing this, its very redundant and we have paved over the side effect. It also tests a bunch of different features: the computation of name length and the aggregation of result for the max element. Plus it fails, did you see that coming?

2. This is much better we have focused on one single task, so the test is simpler. We also dont have to fixate on column names like we did before. However, I think that the format of the data gets in the way of verifying the correctness of the computation.

3. Here we have cleaned up the desk. We test the computation function inside out, leaving the pandas overlay behind. Its easier to come up with edge cases when you focus on one thing at a time. I figured out that Id like to test for None values that may appear in the DataFrame and I eventually had to improve my function for that test to pass. A bug caught!

4. Were only missing the test for find_max_element:

One additional benefit of unit testing that I never forget to mention is that it is a way of documenting your code, as someone who doesnt know it (like you from the future) can easily figure out the inputs and expected outputs, including edge cases, just by looking at the tests. Double gain!

These are some tricks I found useful while coding and reviewing other peoples code. Im far from telling you that one or another way of coding is the only correct one you take what you want from it, you decide whether you need a quick scratch or a highly polished and tested codebase. I hope this thought piece helps you structure your scripts so that youre happier with them and more confident about their infallibility.

If you liked this article, I would love to know about it. Happy coding!

TL;DR

Theres no one and only correct way of coding, but here are some inspirations for scripting with pandas:

Donts:

- dont mutate your DataFrame too much inside functions, because you may lose control over what and where gets appended/removed from it,

- dont write methods that mutate a DataFrame and return nothing because that's confusing.

Dos:

- create new objects instead of modifying the source DataFrame and remember to make a deep copy when needed,

- perform only similar-level operations inside a single function,

- design functions for flexibility and reusability,

- test your functions because this helps you design cleaner code, secure against bugs and edge cases and document it for free.

The graphs were created by me using Miro. The cover image was also created by me using the Titanic dataset and GIMP (smudge effect).

Read more from the original source:

Pandas: From Messy To Beautiful. This is how to make your pandas code | by Anna Zawadzka | Mar, 2024 - Towards Data Science

Read More..

How the New Breed of LLMs is Replacing OpenAI and the Likes – DataScienceCentral.com – Data Science Central

Of course, OpenAI, Mistral, Claude and the likes may adapt. But will they manage to stay competitive in this evolving market? Last week Databricks launched DBRX. It clearly shows the new trend: specialization, lightweight, combining multiple LLMs, enterprise-oriented, and better results at a fraction of the cost. Monolithic solutions where you pay by the token encourage the proliferation of models with billions or trillions of tokens, weights and parameters. They are embraced by companies such as Nvidia, because they use a lot of GPU and make chip producers wealthy. One of the drawbacks is the cost incurred by the customer, with no guarantee of positive ROI. The quality may also suffer (hallucinations).

In this article, I discuss the new type of architecture under development. Hallucination-free, they achieve better results at a fraction of the cost and run much faster. Sometimes without GPU, sometimes without training. Targeting professional users rather than the layman, they rely on self-tuning and customization. Indeed, there is no universal evaluation metric: laymen and experts have very different ratings and expectations when using these tools.

Much of this discussion is based on the technology that I develop for a fortune 100 company. I show the benefits, but also potential issues. Many of my competitors are moving in the same direction.

Before diving into the architecture of new LLMs, lets first discuss the current funding model. Many startups get funding from large companies such as Microsoft, Nvidia or Amazon. It means that they have to use their cloud solutions, services and products. The result is high costs for the customer. Startups that rely on vendor-neutral VC funding face a similar challenge: you cannot raise VC money by saying that you could do better and charge 1000x less. VC firms expect to make billions of dollars, not mere millions. To maintain this ecosystem, players spend a lot of money on advertising and hype. In the end, if early investors can quickly make big money through acquisitions, it is a win. What happens when clients realize ROI is negative, is unimportant. As long as it does not happen too soon! But can investors even achieve this short-term goal?

The problem is compounded by the fact that researchers believe deep neural networks (DNN) are the panacea, with issues simply fixed by using bigger data, multiple transforms to make DNN work, or front-end patches such as prompt engineering, to address foundational back-end problems. Sadly, no one works on ground-breaking innovations outside DNNs. I am an exception.

In the end, very few self-funded entrepreneurs can compete, offering a far less expensive alternative with no plan on becoming a billionaire. I may be the only one able to survive and strive, long-term. My intellectual property is open-source, patent-free, and comes with extensive documentation, source code, and comparisons. It appeals to large, traditional corporations. The word is out; it is no longer a secret. In turn, it puts pressure on big players to offer better LLMs. They can see how I do it and implement the same algorithms on their end. Or come up with their own solutions independently. Either way, the new type of architecture is pretty much the same in all cases, not much different from mine. The new Databricks LLM (DBRX) epitomizes this trend. Mine is called XLLM.

Surprisingly, none of the startups working on new LLMs consider monetizing their products via advertising: blending organic output with sponsored results relevant to the user prompt. I am contemplating doing it, with a large client interested in signing-up when the option is available.

As concisely stated by one of my clients, the main issues to address are:

In addition to blending specialized LLMs (one per top category with its own set of embeddings and other summary tables) a new trend is emerging. It consists of blending multiple LLMs focusing on the same topic, each one with its own flavor: technical, general, or based on different parameters. Then, combining these models just like XGBoost combines multiple small decisions trees to get the best from all. In short, an ensemble method.

Note that speed and accuracy result from using many small, specialized tables (embeddings and so on) as opposed to a big table with long, fixed-size embedding vectors and expensive semantic / vector search. The user selects the categories that best match his prompt. In my case, there is no neural network involved, no GPU needed, yet no latency and no hallucinations. Liability is further reduced with a local implementation, and explainable AI.

Carefully selecting input sources (in many cases, corporate repositories augmented with external data) and smart crawling to reconstruct the hidden structure (underlying taxonomy, breadcrumbs, navigation links, headings, and so on), are critical components of this architecture.

For details about xLLM (technical implementation, comparing output with OpenAI and the likes on the same prompts, Python code, input sources, and documentation), see here. I also offer a free course on the topic, here.

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist atMLTechniques.comandGenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner one related to LLM. Vincents past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.Follow Vincent on LinkedIn.

Read the rest here:

How the New Breed of LLMs is Replacing OpenAI and the Likes - DataScienceCentral.com - Data Science Central

Read More..

Data science classes, bootcamps and certificates in NYC – Time Out

Data science is booming, thanks to the exponential increase in available data, advancements in technology and computing power, and the high demand for data-driven insights to inform decisions across all sectors. Data science classes and bootcamps in NYC offer the perfect opportunity to master essential data science skills like Python programming, machine learning, and data analysis. You'll learn how to extract insights from complex datasets, build predictive models, and create data visualizations. NYC is a hub of innovation and technology, where youll have unparalleled access to industry experts, networking opportunities, and real-world projects. Whether you're a seasoned professional looking to upskill or a curious beginner eager to explore the possibilities of data science, NYC offers the ideal environment to thrive in this rapidly evolving field.

Recommended: Best certificate programs in NYCRecommended: Best coding bootcamps in NYCRecommended: Best coding classes & bootcamps near youRecommended: Best data science classes and programs for high school studentsRecommended: Best digital marketing classes and certificates in NYC

Continued here:

Data science classes, bootcamps and certificates in NYC - Time Out

Read More..

Do not over-think about ‘outliers’, use a student-t distribution instead – Towards Data Science

A Students t-distribution is nothing more than a Gaussian distribution with heavier tails. In other words, we can say that the Gaussian distribution is a special case of the Students t-distribution. The Gaussian distribution is defined by the mean () and the standard deviation (). The Student t distribution, on the other hand, adds an additional parameter, the degrees of freedom (df), which controls the thickness of the distribution. This parameter assigns greater probability to events further from the mean. This feature is particularly useful for small sample sizes, such as in biomedicine, where the assumption of normality is questionable. Note that as the degrees of freedom increase, the Student t-distribution approaches the Gaussian distribution. We can visualize this using density plots:

Note in Figure 1 that the hill around the mean gets smaller as the degrees of freedom decrease as a result of the probability mass going to the tails, which are thicker. This property is what gives the Students t-distribution a reduced sensitivity to outliers. For more details on this matter, you can check this blog.

We load the required libraries:

So, lets skip data simulations and get serious. Well work with real data I have acquired from mice performing the rotarod test.

First, we load the dataset into our environment and set the corresponding factor levels. The dataset contains IDs for the animals, a groping variable (Genotype), an indicator for two different days on which the test was performed (day), and different trials for the same day. For this article, we model only one of the trials (Trial3). We will save the other trials for a future article on modeling variation.

As the data handling implies, our modeling strategy will be based on Genotype and Day as categorical predictors of the distribution of Trial3.

In biomedical science, categorical predictors, or grouping factors, are more common than continuous predictors. Scientists in this field like to divide their samples into groups or conditions and apply different treatments.

Lets have an initial view of the data using Raincloud plots as shown by Guilherme A. Franchi, PhD in this great blog post.

Figure 2 looks different from the original by Guilherme A. Franchi, PhD because we are plotting two factors instead of one. However, the nature of the plot is the same. Pay attention to the red dots, these are the ones that can be considered extreme observations that tilt the measures of central tendency (especially the mean) toward one direction. We also observe that the variances are different, so modeling also sigma can give better estimates. Our task now is to model the output using the brms package.

See the rest here:

Do not over-think about 'outliers', use a student-t distribution instead - Towards Data Science

Read More..

The Many Pillars of Getting the Most Value From Your Organization’s Data – Towards Data Science

Photo by Choong Deng Xiang on Unsplash

Letmeintroduce youtoSarah, a talented and passionate data scientist, who just landed her dream job at GreenEnv, a large company that makes eco-friendly cleaning products. GreenEnv has tons of data on customers, products, and other areas of the business. They hired Sarah to unlock the hidden potential within this data, uncovering market trends, competitive advantages, and more.

Her first task: analyze customer demographics and buying habits to create targeted marketing campaigns. Confident in her abilities and excited to apply data science methods, Sarah dived into the customer database. But her initial excitement quickly faded. The data was a mess inconsistent formatting, misspelled names, and duplicate entries everywhere. Data quality was terrible. There were variations of names like Jhon Smith and Micheal Brown alongside entries like Jhonn Smtih and Michealw Brown. Emails had extra spaces and even typos like gnail.com instead of gmail.com. along with many other inaccuracies. Sarah realized the hard job ahead of her data cleaning.

Inconsistent formatting, missing values, and duplicates would lead to skewed results, giving an inaccurate picture of GreenEnvs customer base. Days turned into weeks as Sarah tirelessly cleaned the data, fixing inconsistencies, filling in gaps, and eliminating duplicates. It was a tedious process, but essential to ensure her analysis was built on a solid foundation.

Who cares about data quality?

Every year, poor data quality costs organizations an average of $12.9 million. [1]

Thankfully, after weeks of cleaning and organizing this messy data, Sarah was able to get the job doneor at least for this part..

Her next challenge came when she ventured into product data, aiming to identify top-selling items and recommend future opportunities. However, she encountered a different problem a complete lack of metadata. Product descriptions were absent, and categories were ambiguous. Basically, there wasnt enough data to help Sarah to understand the products data. Sarah realized the importance of metadata management structured information about the data itself. Without it, understanding and analyzing the data was almost impossible.

Research Shows Most Data Has Inaccuracies

Research by Experian reveals that businesses believe around 29% of their data is inaccurate in some way. [2]

Frustrated but determined, Sarah reached out to different departments to piece together information about the products. She discovered that each department used its own internal jargon and classification systems. Marketing and sales refer to the same cleaning product with different names.

As Sarah delved deeper, she found that datasets were kept in separate applications by different departments, outdated storage systems struggling to handle the growing volume of data, and Sarah had to wait for a long time for her queries to be executed. Sarah noticed also there are no clear rules on who can access what data and under what terms, without centralized control and proper access controls, the risk of unauthorized access to sensitive information increases, potentially leading to data breaches and compliance violations. The lack of data governance, a set of rules and procedures for managing data, was evident.

Data Breaches Can Be Costly

According to the Ponemon Institute, the average cost of a data breach in 2023 is $4.45 million globally, an all-time high record, with costs varying by industry and location. [3]

Each of the above issues and hurdles in Sarahs story highlighted the interconnectedness of many pillars data quality, metadata management, and data governance all played a crucial role in accessing and utilizing valuable insights at GreenEnv.

Sarahs journey is a common one for data scientists and analysts. Many organizations have massive amounts of data, and everyone knows the saying: Data is the new electricity. Every organization wants to make the most of their data, as its a very valuable asset. But most people mistakenly (and practically) believe that simply hiring a data analyst or data scientist is enough to unlock this value. There are many pillars to getting the most value from data, and organizations need to account for and pay attention to these. The keyword here is data management.

Did you know..

86% of organizations say they believe investing in data management directly impacts their business growth[4]

Read the original:

The Many Pillars of Getting the Most Value From Your Organization's Data - Towards Data Science

Read More..

8 Things Most Data Science Programs Don’t Teach (But You Should Know) Part 2 – Towards Data Science

MIT calls this the missing semester of your CS education 10 min read

What data science and software engineering have in common is writing code. But while code is the main outcome of software engineering, data science projects typically end with models, results, and reports. Consequently, in data science the quality, structure, and delivery of code is often an afterthought at best.

The implicit expectation with data science projects is that the results reported at the end can be trusted.

This means that if someone asked you to re-run your or somebody elses analysis, you would be able to obtain the same results, regardless of how much time has passed since you first performed the analysis.

Similarly, if you are developing a component for a product, the implicit expectation is that component you developed represents the best possible performance given what is reasonably possible within the requirements of the product.

These statements may seem obvious, but satisfying both expectations can be quite difficult.

If you dont believe me, think about your past projects.

Continued here:

8 Things Most Data Science Programs Don't Teach (But You Should Know) Part 2 - Towards Data Science

Read More..

Data-Science-Powered Research by Seattle Children’s and Microsoft Shows Promise of Predicting SIDS and Other … – PR Newswire

REDMOND, Wash., March 28, 2024 /PRNewswire/ -- More than 155 International researchers and data scientists met this week in the Pacific Northwest to share with each other the newest insights into the causes of Sudden Infant Death Syndrome (SIDS). The event was sponsored by The Center for Integrative Brain Research at Seattle Children's and Microsoft AI for Good Lab. Among the many topics attendees discussed was groundbreaking new research that suggests genetic testing at birth may hold the promise of detecting SIDS risk and potentially other causes of sudden death later in life.

SIDS is the leading cause of death of infants one month to one year old in the US and other developed countries. The new findings, resulting from a partnership between the Center for Integrative Brain Research at Seattle Children'sand data scientists at Microsoft, come from the first-ever whole genome sequencing of 145 infants who succumbed to SIDS. The Aaron Matthew SIDS Research Foundationfunds the database, which is maintained and managed at Seattle Children's Research Institute.

In a study, soon to be published in the American Journal of Medical Genetics, researchers identify novel genes associated with Sudden Unexplained Infant Deaths (SUID), which includes SIDS. Some of these genes are important for detecting and responding to hypoxia low levels of oxygen in body tissues. Children with these vulnerabilities could increase their susceptibility to death caused from sleeping face down.

For decades, medical professionals have found a correlation between the sleeping position of infants and SIDS. This research suggests why that risk exists for certain infants. The study also identified genes associated with Sudden Cardiac Death, which could also explain why some children are particularly vulnerable to succumb to SIDS. However, because not every child with this vulnerability will succumb to SIDS, those who survive may be vulnerable to Sudden Cardiac Death later in life. Sudden Cardiac Death is responsible for 360,000 fatalities annually in the United States.

"Scientific research sometimes leads to surprises," saidJan-Marino Ramirez, PhD, Director of theCenter for Integrative Brain Research at Seattle Children's. "One surprise in our research leads to an exciting question: what if a genetic test at birth could not only predict the risk of SIDS, but also terminal cardiac problems well into adulthood? Preventative treatments exist for these dangerous conditions, and early detection could save lives."

John Kahan, the former Microsoft Vice President and Chief Data Analytics Officer who co-founded The Aaron Matthew SIDS Research Foundation with his wifeHeather Kahan, organized the first SIDS Summit while he was working at Microsoft.

"Thanks to this collaboration between world-class researchers and data scientists armed with cutting edge AI, we can now use genetic data to predict children at high risk of SIDS, which claims approximately 3,200 children a year," Kahan said."We are getting far closer to enabling medical professionals to bring preventative treatments to children who exhibit these risks, and potentially to far more people those susceptible to Sudden Cardiac Death later in life."

Juan M. Lavista Ferres PhD, MS, Microsoft Chief Data Scientist and the Director of the AI For Good Lab at Microsoft,is among those who hosted the Summit this week. Dr. Lavista wasthe lead researcher who used big data to estimate that 22% of Sudden or Unexplained Infant Deaths in the United States can be directly attributed to maternal smoking during pregnancy, which led to the assertion that SIDS rates can be reduced through education programs about the risks of smoking during pregnancy.

"The learning from this collaboration with SIDS researchers is proving, once again, the power AI has to scale human expertise," Dr. Lavista said. "It's a privilege for my team to put AI in the hands of some of the leading medical researchers in the world, and to see the number of potentially life-saving outcomes that flow from their work, partly through their access to AI."

The new findings on the mechanisms of SIDS were among many issues discussed at the Seventh Annual SIDS Summit, hosted by Ramirez, Kahan, and Lavista. Other research discussed included:

About the Aaron Matthew SIDS Research Guild at Seattle Children's Hospital The Aaron Matthew SIDS Research Guild at Seattle Children's Hospitalwas named in honor of Aaron Matthew Kahan, son of Heather Kahan and John B. Kahan. Aaron died of SIDS days after his birth in 2003. The Guild board includes leaders from Seattle Children's Hospital's Integrative Brain Research Institute, Microsoft, Accenture, Marriott Hotels, Adobe, Tata Consulting Services, and VMLY&R.

SOURCE Aaron Matthew SIDS Research Guild

See the article here:

Data-Science-Powered Research by Seattle Children's and Microsoft Shows Promise of Predicting SIDS and Other ... - PR Newswire

Read More..

Advancing drug discovery with AI: introducing the KEDD framework – EurekAlert

image:

A simple but effective feature fusion framework that jointly incorporates biomolecular structures, knowledge graphs, and biomedical texts for AI drug discovery.

Credit: [Yizhen Luo, Institute for AI Industry Research (AIR), Tsinghua University]

A transformative study published in Health Data Science, a Science Partner Journal, introduces a groundbreaking end-to-end deep learning framework, known as Knowledge-Empowered Drug Discovery (KEDD), aimed at revolutionizing the field of drug discovery. This innovative framework adeptly integrates structured and unstructured knowledge, enhancing the AI-driven exploration of molecular dynamics and interactions.

Traditionally, AI applications in drug discovery have been constrained by their focus on singular tasks, neglecting the rich tapestry of structured and unstructured data that could enrich their predictive accuracy. These limitations are particularly pronounced when dealing with novel compounds or proteins, where existing knowledge is scant or absent, often hampered by the prohibitive costs of manual data annotation.

Professor Zaiqing Nie, from Tsinghua University's Institute for AI Industry Research, emphasizes the enhancement potential of AI in drug discovery through KEDD. This framework synergizes data from molecular structures, knowledge graphs, and biomedical literature, offering a comprehensive approach that transcends the limitations of conventional models.

At its core, KEDD employs robust representation learning models to distill dense features from various data modalities. Following this, it integrates these features through a fusion process and leverages a predictive network to ascertain outcomes, facilitating its application across a spectrum of AI-facilitated drug discovery endeavors.

The study substantiates KEDD's effectiveness, showcasing its ability to outperform existing AI models in critical drug discovery tasks. Notably, KEDD demonstrates resilience in the face of the 'missing modality problem,' where lack of documented data on new drugs or proteins could undermine analytical processes. This resilience stems from its innovative use of sparse attention and modality masking techniques, which harness the power of existing knowledge bases to inform predictions and analyses.

Looking forward, Yizhen Luo, a key contributor to the KEDD project, outlines ambitious plans to enhance the framework's capabilities, including the exploration of multimodal pre-training strategies. The overarching objective is to cultivate a versatile, knowledge-driven AI ecosystem that accelerates biomedical research, delivering timely insights and recommendations to advance therapeutic discovery and development.

Health Data Science

Toward Unified AI Drug Discovery with Multimodal Knowledge

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

Read the original here:

Advancing drug discovery with AI: introducing the KEDD framework - EurekAlert

Read More..

Generation Google Scholarship 2024 for Women in Computer Science in Ireland (up to 5000 EUR) Opportunity Desk – Opportunity Desk

Deadline: April 23, 2024

Applications are open for the Generation Google Scholarship 2024 for Women in Computer Science in Ireland. The Generation Google Scholarship: for women in computer science in Ireland was established to help aspiring computer scientists excel in technology and become leaders in the field.

A group of undergraduate students who identify as women will be chosen from the applicant pool, and scholarships will be awarded based on the strength of each candidates impact on diversity, demonstrated leadership, and academic background. The program is open to qualified students who meet all the minimum qualifications. All students who identify as women interested in computer science are strongly encouraged to apply.

Benefits

Eligibility

To be eligible to apply, applicants must:

Application

You will be asked to complete an online application which includes:

Essay Questions:

The two short answer essay questions below are intended to assess your problem solving skills and commitment to diversity, equity, and inclusion. Each response to the two questions below should be 500 words or less.

IMPORTANT: Before starting the application, have the following ready for upload:

Click here to apply

For more information, visit Generation Google Scholarship.

Excerpt from:

Generation Google Scholarship 2024 for Women in Computer Science in Ireland (up to 5000 EUR) Opportunity Desk - Opportunity Desk

Read More..

10 of the highest-paying programming jobs right now – Fortune

In our modern age, computers impact practically every aspect of daily life. But before we can type an address into our smartphone or book a restaurant reservation online, a programmer was necessary to create the software these programs rely on. Still, programmers arent as in demand as they once were. Following the programming hiring boom of the pandemic, fewer programmers are now needed in the workforce, especially as programming becomes increasingly automated in the age of AI.

The U.S. Bureau of Labor Statistics anticipates a 11% decline in employment for computer programmers between 2022 to 2032. Still, BLS projects that there will be about 6,700 job openings per year over that decade because of workers transferring to other occupations or retiring. The amount of code that you write is going down, but the impact that you have with that code that you write is going up, explains George Heineman, an assistant professor of computer science at Worcester Polytechnic Institute in Worcester, Massachusetts, of the current need for programmers.

ADVERTISEMENT

For those who are interested in pursuing programming jobs, here are 10 of the fields top-paying roles.

Highest-paying cities: San Diego, Calif. ($298,291), New York, N.Y. ($225,432), Phoenix, Ariz. ($216,956), according to Indeed.

Description: This C-suite position oversees a companys IT department and research and development department. Part of this job includes researching new technology and finding weaknesses that can be fixed with new IT technology.

Heineman says this C-suite position carries a huge amount of responsibility, and that its more about hiring people than actual programming. You dont get to do the good stuff. You hire people who get to do the good stuff, Heineman explains. Its understanding the mission of the company and how to translate that mission into action.

Education: Its common for chief technology officers to have a bachelors in IT, business or cybersecurity; an MBA can provide business acumen and leadership skills.

Highest-paying cities: New York, N.Y. ($188,965), Cupertino, Calif. ($183,159), Santa Clara, Calif. ($182,851), according to Indeed.

Description: Machine learning engineers create software that can run automatically and contend with problems it encounters by learning to improve upon its tasks without assistance from humans. This wide-ranging skill can be applied to virtual assistants like Amazons Alexa, self-driving cars and recommendation algorithms.

You dont have to be a programmer to be a machine learning engineer, although a lot of programmers go in that direction, Heineman says. They really understand how to take that very specific domain and model it and run these machine learning algorithms through their paces.

Education: Machine learning engineers are usually required to have at least a bachelors in a related field; most job postings require a masters in computer science, data science, software engineering or a similar field.

Highest-paying cities: Bolinas, Calif. ($181,995), Kensington, Ky. ($178,838), Summitview, Wash. ($173,838), according to ZipRecruiter.

Description: AI engineers research and develop machines that simulate the thinking patterns and behavior of humans. Using machine learning and artificial intelligence, AI engineers create applications and systems that assist companies to increase profits and efficiency, cut costs, and make better business decisions.

Education: AI engineers typically hold a bachelors in a related field such as IT, computer science, data science or statistics. Though not usually required, its also common for AI engineers to hold a masters degree in a field like data science or computer science.

Highest-paying cities: San Jose, Calif. ($170,922), San Francisco, Calif. ($159,685), Washington, D.C. ($154,236), according to Indeed.

Description: Cloud computing allows companies to access large-scale storage without maintaining their own physical servers. Cloud architects set up these clouds for companies, maintain systems and communicate with third-party cloud servicers. Cloud architects must be deeply knowledgeable about security to protect the cloud.

You cant just start off as a cloud architect. You need experience, Heineman says. There is no single cloud. Theres not just one programming language, or theres not just one computer architecture.

Education: Earning a bachelors degree in computer science or a related field is often preferred by employers. A cloud architecture certification can also be helpful.

Highest-paying cities: Palo Alto, Calif. ($169,540), Bellevue, Wash. ($167,827), Redmond, Wash. ($141,286), according to Indeed.

Description: Data science professionals can take large sets of raw data, revise it, and analyze it to reveal actionable insights. Prevalent in industries like finance, health care, and technology, this role is especially useful for its ability to take data that was once incomprehensible and turn it into something constructive

Information is data with meaning, and thats the job of a data scientistget the data to be meaningful, says Paulus Wahjudi, chair and professor of the Department of Computer Sciences and Electrical Engineering at Marshall University.

Education: Its not strictly necessary, but a bachelors degree in computer science is useful for this role.

Highest-paying cities: San Diego, Calif. ($168,874), Herndon, Va. ($160,279), Chicago, Ill. ($156,416), according to Indeed.

Description: Requiring both IT and business skills, an enterprise architect ensures that a companys technology is in line with its business goals. Enterprise architects set IT standards, buy software or get an IT department to create it, based on their analysis of an employers business goals.

Education: Enterprise architect jobs usually require a four-year degree in data science, computer science, or a similar field. These roles often require five to 10 years of experience and a masters degree in a related field.

Highest-paying cities: Palo Alto, Calif. ($159,261), San Francisco, Calif. ($151,315), and Herndon, Va. ($151,190), according to Indeed.

Description: DevOps engineers work to improve software development processes by coordinating all teams involved with a products development. This role updates and maintains software processes with the aim of fixing bugs and improving user experience.

Education: Employers usually prefer DevOps engineers who have a bachelors degree in computer programming, software engineering, or a related field.

Highest-paying cities: San Francisco, Calif. ($154,204), Irving, Texas ($144,535), Charlotte, N.C. ($139,145), according to Indeed.

Description: A full stack developer can do pretty much anything related to computer programming. With the back-end team, they help manage servers and create databases; with the front-end team, they assist with the creation of parts of the project that are client-facing. Full stack developers are in high demand because of their ability to assist at any stage of a project.

Education: Full stack developers usually have at least a bachelors in computer engineering, information technology, computer science, or a related field. Some have certificates or specialized degrees in AI, web development, information security, or database management.

Highest-paying cities: Washington, D.C. ($124,316), New York, N.Y. ($121,596), and Boston, Mass. ($120,549), according to Indeed.

Description: As the name of the role suggests, database developers oversee developing databases. In modern times, most companies constantly record and store data thats used to conduct data analysis, record the companys history, and comply with regulations. Databases and data warehouses are necessary to securely store this data and must be crafted to meet the needs of each individual business. After creating these databases, a database developer must constantly maintain them.

The databases, in some ways, are so optimized that they can run themselves, but you still need someone to know how to model the data, and thats what a database developer does, Heineman says. They could have moderate programming skills, [but] thats not really the strength. Its modeling skills.

Education: Database developers usually have a bachelors degree in computer science, software engineering, or a similar field.

Highest-paying cities: Tallahassee, Fla. ($121,435), San Francisco, Calif. ($118,145), and Washington, D.C. ($98,807), according to Indeed.

Description: It is the job of a systems administrator to keep a companys hardware and software up and running securely. From managing operating systems and servers to updating and installing new software to providing tech support, a systems administrator must be able to take on any task required of them.

Having good people skills is a boon for this role, as systems administrators must often help non-technical employees. Theyre the one that gets called 24/7, says Wahjudi. If anything goes wrong, its your responsibility to get it back up.

Education: To be a systems administrator, most jobs require a bachelors in a field related to information or computer science. Some positions may require an associates or postsecondary degree.

Though the need for programmers has fallen since the boom of the pandemic, both Heineman and Wahjudi say that programming skills are useful and transferable. While demand may have slowed for these positions, people are still being hired for high-paying jobs that use programming.

We tell our students that a computer is stupid, Wahjudi says. Its only as smart as the programmer.

Read more from the original source:

10 of the highest-paying programming jobs right now - Fortune

Read More..