Category Archives: Data Science

Data-Science-Powered Research by Seattle Children’s and Microsoft Shows Promise of Predicting SIDS and Other … – PR Newswire

REDMOND, Wash., March 28, 2024 /PRNewswire/ -- More than 155 International researchers and data scientists met this week in the Pacific Northwest to share with each other the newest insights into the causes of Sudden Infant Death Syndrome (SIDS). The event was sponsored by The Center for Integrative Brain Research at Seattle Children's and Microsoft AI for Good Lab. Among the many topics attendees discussed was groundbreaking new research that suggests genetic testing at birth may hold the promise of detecting SIDS risk and potentially other causes of sudden death later in life.

SIDS is the leading cause of death of infants one month to one year old in the US and other developed countries. The new findings, resulting from a partnership between the Center for Integrative Brain Research at Seattle Children'sand data scientists at Microsoft, come from the first-ever whole genome sequencing of 145 infants who succumbed to SIDS. The Aaron Matthew SIDS Research Foundationfunds the database, which is maintained and managed at Seattle Children's Research Institute.

In a study, soon to be published in the American Journal of Medical Genetics, researchers identify novel genes associated with Sudden Unexplained Infant Deaths (SUID), which includes SIDS. Some of these genes are important for detecting and responding to hypoxia low levels of oxygen in body tissues. Children with these vulnerabilities could increase their susceptibility to death caused from sleeping face down.

For decades, medical professionals have found a correlation between the sleeping position of infants and SIDS. This research suggests why that risk exists for certain infants. The study also identified genes associated with Sudden Cardiac Death, which could also explain why some children are particularly vulnerable to succumb to SIDS. However, because not every child with this vulnerability will succumb to SIDS, those who survive may be vulnerable to Sudden Cardiac Death later in life. Sudden Cardiac Death is responsible for 360,000 fatalities annually in the United States.

"Scientific research sometimes leads to surprises," saidJan-Marino Ramirez, PhD, Director of theCenter for Integrative Brain Research at Seattle Children's. "One surprise in our research leads to an exciting question: what if a genetic test at birth could not only predict the risk of SIDS, but also terminal cardiac problems well into adulthood? Preventative treatments exist for these dangerous conditions, and early detection could save lives."

John Kahan, the former Microsoft Vice President and Chief Data Analytics Officer who co-founded The Aaron Matthew SIDS Research Foundation with his wifeHeather Kahan, organized the first SIDS Summit while he was working at Microsoft.

"Thanks to this collaboration between world-class researchers and data scientists armed with cutting edge AI, we can now use genetic data to predict children at high risk of SIDS, which claims approximately 3,200 children a year," Kahan said."We are getting far closer to enabling medical professionals to bring preventative treatments to children who exhibit these risks, and potentially to far more people those susceptible to Sudden Cardiac Death later in life."

Juan M. Lavista Ferres PhD, MS, Microsoft Chief Data Scientist and the Director of the AI For Good Lab at Microsoft,is among those who hosted the Summit this week. Dr. Lavista wasthe lead researcher who used big data to estimate that 22% of Sudden or Unexplained Infant Deaths in the United States can be directly attributed to maternal smoking during pregnancy, which led to the assertion that SIDS rates can be reduced through education programs about the risks of smoking during pregnancy.

"The learning from this collaboration with SIDS researchers is proving, once again, the power AI has to scale human expertise," Dr. Lavista said. "It's a privilege for my team to put AI in the hands of some of the leading medical researchers in the world, and to see the number of potentially life-saving outcomes that flow from their work, partly through their access to AI."

The new findings on the mechanisms of SIDS were among many issues discussed at the Seventh Annual SIDS Summit, hosted by Ramirez, Kahan, and Lavista. Other research discussed included:

About the Aaron Matthew SIDS Research Guild at Seattle Children's Hospital The Aaron Matthew SIDS Research Guild at Seattle Children's Hospitalwas named in honor of Aaron Matthew Kahan, son of Heather Kahan and John B. Kahan. Aaron died of SIDS days after his birth in 2003. The Guild board includes leaders from Seattle Children's Hospital's Integrative Brain Research Institute, Microsoft, Accenture, Marriott Hotels, Adobe, Tata Consulting Services, and VMLY&R.

SOURCE Aaron Matthew SIDS Research Guild

See the article here:

Data-Science-Powered Research by Seattle Children's and Microsoft Shows Promise of Predicting SIDS and Other ... - PR Newswire

8 Things Most Data Science Programs Don’t Teach (But You Should Know) Part 2 – Towards Data Science

MIT calls this the missing semester of your CS education 10 min read

What data science and software engineering have in common is writing code. But while code is the main outcome of software engineering, data science projects typically end with models, results, and reports. Consequently, in data science the quality, structure, and delivery of code is often an afterthought at best.

The implicit expectation with data science projects is that the results reported at the end can be trusted.

This means that if someone asked you to re-run your or somebody elses analysis, you would be able to obtain the same results, regardless of how much time has passed since you first performed the analysis.

Similarly, if you are developing a component for a product, the implicit expectation is that component you developed represents the best possible performance given what is reasonably possible within the requirements of the product.

These statements may seem obvious, but satisfying both expectations can be quite difficult.

If you dont believe me, think about your past projects.

Continued here:

8 Things Most Data Science Programs Don't Teach (But You Should Know) Part 2 - Towards Data Science

Create Mixtures of Experts with MergeKit | by Maxime Labonne | Mar, 2024 – Towards Data Science

MoEs also come with their own set of challenges, especially in terms of fine-tuning and memory requirements. The fine-tuning process can be difficult due to the models complexity, with the need to balance expert usage during training to properly train the gating weights to select the most relevant ones. In terms of memory, even though only a fraction of the total parameters are used during inference, the entire model, including all experts, needs to be loaded into memory, which requires high VRAM capacity.

More specifically, there are two essential parameters when it comes to MoEs:

Historically, MoEs have underperformed dense models. However, the release of Mixtral-8x7B in December 2023 shook things up and showed impressive performance for its size. Additionally, GPT-4 is also rumored to be an MoE, which would make sense as it would be a lot cheaper to run and train for OpenAI compared to a dense model. In addition to these recent excellent MoEs, we now have a new way of creating MoEs with MergeKit: frankenMoEs, also called MoErges.

The main difference between true MoEs and frankenMoEs is how theyre trained. In the case of true MoEs, the experts and the router are trained jointly. In the case of frankenMoEs, we upcycle existing models and initialize the router afterward.

In other words, we copy the weights of the layer norm and self-attention layers from a base model, and then copy the weights of the FFN layers found in each expert. This means that besides the FFNs, all the other parameters are shared. This explains why Mixtral-8x7B with eight experts doesnt have 8*7 = 56B parameters, but about 45B. This is also why using two experts per token gives the inference speed (FLOPs) of a 12B dense model instead of 14B.

FrankenMoEs are about selecting the most relevant experts and initializing them properly. MergeKit currently implements three ways of initializing the routers:

As you can guess, the hidden initialization is the most efficient to correctly route the tokens to the most relevant experts. In the next section, we will create our own frankenMoE using this technique.

To create our frankenMoE, we need to select n experts. In this case, we will rely on Mistral-7B thanks to its popularity and relatively small size. However, eight experts like in Mixtral is quite a lot, as we need to fit all of them in memory. For efficiency, I'll only use four experts in this example, with two of them engaged for each token and each layer. In this case, we will end up with a model with 24.2B parameters instead of 4*7 = 28B parameters.

Here, our goal is to create a well-rounded model that can do pretty much everything: write stories, explain articles, code in Python, etc. We can decompose this requirement into four tasks and select the best expert for each of them. This is how I decomposed it:

Now that weve identified the experts we want to use, we can create the YAML configuration that MergeKit will use to create our frankenMoE. This uses the mixtral branch of MergeKit. You can find more information about how to write the configuration on this page. Here is our version:

For each expert, I provide five basic positive prompts. You can be a bit fancier and write entire sentences if you want. The best strategy consists of using real prompts that should trigger a particular expert. You can also add negative prompts to do the opposite.

Once this is ready, you can save your configuration as config.yaml. In the same folder, we will download and install the mergekit library (mixtral branch).

If your computer has enough RAM (roughly 2432 GB of RAM), you can run the following command:

If you dont have enough RAM, you can shard the models instead as follows (it will take longer):

This command automatically downloads the experts and creates the frankenMoE in the merge directory. For the hidden gate mode, you can also use the --load-in-4bit and --load-in-8bit options to compute hidden states with lower precision.

Alternatively, you can copy your configuration into LazyMergekit, a wrapper I made to simplify model merging. In this Colab notebook, you can input your model name, select the mixtral branch, specify your Hugging Face username/token, and run the cells. After creating your frankenMoE, it will also upload it to the Hugging Face Hub with a nicely formatted model card.

I called my model Beyonder-4x7B-v3 and created GGUF versions of it using AutoGGUF. If you cant run GGUF versions on your local machine, you can also perform inference using this Colab notebook.

To get a good overview of its capabilities, it has been evaluated on three different benchmarks: Nous benchmark suite, EQ-Bench, and the Open LLM Leaderboard. This model is not designed to excel in traditional benchmarks, as the code and role-playing models generally do not apply to those contexts. Nonetheless, it performs remarkably well thanks to strong general-purpose experts.

Nous: Beyonder-4x7B-v3 is one of the best models on Nous benchmark suite (evaluation performed using LLM AutoEval) and significantly outperforms the v2. See the entire leaderboard here.

EQ-Bench: Its also the best 4x7B model on the EQ-Bench leaderboard, outperforming older versions of ChatGPT and Llama-270b-chat. Beyonder is very close to Mixtral-8x7B-Instruct-v0.1 and Gemini Pro, which are (supposedly) much bigger models.

Open LLM Leaderboard: Finally, its also a strong performer on the Open LLM Leaderboard, significantly outperforming the v2 model.

On top of these quantitative evaluations, I recommend checking the models outputs in a more qualitative way using a GGUF version on LM Studio. A common way of testing these models is to gather a private set of questions and check their outputs. With this strategy, I found that Beyonder-4x7B-v3 is quite robust to changes in the user and system prompts compared to other models, including AlphaMonarch-7B. This is pretty cool as it improves the usefulness of the model in general.

FrankenMoEs are a promising but still experimental approach. The trade-offs, like higher VRAM demand and slower inference speeds, can make it challenging to see their advantage over simpler merging techniques like SLERP or DARE TIES. Especially, when you use frankenMoEs with just two experts, they might not perform as well as if you had simply merged the two models. However, frankenMoEs excel in preserving knowledge, which can result in stronger models, as demonstrated by Beyonder-4x7B-v3. With the right hardware, these drawbacks can be effectively mitigated.

In this article, we introduced the Mixture of Experts architecture. Unlike traditional MoEs that are trained from scratch, MergeKit facilitates the creation of MoEs by ensembling experts, offering an innovative approach to improving model performance and efficiency. We detailed the process of creating a frankenMoE with MergeKit, highlighting the practical steps involved in selecting and combining different experts to produce a high-quality MoE.

Thanks for reading this article. I encourage you to try to make your own FrankenMoEs using LazyMergeKit: select a few models, create your config based Beyonders, and run the notebook to create your own models! If you liked this article, please follow me on Hugging Face and X/Twitter @maximelabonne.

Read more:

Create Mixtures of Experts with MergeKit | by Maxime Labonne | Mar, 2024 - Towards Data Science

A Collection Of Free Data Science Courses From Harvard, Stanford, MIT, Cornell, and Berkeley – KDnuggets

Free courses are very popular on our platform, and we've received many requests from both beginners and professionals for more resources. To meet the demand of aspiring data scientists, we are providing a collection of free data science courses from the top universities in the world.

University professors and technical assistants teach these courses and cover topics such as math, probability, programming, databases, data analytics, data processing, data analysis, and machine learning. By the end of these courses, you'll have gained the skills required to master data science and become job-ready.

Link: 5 Free University Courses to Learn Computer Science

If you're considering switching to a career in data, it's crucial to learn computer science fundamentals. Many data science job applications include a coding interview section where you'll need to solve problems using a programming language of your choice.

This compilation offers some of the best free university courses to help you master foundations like computer hardware/software. You will learn Python, data structures and algorithms, as well as essential tools for software engineering.

Link: 5 Free University Courses to Learn Python

A curated list of five online courses offered by renowned universities like Harvard, MIT, Stanford, University of Michigan, and Carnegie Mellon University. These courses are designed to teach Python programming to beginners, covering fundamentals such as variables, control structures, data structures, file I/O, regular expressions, object-oriented programming, and computer science concepts like recursion, sorting algorithms, and computational limits.

Link: 5 Free University Courses to Learn Databases and SQL

It is a list of free database and SQL courses offered by renowned universities such as Cornell, Harvard, Stanford, and Carnegie Mellon University. These courses cover a wide range of topics, from the basics of SQL and relational databases to advanced concepts like NoSQL, NewSQL, database internals, data models, database design, distributed data processing, transaction processing, query optimization, and the inner workings of modern analytical data warehouses like Google BigQuery and Snowflake.

Link: 5 Free University Courses on Data Analytics

Compilation of online courses and resources available for individuals interested in pursuing data science, machine learning, and artificial intelligence. It highlights courses from prestigious institutions like Harvard, MIT, Stanford, Berkeley, covering topics such as Python for data science, statistical thinking, data analytics, mining massive data sets, and an introduction to artificial intelligence.

Link: 5 Free University Courses to Learn Data Science

A comprehensive list of free online courses from Harvard, MIT, and Stanford, designed to help individuals learn data science from the ground up. It begins with an introduction to Python programming and data science fundamentals, followed by courses covering computational thinking, statistical learning, and the mathematics behind data science concepts. The courses cover a wide range of topics, including programming, statistics, machine learning algorithms, dimensionality reduction techniques, clustering, and model evaluation.

Link: 9 Free Harvard Courses to Learn Data Science - KDnuggets

It outlines a data science learning roadmap consisting of 9 free courses offered by Harvard. It starts with learning programming basics in either R or Python, followed by courses on data visualization, probability, statistics, and productivity tools. It then covers data pre-processing techniques, linear regression, and machine learning concepts. The final step involves a capstone project that allows learners to apply the knowledge gained from the previous courses to a hands-on data science project.

Free online courses from top universities are an incredible resource for anyone looking to break into the field of data science or upgrade their current skills. This curated collection contains a list of courses that covers all the key areas - from core computer science and programming with Python, to databases and SQL, data analytics, machine learning, and full data science curricula. With courses taught by world-class professors, you can gain comprehensive knowledge and hands-on experience with the latest data science tools and techniques used in industry.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Read more from the original source:

A Collection Of Free Data Science Courses From Harvard, Stanford, MIT, Cornell, and Berkeley - KDnuggets

Text Embeddings, Classification, and Semantic Search | by Shaw Talebi – Towards Data Science

Imports

We start by importing dependencies and the synthetic dataset.

Next, well generate the text embeddings. Instead of using the OpenAI API, we will use an open-source model from the Sentence Transformers Python library. This model was specifically fine-tuned for semantic search.

To see the different resumes in the dataset and their relative locations in concept space, we can use PCA to reduce the dimensionality of the embedding vectors and visualize the data on a 2D plot (code is on GitHub).

From this view we see the resumes for a given role tend to clump together.

Now, to do a semantic search over these resumes, we can take a user query, translate it into a text embedding, and then return the nearest resumes in the embedding space. Heres what that looks like in code.

Printing the roles of the top 10 results, we see almost all are data engineers, which is a good sign.

Lets look at the resume of the top search results.

Although this is a made-up resume, the candidate likely has all the necessary skills and experience to fulfill the users needs.

Another way to look at the search results is via the 2D plot from before. Heres what that looks like for a few queries (see plot titles).

While this simple search example does a good job of matching particular candidates to a given query, it is not perfect. One shortcoming is when the user query includes a specific skill. For example, in the query Data Engineer with Apache Airflow experience, only 1 of the top 5 results have Airflow experience.

This highlights that semantic search is not better than keyword-based search in all situations. Each has its strengths and weaknesses.

Thus, a robust search system will employ so-called hybrid search, which combines the best of both techniques. While there are many ways to design such a system, a simple approach is applying keyword-based search to filter down results, followed by semantic search.

Two additional strategies for improving search are using a Reranker and fine-tuning text embeddings.

A Reranker is a model that directly compares two pieces of text. In other words, instead of computing the similarity between pieces of text via a distance metric in the embedding space, a Reranker computes such a similarity score directly.

Rerankers are commonly used to refine search results. For example, one can return the top 25 results using semantic search and then refine to the top 5 with a Reranker.

Fine-tuning text embeddings involves adapting an embedding model for a particular domain. This is a powerful approach because most embedding models are based on a broad collection of text and knowledge. Thus, they may not optimally organize concepts for a specific industry, e.g. data science and AI.

Although everyone seems focused on the potential for AI agents and assistants, recent innovations in text-embedding models have unlocked countless opportunities for simple yet high-value ML use cases.

Here, we reviewed two widely applicable use cases: text classification and semantic search. Text embeddings enable simpler and cheaper alternatives to LLM-based methods while still capturing much of the value.

More on LLMs

See the original post here:

Text Embeddings, Classification, and Semantic Search | by Shaw Talebi - Towards Data Science

Transform Accelerator Announces Data Science and AI Startups Selected for Cohort 3 – Polsky Center for … – Polsky Center for Entrepreneurship and…

Published on Thursday, March 21, 2024

Following the success of its first and second cohorts, Transform adds seven new early-stage companies utilizing advances in data science and AI.

The University of Chicagos Polsky Center for Entrepreneurship and Innovation and Data Science Institute today announced the seven early-stage companies accepted into the third cohort of the Transform accelerator for data science and AI startups.

Powered by the Polsky Centers Deep Tech Ventures, Transform provides full-spectrum support for the startups accepted into the accelerator, including access to business and technical training, industry mentorship, venture capital connections, and funding opportunities.

The seven startups will receive approximately $250,000 in total investment, including $25,000 in funding, credits for Google for Startups, workspace in the Polsky Exchange on Chicagos South Side, and access to industry mentors, technical advisors and student talent from the University of Chicago Department of Computer Science, Data Science Institute (DSI), and the Chicago Booth School of Business.

Transform Cohort 3:

I am excited to welcome cohort three into Transform, this cycle was particularly competitive and we are delighted with the seven companies we selected, said Shyama Majumdar, director of Transform. We have a good mix of healthcare, construction, manufacturing, and fintech companies represented as we continue to see generative AI startups leading the way, which is reflected in this cohort. After the success of cohort 2, we are ready to run with cohort 3 and help pave their way to success.

The accelerator launched in Spring 2023 with its inaugural cohort and those startups are already seeing success. Echo Labs, a transcription platform in the previous cohort, has scaled up, hiring software engineers to meet the demand of partnerships with 150 universities to pilot their product. Blackcurrant, an online business-to-business marketplace for buying and selling hydrogen and member of the first cohort, recently was awarded a $250,000 co-investment from the George Shultz Innovation Fund after participating in the program.

The continued success of Transform startups has been very encouraging, said David Uminsky, executive director of the Data Science Institute. The wide range of sectors this new cohort serves demonstrates AIs increasing impact on business.

Transform is partly supported by corporate partners McAndrews, Held & Malloy, Ltd and venture partner, True Blue Partners, a Silicon Valley-based venture capital firm investing in early-stage AI companies, founded by Chicago Booth alum Sunil Grover, MBA 99.

Transform is providing the fertile ground necessary to help incubate the next generation of market leaders, said Grover, who also is a former engineer with nearly two decades of experience helping build companies as an entrepreneur, investor, and advisor. Advancements in deep tech present a unique interdisciplinary opportunity to re-imagine every aspect of the business world. This, I believe, will lead to creation of innovative new businesses that are re-imagined, ground up, to apply the capabilities these new technologies can enable.

Original post:

Transform Accelerator Announces Data Science and AI Startups Selected for Cohort 3 - Polsky Center for ... - Polsky Center for Entrepreneurship and...

Vanderbilt to establish a college dedicated to computing, AI and data science – University Business

Provost and Vice Chancellor for Academic Affairs C. Cybele Raver said a dedicated college will enable Vanderbilt to keep making groundbreaking discoveries at the intersections of computing and other disciplines and will more effectively leverage advanced computing to address some of societys most pressing challenges.

Many of the specific details about the collegeincluding its departments, degree programs and research infrastructurewill be informed by the recommendations of a task force on connected computing composed of faculty from across the university.

Read more from Vanderbilt University.

See the article here:

Vanderbilt to establish a college dedicated to computing, AI and data science - University Business

Cracking the Apache Spark Interview: 80+ Top Questions and Answers for 2024 – Simplilearn

Apache Spark is a unified analytics engine for processing large volumes of data. It can run workloads 100 times faster and offers over 80 high-level operators that make it easy to build parallel apps. Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, and can access data from multiple sources.

And this article covers the most important Apache Spark Interview questions that you might face in a Spark interview. The Spark interview questions have been segregated into different sections based on the various components of Apache Spark and surely after going through this article you will be able to answer most of the questions asked in your next Spark interview.

To learn more about Apache Spark interview questions, you can also watch the below video.

Apache Spark

MapReduce

Spark processes data in batches as well as in real-time

MapReduce processes data in batches only

Spark runs almost 100 times faster than Hadoop MapReduce

Hadoop MapReduce is slower when it comes to large scale data processing

Spark stores data in the RAM i.e. in-memory. So, it is easier to retrieve it

Hadoop MapReduce data is stored in HDFS and hence takes a long time to retrieve the data

Spark provides caching and in-memory data storage

Hadoop is highly disk-dependent

Apache Spark has 3 main categories that comprise its ecosystem. Those are:

This is one of the most frequently asked spark interview questions, and the interviewer will expect you to give a thorough answer to it.

Spark applications run as independent processes that are coordinated by the SparkSession object in the driver program. The resource manager or cluster manager assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply operations repeatedly to the data so they can benefit from caching datasets across iterations. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Finally, the results are sent back to the driver application or can be saved to the disk.

Resilient Distributed Datasets are the fundamental data structure of Apache Spark. It is embedded in Spark Core. RDDs are immutable, fault-tolerant, distributed collections of objects that can be operated on in parallel.RDDs are split into partitions and can be executed on different nodes of a cluster.

RDDs are created by either transformation of existing RDDs or by loading an external dataset from stable storage like HDFS or HBase.

Here is how the architecture of RDD looks like:

So far, if you have any doubts regarding the apache spark interview questions and answers, please comment below.

When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.

Also Read: What Are the Skills Needed to Learn Hadoop?

Apache Spark stores data in-memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.

To trigger the clean-ups, you need to set the parameter spark.cleaner.ttlx.

There are a total of 4 steps that can help you connect Spark to Apache Mesos.

Parquet is a columnar format that is supported by several data processing systems. With the Parquet file, Spark can perform both read and write operations.

Some of the advantages of having a Parquet file are:

Shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. The shuffle operation is implemented differently in Spark compared to Hadoop.

Shuffling has 2 important compression parameters:

spark.shuffle.compress checks whether the engine would compress shuffle outputs or not spark.shuffle.spill.compress decides whether to compress intermediate shuffle spill files or not

It occurs while joining two tables or while performing byKey operations such as GroupByKey or ReduceByKey

Spark uses a coalesce method to reduce the number of partitions in a DataFrame.

Suppose you want to read data from a CSV file into an RDD having four partitions.

This is how a filter operation is performed to remove all the multiple of 10 from the data.

The RDD has some empty partitions. It makes sense to reduce the number of partitions, which can be achieved by using coalesce.

This is how the resultant RDD would look like after applying to coalesce.

Consider the following cluster information:

Here is the number of core identification:

To calculate the number of executor identification:

Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:

There are 2 ways to convert a Spark RDD into a DataFrame:

import com.mapr.db.spark.sql._

val df = sc.loadFromMapRDB()

.where(field(first_name) === Peter)

.select(_id, first_name).toDF()

You can convert an RDD[Row] to a DataFrame by

calling createDataFrame on a SparkSession object

def createDataFrame(RDD, schema:StructType)

Resilient Distributed Dataset (RDD) is a rudimentary data structure of Spark. RDDs are the immutable Distributed collections of objects of any type. It records the data from various nodes and prevents it from significant faults.

The Resilient Distributed Dataset (RDD) in Spark supports two types of operations. These are:

The transformation function generates new RDD from the pre-existing RDDs in Spark. Whenever the transformation occurs, it generates a new RDD by taking an existing RDD as input and producing one or more RDD as output. Due to its Immutable nature, the input RDDs don't change and remain constant.

Along with this, if we apply Spark transformation, it builds RDD lineage, including all parent RDDs of the final RDDs. We can also call this RDD lineage as RDD operator graph or RDD dependency graph. RDD Transformation is the logically executed plan, which means it is a Directed Acyclic Graph (DAG) of the continuous parent RDDs of RDD.

The RDD Action works on an actual dataset by performing some specific actions. Whenever the action is triggered, the new RDD does not generate as happens in transformation. It depicts that Actions are Spark RDD operations that provide non-RDD values. The drivers and external storage systems store these non-RDD values of action. This brings all the RDDs into motion.

If appropriately defined, the action is how the data is sent from the Executor to the driver. Executors play the role of agents and the responsibility of executing a task. In comparison, the driver works as a JVM process facilitating the coordination of workers and task execution.

This is another frequently asked spark interview question. A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.

The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.

A Discretized Stream (DStream) is a continuous sequence of RDDs and the rudimentary abstraction in Spark Streaming. These RDDs sequences are of the same type representing a constant stream of data. Every RDD contains data from a specific interval.

The DStreams in Spark take input from many sources such as Kafka, Flume, Kinesis, or TCP sockets. It can also work as a data stream generated by converting the input stream. It facilitates developers with a high-level API and fault tolerance.

Caching also known as Persistence is an optimization technique for Spark computations. Similar to RDDs, DStreams also allow developers to persist the streams data in memory. That is, using the persist() method on a DStream will automatically persist every RDD of that DStream in memory. It helps to save interim partial results so they can be reused in subsequent stages.

The default persistence level is set to replicate the data to two nodes for fault-tolerance, and for input streams that receive data over the network.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark distributes broadcast variables using efficient broadcast algorithms to reduce communication costs.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value

res0: Array[Int] = Array(1, 2, 3)

So far, if you have any doubts regarding the spark interview questions for beginners, please ask in the comment section below.

Moving forward, let us understand the spark interview questions for experienced candidates

DataFrame can be created programmatically with three steps:

1. map(func)

2. transform(func)

3. filter(func)

4. count()

The correct answer is c) filter(func).

This is one of the most frequently asked spark interview questions where the interviewer expects a detailed answer (and not just a yes or no!). Give as detailed an answer as possible here.

Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.

There are 2 types of data for which we can use checkpointing in Spark.

Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.

Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.

Controlling the transmission of data packets between multiple computer networks is done by the sliding window. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data.

DISK_ONLY - Stores the RDD partitions only on the disk

MEMORY_ONLY_SER - Stores the RDD as serialized Java objects with a one-byte array per partition

MEMORY_ONLY - Stores the RDD as deserialized Java objects in the JVM. If the RDD is not able to fit in the memory available, some partitions wont be cached

OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory

MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD is not able to fit in the memory, additional partitions are stored on the disk

MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk

See the rest here:

Cracking the Apache Spark Interview: 80+ Top Questions and Answers for 2024 - Simplilearn

Adding Temporal Resiliency to Data Science Applications | by Rohit Pandey | Mar, 2024 – Towards Data Science

Image by midjourney

Modern applications almost exclusively store their state in databases and also read any state they require to perform their tasks from databases. Well concern ourselves with adding resilience to the processes of reading from and writing to these databases, making them highly reliable.

The obvious way to do this is to improve the quality of the hardware and software comprising the database so our reads and writes never fail. But this becomes a law of diminishing returns where once were already at high availabilities, pouring more money in moves the needle only marginally. Adding redundancy to achieve high availability quickly becomes a much better strategy.

So, what does this high reliability via adding redundancy to the architecture look like? We remove single points of failure by spending more money on redundant systems. For example, maintaining redundant copies of the data so that if one copy gets corrupted or damaged, the others can be used to repair. Another example is having a redundant database which can be read from and written to when the primary one is unavailable. Well call these kinds of solutions where additional memory, disk space, hardware or other physical resources are allotted to ensure high availability spatial redundancy. But can we get high reliability (going beyond the characteristics of the underlying databases and other components) without spending any additional money? Thats where the idea of temporal redundancy comes in.

All images in this article unless otherwise specified are by the author.

If spatial redundancy is running with redundant infrastructure, then temporal redundancy is running more with existing infrastructure.

Temporal redundancy is typically much cheaper than spatial redundancy. It can also be easier to implement.

The idea is that when reliability compromising events happen to our applications and databases, they tend to be restricted to certain windows in time. If the

Read more from the original source:

Adding Temporal Resiliency to Data Science Applications | by Rohit Pandey | Mar, 2024 - Towards Data Science

Digital Transformation in Finance: Challenges and Benefits – Data Science Central

Digital transformation is no longer a choice, but a necessity for financial institutions looking to stay competitive in the ultramodern business world. From perfecting client experience to adding functional effectiveness and enhancing security, the benefits in finance are multitudinous. Still, with benefits come challenges and pitfalls that must be addressed to insure successful perpetration. In this article, we will discuss the advantages and challenges of digital transformation in finance sector, as well as successful exemplifications of companies that have delivered it to their advantage.

Digital transformation in finance is the process of implementing advanced digital technologies to boost financial processes, services, and client experiences. It involves the integration of technologies for example, as big data analytics, cloud computing, artificial intelligence, blockchain, and robotic process automation to automate and streamline financial operations. This process aims to enhance effectiveness, reduce costs, alleviate pitfalls, and give further individualized services to clients. By using digital technologies, financial institutions can gain a competitive advantage in the market and stay ahead of fleetly evolving client requirements and preferences.

The finance industry has been conventionally slow for borrowing new technologies, however the arrival of new technologies has made it significant for financial institutions for embracing transformation. Digital transformation enables financial institutions to offer substantiated services, reduce costs, increase effectiveness, alleviate pitfalls, and ameliorate client experiences. By embracing it, financial institutions can work data and analytics to make further informed opinions and enhance their operations. Also, digital transformation in finance can help financial institutions to stay ahead of the competition by enabling them to produce new products and services that feed to the evolving requirements of their clients. Thus, digital transformation is pivotal for financial institutions to stay applicable and thrive in todays competitive geography.

Digital transformation is reshaping the financial assiduity, furnishing multitudinous benefits to both financial institutions and their clients. In this section, we will explore some of its crucial benefits in finance, including enhanced client experience, increased effectiveness, bettered data analysis, enhanced security, and competitive advantage.

Digital transformation enhances client experience financial institutions can give substantiated services and ameliorate availability through different digital channels. This can drive towards increased client satisfaction and loyalty.

Digital transformation can help financial institutions automate and streamline different processes, leading to cost savings, faster reversal times, and bettered accuracy.

It enables financial institutions to work advanced analytics tools and algorithms to make further informed opinions and identify new business openings.

Digital transformation can ameliorate security by enforcing advanced cybersecurity measures for instance, as encryption, biometric authentication, and real time monitoring. This can cover financial institutions from cyber pitfalls and insure the safety of client data.

It can also give financial institutions with a competitive advantage by enabling them to produce new products and services that feed to the evolving requirements of their clients. Financial institutions that are adopting digital transformation are able to stay ahead of the competition and stay useful in todays digital era.

Digital transformation in finance is revolutionizing the financial sector, with a broad range of impacts affecting businesses and customers as well. From the dislocation of traditional business models to increased competition and higher personalization, the benefits and challenges of this transformation are far reaching. In this section, well explore the major ways in which digital transformation is impacting the financial assiduity.

Its disrupting traditional business models in the financial assiduity by creating new ways of delivering financial services, for example, as peer to peer lending, robo- advisory services, and mobile payments. As a result, traditional financial institutions are facing violent competition from digital-only startups and fintech companies that are more adaptable and agile.

Digital transformation has significantly increased competition in the financial assiduity, as clients now have access to a wider range of financial services and providers. This has forced traditional financial institutions to ameliorate their services, reduce costs, and introduce to remain competitive.

It has enabled financial institutions to automate and streamline different processes, performing in faster reversal times, reduced costs, and enhanced accuracy. For illustration, digital processes can help financial institutions handle client onboarding and loan processing more efficiently.

It has also enabled substantiated services grounded on client experiences and preferences, leading to increased client satisfaction and loyalty. By using data analytics, financial institutions can offer substantiated investment advice and customized product recommendations.

Digital transformation in finance has made financial services more accessible and accessible for clients, who can now pierce their accounts and conduct deals through multiple digital channels, for instance, as mobile apps, online apps, and chatbots.

It has also brought new security pitfalls to the financial assiduity, as financial deals and client data are highly exposed to cyber pitfalls. Financial institutions must apply robust security measures to cover themselves and their clients from implicit cyber attacks.

Digital transformation in finance isnt without its challenges and pitfalls. In this section, we will explore some of the common obstacles that financial institutions face when witnessing this process.

One of the common challenges in digital transformation is resistance to change from workers and clients. It isnt easy to introduce new technologies and processes, and some individualities may feel uncomfortable or hovered by the changes. Proper communication and training are necessary to insure a smooth transition.

The relinquishment of new technologies may bear the relief or integration of legacy systems and processes. These systems can be outdated and incompatible with ultramodern tools, which can produce obstacles and delays in digital transformation. Upgrading legacy systems and processes can be precious and time consuming, but its necessary to insure a smooth transition.

Digital transformation generates an enormous quantum of data, and managing that data can be a significant challenge for financial institutions. Data operation includes collecting, recycling, storing, and assaying data, which can be time consuming and bear significant resources. Effective dataoperation is essential to realize the full benefits of digital transformation.

This process introduces new cybersecurity pitfalls, including data breaches, phishing attacks, and ransomware. Financial institutions must take acceptable measures to cover themselves and their clients from these pitfalls. This includes enforcing strong cybersecurity programs, training workers on best practices, and investing in cybersecurity technologies.

Digital transformation has come a necessity for financial institutions to remain competitive in todays market. While there are challenges and pitfalls associated with digital transformation, the benefits are multitudinous, including enhanced client experience, increased effectiveness, and bettered data analysis. Successful exemplifications for example, as JPMorgan Chase, Ally Financial, Capital One, Goldman Sachs, and Mastercard show how digital transformation can lead to bettered business issues.

With the right strategy and perpetration approach, financial institutions can navigate the challenges and reap the prices of digital transformation. At Aeologic Technologies, we strive to give innovative solutions that enable financial institutions to achieve their digital transformation objectives and stay ahead of the wind.

Read this article:

Digital Transformation in Finance: Challenges and Benefits - Data Science Central