Category Archives: Data Science

Read This Before You Take Any Free Data Science Course – KDnuggets

In today's digital age, the quote by Michael Hakvoort, "If you're not paying for the product, then you are the product", has never been more relevant. While we often think of this in relation to social media platforms like Facebook, it also applies to seemingly harmless free resources such as YouTube courses.

Sure, the platform earns revenue through ads, but what about the time, energy, and motivation you invest? As data becomes increasingly valuable, it's essential to carefully evaluate the potential impact of free data science courses on your learning journey.

With so many options available, it can be overwhelming to determine which ones will provide real value. That's why taking a step back to consider some critical factors before diving into any free resource is crucial. By doing so, you'll ensure that you make the most out of your learning experience while avoiding common pitfalls associated with free courses.

Free courses often provide a one-size-fits-all curriculum, which might not align with your specific learning needs or skill level. They might cover fundamental concepts but lack the depth required for a comprehensive understanding or for tackling complex, real-world problems. Some free courses may have all the necessary ingredients to solve real-world data problems, but they lack structure, leaving you confused about where to start.

Learning a programming language alone can be challenging, especially if you come from a non-technical background. Data Science is a field that demands a hands-on approach. The free courses often offer limited opportunities for interactive learning, such as live coding sessions, quizzes, projects, or instructor feedback. This passive learning experience might prevent you from applying concepts effectively, and eventually, you will give up on learning.

The internet is flooded with free courses, making it challenging to discern the quality and credibility of the content. Some might be outdated or taught by individuals with limited expertise (Fake Gurus). Investing your time in a course that doesn't offer accurate or up-to-date information can be counterproductive.

Here is a list of free courses that I believe are of high quality:

Unlike paid courses, free resources do not come with external accountability measures such as deadlines or grades, making it easy to lose momentum and abandon the course midway. The lack of financial commitment means that students must rely solely on their internal drive and discipline to stay motivated and committed to completing the course. College is a great example of this. Students think 100 times before leaving college because of the costs involved. Most students complete their bachelor's degree because they have taken a student loan and need to pay it back.

Networking is a significant part of building a career in data science. Free courses typically lack the community aspect found in paid programs, such as peer interaction, mentorship, or alumni networks, which are invaluable for career growth and opportunities. There are Slack and Discord groups available but they are usually community-driven and may be inactive. However, in a paid course, there are moderators and community managers who are responsible for making networking easier between students.

Paid courses often provide career services, such as resume reviews, certification, job placement assistance, and interview preparation. These services are essential for individuals transitioning into a data science role but are typically unavailable in free programs. It is crucial to have guidance throughout the hiring process and know how to handle technical interview questions.

While not always necessary, certifications can boost your resume and credibility. Free courses may offer certificates, but they often don't carry the same weight as those from accredited institutions (Harvard / Stanford) or recognized platforms. Employers might not value them as highly, which could impact your job prospects. Additionally, certification exams evaluate key skills essential for working with data in any job. They assess your coding, data management, data analysis, reporting, and presentation abilities.

While free courses on data science can be a valuable resource for initial learning or brushing up on skills, they have certain limitations. It's important to consider these limitations against your personal goals, learning style, financial situation, and career aspirations. To ensure a well-rounded and effective learning experience, you should consider supplementing free resources with other forms of learning or investing in a paid bootcamp.

In the end, the most crucial factor that will help you become a professional data scientist is your dedication and focus on achieving your goals. You will not learn anything if you lack the drive required, no matter how much money you spend on the course. So, before you dive into the world of data, please think ten times if this is the right path for you.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Original post:

Read This Before You Take Any Free Data Science Course - KDnuggets

UC San Diego’s Online Master of Data Science Program Paves the Way Forward – University of California San Diego

Sharpening students competitive edge in the rapidly expanding data science industry, the curriculum consists of three foundational courses, three core courses, three electives and one capstone project. The foundation and core courses offer students the essential background knowledge and central material needed to develop a fundamental understanding of the program as a whole.

Meanwhile, the distinctive elective choices allow students to tailor their experience according to their interests, including options like human-centered artificial intelligence as well as data fairness and ethics. During their capstone projects, students can explore diverse areas such as music, oceanography and computer vision.

The online Master of Data Science program provides a deep understanding of the foundations of data science, while also ensuring that graduates have practical real-world skills and experiences, added Danks, who also teaches the programs data ethics course.

Recently, Martinez transitioned into a new role in standing up machine learning pipelines for the Air Force. She formerly served as the Data Fabric Division Chief for the Chief Information Office at Space Systems Command out of Los Angeles. She attributes her success in both of these roles to the preparation provided by the MDS program. After she graduates, she hopes to continue advancing her career in data analysis, lead research projects under the mentorship of UC San Diegos professors as well as earn a doctorate.

The Fall 2024 application is now open for those interested in joining the next cohort of the MDS program. The priority deadline is March 15, 2024; final deadline is June 5, 2024. For more information, eligibility requirements and more, please visit the Master of Data Science website.

Continue reading here:

UC San Diego's Online Master of Data Science Program Paves the Way Forward - University of California San Diego

CONSTANCE AND MARTIN SILVER ENDOWED PROFESSORSHIP (OPEN RANK) IN DATA SCIENCE AND … – The Chronicle of Higher Education

The New York University Silver School of Social Work (SSSW) invites applications for an endowed open rank professorship in data science and prevention. The endowed professor will play a leadership role in the Schools newly established Constance and Martin Silver Center on Data Science and Social Equity. The appointment will begin on September 1, 2024.

The endowed professorship and Center have been created by a visionary gift to harness the emerging power of big data and related data sciences for transformational social impact. The overall mission of the Center is to help Silver and the social work profession at-large develop data- and evidence-informed interventions to equitably tackle some of societys most pressing problems. A copy of the press release that provides more information regarding the gift is here.

NYU is a leader in data science, machine learning and artificial intelligence, and related areas, with the Center for Data Science, Courant Institute, Tandon School of Engineering, and Steinhardt School (PRIISM Center). Extensive supercomputer resources are available through NYUs High Performance Computing research infrastructure. The Silver School possesses deep expertise in a variety of areas including poverty, racism, policy analysis, prevention science, substance use and misuse, integrated health, mental health interventions and services, criminal justice, homelessness, child welfare, aging, and evidence-based practice. The leveraging of the tools of data science, in alignment with existing expertise at the Silver School, represents a unique opportunity to elevate social works visibility in this field.

SSSW is an international leader in social work research and education and offers outstanding support for scholarship. Located in Greenwich Village on Washington Square Park and in one of the worlds great urban research universities, the SSSW is one of the 14 schools and colleges of NYU, the largest private university in the USA, with students from all 50 states and more than 100 nations. The diversity of the academic community and the incomparable resources of New York City enrich the academic programs and campus life of NYU and the SSSW.

In compliance with NYCs Pay Transparency Act, the annual base salary range for this position varies according to rank, with Assistant Professor salaries ranging from $75,000 to $115,000, Associate Professor salaries ranging from $100,000 to $150,000, and Full Professor salaries ranging from $175,000 to $275,000. New York University considers factors such as (but not limited to) scope and responsibilities of the position, candidates work experience, education/training, key skills, internal peer equity, as well as market and organizational considerations when extending an offer.

Qualifications

A doctoral degree is required, along with a track record of scholarship in applying data science methodology to understand and forge advances in preventing and addressing social inequities. In addition to conducting research, the successful candidate may teach in the Schools academic programs (BSW, MSW, PhD, and/or DSW), and be fully engaged in professional service at the SSSW and New York University (NYU). Applicants should exhibit a commitment to social and economic justice and possess an ability to facilitate conversations concerning privilege, oppression, and intersecting social identities.

Application Instructions

Review of applications will begin on a rolling basis. We invite candidates to apply via Interfolio and upload a letter of application, curriculum vitae, and an equity, diversity, and inclusion statement.

Applicants should apply using this link:

http://apply.interfolio.com/138951

Candidates should also upload a research statement, a teaching statement, and two representative publications as Additional Documents for their application to be considered complete. We request submission of materials at applicants earliest convenience and would welcome confidential discussions with prospective candidates.

Equal Employment Opportunity Statement

For people in the EU, click here for information on your privacy rights under GDPR:http://www.nyu.edu/it/gdpr

NYU is an Equal Opportunity Employer and is committed to a policy of equal treatment and opportunity in every aspect of its recruitment and hiring process without regard to age, alienage, caregiver status, childbirth, citizenship status, color, creed, disability, domestic violence victim status, ethnicity, familial status, gender and/or gender identity or expression, marital status, military status, national origin, parental status, partnership status, predisposing genetic characteristics, pregnancy, race, religion, reproductive health decision making, sex, sexual orientation, unemployment status, veteran status, or any other legally protected basis. Women, racial and ethnic minorities, persons of minority sexual orientation or gender identity, individuals with disabilities, and veterans are encouraged to apply for vacant positions at all levels.

Sustainability Statement

NYU aims to be among the greenest urban campuses in the country and carbon neutral by 2040. Learn more atnyu.edu/sustainability

Follow this link:

CONSTANCE AND MARTIN SILVER ENDOWED PROFESSORSHIP (OPEN RANK) IN DATA SCIENCE AND ... - The Chronicle of Higher Education

Unlocking Data from Graphs: How to Digitise Plots and Figures with WebPlotDigitizer – Towards Data Science

Unlocking Digital Potential from Static Image DataGoing from paper to digital. Image generated using DALLE by the author.

When working within data science, geoscience or petrophysics, we often come across data or charts that are in image form, such as those within publications. However, the associated data is not present which means it can be difficult to use this data in our interpretation or research.

This is where a tool like WebPlotDigitizer becomes really useful. This online tool helps us take those charts from images and turn them into data that we can use for further research and analysis.

There are a number of areas in petrophysics and geoscience where digitising charts can be very beneficial, including:

In this article, we will see how we can use the WebPlotDigitizer to extract data from a scatter plot made with synthetic data. In most cases, the quality of the figures we may deal with will likely be poorer.

Also, it is important to remember that when we use data from sources, we should always cite where it came from as well as the methods of how that data was obtained.

After capturing the image from the publication, it is time to load it into the WebPlotDigitizer.

To do this, we first navigate to:

File -> Load Image Files

Here, we can choose what type of plot we are dealing with.

View original post here:

Unlocking Data from Graphs: How to Digitise Plots and Figures with WebPlotDigitizer - Towards Data Science

Analytics and Data Science News for the Week of January 12; Updates from Databricks, FICO, Power BI & More – Solutions Review

Solutions Review Executive Editor Tim King curated this list of notable analytics and data science news for the week of January 12, 2024.

Keeping tabs on all the most relevant analytics and data science news can be a time-consuming task. As a result, our editorial team aims to provide a summary of the top headlines from the last week, in this space. Solutions Review editors will curate vendor product news, mergers and acquisitions, venture capital funding, talent acquisition, and other noteworthy analytics and data science news items.

Akkios Build-On Package lets agencies create fully branded generative BI experiences for their clients with no lengthy integration project required. Digital agencies can white-label Akkios products through a custom URL or a fully embeddable API to unveil new data service offerings to drive revenue, optimize efficiency, and provide added value to clients.

Read on for more.

Autoscaling is essential because it allows resources to dynamically adjust to fluctuating demands. This ensures optimal performance and cost-efficiency, as processing needs can vary significantly over time, and it helps maintain a balance between computational power and expenses without requiring manual intervention.

Read on for more.

EagleAI will help retailers and grocers across the globe better meet their customers wants and needs individually, optimize promotional spending, increase ROI, and enable true one-to-one engagement that ultimately drives loyalty.

Read on for more.

The program provides students with real-world practitioner challenges and imparts technical skills to prepare them for data science careers in the financial services industry where firms are operationalizing AI, analytics, and machine learning.

Read on for more.

The Kinetica database converts natural language queries to SQL, and returns answers within seconds, even for complex and unknown questions. Further, Kinetica converges multiple modes of analytics such as time series, spatial, graph, and machine learning that broadens the types of questions that can be answered.

Read on for more.

Composable CDPs are new technical architectures for managing and activating customer data for marketing programs. A company can transform an existing cloud data warehouse into a central repository of customer data. It enables businesses to personalize emails, advertising, and other customer experiences more quickly, economically, and effectively than traditional solutions.

Read on for more.

Watch this space each week as our editors will share upcoming events, new thought leadership, and the best resources from Insight Jam, Solutions Reviews enterprise tech community for business software pros. The goal? To help you gain a forward-thinking analysis and remain on-trend through expert advice, best practices, trends and predictions, and vendor-neutral software evaluation tools.

With the next Solutions Spotlight event, the team at Solutions Review has partnered with leading reliability vendor Monte Carlo to provide viewers with a unique webinar called Driving Data Warehouse Cost Optimization and Performance. Hear from our panel of experts on best practices for optimizing Snowflake query performance with cost governance; native Snowflake features such as cost and workload optimization, and Monte Carlos new Performance Dashboard for query optimization across your Snowflake environment.

Read on for more.

Solutions Review hosted its biggest Insight Jam LIVE ever, with 18 hours of expert panels featuring more than 100 thought leaders, sponsored by Satori and Monte Carlo. Also, part of this largest-ever Insight Jam LIVE was a call for 2024 enterprise tech & AI predictions, and wow, did the community oblige!

Read on for more.

For our 5th annual Insight Jam LIVE! Solutions Review editors sourced this resource guide of analytics and data science predictions for 2024 from Insight Jam, its new community of enterprise tech experts.

Read on for more.

For our 5th annual Insight Jam LIVE! Solutions Review editors sourced this resource guide of AI predictions for 2024 from Insight Jam, its new community of enterprise tech experts.

Read on for more.

For consideration in future analytics and data science news roundups, send your announcements to the editor: tking@solutionsreview.com.

More:

Analytics and Data Science News for the Week of January 12; Updates from Databricks, FICO, Power BI & More - Solutions Review

The 11 Best AI Tools for Data Science to Consider in 2024 – Solutions Review

Solutions Reviews listing of the best AI tools for data science is an annual sneak peek of the top tools included in our Buyers Guide for Data Science and Machine Learning Platforms. Information was gathered via online materials and reports, conversations with vendor representatives, and examinations of product demonstrations and free trials.

The editors at Solutions Review have developed this resource to assist buyers in search of the best AI tools for data science to fit the needs of their organization. Choosing the right vendor and solution can be a complicated process one that requires in-depth research and often comes down to more than just the solution and its technical capabilities. To make your search a little easier, weve profiled the best AI tools for data science all in one place. Weve also included platform and product line names and introductory software tutorials straight from the source so you can see each solution in action.

Note: The best AI tools for data science are listed in alphabetical order.

Platform: DataRobot Enterprise AI Platform

Related products: Paxata Data Preparation, Automated Machine Learning, Automated Time Series, MLOps

Description: DataRobot offers an enterprise AI platform that automates the end-to-end process for building, deploying, and maintaining AI. The product is powered by open-source algorithms and can be leveraged on-prem, in the cloud or as a fully-managed AI service.DataRobotincludesseveralindependent but fully integrated tools (PaxataData Preparation,Automated Machine Learning, Automated Time Series,MLOps, and AI applications), and each can be deployed in multiple ways to match business needs and IT requirements.

Platform: H2O Driverless AI

Related products: H2O 3, H2O AutoML for ML, H2O Sparkling Water for Spark Integration, H2O Wave

Description: H2O.ai offers a number of AI and data science products, headlined by its commercial platform H2O Driverless AI. Driverless AI is a fully open-source, distributed in-memory machine learning platform with linearscalability. H2O supports widely used statistical and machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O has also developedAutoMLfunctionality that automatically runs through all the algorithms to produce a leaderboard of the best models.

Platform: IBM Watson Studio

Related products: IBM Cloud Pak for Data, IBM SPSS Modeler, IBM Decision Optimization, IBM Watson Machine Learning

Description: IBM Watson Studio enables users to build, run, and manage AI models at scale across any cloud. The product is a part of IBM Cloud Pak for Data, the companys main data and AI platform. The solution lets you automate AI lifecycle management, govern and secure open-source notebooks, prepare and build models visually, deploy and run models through one-click integration, and manage and monitor models with explainable AI. IBM Watson Studio offers a flexible architecture that allows users to utilize open-source frameworks likePyTorch, TensorFlow, and scikit-learn.

https://www.youtube.com/watch?v=rSHDsCTl_c0

Platform: KNIME Analytics Platform

Related products: KNIME Server

Description: KNIME Analytics is an open-source platform for creating data science. It enables the creation of visual workflows via a drag-and-drop-style graphical interface that requires no coding. Users can choose from more than 2000 nodes to build workflows, model each step of analysis, control the flow of data, and ensure work is current. KNIME can blend data from any source and shape data to derive statistics, clean data, and extract and select features. The product leverages AI and machine learning and can visualize data with classic and advanced charts.

Platform: Looker

Related products: Powered by Looker

Description: Looker offers a BI and data analytics platform that is built on LookML, the companys proprietary modeling language. The products application for web analytics touts filtering and drilling capabilities, enabling users to dig into row-level details at will. Embedded analytics in Powered by Looker utilizes modern databases and an agile modeling layer that allows users to define data and control access. Organizations can use Lookers full RESTful API or the schedule feature to deliver reports by email or webhook.

Platform: Azure Machine Learning

Related products:Azure Data Factory, Azure Data Catalog, Azure HDInsight, Azure Databricks, Azure DevOps, Power BI

Description: The Azure Machine Learning service lets developers and data scientists build, train, and deploy machine learning models. The product features productivity for all skill levels via a code-first and drag-and-drop designer, and automated machine learning. It also features expansiveMLopscapabilities that integrate with existing DevOps processes. The service touts responsible machine learning so users can understand models with interpretability and fairness, as well as protect data with differential privacy and confidential computing. Azure Machine Learning supports open-source frameworks and languages likeMLflow, Kubeflow, ONNX,PyTorch, TensorFlow, Python, and R.

Platform: Qlik Analytics Platform

Related products: QlikView, Qlik Sense

Description: Qlik offers a broad spectrum of BI and analytics tools, which is headlined by the companys flagship offering, Qlik Sense. The solution enables organizations to combine all their data sources into a single view. The Qlik Analytics Platform allows users to develop, extend and embed visual analytics in existing applications and portals. Embedded functionality is done within a common governance and security framework. Users can build and embed Qlik as simple mashups or integrate within applications, information services or IoT platforms.

Platform: RapidMiner Studio

Related products:RapidMiner AI Hub, RapidMiner Go, RapidMiner Notebooks, RapidMiner AI Cloud

Description: RapidMiner offers a data science platform that enables people of all skill levels across the enterprise to build and operate AI solutions. The product covers the full lifecycle of the AI production process, from data exploration and data preparation to model building, model deployment, and model operations. RapidMiner provides the depth that data scientists needbut simplifies AI for everyone else via a visual user interface that streamlines the process of building and understanding complex models.

Platform: SAP Analytics Cloud

Related products:SAP BusinessObjects BI, SAP Crystal Solutions

Description: SAP offers a broad range of BI and analytics tools in both enterprise and business-user driven editions. The companys flagship BI portfolio is delivered via on-prem (BusinessObjects Enterprise), and cloud (BusinessObjects Cloud) deployments atop the SAP HANA Cloud. SAP also offers a suite of traditional BI capabilities for dashboards and reporting. The vendors data discovery tools are housed in the BusinessObjects solution, while additional functionality, including self-service visualization, are available through the SAP Lumira tool set.

Platform: Sisense

Description: Sisense makes it easy for organizations to reveal business insight from complex data in any size or format. The product allows users to combine data and uncover insights in a single interface without scripting, coding or assistance from IT. Sisense is sold as a single-stack solution with a back end for preparing and modeling data. It also features expansive analytical capabilities, and a front-end for dashboarding and visualization. Sisense is most appropriate for organizations that want to analyze large amounts of data from multiple sources.

Platform: Tableau Desktop

Related products:Tableau Prep, Tableau Server, Tableau Online, Tableau Data Management

Description: Tableau offers an expansive visual BI and analytics platform, and is widely regarded as the major player in the marketplace. The companys analytic software portfolio is available through three main channels: Tableau Desktop, Tableau Server, and Tableau Online. Tableau connects to hundreds of data sources and is available on-prem or in the cloud. The vendor also offers embedded analytics capabilities, and users can visualize and share data with Tableau Public.

More here:

The 11 Best AI Tools for Data Science to Consider in 2024 - Solutions Review

Are autonomous labs the future of science? | by Batman | Jan, 2024 – Medium

Photo by Hyundai Motor Group on Unsplash

The scientific community is on the brink of a revolution, driven by the emergence of autonomous labs.

These cutting-edge facilities mark a significant shift in how research and experiments are conducted.

As technology advances, the potential impact of autonomous labs on the scientific landscape becomes increasingly evident.

One key advantage of autonomous labs is the unparalleled efficiency they bring to experimentation.

Traditionally, scientists spent substantial time and effort on manual tasks, often prone to human error.

With autonomous labs, these repetitive tasks are automated, allowing researchers to focus on the core aspects of their work.

This streamlined process not only accelerates the pace of experiments but also enhances the reliability of results.

Moreover, autonomous labs operate round the clock without the need for constant human supervision.

This continuous workflow ensures that experiments can be conducted efficiently, leading to faster data generation and analysis.

The elimination of downtime associated with traditional labs results in a significant boost to overall productivity in the scientific research domain.

Precision and accuracy are paramount in scientific research. Autonomous labs leverage state-of-the-art technologies, such as advanced sensors and artificial intelligence, to ensure precise data collection and analysis.

These technologies significantly reduce the margin of error, providing researchers with more reliable and reproducible results.

Furthermore, the integration of machine learning algorithms within autonomous labs enables real-time data interpretation.

The ability to analyze data on the fly allows researchers to adapt their experimental approaches dynamically, enhancing the quality of research outcomes.

The synergy between automation and intelligent data processing marks a substantial leap forward in scientific methodology.

Read this article:

Are autonomous labs the future of science? | by Batman | Jan, 2024 - Medium

Run Mixtral-8x7B on Consumer Hardware with Expert Offloading – Towards Data Science

Activation pattern of Mixtral-8x7Bs expert sub-networks source (CC-BY)

While Mixtral-8x7B is one of the best open large language models (LLM), it is also a huge model with 46.7B parameters. Even when quantized to 4-bit, the model cant be fully loaded on a consumer GPU (e.g., an RTX 3090 with 24 GB of VRAM is not enough).

Mixtral-8x7B is a mixture of experts (MoE). It is made of 8 expert sub-networks of 6 billion parameters each.

Since only 2 experts among 8 are effective during decoding, the 6 remaining experts can be moved, or offloaded, to another device, e.g., the CPU RAM, to free up some of the GPU VRAM. In practice, this offloading is complicated.

Choosing which one of the experts to activate is a decision taken at inference time for each input token and each layer of the model. Naively moving some parts of the model to the CPU RAM, as with Accelerates device_map, would create a communication bottleneck between the CPU and the GPU.

Mixtral-offloading (MIT license) is a project that proposes a much more efficient solution to reduce VRAM consumption while preserving a reasonable inference speed.

In this article, I explain how mixtral-offloading implements expert-aware quantization and expert offloading to save memory and maintain a good inference speed. Using this framework, we will see how to run Mixtral-8x7B on consumer hardware and benchmark its inference speed.

The tutorial section is also available as a notebook that you can find here:

Get the notebook (#37)

MoE language models often allocate distinct experts to sub-tasks, but not consistently across long token sequences. Some experts are active in short 24 token sequences, while others have intermittent gaps in their activation. This is well illustrated by the following figure:

Follow this link:

Run Mixtral-8x7B on Consumer Hardware with Expert Offloading - Towards Data Science

Attacker Targets Hadoop YARN, Flint Servers in Stealthy Campaign – Dark Reading

A threat actor is targeting a common misconfiguration in Hadoop YARN and Apache Flink to try and drop Monero cyrptominers in environments running the two big data technologies.

What makes the campaign especially notable is the adversary's use of sophisticated evasion techniques, such as rootkits, packed ELF binaries, directory content deletion, and system configuration modifications to bypass typical threat detection mechanisms.

Researchers from Aqua Nautilus uncovered the campaign when they spotted new attacks hitting one of their cloud honeypots recently. One attack exploited a known misconfiguration in a feature in Hadoop YARN called ResourceManager that manages resources for applications running on a Hadoop cluster. The other targeted a similarly known misconfiguration in Flink that, like the YARN issue, gives attackers a way to run arbitrary code on affected systems.

Hadoop YARN (Yet Another Resource Negotiator) is a resource management subsystem of the Hadoop ecosystem for big data processing. Apache Flink is a relatively widely used open source stream and batch processor for event-driven data analytics and data pipeline applications.

Assaf Morag, lead researcher for Aqua Nautilus, says the YARN misconfiguration gives attackers a way to send an unauthenticated API request to create new applications. The Flink misconfiguration allows an attacker to upload a Java archive (JAR) file that contains malicious code to a FLINK server.

"Both misconfigurations permit remote code execution, implying that an attacker could potentially gain complete control over the server," Morag says. Given that these servers are used for data processing, their misconfigurations present a data exfiltration risk. "Furthermore, these servers are typically interconnected with other servers within the organization, which could facilitate lateral movement by the attacker," Morag says.

In the attack on Apache Nautilus' honeypots, the adversary exploited the misconfiguration in Hadoop YARN to send an unauthenticated request to deploy a new application. The attacker was then able to execute remote code on the misconfigured YARN by sending a POST request, asking it to launch the new application using the attacker's command. To establish persistence, the attacker first deleted all cron jobs or scheduled tasks on the YARN server and created a new cron job.

Aqua's analysis of the attack chain showed the attacker using the command to delete the content of the /tmp directory on the YARN server, downloading a malicious file to the /tmp directory from a remote command-and-control server, executing the file, and then again deleting the contents of the directory. Aqua researchers found the secondary payload from the C2 server to be a packed ELF (Executable and Linkable Format) binary that served as a downloader for two different rootkits, one of which was a Monero crypto-currency miner. Malware detection engines on Virus Total did not detect the secondary ELF binary payload, Aqua said.

"As these servers are designed for processing big data, they possess high CPU capabilities," Morag says. "The attacker is exploiting this fact to run cryptominers, which also require a substantial amount of CPU resources."

Morag says the attack is noteworthy for the different techniques the attacker used to conceal their malicious activity. These included the use of a packer to obfuscate the ELF binary, the use of stripped payloads to make analysis more challenging, an embedded payload within the ELF binary, file and directory permissions modifications, and the use of two rootkits to hide the cryptominer and shell commands.

See the original post here:

Attacker Targets Hadoop YARN, Flint Servers in Stealthy Campaign - Dark Reading

Five Key Trends in AI and Data Science for 2024 – MIT Sloan Management Review

Topics AI in Action

This column series looks at the biggest data and analytics challenges facing modern companies and dives deep into successful use cases that can help other organizations accelerate their AI progress.

Carolyn Geason-Beissel/MIT SMR | Getty Images

Artificial intelligence and data science became front-page news in 2023. The rise of generative AI, of course, drove this dramatic surge in visibility. So, what might happen in the field in 2024 that will keep it on the front page? And how will these trends really affect businesses?

During the past several months, weve conducted three surveys of data and technology executives. Two involved MITs Chief Data Officer and Information Quality Symposium attendees one sponsored by Amazon Web Services (AWS) and another by Thoughtworks (not yet published). The third survey was conducted by Wavestone, formerly NewVantage Partners, whose annual surveys weve written about in the past. In total, the new surveys involved more than 500 senior executives, perhaps with some overlap in participation.

Get Updates on Leading With AI and Data

Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.

Please enter a valid email address

Thank you for signing up

Privacy Policy

Surveys dont predict the future, but they do suggest what those people closest to companies data science and AI strategies and projects are thinking and doing. According to those data executives, here are the top five developing issues that deserve your close attention:

As we noted, generative AI has captured a massive amount of business and consumer attention. But is it really delivering economic value to the organizations that adopt it? The survey results suggest that although excitement about the technology is very high, value has largely not yet been delivered. Large percentages of respondents believe that generative AI has the potential to be transformational; 80% of respondents to the AWS survey said they believe it will transform their organizations, and 64% in the Wavestone survey said it is the most transformational technology in a generation. A large majority of survey takers are also increasing investment in the technology. However, most companies are still just experimenting, either at the individual or departmental level. Only 6% of companies in the AWS survey had any production application of generative AI, and only 5% in the Wavestone survey had any production deployment at scale.

Surveys suggest that though excitement about generative AI is very high, value has largely not yet been delivered.

Production deployments of generative AI will, of course, require more investment and organizational change, not just experiments. Business processes will need to be redesigned, and employees will need to be reskilled (or, probably in only a few cases, replaced by generative AI systems). The new AI capabilities will need to be integrated into the existing technology infrastructure.

Perhaps the most important change will involve data curating unstructured content, improving data quality, and integrating diverse sources. In the AWS survey, 93% of respondents agreed that data strategy is critical to getting value from generative AI, but 57% had made no changes to their data thus far.

Companies feel the need to accelerate the production of data science models. What was once an artisanal activity is becoming more industrialized. Companies are investing in platforms, processes and methodologies, feature stores, machine learning operations (MLOps) systems, and other tools to increase productivity and deployment rates. MLOps systems monitor the status of machine learning models and detect whether they are still predicting accurately. If theyre not, the models might need to be retrained with new data.

Producing data models once an artisanal activity is becoming more industrialized.

Most of these capabilities come from external vendors, but some organizations are now developing their own platforms. Although automation (including automated machine learning tools, which we discuss below) is helping to increase productivity and enable broader data science participation, the greatest boon to data science productivity is probably the reuse of existing data sets, features or variables, and even entire models.

In the Thoughtworks survey, 80% of data and technology leaders said that their organizations were using or considering the use of data products and data product management. By data product, we mean packaging data, analytics, and AI in a software product offering, for internal or external customers. Its managed from conception to deployment (and ongoing improvement) by data product managers. Examples of data products include recommendation systems that guide customers on what products to buy next and pricing optimization systems for sales teams.

But organizations view data products in two different ways. Just under half (48%) of respondents said that they include analytics and AI capabilities in the concept of data products. Some 30% view analytics and AI as separate from data products and presumably reserve that term for reusable data assets alone. Just 16% say they dont think of analytics and AI in a product context at all.

We have a slight preference for a definition of data products that includes analytics and AI, since that is the way data is made useful. But all that really matters is that an organization is consistent in how it defines and discusses data products. If an organization prefers a combination of data products and analytics and AI products, that can work well too, and that definition preserves many of the positive aspects of product management. But without clarity on the definition, organizations could become confused about just what product developers are supposed to deliver.

Data scientists, who have been called unicorns and the holders of the sexiest job of the 21st century because of their ability to make all aspects of data science projects successful, have seen their star power recede. A number of changes in data science are producing alternative approaches to managing important pieces of the work. One such change is the proliferation of related roles that can address pieces of the data science problem. This expanding set of professionals includes data engineers to wrangle data, machine learning engineers to scale and integrate the models, translators and connectors to work with business stakeholders, and data product managers to oversee the entire initiative.

Another factor reducing the demand for professional data scientists is the rise of citizen data science, wherein quantitatively savvy businesspeople create models or algorithms themselves. These individuals can use AutoML, or automated machine learning tools, to do much of the heavy lifting. Even more helpful to citizens is the modeling capability available in ChatGPT called Advanced Data Analysis. With a very short prompt and an uploaded data set, it can handle virtually every stage of the model creation process and explain its actions.

Of course, there are still many aspects of data science that do require professional data scientists. Developing entirely new algorithms or interpreting how complex models work, for example, are tasks that havent gone away. The role will still be necessary but perhaps not as much as it was previously and without the same degree of power and shimmer.

This past year, we began to notice that increasing numbers of organizations were cutting back on the proliferation of technology and data chiefs, including chief data and analytics officers (and sometimes chief AI officers). That CDO/CDAO role, while becoming more common in companies, has long been characterized by short tenures and confusion about the responsibilities. Were not seeing the functions performed by data and analytics executives go away; rather, theyre increasingly being subsumed within a broader set of technology, data, and digital transformation functions managed by a supertech leader who usually reports to the CEO. Titles for this role include chief information officer, chief information and technology officer, and chief digital and technology officer; real-world examples include Sastry Durvasula at TIAA, Sean McCormack at First Group, and Mojgan Lefebvre at Travelers.

This evolution in C-suite roles was a primary focus of the Thoughtworks survey, and 87% of respondents (primarily data leaders but some technology executives as well) agreed that people in their organizations are either completely, to a large degree, or somewhat confused about where to turn for data- and technology-oriented services and issues. Many C-level executives said that collaboration with other tech-oriented leaders within their own organizations is relatively low, and 79% agreed that their organization had been hindered in the past by a lack of collaboration.

We believe that in 2024, well see more of these overarching tech leaders who have all the capabilities to create value from the data and technology professionals reporting to them. Theyll still have to emphasize analytics and AI because thats how organizations make sense of data and create value with it for employees and customers. Most importantly, these leaders will need to be highly business-oriented, able to debate strategy with their senior management colleagues, and able to translate it into systems and insights that make that strategy a reality.

This column series looks at the biggest data and analytics challenges facing modern companies and dives deep into successful use cases that can help other organizations accelerate their AI progress.

Thomas H. Davenport (@tdav) is the Presidents Distinguished Professor of Information Technology and Management at Babson College, a fellow of the MIT Initiative on the Digital Economy, and senior adviser to the Deloitte Chief Data and Analytics Officer Program. He is coauthor of All in on AI: How Smart Companies Win Big With Artificial Intelligence (HBR Press, 2023) and Working With AI: Real Stories of Human-Machine Collaboration (MIT Press, 2022). Randy Bean (@randybeannvp) is an industry thought leader, author, founder, and CEO and currently serves as innovation fellow, data strategy, for global consultancy Wavestone. He is the author of Fail Fast, Learn Faster: Lessons in Data-Driven Leadership in an Age of Disruption, Big Data, and AI (Wiley, 2021).

Excerpt from:

Five Key Trends in AI and Data Science for 2024 - MIT Sloan Management Review