Category Archives: Data Science

Python Profiling Tools: A Tutorial – Built In

Profiling is a software engineering task in which software bottlenecks are analyzed programmatically. This process includes analyzing memory usage, the number of function calls and the runtime of those calls. Such analysis is important because it provides a rigorous way to detect parts of a software program that may be slow or resource inefficient, ultimately allowing for the optimization of software programs.

Profiling has use cases across almost every type of software program, including those used for data science and machine learning tasks. This includes extraction, transformation and loading (ETL) and machine learning model development. You can use the Pandas library in Python to conduct profiling on ETL, including profiling Pandas operations like reading in data, merging data frames, performing groupby operations, typecasting and missing value imputation.

Identifying bottlenecks in machine learning software is an important part of our work as data scientists. For instance, consider a Python script that reads in data and performs several operations on it for model training and prediction. Suppose the steps in the machine learning pipeline are reading in data, performing a groupby, splitting the data for training and testing, fitting three types of machine models, making predictions for each model type on the test data, and evaluating model performance. For the first deployed version, the runtime might be a few minutes.

After a data refresh, however, imagine that the scripts runtime increases to several hours. How do we know which step in the ML pipeline is causing the problem? Software profiling allows us to detect which part of the code is responsible so we can fix it.

Another example relates to memory. Consider the memory usage of the first version of a deployed machine learning pipeline. This script may run for an hour each month and use 100 GB of memory. In the future, an updated version of the model, trained on a larger data set, may run for five hours each month and require 500 GB of memory. This increase in resource usage is to be expected with an increase in data set size. Detecting such an increase may help data scientists and machine learning engineers decide if they would like to optimize the memory usage of the code in some way. Optimization can help prevent companies from wasting money on unnecessary memory resources.

Python provides useful tools for profiling software in terms of runtime and memory. One of the most basic and widely used is the timeit method, which offers an easy way to measure the execution times of software programs. The Python memory_profile module allows you to measure the memory usage of lines of code in your Python script. You can easily implement both of these methods with just a few lines of code.

We will work with the credit card fraud data set and build a machine learning model that predicts whether or not a transaction is fraudulent. We will construct a simple machine learning pipeline and use Python profiling tools to measure runtime and memory usage. This data has an Open Database License and is free to share, modify and use.

More From Sadrach PierreHow to Find Outliers (With Examples)

To start, lets import the Pandas library and read our data into a Pandas data frame:

Next, lets relax the display limits for columns and rows using the Pandas method set_option():

Next, lets display the first five rows of data using the head() method:

Next, to get an idea of how big this data set is, we can use the len method to see how many rows there are:

And we can do something similar for counting the number of columns. We can assess the columns attribute from our Pandas data frame object and use the len() method to count the number of columns:

We can see that this data set is relatively large: 284,807 rows and 31 columns. Further, it takes up 150 MB of space. To demonstrate the benefits of profiling in Python, well start with a small subsample of this data on which well perform ETL and train a classification model.

Lets proceed by generating a small subsample data set. Lets take a random sample of 10,000 records from our data. We will also pass a value for random_state, which will guarantee that we select the same set of records every time we run the script. We can do this using the sample() method on our Pandas data frame:

Next, we can write the subsample of our data to a new csv file:

Now we can start building out the logic for data preparation and model training. Lets define a method that reads in our csv file, stores it in a data frame and returns it:

Next, lets define a function that selects a subset of columns in the data. The function will take a data frame and a list of columns as inputs and return a new one with the selected columns:

Next, lets define a method that itself defines model inputs and outputs and returns these values:

We can then define a method used for splitting data for training and testing. First, at the top of our script, lets import the train_test_split method from the model_selection module in Scikit-learn:

Now we can define our method for splitting our data:

Next, we can define a method that will fit a model of our choice to our training data. Lets start with a simple logistic regression model. We can import the logistic regression class from the linear models module in Scikit-learn:

We will then define a method that takes our training data and an input that specifies the model type. The model type parameter we will use to define and train a more complex model later on:

Next, we can define a method that take our trained model and test data as inputs and returns predictions:

Finally, lets define a method that evaluates our predictions. Well use average precision, which is a useful performance metric for imbalance classification problems. An imbalance classification problem is one where one of the targets has significantly fewer examples than the other target(s). In this case, most of the transaction data correspond to legitimate transactions, whereas a small minority of transactions are fraudulent:

Now we have all of the logic in place for our simple ML pipeline. Lets execute this logic for our small subsample of data. First, lets define a main function that well use to execute our code. In this main function, well read in our subsampled data:

Next, use the data prep method to select our columns. Lets select V1, V2, V3, Amount and Class:

Lets then define inputs and output. We will use V1, V2, V3, and Amount as inputs; the class will be the output:

Well split our data for training and testing:

Fit our data:

Make predictions:

And, finally, evaluate model predictions:

We can then execute the main function with the following logic:

And we get the following output:

Now we can use some profiling tools to monitor memory usage and runtime.

Lets start by monitoring runtime. Lets import the default_timer from the timeit module in Python:

Next, lets start by seeing how long it takes to read our data into a Pandas data frame. We define start and end time variables and print the difference to see how much time has elapsed:

If we run our script, we see that it takes 0.06 seconds to read in our data:

Lets do the same for each step in the ML pipeline. Well calculate runtime for each step and store the results in a dictionary:

We get the following output upon executing:

We see that reading in the data and fitting it are the most time-consuming operations. Lets rerun this with the large data set. At the top of our main function, we change the file name to this:

And now, lets return our script:

We see that, when we use the full data set, reading the data into a data frame takes 1.6 seconds, compared to the 0.07 seconds it took for the smaller data set. Identifying that it was the step where we read in the data that led to the increase in runtime is important for resource management. Understanding bottleneck sources like these can prevent companies from wasting resources like compute time.

Next, lets modify our model training method such that CatBoost is a model option:

Lets rerun our script but now specifying a CatBoost model:

We see the following results:

We see that by using a CatBoost model instead of logistic regression, we increased our runtime from ~two seconds to ~ 22 seconds, which is more than a tenfold increase in runtime because we changed one line of code. Imagine if this increase in runtime happened for a script that originally took 10 hours: Runtime would increase to over 100 hours just by switching the model type.

Another important resource to keep track of is the memory. We can use the memory_usage module to monitor memory usage line-by-line in our code. First, lets install the memory_usage module in terminal using pip:

We can then simply add @profiler before each function definition. For example:

And so on.

Now, lets run our script using the logistic regression model type. Lets look at the step where we fit the model. We see that memory usage is for fitting our logistic regression model is around 4.4 MB (line 61):

Now, lets rerun this for CatBoost:

We see that memory usage for fitting our logistic regression model is 13.3 MB (line 64). This corresponds to a threefold increase in memory usage. For our simple example, this isn't a huge deal, but if a company deploys a newer version of a production and it goes from using 100 GB of memory to 300 GB, this can be significant in terms of resource cost. Further, having tools like this that can point to where the increase in memory usage is occurring is useful.

The code used in this post is available on GitHub.

More in Data ScienceUse Lux and Python to Automatically Create EDA Visualizations

Monitoring resource usage is an important part of software, data and machine learning engineering. Understanding runtime dependencies in your scripts, regardless of the application, is important in virtually all industries that rely on software development and maintenance. In the case of a newly deployed machine learning model, an increase in runtime can have a negative impact on business. A significant increase in production runtime can result in a diminished experience for a user of an application that serves realtime machine learning predictions.

For example, if the UX requirements are such that a user shouldnt have to wait more than a few seconds for a prediction result and this suddenly increases to minutes, this can result in frustrated customers who may eventually seek out a better/faster tool.

Understanding memory usage is also crucial because instances may occur in which excessive memory usage isnt necessary. This usage can translate to thousands of dollars being wasted on memory resources that arent necessary. Consider our example of switching the logistic regression model for the CatBoost model. What mainly contributed to the increased memory usage was the CatBoost packages default parameters. These default parameters may result in unnecessary calculations being done by the package.

By understanding this dynamic, the researcher can modify the parameters of the CatBoost class. If this is done well, the researcher can retain the model accuracy while decreasing the memory requirements for fitting the model. Being able to quickly identify bottlenecks for memory and runtime using these profiling tools are essential skills for engineers and data scientists building production-ready software.

Read the original here:

Python Profiling Tools: A Tutorial - Built In

The Top Five Key Data Visualization Techniques to Utilize Right Now – Solutions Review

The editors at Solutions Review highlight five key data visualization techniques to utilize right now to enhance data storytelling.

Data visualization is a graphical or visual representation of a large amount of data in the form of charts, graphs, and maps which helps analyze the data by identifying the patterns and trends following the data. In this article, we will discuss data science visualization, the applications and benefits of data visualization, and the different types of analysis such as univariate, bivariate, and multivariate data visualization. We will also discuss the top data visualization techniques, which include line charts, histograms, pie charts, area plots, and scatter plots that can help you understand your data better.

So before going on to the techniques of data visualization, let us look at what is data science visualization and its benefits.

Data science visualization refers to the graphical representation of data using various graphics such as charts, plots, maps, graphs, and infographics. Data visualization makes it easier for human beings to understand the data by analyzing patterns and trends, which helps in generating valuable business insights. Data visualization can even help identify critical relationships in various charts and plots that may prove fruitful for businesses.

Data scientists analyze, interpret and visualize various large datasets regularly with the help of multiple data visualization tools such as Tableau, Sisense, Microsoft Power BI, Microsoft Excel, Looker, and Zoho Analytics.

Today data visualization has its application in various fields such as healthcare, finance, marketing, data science, military, E-commerce, education, etc., as it helps in organizing the data in a way that is not possible through traditional techniques, and thus, it helps in the faster data processing. Check out to learn more about critical aspects of data science visualization.

Data visualization is a technique of graphical representation of data that helps us in faster data processing and making improved business decisions. Some of the benefits of data visualization are:

The three different types of analysis for data visualization are:

Data visualization techniques involve generating the graphical or visual representation of the data to identify various patterns, trends, correlations, and dependencies to gain valuable insights. Let us have a look at the top five data visualization techniques that are used commonly:

Line Chart: A line chart or line plot displays information as a series of data points connected by a straight line. A line chart displays any relationship between two variables on the respective X and Y-axis. It is most commonly used to compare several variables and analyze trends.

Histogram: A histogram is the graphical representation of a set of numerical data in the form of rectangle blocks connected. A histogram represents only the quantitative data, unlike the bar chart. It is used to figure out any unusual observation or gap present in the large dataset.

Pie Chart: A pie chart represents the data in a circular statistical graphic form. It records data in the form of numbers, percentages, or degrees. It is the most common form of graphical representation used in business presentations to depict various data related to orders, sales, revenue, profit, loss, etc. Pie chart are divided into sectors that represent the percentage of the whole.

Area Plot: An area plot is almost similar to the line chart or can be defined as a special form of a line chart where to highlight the distance between different variables, the region below the line is filled with a color instead of just connecting the data with a continuous line. It helps to show the rise and fall of data, changes over time, categorical breakdowns, etc.

Scatter Plot: A scatter plot is a graphical representation used to observe and display the relationship between two variables. It uses dots to illustrate values for two different horizontal and vertical axis variables. Scatter plots are used to monitor the relationship between variables.

Data science visualization is an essential tool as it helps you analyze hidden patterns and relationships between variables through the graphical presentation of data. It is still an important tool and is a must-have skill for data scientists and data analysts to derive insights from complex business data. In this article, we discussed various data Visualization techniques like line charts, histograms, scatter plots, and others.

Tim is Solutions Review's Editorial Director and leads coverage on big data, business intelligence, and data analytics. A 2017 and 2018 Most Influential Business Journalist and 2021 "Who's Who" in data management and data integration, Tim is a recognized influencer and thought leader in enterprise business software. Reach him via tking at solutionsreview dot com.

Continued here:

The Top Five Key Data Visualization Techniques to Utilize Right Now - Solutions Review

Hospital IQ Wins Best Predictive Analytics Solution in 2022 MedTech Breakthrough Awards – Yahoo Finance

Recognition marks third consecutive year Hospital IQ has been named MedTech Breakthrough winner

NEWTON, Mass., May 12, 2022--(BUSINESS WIRE)--Hospital IQ, a leading intelligent workflow automation provider for hospital operations, announced today it has been named the winner of the MedTech Breakthrough Award for Best Predictive Analytics Solution. This recognition marks the third consecutive year Hospital IQ has been selected as a MedTech Breakthrough Award winner, previously winning the Health Administration Innovation Award in both 2021 and 2020.

Hospital IQ earned this accolade for its latest innovation in predictive analytics, which accurately predicts patient demand, automates workflows and drives cross-functional team action to improve hospital operational efficiencies and optimize capacity to provide patient care. The Hospital IQ solution leverages data and predictive analytics to provide health system leaders with enterprise-wide visibility into variables across all areas of hospital operations in real-time, including admissions, discharges, staff scheduling and more. Driven by intelligent data analytics, the system identifies potential future challenges in census, boarding, and staffing levels throughout the health system, and turns these predictions into actionable notifications. When a future scenario in which sub-optimal conditions will occur is identified, the Hospital IQ platform alerts the proper stakeholders and provides recommended actions to mitigate problems before they arise.

"Were honored to receive our third consecutive MedTech Breakthrough award win, and were proud to be recognized as the leading innovator in predictive analytics for our clients and the patients they serve," said Rich Krueger, CEO of Hospital IQ. "Success in the post-pandemic future of healthcare will require systematic improvements regarding how healthcare organizations manage processes and people, and core to that improvement is the move from reactive to proactive processes. Predictive analytics allow hospitals and health systems to operate more effectively and strategically utilizing data science to predict whats coming along with real-time insight and transparent communication across the enterprise, allowing them to see more patients, align staffing to demand, provide greater care, and support staff satisfaction."

The Hospital IQ solution is also equipped to promote team success through its coordinated communication platform and automated notifications, keeping everyone informed as patient throughput and capacity utilization goals are achieved and sustained, which eliminates confusion, keeps everyone on the same page and helps with staff morale and retention. The platform can be deployed for use cases across the most impactful areas of the hospital including inpatient, perioperative and infusion, enabling health systems to customize the toolset to their unique needs and further enhancing enterprise-wide performance.

Story continues

The MedTech Breakthrough Awards honor excellence and recognize the innovation, hard work and success in a range of health and medical technology categories, including Clinical Administration, Telehealth, Patient Engagement, Electronic Health Records (EHR), mHealth, Medical Devices, Medical Data and many more. This year, nearly 4,000 nominations from around the world were submitted representing the most competitive program to-date.

About Hospital IQ

Hospital IQ provides an intelligent workflow automation solution for hospital operations that uses artificial intelligence to anticipate and direct actions, enabling health systems to achieve and sustain peak operational performance to improve patient access, clinical outcomes, and financial performance. Hospital IQ's cloud-based software platform combines advanced data analytics, machine learning, and simulation technology with an easy-to-use, intuitive user interface to deliver optimized surgical resource alignment, patient flow, and staff scheduling capabilities. Hundreds of leading hospitals and health systems rely on Hospital IQ to help them make the right operational decisions the first time, every time. To learn more, visit http://www.hospiq.com.

View source version on businesswire.com: https://www.businesswire.com/news/home/20220512005230/en/

Contacts

Laura BastardiMatter for Hospital IQHospital_IQ@matternow.com

Read this article:

Hospital IQ Wins Best Predictive Analytics Solution in 2022 MedTech Breakthrough Awards - Yahoo Finance

EDU introduces MSc in Data Analytics for the first time in the country – The Business Standard

East Delta University (EDU) started a specialised master's program titled MSc in Data Analytics and Design Thinking for Business.

EDU being a pioneer in regards to offering contemporary degree programmes has introduced as the first and only university in Bangladesh a degree in Data Analytics and Design Thinking, reads a press release.

The MSc in Data Analytics and Design Thinking for Business degree blends the data analytics overview with the necessary skills to design creative and successful business models.

This degree allows the learners to focus on data-driven business designs, marketing and human resources management with the appropriate use of data science.

Alongside the analytics module, students will study strategic marketing in this digital era and creative HRM practices with extensive use of data science, reads the statement.

The Architect of the programme, the Founder and Vice-chairman of EDU Sayeed Al Noman expressed his philosophy regarding this one of a kind programme, "The innovative business design makes for substantive and emotional distinction which can certainly lead towards making a lasting effect in the human psyche."

"In this era, the design has become profoundly crucial in business spheres, and companies are consistently seeking to recognise and optimise the strategic edge that creative design can bring."

"To help companies solve today's business issues in new ways, business executives aim to improve their innovative and operational thinking skills. Such a program was not offered before in our country and to relinquish this void, EDU came forward with an internationally acclaimed faculty pool and a curriculum that resonates with the global requirements too," he said.

There are four modules, including the Data Analytics module, Design Thinking for Business module, Creative Design for Marketing and HRM module, and Application of Data Analytics for Business module.

Participants will learn to understand and analyse data from the cores, including data design, data handling and decision making, and data visualization and interpretation. It will include a range of business courses linked to business designs for creative thinking and practices.

The MSc in Data Analytics and Design Thinking for Business program will prepare students for careers that apply and manage modern data science to solve critical business challenges.

This degree aims to turn big data into actionable intelligence. To that end, business analysts use a variety of statistical and quantitative methods, computational tools, and predictive models as well as their knowledge of business, marketing, HRM, the corporate world, and the economy to make data-driven decisions and design thinking for modern business, reads the statement.

University Vice-chancellor Professor Sikandar Khan stated that "this crafted programme may open a new avenue to materialise the dream of the present government to build a Digital Bangladesh. The university got motivated to offer this one of a kind master's programme considering the dream of the government of Bangladesh. To transform the country from traditional to digital requires a group of talented and technology-centric young people who can combine the ideas of business with the latest technology and data science."

EDU authorities declared that applicants and graduates from any background are eligible for this program. The unique curriculum mapping of this unconventional programme allows applicants with any undergraduate degree to apply for admissions.

The graduates from non-relevant disciplines will also be equipped with necessary techniques from the tailor-made course.

The total program cost is Tk419,000 but the university authority is offering a special 70% waiver on tuition fees. Special waiver on admission fees will be available.

After all these waivers the programme will cost Tk167,000. In addition, students can avail upto 100% scholarship based on academic merit and work experience.

Continued here:

EDU introduces MSc in Data Analytics for the first time in the country - The Business Standard

Analytics and Data Science News for the Week of May 6; Updates from Domino Data Lab, Gartner, Starburst, and More – Solutions Review

The editors at Solutions Review have curated this list of the most noteworthy analytics and data science news items for the week of May 6, 2022.

Keeping tabs on all the most relevant data management news can be a time-consuming task. As a result, our editorial team aims to provide a summary of the top headlines from the last month, in this space. Solutions Review editors will curate vendor product news, mergers and acquisitions, venture capital funding, talent acquisition, and other noteworthy data science and analytics news items.

New capabilities will recommend the optimal size for a development environment, thereby improving the model development experience for data science teams. Integrated workflows in Domino 5.2 automate model deployment to Snowflakes Data Cloud and enable the power of in-database computation, as well as model monitoring and continuous identification of new production data to update data drift and model quality calculations that drive better business decisions.

Read on for more.

Analyst house Gartner, Inc. has released its newest research highlighting four emerging solution providers that data and analytics leaders should consider as compliments to their existing architectures. The 2022Cool Vendors in Analytics and Data Sciencereport features information on startups that offer some disruptive capability or opportunity not common to the marketplace.

Read on for more.

An integration plugin now ships with every Starburst Enterprise instance and features include: scalable attribute-based access control (ABAC), sensitive data discovery and classification, data policy enforcement and advanced policy building, and dynamic data masking auditing.

Read on for more.

For consideration in future data analytics news roundups, send your announcements to tking@solutionsreview.com.

Tim is Solutions Review's Editorial Director and leads coverage on big data, business intelligence, and data analytics. A 2017 and 2018 Most Influential Business Journalist and 2021 "Who's Who" in data management and data integration, Tim is a recognized influencer and thought leader in enterprise business software. Reach him via tking at solutionsreview dot com.

View original post here:

Analytics and Data Science News for the Week of May 6; Updates from Domino Data Lab, Gartner, Starburst, and More - Solutions Review

Dataiku Named to Forbes AI 50 List of Top AI Companies Shaping the Future – GlobeNewswire

New York, May 12, 2022 (GLOBE NEWSWIRE) -- Dataiku, the platform for Everyday AI, today announced it has been named to the Forbes AI 50, a list of the top private companies in North America using artificial intelligence to transform industries and shape the future. Dataiku is the only AI platform that empowers anyone from technical staff to business leadership to simply and quickly design, deploy, govern, and manage AI and analytics applications.

To create the list, Forbes, in partnership with Sequoia Capital, evaluated over 400 submissions from the U.S. and Canada. An algorithm identified the top 100 companies with the highest quantitative scores. A panel of expert AI judges then reviewed the finalists to hand-pick the 50 most compelling companies based on their use of AI-enabled technology, business models, and financials.

At Dataiku we help all industries from pharma to financing, truckstops to chicken farms make AI part of an organizations everyday activities. Dataiku is proud to be recognized by Forbes as one of North Americas Top AI companies shaping the future, said Florian Douetteau, co-founder and CEO of Dataiku. Being on the Forbes AI 50 list is an honor and encourages us to work even harder on the Everyday AI journey, enabling our customers to turn intangible data into tangible results from the mundane to the moonshot.

This recognition comes at an exciting time for Dataiku. Within a one-week span, Dataiku was:

Resources

About Dataiku

Dataiku is the platform for Everyday AI that allows companies to leverage one central solution to design, deploy, govern, and manage AI and analytics applications. Since its founding in 2013, the company has been the leader in democratizing data and empowering organization-wide collaboration. Today, more than 450 companies worldwide use Dataiku to integrate and streamline their use of data, analytics, and AI, driving diverse use cases from fraud detection and customer churn prevention, to predictive maintenance and supply chain optimization. Stay connected with us on our blog, Twitter (@dataiku) and on LinkedIn.

About Gartner

Gartner, Market Guide for Multipersona Data Science and Machine Learning Platforms, 2 May 2022, Pieter den Hamer, et. Al. Gartner, Market Guide for DSML Engineering Platforms, 2 May 2022, Afraz Jaffri et. Al.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartners research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Go here to see the original:

Dataiku Named to Forbes AI 50 List of Top AI Companies Shaping the Future - GlobeNewswire

Ten Year Trends, Covid’s Impact, Where Dagdigian Was Wrong – Bio-IT World

By Allison Proffitt

May 12, 2022 | At Bio-IT Worlds 20th anniversary event last week, Chris Dagdigian and friends from BioTeam, once again closed out programming with a rapid-fire look at the bio-IT landscape and an IT trends assessmentcalling out what is working, whats not, and how its all evolving.

Repeating a format he introduced last fall, Dagdigian led a panel of speakers, each commenting on their own experience with bio-IT trends. This year all four speakers were experienced BioTeam consultants, but Dagdigian also flagged Twitter friends who can speak freely including James Cuff (@DrCuff, hear his Trends from the Trenches podcast here), Chris Dwan (@fdmnts), @hpcguru, and Corey Quinn (@QuinnyPig).

From his personal vantage pointhaving given an IT trends talk at Bio-IT World since 2009Dagdigian began by outlining the trends that have held firm over the past decade. He still starts every year, Dagdigian said, with the existential dread. Science has always changed more rapidly than IT can keep up, and most certainly faster than your IT budget renews. This remains a problem, Dagdigian said, and there is a real risk when IT builds a wrong solution for the scientist.

Cloud, he repeated, remains a capability play, not a cost savings strategy. Capability and flexibility still justify cloud adoption, they do not, however, justify a multi-cloud approach. A multi-cloud strategy is definitely dumb, Dagdigian said, while a hybrid cloud approach is absolutely fine. Multi-cloud requires developers to devolve applications to the lowest common API denominator. Its a degraded experience, he said, unless you were all in on Kubernetes, which can reasonably port between AWS, Google Cloud, and Microsoft Azure. In his trademark bluntness, Dagdigian said any company with a multi-cloud strategy is a red flag for poor senior leadership.

As in years past, moving and managing data is a pain, Dagdigian said, and he again threatened to call out scientists who build careers and publications lists on data intensive science but refuse to take responsibility for their own data.

Its a shared responsibility model. My job as an IT person is to provide you with safe, durable storage options that are fit for purpose and aligned with what youre trying to do. The combo between science and IT is to provide end users with tools to manage, govern, pull actionable insights, understand what were actually storing. But finally end users have to take some responsibility. Thats the sort of missing piece of the equation. It is wildly inappropriate for IT to make a lot of storage and data management decisions, he said.

Dagdigian deemed many of the problems that weve struggled with in years past solved problems including compute, networking, and storage. He called compute mostly a financial planning exercise and flagged Internet2 and PetaGene as solid networking options that are no longer hard, risky, or exotic.

He pointed to many vendors in the Bio-IT space that can help with storage that all have strong track records and referenceable customers. He advised starting with object storage or scale-out NASonly exploring something else if business or scientific needs require.

So Smug, So Wrong

But one of the great attractions to Dagdigians annual insights is his willingnesseven delightin point out his past errors. He flagged his own storage failed prediction with zeal: The future of scientific data at rest is object storage, he recounted on a slide, and attributed the quote to some jerk.

It sounded good! Object storage can be deployed on premises and in the cloud. Metadata tagging is fantastic for scientific data management and search. And object storage is purpose built for a FAIR future in which humans are not the dominant data consumers.

I am completely, utterly, totally wrong on this, he said. Were still using POSIX Windows or Linux flavored storage.

It turns out, Dagdigian conceded, scientists do still assume that humans are doing most of the folder browsing, and neither commercial code nor open-source code is object-aware. Scientists who just need to transform data in R or Python dont have the bandwidth to learn object storage.

In fact, he flagged a death of tiered storage. Machine learning and AI have messed up long-held storage design patterns in the past three to four years, he said.

The concept of having an archive tier or a nearline tier or a slow tier doesnt make a lot of sense. If youre talking about machine learning or AI, youre really churning through all of your dataold and newall the time. Youre constantly reevaluating, retraining, pulling out different training sets, Dagidian said. I no longer can get away with tiers of different speed and capacity. If I need to satisfy the ML and AI people, you pretty much need one single tier of performant storage.

The vendor landscape on this new storage structure is crowded, he said, but he highlighted VAST Data, Weka, Hammerspace, and DellEMC.

COVID-Era Trends

Next, Dagdigian turned his attention to the trends arising the past year or two, starting with one of his single biggest obnoxious problems: the scarcity of GPUs on the cloud, particularly in Amazons US-East-1 availability code. One Boston-based client is building their first AWS footprint in US-East-2, simply because we cannot get the GPUs that we need, particularly for computational chemistry workloads.

An increasingly attractive alternative, Dagdigian said, is launching clusters in co-location spaces. He highlighted Markley Group as New Englands best-connected co-location facility and gave a quick outline of the solution hes placing there: 1.5 Tb of RAM, 40 CPU cores, four Nvidia Tesla V100 GPUs for about $70,000. As part of a hybrid cloud solution, a co-lo cluster creates a hedge against the cloud GPU scarcity or rising cost. He recommends using such solutions to soak up computational chemistry, simulations, and other persistent workloads.

From his current personal BioTeam workload, Dagdigian is integrating AWS Parallelcluster with Schrodinger computational chemistry tools and SLURMs license-aware job scheduling. It may be a niche use case, he conceded, but an autoscaling HPC grid that understands that you cannot run a job unless a particular license is available is a magic nirvana for us.

Finally, he zipped through a few mini-trends that seem to be at reality inflection points.

Read this article:

Ten Year Trends, Covid's Impact, Where Dagdigian Was Wrong - Bio-IT World

Chinese rover detects water existed on Mars more recently than thought – UPI News

Scientists used instruments to analyze rocks and minerals on the surface of Mars, finding evidence there was substantial liquid water on the planet more recently than previously thought. Photo courtesy of the China National Space Administration

May 11 (UPI) -- Nearly one year after landing on Mars, scientists say China's Zhurong rover collected data indicating water may have existed on the planet over a longer period of time than previously thought.

A study published Wednesday in the journal Science Advances said Zhurong detected evidence that the Utopia Planitia basin had "substantial" liquid water during its most recent epoch of geologic history -- the Amazonian. Scientists previously believed this time period, about 700 million years ago, to be cold and dry and liquid water activities to be "extremely limited."

Before assessing the new data, scientists believed that Mars lost much of its water after its Hesperian period, about 3 billion years ago.

The Zhurong rover touched down on Mars' surface May 15 as part of the Tianwen-1 mission. The main point of the mission was to search for signs of life, ice and water.

Scientists from China's National Space Science Center and the Chinese Academy of Sciences analyzed data gathered from a laser-induced breakdown spectrometer, telescopic microimaging camera and short-wave infrared spectrometer to study minerals to determine the amount of liquid water that would have been at the site millions of years ago.

NASAs Curiosity Mars rover used two different cameras to create this panoramic selfie, comprised of 60 images, in front of Mont Mercou, a rock outcrop that stands 20 feet tall on March 26, 2021, the 3,070th Martian day, or sol, of the mission. These were combined with 11 images taken by the Mastcam on the mast, or "head," of the rover on March 16. The hole visible to the left of the rover is where its robotic drill sampled a rock nicknamed "Nontron." The Curiosity team is nicknaming features in this part of Mars using names from the region around the village of Nontron in southwestern France. Photo courtesy of NASA/JPL-Caltech/MSSS

The rest is here:

Chinese rover detects water existed on Mars more recently than thought - UPI News

Podcast: All Things Data with Guest Sean Owen – Newsroom | University of St. Thomas – University of St. Thomas Newsroom

In the ever-evolving technology landscape, data analytics and data strategy continue to play a larger role in economics and business models. Director of the Center for Applied Artificial Intelligence at the University of St. Thomas, Dr. Manjeet Rege, co-hosts the "All Things Data" podcast with adjunct professor and Innovation Fellow Dan Yarmoluk. The podcast provides insight into the significance of data science as it relates to business models, business economics and delivery systems. Through informative conversation with leading data scientists, business model experts, technologists and futurists, Rege and Yarmoluk discuss how to utilize, harness, and deploy data science, data-driven strategies, and enable digital transformations.

Rege and Yarmoluk spoke with Sean Owen on Cloudera solutions, big data, and data science skills. Owen is the current Director of Data Science of Cloudera, a hybrid data cloud. Before Cloudera, Sean founded Myrrix Ltd (now, the Oryx project) to commercialize large-scale real-time recommender systems on Apache Hadoop. He is an Apache Spark Committer and co-authored Advanced Analytics on Spark. He was a Committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google.

Here are some highlights from their conversation.

Q. How would you describe the difference between data science and big data?

A. These are big terms, I think they mean different things to different people. I tend to think of big data as a movement that started right after the ".com" bust. Early 2000s, when the availability of data increased dramatically with the rise of the web and then mobile. Suddenly there was a huge amount of data being generated that one could collect. It also began to get cheaper and cheaper to store data. So big data was a name for this phenomenon. We suddenly went from a data scarce world to one where you could collect as much data as you cared to. Data science is obviously a mixture of data and statistics, as well as engineering and computer science. And it's necessary these days because you can't really separate the software issues from the analytic issues. When you are doing analytics today you are working with software. So those words have come together and I think data helps to propel those worlds together.

Q. How does Cloudera differentiate itself from other hybrid cloud systems?

A. We like to present Cloudera as an enterprise data hub. It is a big generic platform. It is a place to store data, process data, and secure it to do analytics and machine learning. It's a big Swiss army knife. We are looking to help you solve your problems. I think Cloudera offers more scales on its platforms compared to competitors. Cloudera has made better decisions about what to center around the core of its platform and what packages to surround itself with.

Q. What is the market needing to do to harness the power of big data?

A. Let's think about the ingredients there. We are going to need data and we are going to need software and some skills and then we need to figure out what to do with it. Two of those elements are pretty easy, software is free and computers are cheap. I think data is one of the remaining differentiators in this new era of big data and data analytics. What differentiates Company A from Company B is who has better data and who is better organized about data collection. One thing that can't hurt anyone is investing in collecting data intelligently. You have to have a purpose too. Data by itself just sits there. It has to be mined and interpreted to have real value.

Listen to their conversation here:

Read the original post:

Podcast: All Things Data with Guest Sean Owen - Newsroom | University of St. Thomas - University of St. Thomas Newsroom

IBM’s AutoAI Has The Smarts To Make Data Scientists A Lot More Productive But What’s Scary Is That It’s Getting A Whole Lot Smarter – Forbes

IBM

I recently had the opportunity to discuss current IBM artificial intelligence developments with Dr. Lisa Amini, an IBM Distinguished Engineer and the Director of IBM Research Cambridge, home to the MIT-IBM Watson AI Lab. Dr. Amini was previously Director of Knowledge & Reasoning Research in the Cognitive Computing group at IBMs TJ Watson Research Center in New York. Dr. Amini earned her Ph.D. degree in Computer Science from Columbia University. Dr. Amini and her team are part of IBM Research tasked with creating the next generation of Automated AI and data science.

I was interested in automation's impact on the lifecycles of artificial intelligence and machine learning and centered our discussion around next-generation capabilities for AutoAI.

AutoAI automates the highly complex process of finding and optimizing the best ML model, features, and model hyperparameters for your data. AutoAI does what otherwise would need a team of specialized data scientists and other professional resources, and it does it much faster.

AI model building can be challenging

How Much Automation Does a Data Scientist Want?

Building AI and machine learning models is a multifaceted process that involves gathering requirements and formulating the problem. Before model training begins, data must be acquired, assessed, and preprocessed to identify and correct data quality issues.

Because the process is so complex, data scientists and ML engineers typically create ML pipelines to link those steps together for reuse each time data and models are refined. Pipelines handle data cleansing and manipulation operations for model training, testing and deployment, and inference. Constructing and tuning a pipeline is not only complex but also labor-intensive. It requires a team of trained resources who understand data science, plus subject-matter experts knowledgeable about the models purpose and outputs.

It is a lengthy process because there are many design choices to be made, plus a myriad of tuning adjustments for various data processing and modeling stages.

The pipeline's high degree of complexity makes it a prime candidate for automation.

IBM AutoAI automates model building across the entire AI lifecycle

IBM

According to Dr. Amini, AutoAI does in minutes what would typically take hours to days for a whole team of data scientists. Automated functions include data preparation, model development, feature engineering, and hyperparameter optimization.

IBM

End-to-end automation of an entire model building process can result in significant resource savings. Here is a partial list of AutoAI features:

AutoAI provides a significant productivity boost. Even a person with basic data science skills can automatically select, train, and tune a high-performing ML model with customized data in just a few mouse clicks.

However, expert data scientists can rapidly iterate on potential models and pipelines, and experiment with the latest models, feature engineering techniques, and fairness algorithms. This can all be done without having to code pipelines from scratch.

Future AI automation projects

IBM Research is working on several next-generation AI automation projects, such as next-generation algorithms to handle new data types, bring new automated quality and fairness, and dramatically boost scale and performance.

Dr. Amini provided a deep dive into two especially interesting next-generation capabilities for scaling enterprise AI: AutoAI for Decisions and Semantic Data Science.

AutoAI for improved decision making

Time series forecasting is one of the most popular but one of the most difficult predictive analytics. It uses historical data to predict the timing of future results. Time series forecasting is commonly used for financial planning, inventory, and capacity planning. The time dimensions within a dataset make analysis difficult and require more advanced data handling.

IBM

IBMs AutoAI product already supports Time Series forecasting. It automates the following steps of building predictive models:

Dr. Amini explained that after a time series forecast is created in many settings, the next step is to leverage that forecast for improved decision-making.

For example, a data scientist might build a time series forecasting model for product demand, but the model can also be used as input for inventory restocking decisions with the goal to maximize profit by reducing costly over-stocking of too much inventory or avoiding lost sales due to stock outages.

Simple heuristics are sometimes used for inventory restocking decisions, such as determining when inventory should be restocked and by how much. In other cases, a more systematic approach, called decision optimization, is leveraged to build a prescriptive model to complement the predictive time series forecasting model.

Prescriptive analytics (as opposed to predictive analytics) use sophisticated mathematical modeling techniques and data structures for decision optimization and leverage expertise in short supply. However, products for automated decision optimization pipeline generation created directly from data, like AutoAI for predictive models, do not exist today.

Multi-model pipelines

IBM

Dr. Amini explained that the best results are obtained by using both machine learning and decision optimization. To support that capability, IBM researchers are working on multi-model pipelines that could accommodate the needs of predictive and prescriptive models. Multi-models will allow business analysts and data scientists to use a common model to discuss aspects of the problem from each other's perspectives. Such a product would also promote and improve collaboration between diverse but equally essential resources.

Automation for Deep Reinforcement Learning

The new capability to automate pipeline generation for decision models is now available through the Early Access program from IBM Research. It leverages deep reinforcement learning to learn an end-to-end model from data to decision policy. The technology, called AutoDO (Automated Decision Optimization), leverages reinforcement learning (RL) models and gives data scientists the capability to train machine learning models to perform sequential decision-making under uncertainty. Automation for reinforcement learning (RL) is critical because RL algorithms are highly sensitive to internal hyperparameters. Therefore, they require significant expertise and manual effort to tune them to specific problems and data sets.

Dr. Amini explained that the technology automatically selects the best reinforcement learning model to use according to the data and the problem. Using advanced search strategies, it also selects the best configuration of hyperparameters for the model.

The system can automatically search historical data sets or any gym-compatible environment to automatically generate, tune, and rank the best RL pipeline. The system supports various flavors of reinforcement learning, including online and offline learning and model-free and model-based algorithms.

Scaling AI with automation

Automation for reinforcement learning tackles two pressing problems for scaling AI in the enterprise.

First, it provides automation for sequential decision-making problems where uncertainty may weaken heuristic and even formal optimization models that don't utilize historical data.

Secondly, it brings an automated, systematic approach to the challenging reinforcement learning model building domain.

Semantic Data Science

State-of-the-art automated ML products like AutoAI can efficiently analyze historical data to create and rank custom machine learning pipelines. It includes automated feature engineering, which expands and augments the feature space of data to optimize model performance. Automated methods currently rely on statistical techniques to explore the feature space.

However, if a data scientist understands the semantics of the data, it is possible to leverage domain knowledge to expand the feature space to increase model accuracy. This expansion can be done using complementary data from internal or external data sources. Feature space is the group of features used to characterize data. For example, if the data is about cars, the feature space could be (Ford, Tesla, BMW).

Complementary feature transformations may be found in existing python scripts or relationships described in the literature. Despite this, knowing which features and transformations are relevant, a user must have sufficient technical skills to decipher and translate from code and documents.

IBM

New semantic power for data scientists

Dr. Amini described another powerful new capability created by IBM Research called Semantic Data Science that automatically detects semantic concepts for a given dataset. Semantic concepts characterize concepts to help understand the words and sentences to provide a way for meanings to be represented. Once AutoAI has detected the proper semantic concepts, the program uses those concepts in a broad search for relevant features and feature engineering operations that may be present in existing code, data, and literature.

AutoAI can use these new, semantically-rich features to improve the accuracy of generated models and provide human-readable explanations with these generated features.

IBM

Even without having domain expertise to assess these semantic concepts or new features, a data scientist can still run AutoAI experiments. However, data scientists who want to understand and interact with the discovered semantic concepts can use the Semantic Feature Discovery visual explorer to explore discovered relationships.

Users can go directly from the visual explorer into the python code or document where the new feature originated simply by clicking the Sources hyperlink, as shown in the graphics below.

IBM

IBM

The Semantic Data Science capability is also available as an IBM Research Early Access offering. Some of the capabilities are even available for experimentation on IBMs API Hub.

Dr. Amini concluded our conversation and summed up the vast research effort IBM is pouring into AutoAI with one single yet efficient sentence:

We want AutoAI and Semantic Data Science to do what an expert data scientist would want to do but may not always have the time or domain knowledge to do by themselves.

Wrap-up key points

Analyst Notes:

For more information and comments about quantum computing and artificial intelligence, follow Paul Smith-Goodson on Twitter @Moor_Quantum

Moor Insights & Strategy, like all research and tech industry analyst firms, provides or has provided paid services to technology companies. These services include research, analysis, advising, consulting, benchmarking, acquisition matchmaking, or speaking sponsorships. The company has had or currently has paid business relationships with 88, A10 Networks, Advanced Micro Devices, Amazon, Ambient Scientific, Anuta Networks, Applied Micro, Apstra, Arm, Aruba Networks (now HPE), AT&T, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, Calix, Cisco Systems, Clear Software, Cloudera, Clumio, Cognitive Systems, CompuCom, CyberArk, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Dialogue Group, Digital Optics, Dreamium Labs, Echelon, Ericsson, Extreme Networks, Flex, Foxconn, Frame (now VMware), Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Revolve (now Google), Google Cloud, Graphcore, Groq, Hiregenics, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, IBM, IonVR, Inseego, Infosys, Infiot, Intel, Interdigital, Jabil Circuit, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, Luminar, MapBox, Marvell Technology, Mavenir, Marseille Inc, Mayfair Equity, Meraki (Cisco), Mesophere, Microsoft, Mojo Networks, National Instruments, NetApp, Nightwatch, NOKIA (Alcatel-Lucent), Nortek, Novumind, NVIDIA, Nutanix, Nuvia (now Qualcomm), ON Semiconductor, ONUG, OpenStack Foundation, Oracle, Panasas, Peraso, Pexip, Pixelworks, Plume Design, Poly (formerly Plantronics), Portworx, Pure Storage, Qualcomm, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Residio, Samsung Electronics, SAP, SAS, Scale Computing, Schneider Electric, Silver Peak (now Aruba-HPE), SONY Optical Storage, Springpath (now Cisco), Spirent, Splunk, Sprint (now T-Mobile), Stratus Technologies, Symantec, Synaptics, Syniverse, Synopsys, Tanium, TE Connectivity, TensTorrent, Tobii Technology, T-Mobile, Twitter, Unity Technologies, UiPath, Verizon Communications, Vidyo, VMware, Wave Computing, Wellsmith, Xilinx, Zayo, Zebra, Zededa, Zoho, and Zscaler. Moor Insights & Strategy founder, CEO, and Chief Analyst Patrick Moorhead is a personal investor in technology companies dMY Technology Group Inc. VI and Dreamium Labs.

Read more:

IBM's AutoAI Has The Smarts To Make Data Scientists A Lot More Productive But What's Scary Is That It's Getting A Whole Lot Smarter - Forbes