Category Archives: Data Science

PyCharm vs. Spyder: Choosing the Right Python IDE – Unite.AI

Written by admin on September 17, 2023 — Leave a Comment

Python is immensely popular among developers and data scientists due to its simplicity, versatility, and robustness, making it one of the most used programming languages in 2023. With around 147,000 packages, the Python ecosystem continues to evolve with better tools, plugins, and community support.

When we talk about Python development, Integrated Development Environments (IDEs) take center stage, allowing developers to enhance their coding experience. Two popular IDEs for Python development are PyCharm and Spyder. This article briefly compares Python vs. Spyder to help developers make an informed choice.

Before comparing PyCharm vs. Spyder to determine the best IDE for Python development, its essential to understand what these tools entail.

PyCharm is a product by JetBrains that offers a feature-rich integrated development environment for Python. The IDE has two editions PyCharm Community and PyCharm Professional. The former is a free, open-source version, while the latter is a paid version for full-stack development. Both versions support several features, including code completion, code analysis, debugging tools, and integration with various version control systems. The professional edition further includes frameworks for web development and data science.

Spyder, or Scientific Python Development Environment, is an open-source IDE primarily focusing on data science and scientific computing in Python. Its part of the Anaconda distribution, a popular package manager and distribution platform for Python. Spyder provides comprehensive tools for advanced data analysis, visualization, and scientific development. It features automatic code completion, code analysis, and vertical/horizontal screen splits with a multi-language editor pane that developers can use for creating and modifying source files. Moreover, developers can extend Spyders functionality with powerful plugins.

Several similarities and differences exist between these two IDEs. Below, we compare them against various dimensions, including code editing and navigation features, debugging capability, support for integrated tools, customizability, performance, usability, community support, and pricing.

Both PyCharm and Spyder offer powerful code editing and navigation features, making it easy for developers to write and understand code across files. While Spyder provides similar code completion and navigation ability, it is less robust than PyCharm's code editing features, which offer context-based recommendations for faster development. For instance, developers get code completion suggestions (sorted by priority) based on other developers' work in a similar scenario.

PyCharm leads this category with its advanced code analysis and completion capabilities.

PyCharms professional version has a Javascript-based debugger that supports various debugging modes, including remote debugging. It also provides a visual debugger with breakpoints, variable inspection, and step-by-step execution.

Spyder includes a PDB debugger. PDB is a source debugging library for Python that lets developers set conditional breakpoints and inspect stack frames. Its variable explorer is particularly helpful for checking variable states at several breakpoints.

While Spyders debugging capabilities are robust, PyCharms visual debugger is better as it helps in more complex debugging scenarios.

PyCharm has extensive integration with third-party tools and services. For instance, it has built-in support for version control systems like Git, SVN, Perforce, etc. The professional edition supports web development frameworks, such as Django, Flask, Angular, etc., making it an excellent choice for full-stack development.

Spyder, primarily a data science and scientific computing utility, comes with numerous libraries and tools, such as NumPy, SciPy, Matplotlib, and Jupyter Notebooks. Also, it shares all libraries that come with the Anaconda distribution. However, Spyder only supports Git for version control.

Overall, PyCharm overtakes Spyder in this category since the former offers integration with diverse tools through plugins.

PyCharm offers a high level of visual customization, allowing developers to tailor the IDE according to their workflow and preferences. They can change font type and color, code style, configure keyboard shortcuts, etc.

Spyder is relatively less customizable compared to PyCharm. The most a user can do is change the user interfaces (UIs) theme using a few options among light and dark styles.

Again, PyCharm takes the win in the customization category.

While performance can vary depending on the size and complexity of the projects, Spyder is relatively faster than PyCharm. Since PyCharm has many plugins installed by default, it consumes more system resources than Spyder.

As such, Spyder's lightweight architecture can make it a better choice for data scientists who work on large datasets and complex data analysis.

Spyder is the clear winner in the performance category.

PyCharm has many customization options for its user interface (UI). Developers benefit from an intuitive navigation system with a clean layout. However, its extensive feature set means it has a steep learning curve, especially for beginners.

In contrast, Spyder's interface is much more straightforward. Like R, it has a variable navigation pane, a console, a plot visualization section, and a code editor, all on a single screen. The simplified view is best for data scientists who want a holistic view of model results with diagnostic charts and data frames. Also, Spyder's integration with Jupyter Notebooks makes data exploration and visualization easier for those new to data science.

Overall, Spyder is ideal for beginners, while PyCharm is more suited to experienced Python developers.

PyCharm has a free and paid version. The free community version is suitable for individual developers and teams working on a small scale. The paid version, the Professional Edition, comes in two variants for organizations and individuals. The organization version costs US 24.90 monthly, while the individual one costs USD 9.90 monthly.

In contrast, Spyder is open-source and entirely free to use. It comes as part of the Anaconda distribution, which is also open-source and free.

In terms of cost, Sypder is a clear winner. However, in Python development, it is up to the practitioners and organizations to choose based on their business requirements.

Both PyCharm and Spyder have active communities that provide extensive support to users. PyCharm benefits from JetBrains' strong reputation and rich experience in building Python development tools. As such, developers can utilize its large user community and get help from a dedicated support team. They also have access to many tutorials, help guides, and plugins.

Spyder leverages the Anaconda community for user support. With an active data science community, Spyder benefits from the frequent contributions of data scientists who provide help through forums and online resources, data science tutorials, frameworks, and computation libraries.

Again, it is up to the practitioners and organizations to choose a community that aligns with their task or business requirements.

Choosing between PyCharm and Spyder can be challenging. Its helpful to consider some of their use cases so practitioners can decide which IDE is better for their task.

PyCharm is ideal for full-stack developers as the IDE features several web and mobile app development tools and supports end-to-end testing. Its best for working on large-scale projects requiring extensive collaboration across several domains.

Spyder, in contrast, is suitable for data scientists, researchers, and statisticians. Its lightweight architecture allows users to perform exploratory data analysis and run simple ML models for experimentation. Instructors can use this IDE to teach students the art of data storytelling and empower them to train machine learning models efficiently.

The choice between PyCharm and Spyder ultimately depends on user needs, as both IDEs offer robust features for specific use cases.

PyCharm is best for experienced professionals who can benefit from its advanced web development tools, making it an excellent choice for building web and mobile apps. Users wishing to learn data science or work on related projects should go for Spyder.

To read more interesting technology-related content, navigate through Unite.ais extensive catalog of insightful resources to amplify your knowledge.

Link:

PyCharm vs. Spyder: Choosing the Right Python IDE - Unite.AI

360DigiTMG: Your Path to Success in Data Science and AI Careers – mid-day.com

Written by admin on September 17, 2023 — Leave a Comment

360DigiTMG

360DigiTMG, a globally recognized leader in IT training, has introduced a groundbreaking Professional Data Science and AI with Placement Guarantee Program. The goal of this extensive program is to provide students with the knowledge and abilities they need to succeed in the fast-paced fields of data science and artificial intelligence, opening the door to lucrative careers in these emerging fields.

360DigiTMG is leading the way globally in offering expert training in many different fields. The institute, which was founded in 2013, is well-known in India and other countries. 360DigiTMG significantly transforms the careers of its students by continuously offering them an exceptional learning experience. It has branches in Malaysia, the United States, East Asia, Australia, the United Kingdom, the Netherlands, and the Middle East in addition to its main office in India.

Data Science and AI: A Trending Career

The combination of data science and AI has transformed businesses. Automation, output, cost reduction, and creativity are now common uses of AI. PWC predicts that AI will affect 50% of human jobs in five years, increasing demand for AI experts. AI job openings have increased by 119% in the past three years, doubling the demand for AI skills.

A growing emphasis is being placed on creating ethical and impartial AI technologies amid discussions about how AI will affect employment. Leading tech firms have created ethical committees to monitor AI's effects on society, including Microsoft and Google. Data silos have been dismantled by data management platforms, allowing businesses to gain insightful data. AI is also making it possible to customize services, especially in the finance industry, where analytics are essential for retaining customers.

The public cloud market is predicted to be dominated by AI platforms in the upcoming years as Google, AWS, and Microsoft expand their AI cloud offerings. More businesses will use real-time analytics to make data-driven decisions by spotting hidden patterns. Other fields experiencing rapid development include IoT applications, Patent Analytics, market sizing tools, and Earning Transcripts analysis.

360DigiTMG's Data Science & AI Professional Course with Placement Guarantee

Leading professionals in India have received training from 360DigiTMG. Through its outstanding training programs created to meet industry needs, the institute is dedicated to transforming careers. The curriculum of each certification program at 360DigiTMG is carefully crafted to reflect the most recent market trends.

The Professional Data Science and AI with Placement Guarantee Program provides a solid foundation in math, statistics, calculus, linear algebra, probability, data mining, and regression analysis. Along with NLP libraries and OpenCV for coding machine learning algorithms, the program also covers Python programming for data mining and machine learning.

The main strength of the program is its thorough discussion of machine learning, deep learning, and neural networks. This includes subjects like feedforward and backward propagation, activation functions, loss functions, non-linear activation functions, convolution neural networks (CNNs), recurrent neural networks (RNNs), GANs, reinforcement learning, and Q learning, as well as topics like activation functions, activation functions, and non-linear activation functions. For IT enthusiasts looking to design and develop AI applications, it is a complete package.

Each certification program's curriculum has been painstakingly created to be in line with current business trends. To provide students with practical experience, 360DigiTMG collaborates with renowned businesses like Panasonic India Innovation Center and Innodatatics, which is accredited by UTM, Malaysia. For concept review, students have access to a Learning Management System (LMS), and the institute provides 100% job assistance to help students land jobs at prestigious companies.

Key Features of the Professional Course in Data Science & AI with Placement Guarantee

Who Should Sign Up For The Program

Since its establishment, 360DigiTMG has helped countless people transform their careers by providing top-notch training opportunities.

The Professional Data Science and AI with Placement Guarantee Program is ideal for individuals aspiring to become Data Scientists, AI experts, Business Analysts, Data Analytics developers, recent graduates seeking careers in Data Science, Machine Learning, and AI, professionals transitioning into Data Science, academicians, researchers, and students entering the IT industry.

Since it was started, the program has attracted 3472 learners who have embarked on a journey towards a promising future in Data Science and AI, and yours could be the next success story.

For additional information about 360DigiTMG and its Professional Data Science and AI with Placement Guarantee Program, please visit https://360digitmg.com/.

360DigiTMG: Your Path to Success in Data Science and AI Careers - mid-day.com

Why Some Top Runners Prefer to Train Without a GPS Watch – The New York Times

Written by admin on September 17, 2023 — Leave a Comment

As a decorated college runner at Notre Dame and then at the University of Tennessee, Dylan Jacobs dabbled with a device that many of his teammates considered indispensable.

But on those rare occasions when Jacobs succumbed to peer pressure and slapped a GPS watch around his wrist, he almost immediately remembered why he had resisted the temptation in the first place.

The runs just felt so much longer, said Jacobs, 23, a three-time N.C.A.A. champion who recently turned pro. That was one of my main problems with it. I wasnt enjoying myself or looking around. Instead, I was kind of looking at the watch every quarter-mile to see how much longer I had left.

GPS watches popular brands are Garmin, Suunto and Coros come equipped with satellite technology and heart rate monitors to produce a buffet of functions. Want to know how far and how fast youve run, or how many milliliters of sweat you dumped in Central Park last weekend? How about your average stride length? Your cadence? The list goes on.

For many, GPS watches are a remarkably useful training tool. But there are other runners, including world-class runners like Jacobs, who have a hard time understanding the fuss. To them, a smorgasbord of data is more hindrance than help. And get this: Some runners dont wear watches at all.

I like to focus more on the feel of everything and not worry too much about the time, Jacobs said.

Heather MacLean, an Olympic 1,500-meter runner, recalled a period of her life when she enjoyed the utility of a GPS watch. As a student at the University of Massachusetts, she grew to understand the value of sleep and more important, that she was not getting enough of it while working in a neuroscience laboratory. So she began using a Garmin Forerunner to monitor her rest and adjust her schedule.

Later, as a first-year pro with Team New Balance Boston, MacLean tried to be consistent about wearing a GPS watch but was hampered by a couple of issues. First, she was always forgetting to charge it.

I would just let it die all the time, and Im super lazy with that kind of stuff, she said.

Second, MacLean realized her watch was draining the fun from her runs. It was especially apparent to her during a low-key stretch when she was simply trying to build fitness.

I hated that every run I went on, I felt like I had to check my pace and my distance and whatever else, she said. So I just decided that I was going to lay off it for a while and switch to a regular watch.

She never went back. MacLean, 28, who now wears an Armitron Dragonfly that she said she picked up for $10 at Walmart, acknowledged that there were certain workouts when a GPS watch would come in handy, like when she did a tempo run by herself. (Tempo runs are faster than easy jogs, and frequently run at a prescribed pace.) But Mark Coogan, her coach, has long prioritized effort over pace, and MacLean logs her training in minutes rather than in miles.

I know Im at the elite level now, so not everything is going to be joyful, MacLean said. But when there are things that bring me a lot of joy, Im going to invest in them. And one of those things is the ability to avoid focusing on my pace during my runs.

Without the pressure of feeling as if she needs to account for every mile or, perish the thought, post her workouts for public inspection on Strava, the exercise-tracking platform MacLean has also gotten better about listening to her body. She has no qualms about bailing on an extra workout if she is feeling beat.

And Ill tell Mark that Im going for a walk instead, MacLean said. And hes like, OK!

Sam Prakel was a high school standout in Versailles, Ohio, when the assistant coach of his cross-country team introduced him to the magic of GPS watches. Prakel invested in one. It was a mistake from the start.

I just started running too fast on all my runs, Prakel said, and it became harder to recover from them because I was so focused on my pace. I learned pretty quickly that it wasnt good for me.

Prakel opted instead for a Timex Ironman, which he wore through his freshman year at the University of Oregon. When the band snapped in his sophomore year, he ordered another. Prakel, 28, has worn the same no-frills watch ever since through his time at Oregon, where he was a five-time all-American, and in more recent years as a pro miler for Adidas. He has never needed to change its battery.

The reigning U.S. indoor champion in the mens 1,500 and 3,000 meters, Prakel has a system that works for him, which is a throwback in a sense. What did any runners do before the advent of GPS watches? They estimated. In Prakels case, a 65-minute run is roughly equivalent to 10 miles and a half-hour jog is good for four miles. He does not need to be precise.

As long as I do the same things every week and keep it consistent, thats all that matters, he said, adding: I feel like Im in a better place when I dont have all that data to worry about.

For some runners, aesthetics also matter. Luke Houser, a junior at the University of Washington who won an N.C.A.A. championship in the mens indoor mile last winter, wears a vintage-inspired Casio with a digital display and a gold metal band. His teammates simply refer to it as the gold Casio.

I just think it looks cool, he said. Ive never been interested in cadence or heart rate, which I dont think is ever that accurate anyway. All you need to know is how you feel and the time. That does the job.

Kieran Lumb, who recently broke his own Canadian record in the mens 3,000 meters, is well aware that he is the type of person who is susceptible to the sweet lure of data.

At the University of British Columbia, Lumb majored in electrical engineering. Later, while running at Washington, he earned a masters degree in information systems. And for the longest time, no one who knew him was surprised that he maintained an Excel spreadsheet to catalog his sleep, workouts and something he called rated perceived fatigue.

Just trying to do a little bit of data science on myself, he said.

The twist is that Lumb, 25, who now runs professionally for the athletics apparel brand On, has not worn a GPS watch since he was a competitive cross-country skier growing up in Canada. He made the switch as a college freshman to a Casio calculator watch that didnt even have a proper lap function for track workouts.

So Id just have to remember all my splits, he said, and it was awesome.

Lumb noted that because many runners are naturally competitive, they can become obsessed with numbers. And the business of making it to the top of the heap as an elite runner can be especially taxing.

As a result, Lumbs coach, Andy Powell, tries to keep things as simple as possible. For Lumb, that has meant ditching his Excel folder in favor of Powells old-school approach: weekly workout sheets that his runners fill out and file in three-ring binders.

Theres something nice about slowing down and writing it by hand that I find almost endearing, Lumb said. Its taken a while for me to be less neurotic, but its liberating.

View post:

Why Some Top Runners Prefer to Train Without a GPS Watch - The New York Times

One Model lands $41M to bring data science-powered insights to HR – TechCrunch

Written by admin on August 13, 2023 — Leave a Comment

Image Credits: portishead1 / Getty Images

One Model, a platform that uses AI to help employers make decisions about recruiting, hiring, promotions, layoffs and general workplace planning, today announced that its raised $41 million in a funding round led by Riverwood Capital.

Christopher Butler, CEO of One Model, said that the capital will be used to boost several of the companys growth initiatives, particularly in the areas of technology, product development, customer success and go-to-market.

One Models people analytics product roadmap will be expanded to solve problems for a diverse array of data science, analyst, people manager and C-level audiences, delivering tailored content proactively through alerts, notifications and individualized reporting, Butler told TechCrunch via email. Further investment in One AI will provide more analysts and decision-makers with actionable forecasts, all from a powerful ethical and data governance posture.

One Model is whats known as a people analytics platform a platform designed to collect and apply organizational and talent data to improve business outcomes, at least in theory. Theres long been high interest in people analytics; according to a 2018 Deloitte survey, 84% of large organizations rated people analytics as important or very important and 69% had already formed a people analytics team.

And its showing no signs of slowing. By one estimate, the market for people analytics software will grow from $2.58 billion in 2022 to $7.67 billion by 2031.

One Models founding team, which includes Butler, hails from Inform, a people analytics service that was acquired by SuccessFactors, now SAP SuccessFactors, in 2010. Following the acquisition, they say they witnessed a growing disconnect between what customers were requesting to do with their people data and the solutions being provided in market.

Customers want greater access and more flexibility to manage and democratize access to people data and analysis, Butler said. More data is being generated and collected by the tools we use to manage the workforce and rarely are they integrated with each other. Every new technology and vendor in the HR space generates data, and organizations are not equipped to manage this data complexity.

Butler describes One Model as an enterprise people data orchestration platform. Thats quite a mouthful. But what One Model essentially does is provide a toolkit for extracting, modeling and governing HR data as well as delivering that data to various applications and services.

One Model can perform basic tasks like identifying areas where a company has a shortage of skills or talent and projecting future workforce needs based on demographic changes and business goals. But beyond this, the platform can calculate the cost of turnover and headcount, attempting to create a plan that measures and reduces this cost over time.

We recognized an opportunity to transform large organizations by enhancing enterprise people strategies through data insights, Butler said. One Model is addressing persistent, age-old data integration challenges that large organizations face, while also tackling modern concerns around ensuring consistent, auditable talent decisions with robust privacy standards and data governance.

One Model customers can integrate different data sources and data destinations to create reports, dashboards and visualizations. They also gain access to One AI, One Models data science suite, which can be used to perform a range of workloads centering around HR data.

For example, with One AI, a user can try to predict the likelihood that an employee resigns within a specified time frame factoring in aspects like their daily commute and the time since their last promotion. (One wonders how accurate these predictions are, of course, given AIs potential to pick up on biases in practically any dataset.) Or they could identify opportunities to improve their company workforces efficiency, perhaps by reallocating resources or adjusting work schedules.

Theres competition in the people analytics space see startups like ChartHop and Knoetic. But One Model claims to have built up a sizeable customer base, counting brands including Colgate-Palmolive, Squarespace, Robinhood and Airtable among its list of clients.

We believe One AI takes a vastly different approach than what is available on the market today, Butler said. And we strongly believe in only the most ethical applications of advanced data science because its possible to make rational, impactful people decisions fairly and equitably Our mission is clear: to ensure every workforce decision taken by an enterprise is the best one possible, informed by all relevant enterprise data, and executed in the most transparent and ethical manner possible.

One Models newest tranche brings the Austin, Texas-based startups total raised to $44.8 million.

The rest is here:

One Model lands $41M to bring data science-powered insights to HR - TechCrunch

Why Data Science Teams Should Be Using Pair Programming – The New Stack

Written by admin on August 13, 2023 — Leave a Comment

Data science is a practice that requires technical expertise in machine learning and code development. However, it also demands creativity (for instance, connecting dense numbers and data to real user needs) and lean thinking (like prioritizing the experiments and questions to explore next). In light of these needs, and to continuously innovate and create meaningful outcomes, its essential to adopt processes and techniques that facilitate high levels of energy, drive and communication in data science development.

Pair programming can increase communication, creativity and productivity in data science teams. Pair programming is a collaborative way of working in which two people take turns coding and navigating on the same problem, at the same time, on the same computer connected with two mirrored screens, two mice and two keyboards.

At VMware Tanzu Labs, our data scientists practice pair programming with each other and with our client-side counterparts. Pair programming is more widespread in software engineering than in data science. We see this as a missed opportunity. Lets explore the nuanced benefits of pair programming in the context of data science, delving into three aspects of the data science life cycle and how pair programming can help with each one.

When data scientists pick up a story for development, exploratory data analysis (EDA) is often the first step in which we start writing code. Arguably, among all components of the development cycle that require coding, EDA demands the most creativity from data scientists: The aim is to discover patterns in the data and build hypotheses around how we might be able to use this information to deliver value for the story at hand.

If new data sources need to be explored to deliver the story, we get familiar with them by asking questions about the data and validating what information they are able to provide to us. As part of this process, we scan sample records and iteratively design summary statistics and visualizations for reexamination.

Pairing in this context enables us to immediately discuss and spark a continuous stream of second opinions and tweaks on the statistics and visualizations displayed on the screen; we each build on the energy of our partner. Practicing this level of energetic collaboration in data science goes a long way toward building the creative confidence needed to generate a wider range of hypotheses, and it adds more scrutiny to synthesis when distinguishing between coincidence and correlation.

Based on what we learn about the data from EDA, we next try to summarize a pattern weve observed, which is useful in delivering value for the story at hand. In other words, we build or train a model that concisely and sufficiently represents a useful and valuable pattern observed in the data.

Arguably, this part of the development cycle demands the most science from data scientists as we continuously design, analyze and redesign a series of scientific experiments. We iterate on a cycle of training and validating model prototypes and make a selection as to which one to publish or deploy for consumption.

Pairing is essential to facilitating lean and productive experimentation in model training and validation. With so many options of model forms and algorithms available, balancing simplicity and sufficiency is necessary to shorten development cycles, increase feedback loops and mitigate overall risk in the product team.

As a data scientist, I sometimes need to resist the urge to use a sophisticated, stuffy algorithm when a simpler model fits the bill. I have biases based on prior experience that influence the algorithms explored in model training.

Having my paired data scientist as my data conscience in model training helps me put on the brakes when Im running a superfluous number of experiments, constructively challenges the choices made in algorithm selection and course-corrects me when I lose focus from training prototypes strictly in support of the current story.

In addition to aspects of pair programming that influence productivity in specific components of the development cycle such as EDA and model training/validation, there are also perhaps more mundane benefits of pairing for data science that affect productivity and reproducibility more generally.

Take the example of pipelining. Much of the code written for data science is sequential by nature. The metrics we discover and design in EDA are derived from raw data that requires sequential coding to clean and process. These same metrics are then used as key pieces of information (a.k.a. features) when we build experiments for model training. In other words, the code written to design these metrics is a dependency for the code written for model training. Within model training itself, we often try different versions of a previously trained model (which we have previously written code to build) by exploring different variations of input parameter values to improve accuracy. The components and dependencies described above can be represented as steps and segments in a logical, sequential pipeline of code.

Pairing in the context of pipelining brings benefits in shared accountability driven by a sense of shared ownership of the codebase. While all data scientists know and understand the benefits of segmenting and modularizing code, when coding without a pair, it is easy to slip into a habit of creating overly lengthy code blocks, losing count on similar code being copied-pasted-modified and discounting groups of code dependencies that are only obvious to the person coding. These habits create cobwebs in the codebase and increase risks in reproducibility.

Enter your paired data scientist, who can raise a hand when it becomes challenging to follow the code, highlight groups of code to break up into pipeline segments and suggest blocks of repeated similar code to bundle into reusable functions. Note that this works bidirectionally: when practicing pairing, the data scientist who is typing is fully aware of the shared nature of code ownership and is proactively driven to make efforts to write reproducible code. Pairing is thus an enabler for creating and maintaining a reproducible data science codebase.

If pair programming is new to your data science practice, we hope this post encourages you to explore it with your team. At Tanzu Labs, we have introduced pair programming to many of our client-side data scientists and have observed that the cycles of continuous communication and feedback inherent in pair programming instill a way of working that sparks more creativity in data discovery, facilitates lean experimentation in model training and promotes better reproducibility of the codebase. And lets not forget that we do all of this to deliver outcomes that delight users and drive meaningful business value.

Here are some practical tips to get started with pair programming in data science:

See the original post here:

Why Data Science Teams Should Be Using Pair Programming - The New Stack

NVIDIA, Global Workstation Manufacturers to Launch Powerful Systems for Generative AI and LLM Development … – NVIDIA Blog

Written by admin on August 13, 2023 — Leave a Comment

Desktops Feature NVIDIA RTX 6000 Ada GPUs, NVIDIA Omniverse and NVIDIA AI Enterprise Software

SIGGRAPHNVIDIA and global manufacturers today announced powerful new NVIDIA RTX workstations designed for development and content creation in the age of generative AI and digitalization.

The systems, including those from BOXX, Dell Technologies, HP and Lenovo, are based on NVIDIA RTX 6000 Ada Generation GPUs and incorporate NVIDIA AI Enterprise and NVIDIA Omniverse Enterprise software.

Separately, NVIDIA also released three new desktop workstation Ada Generation GPUs the NVIDIA RTX 5000, RTX 4500 and RTX 4000 to deliver the latest AI, graphics and real-time rendering technology to professionals worldwide.

Few workloads are as challenging as generative AI and digitalization applications, which require a full-stack approach to computing, said Bob Pette, vice president of professional visualization at NVIDIA. Professionals can now tackle these on a desktop with the latest NVIDIA-powered RTX workstations, enabling them to build vast, digitalized worlds in the new age of generative AI.

The new RTX workstations offer up to four NVIDIA RTX 6000 Ada GPUs, each equipped with 48GB of memory, and a single desktop workstation can provide up to 5,828 TFLOPS of AI performance and 192GB of GPU memory. Depending on user needs, systems can be configured with NVIDIA AI Enterprise or Omniverse Enterprise to power a breadth of demanding generative AI and graphics-intensive workloads.

NVIDIA AI Enterprise 4.0, announced separately today, now includes NVIDIA NeMo, an end-to-end framework for building and customizing foundation models for generative AI, NVIDIA RAPIDS libraries for data science, as well as frameworks, pretrained models and tools for building common enterprise AI use cases, including recommenders, virtual assistants and cybersecurity solutions.

Omniverse Enterprise is a platform for industrial digitalization that enables teams to develop interoperable 3D workflows and OpenUSD applications. As an OpenUSD-native platform, Omniverse enables globally distributed teams to collaborate on full-design-fidelity datasets from hundreds of 3D applications.

Yurts provides a full-stack generative AI solution aligning with multiple form factors, deployment models and budgets of our customers. Weve achieved this by leveraging LLMs for various natural language processing tasks and incorporating the RTX 6000 Ada. From private data centers to workstation-sized solutions that fit under a desk, Yurts remains committed to scaling our platform and offering alongside NVIDIA, said Jason Schnitzer, chief technology officer at Yurts.

Workstation users can also take advantage of the new NVIDIA AI Workbench, available soon in early access, which provides developers with a unified, easy-to-usetoolkit for creating, fine-tuning and running generative AI models with just a few clicks. Users of any skill level can quickly create, test and customize pretrained generative AI models on a PC or workstation and then scale them to virtually any data center, public cloud or NVIDIA DGX Cloud.

Next-Generation RTX TechnologyThe new NVIDIA RTX 5000, RTX 4500 and RTX 4000 desktop GPUs feature the latest NVIDIA Ada Lovelace architecture technologies, including:

The NVIDIA RTX 5000 Ada GPU demonstrates NVIDIAs impressive generational performance improvements it has significantly increased our efficiency in creating stereo panoramas using Enscape, said Dan Stine, director of design technology at architectural firm Lake|Flato. With the performance boost and large frame buffer of RTX 5000 GPUs, our large, complex models look great in virtual reality, which gives our clients a more comfortable and contextual experience.

AvailabilityRTX workstations featuring up to four RTX 6000 Ada GPUs, NVIDIA AI Enterprise and NVIDIA Omniverse Enterprise will be available from system builders starting in the fall.

The new NVIDIA RTX 5000 GPU is now available and shipping from HP and through global distribution partners such as Leadtek, PNY and Ryoyo Electro starting today. The NVIDIA RTX 4500 and RTX 4000 GPUs will be available in the fall from BOXX, Dell Technologies, HP and Lenovo and through global distribution partners.

Link:

NVIDIA, Global Workstation Manufacturers to Launch Powerful Systems for Generative AI and LLM Development ... - NVIDIA Blog

ChatGPT accelerates chemistry discovery for climate response … – University of California, Berkeley

Written by admin on August 13, 2023 — Leave a Comment

UC Berkeley experts taught ChatGPT how to quickly create datasets on difficult-to-aggregate research about certain materials that can be used to fight climate change, according to a new paper published in the Journal of the American Chemical Society.

These datasets on the synergy of the highly-porous materials known as metal-organic frameworks (MOFs) will inform predictive models. The models will accelerate chemists ability to create or optimize MOFs, including ones that alleviate water scarcity and capture air pollution. All chemists not just coders can build these databases due to the use of AI-fueled chatbots.

In a world where you have sparse data, now you can build large datasets, said Omar Yaghi, the Berkeley chemistry professor who invented MOFs and an author of the study. There are hundreds of thousands of MOFs that have been reported, but nobody has been able to mine that information. Now we can mine it, tabulate it and build large datasets.

This breakthrough by experts at the College of Computing, Data Science, and Societys Bakar Institute of Digital Materials for the Planet (BIDMaP) will lead to efficient and cost-effective MOFs more quickly, an urgent need as theplanetwarms. It can also be applied to other areas of chemistry. It is one example of how AI can augment and democratize scientific research.

We show that ChatGPT can be a very helpful assistant, said Zhiling Zheng, lead author of the study and a chemistry Ph.D. student at Berkeley. Our ultimate goal is to make [research] much easier.

Other authors of the study, ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis, include the Department of Chemistrys Oufan Zhang and the Department of Electrical Engineering and Computer Sciencess Christian Borgs and Jennifer Chayes. All are affiliated with BIDMaP, except Zhang.

Certain authors are also affiliated with the Kavli Energy Nanoscience Institute, the Department of Mathematics, the Department of Statistics, the School of Information and KACST-UC Berkeley Center of Excellence for Nanomaterials for Clean Energy Applications.

The team guided ChatGPT to quickly conduct a literature review. They curated 228 relevant papers. Then they enabled ChatGPT to process the relevant sections in those papers and to extract, clean and organize that data.

To help them teach ChatGPT to generate accurate and relevant information, they modified an approach called prompt engineering into ChemPrompt Engineering. They developed prompts that avoided asking ChatGPT for made up or misleading content; laid out detailed directions that explained to the chatbot the context and format for the response; and provided the large language model a template or instructions for extracting data.

The chatbots literature review and the experts approach was successful. ChatGPT finished in a fraction of an hour what would have taken a student years to complete, said Borgs, BIDMaPs director. It mined the synthetic conditions of MOFs with 95% accuracy, Yaghi said.

"AI has transformed many other sectors of our society," said Omar Yaghi, BIDMaP's co-director and chief scientist. "Why not transform science?"

One big area of how you do AI for science is probing literature more effectively. This is really a substantial jump in doing natural language processing in chemistry, said Chayes, dean of the College of Computing, Data Science, and Society. And to use it, you can just be a chemist, not a computer scientist.

This development will speed up MOF-related science work, including those efforts aimed at combating climate change, said Borgs. With natural disasters becoming more severe and frequent, we need that saved time, he said.

Yaghi noted that using AI in this way is still new. Like any new tool, experts will need time to identify its shortcomings and address them. But its worth investing the effort, he said.

If we don't use it, then we can't make it better. If we can't make it better, then we will have missed a whole area that society is already using, Yaghi said. AI has transformed many other sectors of our society commerce, banking, travel. Why not transform science?

Originally posted here:

ChatGPT accelerates chemistry discovery for climate response ... - University of California, Berkeley

Not Blowing Smoke: Howard University Researchers Highlight Earth … – The Dig

Written by admin on August 13, 2023 — Leave a Comment

Since 2021, Amy Y. Quarkume, PhD, has investigated the impacts of environmental data bias on eight Black, Brown, and Indigenous communities across the United States.

Quarkume is an Africana Studies professor and the graduate director of Howard University's inaugural Center for Applied Data Science and Analytics program.

Through in-depth interviews with community members, modeling, and mapping, her team of college, high school, and middle school researchers have already identified significant disparities in environmental data representation.

What happens when your local news station, state Department of Environmental Quality or the federal Environmental Protection Agency cant disclose what is in the colored skyline and funny odor you smell in the morningnot because they dont want to give you the information, but because they dont know how to give you data that is specific to your community and situation? Quarkume said.

In 2020, the National Oceanic and Atmospheric Administration (NOAA) announced its strategy to dramatically expand the application of artificial intelligence (AI) in every NOAA mission area. However, as algorithms provided incredibly powerful solutions, they simultaneously disenfranchised marginalized populations.

Quarkume and her teams findings highlight various challenges: inadequate environmental data collection sites, uneven dissemination of environmental information, delays in installing data collection instruments, and a lack of inclusive community voices on environmental concerns. The team argues that implementing AI in this domain will dramatically expand historical and chronic problems.

Imagine a world where there is clean air for all. In order to make that happen, we would need to collect enough data on some of our most at-risk communities to begin to model such a reality.

The multi-year Whats Up with All the Bias project, funded by the National Center for Atmospheric Researchs Innovator Program, connects questions of climate change, race, AI, culture, and environmental justice in hopes of emphasizing the true lived realities of communities of color in data. Black and Hispanic communities are exposed to more air pollution than their white counterparts and left to deal with the effects of environmental racismas the new Jim Crow. The projects intersectional approach skillfully magnifies the negative effects of climate change.

Community organizer and lifelong D.C. resident Sebrena Rhodes thinks about air quality often.

The environmental activist has been outspoken about inequalities in air pollution, urban heat, and other environmental justice issues in the Ivy City and Brentwood neighborhoods for years, even prior to a nationwide increase in air quality app usage. In the wake of this summers ongoing Canadian wildfires, Rhodes is especially vigilant.

Because of the wildfires in Canada, our poor air quality was further exacerbated, said Rhodes. Our air quality, per the Purple Air monitor placed at the Ivy City Clubhouse, went up to 403 which was one of Ivy City's worst AQ days ever!

Purple Air monitors provide community stakeholders with hyperlocal air quality readings that can help them shape their day-to-day experiences.

We check the air quality in the morning, during the lunch hour, and around 2 p.m. Purple Air gives us results of our air quality in real time, every 10 minutes. The data [updates] throughout the day, Rhodes said.

Studies consistently reveal that populations of color bear an unequal burden when it comes to exposure to air pollution. This inequality is evident in fence-line communities, where African Americans are 75% more likely than their white counterparts to reside near commercial facilities that generate noise, odor, traffic, or emissions.

Further, asthma rates are significantly higher among people of color compared to white communities, with Black Americans being nearly one and a half times more likely to suffer from asthma and three times more likely to die from the condition.

According to Whats Up with All the Bias principal investigator, Dr. Quarkume says matters of air quality, heat, racism, policing, and housing often go hand-in-hand, like in her hometown of the Bronx.

Whether it's Asthma Alley in the Bronx, or Cancer Alley in Louisiana, some communities have been dealing with these issues for decades, said Quarkume. They still deserve to know what is in the air. They deserve quality data to make their own decisions and push for change.

Curtis Bay, a working-class Baltimore neighborhood, currently faces disproportionately high amounts of dangerous air pollution and has a long history of industrial accidents and toxic exposures. Activists, residents, and community members alike have fought for these local air quality issues to be addressed.

Accessing quality data has also presented an additional hurdle for frontline communities. In Miamis Little Haiti neighborhood, there are sizable differences between the reported temperature from the National Weather Service and local temperature readings.

These stories have motivated Quarkume and her team to deploy additional air quality monitors, heat sensors, and water quality monitors to communities during the next phase of their project. By supporting local organizations already invested in their communities, she hopes to support community-centered data, data openness, community-centered research, and data equity, principles of the CORE Futures Lab, which she also leads.

Imagine a world where there is clean air for all. In order to make that happen, we would need to collect enough data on some of our most at-risk communities to begin to model such a reality. The data world has yet to substantially invest in such projects Progress is in our ability to translate and empower communities to own and imagine those data points and future for themselves, said Quarkume.

Jessica Moulite and Mikah Jones are PhD research assistants and members of the NCAR Early Career Faculty Innovator Program.

Originally posted here:

Not Blowing Smoke: Howard University Researchers Highlight Earth ... - The Dig

The Importance of Data Cleaning in Data Science – KDnuggets

Written by admin on August 13, 2023 — Leave a Comment

In data science, the accuracy of predictive models is vitally important to ensure any costly errors are avoided and that each aspect is working to its optimal level. Once the data has been selected and formatted, the data needs to be cleaned, a crucial stage of the model development process.

In this article, we will provide an overview of the importance of data cleaning in data science, including what it is, the benefits, the data cleaning process, and the commonly used tools.

In data science, data cleaning is the process of identifying incorrect data and fixing the errors so the final dataset is ready to be used. Errors could include duplicate fields, incorrect formatting, incomplete fields, irrelevant or inaccurate data, and corrupted data.

In a data science project, the cleaning stage comes before validation in the data pipeline. In the pipeline, each stage ingests input and creates output, improving the data each step of the way. The benefit of the data pipeline is that each step has a specific purpose and is self-contained, meaning the data is thoroughly checked.

Data seldom arrives in a readily usable form; in fact, it can be confidently stated that data is never flawless. When collected from diverse sources and real-world environments, data is bound to contain numerous errors and adopt different formats. Hence, the significance of data cleaning arises -- to render the data error-free, pertinent, and easily assimilated by models.

When dealing with extensive datasets from multiple sources, errors can occur, including duplication or misclassification. These mistakes greatly affect algorithm accuracy. Notably, data cleaning and organization can consume up to 80% of a data scientist's time, highlighting its critical role in the data pipeline.

Below are three examples of how data cleaning can fix errors within datasets.

Data Formatting

Data formatting involves transforming data into a specific format or modifying the structure of a dataset. Ensuring consistency and a well-structured dataset is crucial to avoid errors during data analysis. Therefore, employing various techniques during the cleaning process is necessary to guarantee accurate data formatting. This may encompass converting categorical data to numerical values and consolidating multiple data sources into a unified dataset.

Empty/ Missing Values

Data cleaning techniques play a crucial role in resolving data issues such as missing or empty values. These techniques involve estimating and filling in gaps in the dataset using relevant information.

For instance, consider the location field. If the field is empty, scientists can populate it with the average location data from the dataset or a similar one. Although not flawless, having the most probable location is preferable to having no location information at all. This approach ensures improved data quality and enhances the overall reliability of the dataset.

Identifying Outliers

Within a dataset, certain data points may lack any substantive connection to others (e.g., in terms of value or behavior). Consequently, during data analysis, these outliers possess the ability to significantly distort results, leading to misguided predictions and flawed decision-making. However, by implementing various data cleaning techniques, it is possible to identify and eliminate these outliers, ultimately ensuring the integrity and relevance of the dataset.

Data cleaning provides a range of benefits that have a significant impact on the accuracy, relevance, usability, and analysis of data.

The data cleaning stage of the data pipeline is made up of eight common steps:

Large datasets that utilize multiple data sources are highly likely to have errors, including duplicates, particularly when new entries haven't undergone quality checks. Duplicate data is redundant and consumes unnecessary storage space, necessitating data cleansing to enhance efficiency. Common instances of duplicate data comprise repetitive email addresses and phone numbers.

To optimize a dataset, it is crucial to remove irrelevant data fields. This will result in faster model processing and enable a more focused approach toward achieving specific goals. During the data cleaning stage, any data that does not align with the scope of the project will be eliminated, retaining only the necessary information required to fulfill the task.

Standardizing text in datasets is crucial for ensuring consistency and facilitating easy analysis. Correcting capitalization is especially important, as it prevents the creation of false categories that could result in messy and confusing data.

When working with CSV data using Python to manipulate it, analysts often rely on Pandas, the go-to data analysis library. However, there are instances where Pandas fall short in processing data types effectively. To guarantee accurate data conversion, analysts employ cleaning techniques. This ensures that the correct data is easily identifiable when applied to real-life projects.

An outlier is a data point that lacks relevance to other points, deviating significantly from the overall context of the dataset. While outliers can occasionally offer intriguing insights, they are typically regarded as errors that should be removed.

Ensuring the effectiveness of a model is crucial, and rectifying errors before the data analysis stage is paramount. Such errors often result from manual data entry without adequate checking procedures. Examples include phone numbers with incorrect digits, email addresses without an "@" symbol, or unpunctuated user feedback.

Datasets can be gathered from various sources written in different languages. However, when using such data for machine translation, evaluation tools typically rely on monolingual Natural Language Processing (NLP) models, which can only handle one language at a time. Thankfully, during the data cleaning phase, AI tools can come to the rescue by converting all the data into a unified language. This ensures greater coherence and compatibility throughout the translation process.

One of the last steps in data cleaning involves addressing missing values. This can be achieved by either removing records that have missing values or employing statistical techniques to fill in the gaps. A comprehensive understanding of the dataset is crucial in making these decisions.

The importance of data cleaning in data science can never be underestimated as it can significantly impact the accuracy and overall success of a data model. With thorough data cleaning, the data analysis stage is likely to output flawed results and incorrect predictions.

Common errors that need to be rectified during the data cleaning stage are duplicate data, missing values, irrelevant data, outliers, and converting multiple data types or languages into a single form.Nahla Davies is a software developer and tech writer. Before devoting her work full time to technical writing, she managed among other intriguing things to serve as a lead programmer at an Inc. 5,000 experiential branding organization whose clients include Samsung, Time Warner, Netflix, and Sony.

Read the original here:

The Importance of Data Cleaning in Data Science - KDnuggets

Research Associate (Data Science – HRPC) job with NATIONAL … – Times Higher Education

Written by admin on August 13, 2023 — Leave a Comment

Main Duties and Responsibilities

The main areas of responsibilities of the role will include but limited to the following:

Job Description

The National University of Singapore (NUS) invites applications for the role of Research Associate with the Heat Resilience & Performance Centre (HRPC), Yong Loo Lin School of Medicine. The HRPC is a first-of-its-kind research centre, established at the NUS, to spearhead and conduct research and development to better enable the future challenges arising from rising ambient heat. Appointments will be made on a two-year contract basis, with the possibility of extension.

We are looking for a Research Associate who is excited to be involved in research aimed at developing strategies and solutions that leverage on technology and data science to allow individuals to continue to live, work and thrive in spite of a warming world. This role will see you working as part of a multi-disciplinary team to develop heat health prediction models to be deployed with wearable systems

Qualifications

A minimum qualification of a Masters degree in a quantitative discipline (eg. statistics, mathematics, data science, computer science, computational biology, engineering.). Proficient in statistical software and programming languages and familiarity with relevant libraries.

In addition, the role will require the following job relevant experience or attributes:

Additional Skills

Application

The role is based in Singapore.

Prospective candidates can contact Ms Lydia Law atlydialaw@nus.edu.sg.

Remuneration will be commensurate with the candidates qualifications and experience.

Only shortlisted candidates will be notified.

More Information

Location: Kent Ridge CampusOrganization: Yong Loo Lin School of MedicineDepartment: Dean's Office (Medicine)Employee Referral Eligible: NoJob requisition ID: 20100

Originally posted here:

Research Associate (Data Science - HRPC) job with NATIONAL ... - Times Higher Education

Cloud Hosting

Category Archives: Data Science

PyCharm vs. Spyder: Choosing the Right Python IDE – Unite.AI

360DigiTMG: Your Path to Success in Data Science and AI Careers – mid-day.com

Why Some Top Runners Prefer to Train Without a GPS Watch – The New York Times

One Model lands $41M to bring data science-powered insights to HR – TechCrunch

Why Data Science Teams Should Be Using Pair Programming – The New Stack

NVIDIA, Global Workstation Manufacturers to Launch Powerful Systems for Generative AI and LLM Development … – NVIDIA Blog

ChatGPT accelerates chemistry discovery for climate response … – University of California, Berkeley

Not Blowing Smoke: Howard University Researchers Highlight Earth … – The Dig

The Importance of Data Cleaning in Data Science – KDnuggets

Research Associate (Data Science – HRPC) job with NATIONAL … – Times Higher Education

Recent Posts

Categories

Archives

Media Sites

Pages

Site admin