Category Archives: Data Science
Where Does AI Happen? – KDnuggets
Partnership Post
By Connor Lee, Incoming NYU Computer Science Student
With the leap in AI progress making shockwaves throughout mainstream media since November 2022, many speculate their jobs will be taken over by their AI counterparts. One profession, however, cannot be possibly replaced: the researchers advancing deep neural networks and other machine learning models the humans behind the AI. Although research is traditionally done within university walls, AI is by no means a traditional research field. A sizable portion of AI research is done in industrial labs. But which sector should aspiring researchers flock toward? Academia or industry?
Academia is more inclined to basic fundamental research while the industry is inclined to user-oriented research driven by the large data access, says Nitesh Chawla, a Professor of Computer Science and Engineering at the University of Notre Dame. Prof. Chawla points to the pursuit of knowledge as a separating factor between industrial and academic AI research. Within the industry, research is tied to a product, advancing towards a better society-- while within academia, the pursuit of pure discovery drives research breakthroughs. The seemingly endless academic freedom does not come without its drawbacks, academia does not have the data nor the computing access available, according to Prof. Chawla.
For aspiring young researchers, the choice seems simple: the private sector has everything they could want. Vast, autonomous, commercial organizations striving toward innovation while supported by readily available data, computing power, and funding. This led to a perception that the industry is stealing talent away from academia. Academics, naturally, complain. A study published in 2021 by a team from Aalborg University pointed out that increasing participation of the private sector in AI research has been accompanied by a growing flow of researchers from academia into industry, and especially into technology companies such as Google, Microsoft, and Facebook.
As expected, industrial researchers disagree. When I hire for my team, I want top talent, and as such Im not poaching academic talent, but rather I am trying to help them get industry awards, funding from industry, and have their students as interns, explains Dr. Luna Dong, a Principal Scientist at Meta who is the head scientist working on Metas smart glasses. She sees a glaring difference between industry and academia, which could be credited to the fundamental way research is conducted. According to Dr. Dong, AI research within an industry is conducted by knowing what the end product should look like and reverse engineering a path toward it. In contrast, academics, having a promising idea, continuously construct various paths, not knowing where those paths would lead.
Yet, despite these contrasts, Dr. Dong believes the industry helps academia and vice versa, lots of industry breakthroughs are inspired by applying the research from academia on real use-cases. Likewise, Computer Science Professor Ankur Teredesai from the University of Washington, Tacoma, describes the relationship between industry and academia as supporting each other, symbiotic is the word that comes to mind. As he views it, research practices have evolved into academics shifting their agenda to aid industry products -- a good example of that shift would be joint positions within major corporations that some prominent professors are holding.
Regardless of their affiliations, the data science community converges together a few times a year at conferences. Prof. Chawla describes them as a wonderful melting pot. Some conferences are traditionally more academic, some purely industrial but some are a perfect blend of both. Prof. Chawla points to KDD, or the Special Interest Group on Knowledge Discovery and Data Mining, a conference known for such a connection. KDD maintains two parallel peer-reviewed tracks: the research track and the applied data science (ADS) track. As put by Dr. Dong, who was the ADS Program Co-Chair at KDD-2022, KDD is helpful by providing a forum for researchers and practitioners to come together to listen to the talks and discuss the techniques while inspiring each other. KDD is a place where we break the barriers of communication and collaboration, where we demonstrate how data science and machine learning advances with industry consumption.
This is the mindset that drove KDD from its early days. One of the things we wanted to do from the very beginning was to create a conference where applications were well represented, commends Prof. Usama Fayyad, Executive Director of the Institute for Experiential AI at Northeastern University and a former Chief Data Officer of Yahoo, who together with Dr. Gregory Piatetsky-Shapiro co-founded the KDD conference in 1995. Prof. Fayyad believes that if AI conferences were only focused on academics, it would be a big miss due to the collective desire to prove research on real problems and motivation to drive new research based on emerging data sets.
However, opening up KDD to the industry also had its challenges. With the research track being rightfully dominated by academia-originated work, the ADS track should have been primarily dedicated to applied studies coming from industrial research labs. In reality, more than half of ADS publications have their origins within academia or are a result of strong academic-industrial collaboration. A decade ago, Prof. Fayyad realized that many interesting AI applications were developed by teams that were simply too busy to write papers. He led KDD into its current phase, where KDD organizers venture and curate distinguished invited talks given by top industrial practitioners. The ADS invited talks have quickly become the highlight of the conference.
The KDD Cup competition held annually in conjunction with the KDD conference, is yet another way to connect the academic and industrial worlds. KDD Cup is a way to attract both industry and academia participants where companies bring some of the challenges that they are comfortable sharing, while academics get to work on data they would never have access to, describes Prof. Teredesai, who is also the CEO of a health tech company CueZen. Each year, a novel task is introduced and a new dataset is released. Hundreds of teams sprint towards the most effective solution, competing for prizes and fame. Prof. Fayyad agrees, It's been a very healthy thing for the field because we see participation from academia, students diving in, or even companies teaming together.
Circling back to the choice between industry and academia, it will soon become irrelevant. With academic courses taught by practitioners, professors leading industrial labs, global cloud computing resources becoming dominant, and more data becoming available, the academic-industrial boundaries are quickly getting blurred in the AI domain. No need to stick to any of the two sectors, just choose the project you are most excited about!
Connor Lee is a 2023 graduate from Saratoga High School in the Bay Area. He will be joining the Computer Science program at NYU in the fall. By all means, Connor will be one of the youngest KDD attendees ever!
Link:
Synthetic Data Platforms: Unlocking the Power of Generative AI for … – KDnuggets
Creating a machine learning or deep learning model is so easy.. Nowadays, there are different tools and platforms available to not only automate the entire process of creating a model but to even help you to select the best model for a particular data set.
One of the essential things you need to solve a problem by creating a model is a dataset that contains all the required attributes describing the problem you are trying to solve.. So, suppose we are looking at a dataset describing the diabetes history of patients. There will be specific columns that are the significant attributes like age, gender, glucose level, etc. which play an essential role in predicting whether a person has diabetes or not. In order to build a diabetes prediction model, we can find multiple datasets that are publicly available. However, we may face difficulty in solving problems where data is not readily available or highly imbalanced.
Synthetic data generated by deep learning algorithms is often used in replacement of original data when data access is limited by privacy compliance or when the original data needs to be augmented to fit specific purposes. Synthetic data mimics the real data by recreating the statistical properties. Once trained on real data, the synthetic data generator can create any amount of data that closely resembles the patterns, distributions, and dependencies of the real data. This not only helps generate similar data but also helps in introducing certain constraints to the data, such as new distributions. . Let's explore some use cases where synthetic data can play an important role.
Generative AI models are crucial in synthetic data production since they are explicitly trained on the original dataset and can replicate its traits and statistical attributes. Models of generative AI, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs), comprehend the underlying data and produce realistic and representative synthetic instances.
There are numerous open-source and closed source synthetic data generators out there, some better than others. When evaluating the performance of synthetic data generators, its important to look at two aspects: accuracy and privacy. Accuracy needs to be high without the synthetic data overfitting the original data and the extreme values present in the original data need to be handled in a way that doesnt endanger the privacy of data subjects. Some synthetic data generators offer automated privacy and accuracy checks - its a good idea to start with these first. MOSTLY AIs synthetic data generator offers this service for free - anyone can set up an account with just an email address.
Synthetic data is not personal data by definition. As such, it is exempt from GDPR and similar privacy laws, allowing data scientists to freely explore the synthetic versions of datasets. Synthetic data is also one of the best tools to anonymize behavioral data without destroying patterns and correlations. These two qualities make it especially useful in all situations when personal data is used - from simple analytics to training sophisticated machine learning models.
However, privacy is not the only use case. Synthetic data generation can also be used in the following use cases:
In order to generate synthetic data we may use different tools that are available in the market. Let's explore some of these tools and understand how they work.
For a comprehensive list of synthetic data tools and companies, here is a curated list with synthetic data types.
Now as we have discussed the pros and cons of using these above-described tools and libraries for synthetic data generation, now lets look at How we can use Mostly AI which is one of the best tools available in the market and easy to use.
MOSTLY AI is a synthetic data creation platform that assists enterprises in producing high-quality, privacy-protected synthetic data for a number of use cases such as machine learning, advanced analytics, software testing, and data sharing. It generates synthetic data using a proprietary AI-powered algorithm that learns the statistical aspects of the original data, such as correlations, distributions, and properties. This enables MOSTLY AI to produce synthetic data that is statistically representative of the actual data while simultaneously safeguarding data subjects' privacy.
Its synthetic data is not only private, but it is also simple to use and can be made in minutes. The platform has an easy-to-use interface powered by generative AI that enables organizations to input existing data, choose the appropriate output format, and produce synthetic data in a matter of seconds. Its synthetic data is a beneficial tool for organizations that need to preserve the privacy of their data while still using it for a number of objectives. The technology is simple to use and quickly creates high-quality, statistically representative synthetic data.
Synthetic data from MOSTLY AI is offered in a number of formats, including CSV, JSON, and XML. It can be utilized with several software programs, including SAS, R, and Python. Additionally, MOSTLY AI provides a number of tools and services, such as a data generator, a data explorer, and a data sharing platform, to assist organizations in using synthetic data.
Lets explore how to use the MOSTLY AI platform. We can start by visiting the link below and creating an account.
MOSTLY AI: The Synthetic Data Generation and Knowledge Hub - MOSTLY AI
Once we have created the account we can see the home page where we can choose from different options related to data generation.
As you can see in the image above on the home page we can upload the original dataset for which we want to generate synthetic data or just to try it out we can use the sample data. We can upload data as per your requirement.
As you can see in the image above, once we upload the data we can make changes in terms of what columns we need to generate and also set different settings related to data, training and output.
Once we set all these properties as per our requirement we need to click on the launch job button to generate the data and it will be generated in real-time. On MOSTLY AI, we can generate 100K rows of data every day for free.
This is how you can use MOSTLY AI to generate synthetic data by setting the properties of data as required and in real time. There can be multiple use cases according to the problem that you are trying to solve. Go ahead and try this with datasets and let us know how useful you think this platform is, in the response section.Himanshu Sharma is a Post Graduate in Applied Data Science from the Institute of Product Leadership. A self-motivated professional with experience working on Python Programming Language/Data Analysis. Looking to make my mark in the field of Data Science. Product Management. An active blogger with expertise in Technical Content Writing in Data Science, awarded as the Top Writer in the field of AI by Medium.
See the original post here:
Synthetic Data Platforms: Unlocking the Power of Generative AI for ... - KDnuggets
IIT Guwahati rolls out an Online Bachelor of Science (Honours) Degree in Data Science and Artificial Intel – Economic Times
Indian Institute of Technology (IIT) Guwahati is launching an online Bachelor of Science (Hons) degree programme in Data Science and Artificial Intelligence on Coursera, an online learning platform.Anyone after Class XII or its equivalent, with mathematics as a compulsory subject, can apply. Those eligible and registered for JEE Advanced (in any year) will get direct admission, while those without can complete an online course and gain entry based on their performance, according to a release issued by Coursera.This programme teaches students the digital skills they need to thrive in the modern workforce. They graduate knowing how to implement the latest AI and data science techniques in any field, setting them up for success in their careers, said Parameswar K. Iyer, officiating director, IIT Guwahati. Students will receive job placement support from IIT Guwahati and access to Courseras skill-based recruitment platform, Coursera Hiring Solutions, according to the release.
Continued here:
Upcoming online Master’s of Applied Data Science program … – The Daily Tar Heel
Classes for UNC's new Master's of Applied Data Science program, which aims to provide graduate students and working professionals with the ability to advance their knowledge in data science, will begin in January 2024.
To launch this new, primarily online degree, UNC partnered with The Graduate School, the UNC Office of Digital and Lifelong Learning and 2U, an online education company.
To me, the most important aspect of this is that it allows people who are currently in the workforce to have the options of either adding another degree to their resume or to just get some classes, Stanley Ahalt, dean of the UNC School of Data Science and Society, said.
Ahalt said that there are many people who are looking to improve their resume and can benefit financially from developing their data science skills.
Students in the MADS program will have a choice in how they complete this degree. The School of Data Science and Society will offer both live and asynchronous classes.
Arcot Rajasekar will be teaching an introductory course for advanced data science for the new program. Rajasekar is a current UNC professor for the School of Information and Library Science and one of the chief scientists at the Renaissance Computing Institute.
This is for people who are already in the industry and who would like to find new types of tools and techniques and methodologies which are useful for them to deal with their data problems, Rajasekar said.
Rajasekar said that students will gain the proper skills for approaching current and future technology through this program.
"Data science is becoming really important, in a sense, as they call it, 'Data is the new currency,'" he said. "And if you want to deal with data and do large data, what is called big data, you need to have the proper tools to do that."
Kristen Young, the director of communications at the School of Data Science and Society, said that all participants in the MADS program will have the opportunity to participate in an immersion experience that includes staying on campus for two or three days, as well as meeting peers and professors.
This is an experience that we'll provide as an option for online students to have some time on campus and working together in person, Young said.
Ahalt said that there is a high market demand for those with a degree in data science, and that the program would be doing a service for North Carolinians.
According to the U.S. Bureau of Labor Statistics, job growth for data scientists is projected to grow 36 percent between 2021 and 2031. The average employment growth over this time period is five percent.
Students of the program will get to apply their findings to the real world through the MADS capstone projects. For example, Ahalt said the program is considering working with companies in the Triangle to provide students with real-world experiences.
The MADS program not only helps students develop their data science skills, but also equips students with ethical understanding, Rajasekar said. He said that students will also learn how data science can provide avenues for doing good in society.
Those applying to this master's program do not need to have a data science degree, but, Ahalt said the MADS program will require fundamental mathematics, an understanding of programming and a basic working knowledge of some data science modeling.
We're pretty flexible," Ahalt said. "We're going to require some basic skill set coming into the program, but we're trying to make this very accessible."
Applications for the program have been available online since Wednesday, June 21. The deadline for submissions is Tuesday, Nov. 14.
@dailytarheel | university@dailytarheel.com
To get the day's news and headlines in your inbox each morning, sign up for our email newsletters.
More here:
Upcoming online Master's of Applied Data Science program ... - The Daily Tar Heel
In-house automation, analytics tools speed audit processing – GCN.com
Home to one of the countrys hottest housing markets, Travis County, Texasparticularly the city of Austinhas seen the volume of property tax refunds increase by 25% annually since 2018.To keep up and meet requirements to audit the refunds for accuracy throughout the year, the Risk Evaluation and Consulting Division of the countys Auditors Office relies on automation and analytics tools built in-house to perform continuous auditing. REC has reduced the time it takes to process audits of property tax refunds by 91%.
It used to take weeks to analyze the large volumes of property tax refunds, but the model can do it in less than five minutes, said John Montalbo, data scientist for the county. It can detect anomalies, double check for accuracy and write findings to audit standards with incredible efficiency, he added.
Weve gone from 1,000-plus auditor hours per year to [being] at a pace right now for under 40, and we continue to trim that down, REC Manager David Jungerman said. Weve made a lot of progress [in] being able to dedicate folks to more interesting, less mundane work.
Last month, the National Association of Counties, or NACo, recognized RECs work with an Achievement Award for Financial Management.
Even as Travis Countys operating environment and services grew increasingly sophisticated, additional funding for audit compliance was unavailable, according to NACo. Developing innovative, automated auditing techniques allowed auditors to improve their effectiveness and increase their coverage.
The move from a time-consuming, paper-based process has been several years in the making. In 2018, REC began using a dashboard for remote auditing, but the COVID-19 pandemic really showed the office what was possible.
It pushed forward how much more data is being collected during that whole refund process, said John Gomez, senior data scientist at the county. It allowed us to use data to verify when the check was scanned into the system or when the refund application was received and scanned in.
It also enabled auditors to see the metadata so they could determine who looked at and verified an application. Theres a timestamp that gets tied to it recorded and stored, he said.
Since then, the data science team has integrated algorithms into the review process to automate it. Now, human auditors are needed only to review audits that the system calls out as anomalous.
Before the algorithm could be deployed, the data scientists built an extract, transform and load process to collect and organize the data needed for all property tax refunds. Then the countys senior auditor walked them through all the steps she takes and what she looks for in processing the refunds.
We have our algorithms sitting on a virtual machine that will run itself, Montalbo said. Every time that it needs to run, it goes and it gets all the information, does all the tests with which it needs to do, notes exceptions when it finds them, and then starts compiling work documents.
Those documents are put into an email that goes to auditors who spot-check what failed.
Its basically a multi-tab Excel spreadsheet that they get, Jungerman said. We keep one senior [analyst] dedicated to the audit and rotate staff, and basically, they just work the tabs of the spreadsheet if theres any exceptions on there.
Currently, REC is working with the data scientists to automate system-generated receipt testing to streamline audits. Were in the process with 12 county offices right nowand portions of a 13thof looking at all of the system-generated receipts and tracking them to the elected officials bank account and then tracing them to the posting in the enterprise accounting software, Jungerman said. The automation would mean being able to turn around findings to offices within a couple of weeks.
It would also mean processing tens of thousands of receipts every week across all county offices. Currently, receipt testing typically samples only about 80 out of 20,000 receipts, he added.
Automation could be applied to any type of audit, Montalbo said, although the exact mechanisms wont translate seamlessly every time.
We have 50-plus departments [in the county government] and most departments use a different application for their day-to-day business activities, which means different data is being stored for each transaction that is being receipted, Gomez said. So, we have to mine the data for each department to extract the information we need to verify each receipt is recorded correctly and deposited in a timely matter.
Despite the efficiency of automation, Jungerman said that he doesnt foresee any processes running without some form of human interaction. The vision is to automate all of our processes that we can and free standard auditors to just look at exceptions and to look at a whole lot of other areas, he said, adding that you need a human being to verify the potential findings.
Stephanie Kanowitz is a freelance writer based in northern Virginia.
Read the original post:
In-house automation, analytics tools speed audit processing - GCN.com
Data Science: Debunking the Myth of Agile Compatibility – DataDrivenInvestor
A Critical Perspective on Agile Methodologies in Data Science14 min read
Data Science is gaining prominence as a mainstream practice in various industries, leading companies to integrate it into their operations. However, there is a genuine concern that Data Science may be mistakenly categorized as just another software practice, akin to traditional web application development approaches. Over the past years, the Agile hype has spread throughout the technology industry, extending beyond its roots in web development.
Recalling an anecdote, I was once told about Agile being introduced into a legal practice, much to the surprise of the attorneys involved. They found themselves adopting techniques that were completely disconnected from their legal practice, day-to-day work, and actual needs. The resulting negative feedback and disengagement were so overwhelming that they cannot be ignored or understated. The impact was reported to be mentally distressing, almost as if it were an experiment conducted by Dr. Zimbardo[1] himself, the renowned social psychologist.
While coding does play a role in Data Science, it is not the primary activity of a Data Scientist. Unfortunately, this distinction is not widely recognized or understood by individuals outside the field. As organizations grapple with a misunderstanding of what Data Science truly entails, there is an increasing pressure to enforce alignment. In the dynamics of small groups, teams often expand, and IT views Data Science as a logical area for expansion, leading to a perfect storm of misalignment.
To illustrate this point, I once witnessed, with a sense of unease, an Agilist referring to a Data Scientist as a developer and a Notebook with a model as an application. Such remarks highlight the profound misunderstanding of Data Science within the IT industry. It seems that certain factions of the industry adopt a one-size-fits-all mentality, treating a data science study in the same manner as they would approach a web application development project. This approach feels archaic and reminiscent of a bygone era.
Before delving into a detailed analysis of why Agile does not align with Data Science, it is important to understand the origins of Agile and the motivations behind its creation.
The origins of Agile can be traced back to the proclamation of their Manifesto[2] by the Agile Alliance. In general terms, a manifesto is defined as a written statement publicly declaring the intentions, motives, or views of its issuer, as described by Merriam-Webster[3]. Traditionally, manifestos have been associated with movements and schools of thought in social, political, and artistic realms, representing or aspiring to bring about significant qualitative progress for humanity.
Historical examples of manifestos include The Declaration of the Rights of Man and of the Citizen (1789) by the National Assembly of France, The Abolitionist Manifesto (1829) by William Lloyd Garrison, and The Communist Manifesto (1848) by Karl Marx and Friedrich Engels. While individual opinions may differ on the content of these manifestos, they address matters of great importance to humanity. Manifestos are also commonly used to define artistic movements or schools, as exemplified by The Bauhaus Manifesto (1919) by Walter Gropius.
In light of these historical references, it is necessary to express some reservations about labeling a software development methodology, created by software practitioners, as a Manifesto. This usage could be seen as somewhat disrespectful to the likes of Walter Gropius, the National Assembly of France, William Lloyd Garrison, Karl Marx, and Friedrich Engels. It is essential to approach such grandiose associations with caution and scrutiny.
The Agile Manifesto was developed by a group of 17 software practitioners who founded the Agile Alliance. These individuals, including notable names like Kent Beck, Martin Fowler, and Ward Cunningham, are primarily recognized for their involvement in the creation of the manifesto. It is important to note that their expertise lies in coding and related activities, which has formed a consulting industry akin to other domains like coaching and training.
While this association is not inherently problematic, it is worth noting that these authors are not widely acclaimed for groundbreaking software advancements, with the exception of Ward Cunninghams involvement in the creation of the first wiki system. This observation highlights that the Agile Alliance lacks direct connections with industry leaders and innovators.
Recognizing their skill and competence is certainly commendable, and it is not fair or valid to diminish their contributions. However, it does raise questions about the significant impact asserted by the Agile Manifesto without substantial groundbreaking contributions to the field. It prompts us to ponder the underlying motivations and intentions behind the creation of the Agile Manifesto.
Considering who benefits from the Agile Manifesto could help shed light on why it was written and why an Alliance was established. The close association between the members and coaching/training/educational activities raises the question of whether the Agile practice is primarily driven by revenue generation.
While this is a common practice and not inherently wrong, illegal, or unethical, one can infer a conflict of interests that may prevent Agile from being solely focused on your best interests. The assessment of this potential conflict and the alignment of Agile with the broader Consulting industry will depend on your prior experiences with such value-added industries. However, it is widely known that these industries often face criticism due to the lack of accountability and difficulties in measuring performance improvements, especially when they move beyond mere rhetoric.
Recognizing that these reservations are subjective, I will now analyze each claim of the Agile Manifesto individually and evaluate its suitability for the field of Data Science.
These four values are advocated in the Agile Manifesto and form the core of Agile methodology. Lets analyze them in detail:
Data Science is generally considered a lightweight scientific activity. This is because many practitioners in the field primarily apply established methodologies to extract business benefits from data, rather than conducting groundbreaking scientific research. Therefore, the term scientist in data scientist can be seen as more of a vanity term that doesnt fully reflect the pragmatic nature of most practitioners.
However, it is important to note that the Data Science process aims to adhere to scientific methodology in terms of rigor, attention to detail, and employed procedures. It also involves significant mathematical aspects, which are inherently scientific. So, while the intensity of scientific method application may be lower compared to actual scientific research, the underlying aim is still present.
In the context of Data Science, it is difficult to prioritize individuals and interactions over processes and tools. In many cases, the focus is primarily on data, methodologies, and analysis rather than individual interactions. For example, when evaluating the value of a tumor marker, statistical rigor and manufacturing quality are typically more important than the level of individual interaction involved in its development.
Data Science is inherently complex, and in practice, it is often even more challenging to understand and verify compared to regular software development. Jupyter Notebooks have gained popularity because they provide a means of combining inline documentation, including mathematical explanations, with actual code. They resemble traditional scientific research notebooks where authors describe their analysis workflows.
In the context of Data Science, the principle of working software over comprehensive documentation does not align well. An undocumented notebook would be a nightmare scenario, as both the process and outcome of the analysis must be accurately described. In Data Science, comprehensive documentation is just as important as the software itself, if not more so.
In Data Science, the concept of customers is not typically present in the traditional sense. While there may be goals in certain projects, sometimes the work is purely exploratory without specific predefined objectives. Additionally, there are usually no formal contracts in Data Science, as it can be challenging to determine the potential outcomes or directions of a particular study or analysis.
However, it is crucial to clearly specify the specific analysis, outcomes, assumptions, data, and methodology involved in the activity itself. This documentation is typically included as part of the Data Science process, often within the notebook if notebooks are used for analysis. Its worth noting that some Agilists may argue that there are internal customers within a business or organization, as the data generated is intended to be valuable for the overall operation. However, this perspective does not align with the core principle of Data Science.
In summary, while customer collaboration may not be the central guiding principle in Data Science, the clarity and specification of analysis details are essential components of the practice.
In Data Science, traditional plans in the sense of predefined step-by-step procedures are not typically used. Instead, the process often involves formulating hypotheses and testing them or predicting and classifying future events based on past observations. The plan itself becomes an hypothesis to be validated.
However, its important to note that there is usually a script or plan outlining what needs to be done and how to do it. The exploratory nature of Data Science means that outcomes may change and redirect the course of analysis. While this principle of responding to change does not explicitly contradict Data Science, it is not entirely applicable to the field. This principle describes a contradiction that occurs in a different context, such as web application design, where lengthy requirement documents are commonly written and sometimes form part of contractual agreements.
In summary, while Data Science doesnt adhere to traditional plans, there is still a general script or plan in place that can be adjusted based on the evolving insights and outcomes of the analysis.
In addition to the values, the Agilists have principles:
Lets examine them one by one:
(1) Prioritize customer satisfaction by delivering valuable software frequently
In Data Science, the primary focus is on delivering actionable knowledge and insights from data, rather than software. Customer satisfaction is achieved through the quality and impact of the extracted knowledge, rather than the frequency of software deliverables. The value lies in the insights gained, not in the software itself.
(2) Welcome changing requirements, even if they occur late in the project
In the realm of Data Science, the requirements are often based on hypotheses to be tested or predictions to be made. While some flexibility may exist in refining the scope of a project, significant changes in requirements can have far-reaching implications. The iterative nature of Agile may not be as applicable to Data Science, where study cycles are often longer and altering the requirements late in the project can significantly disrupt the study methodology.
(3) Deliver working software frequently, with a preference for shorter timescales
Data Science is not focused on delivering software but rather on extracting meaningful insights from data. The notion of frequent software deliveries is not relevant or feasible in the context of Data Science. The emphasis lies more on the accuracy, validity, and impact of the knowledge extracted, rather than the frequency or timeliness of software releases.
(4) Collaborate with the customer and stakeholders throughout the project
While collaboration with customers and stakeholders is important in any project, it is worth noting that the nature of collaboration in Data Science differs significantly from that in software development. In the initial stages of a Data Science project, interactions with customers and stakeholders play a crucial role in understanding their requirements and objectives. However, once the project moves into the research and study phase, the focus shifts towards extensive data analysis, experimentation, and hypothesis validation, which often occur over longer periods with less frequent interaction.
In Data Science, the emphasis lies on delving deep into the data, applying statistical and mathematical techniques, and extracting valuable insights. This process requires time, careful analysis, and scientific rigor, which may not align with the iterative and rapid delivery approach commonly associated with software development. Therefore, while collaboration remains important, the dynamics of collaboration in Data Science projects differ significantly from those in software development, reflecting the unique nature of the field.
(5) Build projects around motivated individuals and give them the support they need
The idea of building projects around motivated individuals and providing necessary support seems like a self-evident concept applicable to any industry. In the context of Data Science, it is unlikely that professionals would deliberately choose unmotivated individuals or neglect to provide the support required to achieve project objectives.
(6) Measure progress through working software and adjust accordingly
In Data Science, progress is measured by the accuracy, reliability, and impact of the insights generated, rather than by working software. The focus is on refining and improving the analytical models and methodologies based on the data. Adjustments are made to enhance the quality and reliability of the insights, rather than solely based on the functionality of software.
(7) Maintain a sustainable pace of work
While maintaining a sustainable pace of work is important in any field, including Data Science, the nature of Data Science projects may involve extended periods of exploration, experimentation, and analysis. The pace of work may fluctuate depending on the complexity of the data, the methodologies employed, and the depth of insights sought. Striving for a sustainable pace must be balanced with the requirements of the specific project and the need for thorough analysis.
(8) Strive for technical excellence and good design
While technical competence is certainly important in Data Science, the goal is not to pursue technical excellence or intricate design for its own sake. Data Science is focused on utilizing appropriate mathematical tools and methodologies to extract meaningful insights from data. The emphasis lies on the accuracy, validity, and interpretability of the results, rather than striving for technical excellence in the traditional sense.
(9) Keep things simple and focus on what is necessary
The principle of keeping things simple and focusing on what is necessary applies universally to various fields and is not exclusive to Data Science. While simplicity and focus are important, the complexity of Data Science often necessitates specialized techniques and methodologies. The focus is more on deriving actionable knowledge from data rather than oversimplifying or neglecting important aspects of the analysis.
(10) Reflect on your work and continuously improve
The principle of reflection and continuous improvement is valuable in any professional endeavor, including Data Science. However, it is not unique to Data Science and is a widely accepted practice across industries. Professionals in any field are expected to reflect on their work, learn from their experiences, and strive for improvement. Therefore, this principle does not offer specific insights or considerations specific to Data Science.
Summary
In the context of Data Science, it becomes evident that Agile methodologies fall short and are largely irrelevant. The principles put forth by Agile proponents may be seen as nothing more than empty platitudes, failing to address the specific challenges and intricacies of the field. The notion of prioritizing frequent software delivery, embracing changing requirements, and collaborating with stakeholders throughout the project are not only obvious but also fail to recognize the distinct nature of Data Science. Agiles focus on technical excellence and good design disregards the fact that Data Science is more about using the right mathematical tools rather than achieving technical perfection. In truth, Agiles attempt to infiltrate the realm of Data Science can only be described as complete and utter nonsense.
The practical implementation of Agile, particularly in conjunction with the Scrum methodology, often falls short of its intended goals when applied to Data Science. The periodic meetings, known as stand-ups, where team members provide updates, lead to poor engagement and disruption in workflow. The presence of a non-technical or inadequately skilled Scrum master or project manager further compounds the issues, as they normally lack industry-specific knowledge and reduce complex workflows into simplistic task lists. This lack of understanding and accountability creates frustration and hinders the teams progress.
Additionally, the concept of user stories and the emphasis on user-centric requirements do not align well with Data Science, where the focus is more on data, hypotheses, and analysis rather than traditional user-driven needs.
Furthermore, when Agile consulting services are brought in, the emphasis often shifts to methodology and best practices, disconnecting them from the actual business needs and resulting in repetitive and irrelevant discussions. This disconnect and lack of understanding have detrimental effects on team morale and project outcomes, leading to project failures, low quality, massive hidden costs and other negative consecuences.
There is no one-size-fits-all solution for effective project management in Data Science, but based on my experience and observations, the following approaches seem to yield better results:
By embracing these principles, teams can foster a more focused and collaborative environment for Data Science projects.
The Agile Manifesto poses challenges due to its loose definition and sometimes feels akin another Conjoined Triangle of Success[5]. Its values and principles are not universally applicable across industries, and in the realm of Data Science, they often clash with the specific needs and workflows of projects in this field.
As a Data Scientist, it is not uncommon to find yourself pulled into the Agile methodology. However, I encourage you to consider alternatives. Agile is unlikely to serve the best interests of your employer or customers, and it may drain your energy, focus, and time, diverting you from the path of professional growth. Engaging in low-value activities that stray from your core skills can hinder your career.
On a personal and professional level, it is worth considering adjusting your compensation to reflect the challenges posed by following the Agile methodology. The Agile workflow often fosters an environment focused on justifying the methodology itself rather than addressing genuine business needs. Among the various negative aspects, the sense of wasted time can be particularly disheartening. As professionals and human beings, our time is limited, and how we allocate it directly impacts our learning curve and overall fulfillment.
Moreover, Agiles impact on creativity cannot be overlooked. The rigid planning, approvals, timeboxing, and administrative burdens disrupt the very essence of creativity crucial to excelling in Data Science. The prevalence of frequent meetings and administrative tasks stifles the creative process necessary for innovation.
Unfortunately, the prevailing trend indicates that Agile will continue to gain traction. As the world becomes more challenging, we can anticipate an increase in Agile practices.
In conclusion, as professionals, we recognize the importance of navigating the challenges of Agile with resilience as our armor and integrity as our compass. We shall always strive for impactful work, ensuring our actions align with our principles and exemplify professionalism.
I am a seasoned professional with over 20 years of experience in both technical and non-technical roles in technology. I provide contract services to small and medium sized Hedge Funds in AI/Quantitative and Financial Market Data areas.
I live with my family in Denmark in the countryside. If you would like to discuss industry trends, share insights, or explore potential collaborations, I am always happy to connect.
The opinions expressed in this article are solely my own and do not reflect the views or opinions of any past, present, or future employer or customer.
[1] https://en.wikipedia.org/wiki/Philip_Zimbardo
[2] https://agilemanifesto.org/
[3] https://www.merriam-webster.com/dictionary/manifesto
[4] https://de.wikipedia.org/wiki/Politoffizier
Read the original here:
Data Science: Debunking the Myth of Agile Compatibility - DataDrivenInvestor
Data analytics in the cloud: understand the hidden costs – CIO
Luke Roquet recently spoke to a customer who recounted the shock of getting a $700,000 bill for a single data science workload running in the cloud. When Roquet, who is senior vice president of product marketing at Cloudera, related the story to another customer, he learned that that company had received a $400,000 tab for a similar job just the week before.
Such stories should belie the common myth that cloud computing is always about saving money. In fact, most executives Ive talked to say that moving an equivalent workload from on-premises to the cloud results in about a 30% cost increase, said Roquet.
This doesnt mean the cloud is a poor option for data analytics projects. In many scenarios, the scalability and variety of tooling options make the cloud an ideal target environment. But the choice of where to locate data-related workloads should take multiple factors into account, of which only one is cost.
Data analytics workloads can be especially unpredictable because of the large data volumes involved and the extensive time required to train machine learning (ML) models. These models often have unique characteristics that can cause their costs to explode, Roquet said.
Whats more, local applications often need to be refactored or rebuilt for a specific cloud platform, said David Dichmann, senior director of product management at Cloudera. Theres no guarantee that the workload is going to be improved and you can end up being locked into one cloud or another, he said.
Cloud march is on
That doesnt seem to be slowing the ongoing cloudward migration of workloads. Foundrys 2022 Data & Analytics study found that 62% of IT leaders expect the share of analytics workloads they run in the cloud to increase.
Although cloud platforms offer many advantages, cost- and performance-sensitive workloads are often better run on-prem, Roquet said.
Choosing the right environment is about achieving balance. The cloud excels for applications that are ephemeral, need to be shared with others, or use cloud-native constructs like software containers and infrastructure-as-code, he said. Conversely, applications that are performance- or latency-sensitive are more appropriate for local infrastructure where data can be co-located, and long processing times dont incur additional costs.
The goal should be to optimize workloads to interact with each other regardless of location and to move as needed between local and cloud environments.
The case for portability
Dichmann said three core components are needed to achieve this interoperability and portability:
Once you have one view of all your data and one way to govern and secure it then you can move workloads around without worrying about breaking any governance and security requirements, he said. People know where the data is, how to find it, and were all assured it will be used correctly per business policy or regulation.
Portability may be at odds with customers desire to deploy best-of-breed cloud services, but Dichmann said fit-for-purpose is a better goal than best-of-breed. That means its more important to put flexibility ahead of bells and whistles. This gives the organization maximum flexibility for deciding where to deploy workloads.
A healthy ecosystem is also just as important as robust points solutions because a common platform enables customers to take advantage of other services without extensive integration work.
The best option for achieving workload portability is to use an abstraction layer that runs across all major cloud and on-premises platforms. The Cloudera Data Platform, for example, is a true hybrid solution that provides the same services both in the cloud and on-prem, Dichmann said. It uses open standards that give you the ability to have data share a common format everywhere it needs to be, and accessed by a broader ecosystem of data services that makes things even more flexible, more accessible and more portable.
Visit Cloudera to learn more.
Read the rest here:
Data analytics in the cloud: understand the hidden costs - CIO
IIT Guwahati introduces online BSc (Hons) degree in Data Science and Artificial Intelligence: All the det – Times of India
The
(IIT) Guwahati, ranked as the country's 7th best engineering institute, is introducing an online Bachelor of Science (Hons) Degree Program in
. This program will be offered on
. By completing this online degree, students will be equipped with the necessary skills to pursue lucrative and rapidly growing careers in the fields of
and
.
The National Education Policy 2020 recognizes the significance of training professionals in cutting-edge areas such as machine learning, AI, and extensive data analysis. By doing so, it aims to enhance the employability of young people. The policy also highlights the need to increase the Gross Enrollment Ratio in higher education.
According to the World Economic Forum's Future of Jobs Report 2023, tech roles such as AI and machine learning specialists, data analysts, and data scientists are expected to grow by over 30% by 2028.
is responding to the demand and following the recommendations of NEP 2020 by offering multiple admissions pathways to their completely online degree program.
Anyone who has completed Class XII or its equivalent and has mathematics as a compulsory subject can apply. Candidates who have registered for
(in any year) will receive direct admission, while those who have not can complete an online course and be admitted based on their performance.
The degree program offers multiple exit options based on the number of credits earned. Learners can choose to receive a certificate, diploma, degree, or an honor's degree. Optional campus visits also provide opportunities for students to connect with faculty and peers.
The program for students starts with a foundation in coding and then advances to more specialised subjects such as generative AI, deep learning, computer vision, and data mining. Learning is further enhanced through group projects, real-world case studies, and internships. The program also offers industry micro-credentials that recognize prior learning, which allows students to gain more job-relevant knowledge.
The graduates will have the opportunity to apply for over 400,000 job openings in various fields such as AI engineering, data engineering, ML engineering, and data analysis. IIT Guwahati provides job placement assistance to students and grants access to Coursera's recruitment platform, Coursera Hiring Solutions.
This program teaches students the digital skills they need to thrive in the modern workforce. They graduate knowing how to implement the latest AI and data science techniques in any field, setting them up for success in their careers, said Prof. Parameswar K. Iyer, Officiating Director, IIT Guwahati.
Read more:
Unleashing the Power of AI and Data Science Careers – Analytics Insight
Unlocking opportunities with the 10 high-paying careers in AI and data science for financial success
Artificial intelligence (AI) and data science have emerged as dynamic and influential fields in todays rapidly advancing digital landscape, driving innovation and transforming industries worldwide. As organizations strive to gain a competitive edge and make data-driven decisions, the demand for skilled AI and data science professionals has skyrocketed.
To address this growing need presents an insightful exploration of ten high-paying career options within these domains. From machine learning engineers and data scientists to AI research scientists and prominent data engineers, these professions offer financial rewards and the opportunity to be at the forefront of technological advancements. By delving into this comprehensive article, readers will gain valuable insights into the diverse AI and data science pathways, ultimately empowering them to embark on exciting and lucrative career journeys.
Machine learning engineers are at the forefront of AI development. They design and implement machine learning algorithms that enable computers to learn and improve from experience. These professionals deeply understand statistical modeling, programming languages, and data manipulation.
Data scientists are skilled in extracting meaningful insights from vast amounts of data. They utilize advanced statistical techniques and machine learning algorithms to uncover patterns, trends, and correlations. Data scientists are crucial in guiding business strategies and making data-driven decisions. Data science is a highly sought-after profession with a median salary exceeding $120,000 per year.
AI research scientists focus on developing innovative AI algorithms and models. They delve into cutting-edge research to push the boundaries of AI capabilities. These professionals possess strong mathematical and analytical skills and expertise in machine learning and deep learning techniques.
As the volume of data grows exponentially, prominent data engineers play a vital role in managing and processing large-scale datasets. They develop robust data pipelines, implement data storage solutions, and optimize data retrieval and analysis. With a median salary of around $110,000 annually, big data engineering offers lucrative opportunities for professionals with strong programming and database skills.
Ethical considerations are paramount as AI becomes increasingly integrated into various aspects of society. AI ethicists examine the societal impacts of AI systems and ensure their responsible and ethical use. They develop guidelines and policies to address ethical challenges related to AI deployment.
Business intelligence analysts leverage data to drive strategic decision-making within organizations. They collect and analyze data from various sources, providing valuable insights to support business growth and optimization. These professionals excel in data visualization, statistical analysis, and data storytelling.
Robotics engineers merge AI and mechanical engineering to create intelligent robotic systems. They design, develop, and program robots that can perform complex tasks autonomously or assist humans in various industries. Robotics engineers work across diverse sectors, such as manufacturing, healthcare, and logistics, pushing the boundaries of AI and automation.
NLP engineers specialize in developing algorithms and systems that enable computers to understand and interact with human language. They design chatbots, voice assistants, and language translation systems. With the increasing demand for AI-powered language processing solutions, NLP engineers are highly sought after by industries such as customer service, healthcare, and communication.
Computer vision engineers harness the power of AI to enable machines to interpret and understand visual information. They develop image and video analysis algorithms, object recognition, and autonomous navigation. Computer vision finds applications in autonomous vehicles, surveillance systems, medical imaging, and augmented reality, creating exciting career opportunities for computer vision engineers.
AI product managers bridge the gap between technical teams and business stakeholders. They possess a strong understanding of AI technologies and market trends, enabling them to guide the development of AI-powered products and services. AI product managers are responsible for defining product strategy, identifying customer needs, and overseeing the product lifecycle.
Excerpt from:
Unleashing the Power of AI and Data Science Careers - Analytics Insight
In-house tools sped up tax refunds in this county – Route Fifty
Home to one of the countrys hottest housing markets, Travis County, Texasparticularly the city of Austinhas seen the volume of property tax refunds increase by 25% annually since 2018.To keep up and meet requirements to audit the refunds for accuracy throughout the year,the Risk Evaluation and Consulting Division of the countys Auditors Office relies onautomation and analytics tools built in-house to perform continuous auditing. REC has reduced the time it takes to process audits of property tax refunds by 91%.
It used to take weeks to analyze the large volumes of property tax refunds, but the model can do it in less than five minutes, said John Montalbo, data scientist for the county. It can detect anomalies, double check for accuracy and write findings to audit standards with incredible efficiency, he added.
Weve gone from 1,000-plus auditor hours per year to [being] at a pace right now for under 40, and we continue to trim that down, REC Manager David Jungerman said. Weve made a lot of progress [in] being able to dedicate folks to more interesting, less mundane work.
Last month, the National Association of Counties, or NACo, recognized RECs work with an Achievement Award for Financial Management.
Even as Travis Countys operating environment and services grew increasingly sophisticated, additional funding for audit compliance was unavailable, according to NACo. Developing innovative, automated auditing techniques allowed auditors to improve their effectiveness and increase their coverage.
The move from a time-consuming, paper-based process has been several years in the making. In 2018, REC began using a dashboard for remote auditing, but the COVID-19 pandemic really showed the office what was possible.
It pushed forward how much more data is being collected during that whole refund process, said John Gomez, senior data scientist at the county. It allowed us to use data to verify when the check was scanned into the system or when the refund application was received and scanned in.
It also enabled auditors to see the metadata so they could determine who looked at and verified an application. Theres a timestamp that gets tied to it recorded and stored, he said.
Since then, the data science team has integrated algorithms into the review process to automate it. Now, human auditors are needed only to review audits that the system calls out as anomalous.
Before the algorithm could be deployed, the data scientists built an extract, transform and load process to collect and organize the data needed for all property tax refunds. Then the countys senior auditor walked them through all the steps she takes and what she looks for in processing the refunds.
We have our algorithms sitting on a virtual machine that will run itself, Montalbo said. Every time that it needs to run, it goes and it gets all the information, does all the tests with which it needs to do, notes exceptions when it finds them, and then starts compiling work documents.
Those documents are put into an email that goes to auditors who spot-check what failed.
Its basically a multi-tab Excel spreadsheet that they get, Jungerman said. We keep one senior [analyst] dedicated to the audit and rotate staff, and basically, they just work the tabs of the spreadsheet if theres any exceptions on there.
Currently, REC is working with the data scientists to automate system-generated receipt testing to streamline audits. Were in the process with 12 county offices right nowand portions of a 13thof looking at all of the system-generated receipts and tracking them to the elected officials bank account and then tracing them to the posting in the enterprise accounting software, Jungerman said. The automation would mean being able to turn around findings to offices within a couple of weeks.
It would also mean processing tens of thousands of receipts every week across all county offices. Currently, receipt testing typically samples only about 80 out of 20,000 receipts, he added.
Automation could be applied to any type of audit, Montalbo said, although the exact mechanisms wont translate seamlessly every time.
We have 50-plus departments [in the county government] and most departments use a different application for their day-to-day business activities, which means different data is being stored for each transaction that is being receipted, Gomez said. So, we have to mine the data for each department to extract the information we need to verify each receipt is recorded correctly and deposited in a timely matter.
Despite the efficiency of automation, Jungerman said that he doesnt foresee any processes running without some form of human interaction. The vision is to automate all of our processes that we can and free standard auditors to just look at exceptions and to look at a whole lot of other areas, he said, adding that you need a human being to verify the potential findings.
Original post:
In-house tools sped up tax refunds in this county - Route Fifty