Category Archives: Data Mining

GIS-based non-grain cultivated land susceptibility prediction using data mining methods | Scientific Reports –

Research flow

The NCL susceptibility prediction study includes four main parts: (1) screening and analysis of the influencing factors of NCL; (2) construction of the NCL susceptibility prediction model; (3) NCL susceptibility prediction; and (4) evaluation of the prediction results. The Research flow is shown in Fig.2.

The NCL locations were obtained based on information of Google Earth interpretation, field survey, and data released by local government, which derived in a total of 184 NCL locations. For determining the non-NCL locations, GIS software was applied, and 184 locations were randomly selected. In order to decreasing the bias of modeling, we generated non-NCL points by 200m distance for NCL. At each point, the data was divided into training samples and testing samples in a ratio of 7/3, thus forming the training dataset and the testing dataset together (Fig.3).

Currently, there is no unified consensus on the factors influencing NCL. Therefore, based on historical research materials and on-site field investigations24,25,26,27,28, 16 appropriate Non-grain Cultivated Land Susceptibility conditioning factors (NCLSCFs) were chosen for modelling NCL susceptibility in accordance with topographical, geological, hydrological, climatological and environmental situations. Alongside this, a systematic literature review has also been performed on NCL modelling to aid in the identification of the most suitable NCLSCFs for this study. The NCLSCF maps were shown in Fig.4.

Typical NCL factors map: (a) Slope; (b) Aspect; (c) Plan curvature; (d) Profile curvature; (e) TWI; (f) SPI; (g) Rainfall; (h) Drainage density; (i) Distance from river; (j) Lithology; (k) Fault density; (l) Distance from fault; (m) Landuse; (n) Soil; (o) Distance from road.

(1) Topographical factors

The occurrences of NCL and their recurrent frequency are very much dependent on topographical factors of an area. Several topographical factors like slope, elevation, curvature, etc. are triggering parameters for the development of NCL activities29. Here, six topographical factors were chosen: altitude, slope, aspect, plan and profile curvature and topographic wetness index (TWI). All these factors also perform a considerable part in NCL development in study area. These factors were prepared using shuttle radar topographical mission (SRTM) sensor digital elevation model (DEM) data with 30m resolution in the ArcGIS software. The output topographical factors of altitude ranges from 895 to 3289m (Fig.3), slope map 0261.61%, aspect map has nine directions (flat, north, northeast, east, southeast, south, southwest, west, northwest), plan curvature12.59 to 13.40, profile curvature13.05 to 12.68 and TWI 4.96 to 24.75. The following equation was applied to compute TWI:

$$TWI = Lnfrac{propto }{mathrm{tanbeta }+mathrm{ C}}$$


where,specifies flow accumulation, specifies slope and C is the constant value (0.01).

(2) Hydrological factors

Sub-surface hydrology is treated as the activating mechanism for the happening of NCL, as water performs a significant part in the soil moisture content. Therefore, four hydrological factors, namely drainage density, distance from river, stream power index (SPI) and annual rainfall, for modelling NCL susceptibility were chosen30. Here, SRTM DEM data of 30m spatial resolution was used to map the first three hydrological variables. Drainage density and distance from river map was prepared using line density extension and Euclidean extension tool respectively in GIS platform. The following formula was applied to compute SPI.

$$SPI = As*tan beta$$


where, As specifies the definite catchment area in square meters and specifies the slope angle in degrees. The precipitation map of the area was derived from the statistics of 19 climatological stations around the province with a statistical period of 25years and in accordance with the kriging interpolation method in GIS platform. The output drainage density value ranges from 0 to 1.68km/km2. Meanwhile, the value of distance from river ranges between 0 and 9153.93m, average annual rainfall varies from 175 to 459.98mm and the value of SPI ranges from 0 to 8.44.

(3) Geological factors

The characteristics of rock mass, i.e., lithological characteristics of an area, significantly impact on NCL activities31. Therefore, in NCL susceptibility studies geological factors are indeed commonly used as input parameters to optimize NCL prediction assessment. In the current study, three geological factors (namely lithology, fault density and distance from fault) were chosen. The lithological map and fault lines were obtained in accordance with the geological map of study gathered from local government at a scale of 1:100,000. Fault density and distance from fault factor map was prepared using line density extension and extension tool respectively in GIS platform. In this area, the value of fault density varies from 0 to 0.54km/km2 and distance from fault ranges from 0 to 28,247.1m respectively. The lithological map in this area is presented in Fig.4b.

(4) Environmental factors

Several environmental factors can also be significant triggering factors for NCL occurrence in mountainous or hilly regions32. Here, land use land cover (LULC), soil and distance from road were selected as environmental variables for predicting of NCL susceptibility. The LULC map was obtained in accordance with Landsat OLI 8 satellite images applying the maximum probability algorithm in the ENVI. Soil texture map was prepared based on the soil map of study area. The road map of this area was digitized from the topographical map by the local government. The output LULC factor was classified into six land use classes, while the soil map was classified into eight soil texture groups and the value of distance from road ranges from 0 to 31,248.1m.

As the NCLSCFs are selected artificially and their dimensions, as well as the quantification methods of data, are derived through mathematical operations, as subsequent input data for modeling, there may be potential multicollinearity problems among the NCLSCFs33. Such problems arise due to precise or highly correlated relationships between NCLSCFs, which can lead to model distortion or difficulty in estimation. In light of this, to avoid potential multicollinearity problems, this study examines the variance inflation factor and tolerance index to assess whether there exists multicollinearity among the NCLSCFs.

The MC analysis was conducted among the chosen NCLSCFs to optimize the NCL susceptibility model and its predictions34. TOL and VIF statistical tool were used to test MC using SPSS software. Studies indicate that there is a multicollinearity issue if VIF value is>5 and TOL value is<0.10. TOL and VIF were measured applying the following formula:





where, R2 represents a regression value of j on other various factors.

This section details the machine learning models of GBM and XGB, as used in NCL susceptibility studies.

In prediction performance analysis, GBM is one of the most popular machine learning methods, more frequently applied by researchers in different fields and treated as a supervised classification technique. A variety of classification and regression issues are also often solved by the GBM method, which was first proposed by Friedman35. This model is based on the ensemble of different weak prediction models such as decision trees, and is therefore considered as one of the most important prediction models. Three components are required in GBM model, namely a loss operate, a weak learner prediction, and an optimization of the loss function in which an additive function is necessary to include weak learners within the model. In addition to the above mentioned components, three important tuning parameters (namely n-tree, tree depth and shrinkage, i.e., the maximum number of trees, highest possible interaction among the independent variables and the learning rate respectively) is also required to build a GBM model36. The advantage of such a model is that it has capacity to determine the loss function and weak learners in a precise way. It is complex to obtain the solution of optimal estimation applying the loss function of (y, f) and weak learner of (x, ). Thus, to solve this problem, a new operate (x, t) was planned to negative gradient {gt(xi)}i=1 along with the observed data:

$${g}_{t}(x) ={{E}_{y} [frac{apsi (y,f(x))}{af(x)}|x]}_{f(x)={f}^{t-1}(x)}$$


This new operate is greatly associated with(x). This algorithm can permit us to develop aleast square minimization from the method by applying the following equation:

$$(mathrm{rho t},mathrm{ theta t})=mathrm{arg min}sum_{i=1}^{N}{[-{text{gt}}(mathrm{xi }) +mathrm{ rho h}(mathrm{xi },uptheta ]}^{2}$$


Chen & Guestrin then went on to introduce the XGB algorithm. It indicates the advance machine learning method, and is more efficient than the others37. The algorithm of XGB is based on classification trees and the gradient boosting structure. Gradient boosting framework is used in an XGB model by the function of parallel tree boosting. This algorithm is chiefly applied for boosting the operation of different classification trees. A classification tree is usually made up of various regulations to classify each input factor as the function of prejudice variables in a plot construction. This plot is developed as a individual tree and leaves are appointed with respective scores, which convey and choose the respective factor class, i.e., categorical or ordinal. The loss function is used in the XGB algorithm to train the ensemble model; this is known as regularization, which deals specifically with the severity of complexity trees38. Therefore, this regularization method can significantly enhance the performance of prediction analysis through alleviating any over-fitting problems. The boosting method, with the combination of weak learners, is used in XGB algorithm to optimally predict the result. Three parameters (i.e., General, Task and Booster) are applied to separate XGB models. The weighted averages of several tree models are then combined to form the output result in XGB. The following optimization function was applied to form the XGBoost model:

$$OF(theta ) =sum_{i=1}^{n}lleft({{text{y}}}_{i}, {overline{y} }_{i}right)+sum_{k=1}^{k}upomega ({f}_{K})$$


where, (sum_{i=1}^{n}lleft({{text{y}}}_{i}, {overline{y} }_{i}right)) is the optimization loss function of training dataset, (sum_{k=1}^{k}upomega ({f}_{K})) is the regularization of the over-fitting phenomenon, K indicates the number of individual trees, fk is the ensemble of trees, and ({overline{y} }_{i}) and ({{text{y}}}_{i}) indicates the actual and predicted output variables respectively.

Kennedy, an American social psychologist, developed the PSO algorithm based on the vector depending of seeking food by birds and their eating behavior39. It is a meta-heuristic-based simulation of a social model, often applied in behavioral studies of fish schooling, birds and swarming theory. The non-linear problems in our day-to-day research study will be solved by applying this PSO method. The PSO algorithm has been widely applied to determine the greatest achievable direction or direction to collect food, specifically for bird and fish intelligence. Here, birds are treated as particles, and they always search for an optimal result to the issue. In this model, bird is considered an individual, and the swarm is treated as a group like other evolutionary algorithms. The particles always try to locate the best possible solution for a respective problem using n-dimensional space, where n indicates the respective problems several parameters40. PSO consists of two fundamental principles: position and speed. This is the basic principle for the movement of each particle.

Hence, xt=(xt, xt,, xt) and vt=(vt, vt, , vt) is the position and speed for the changing particle position which is designed for ith particle in tth iteration. The given formula are used for the ith particle position and speed in (t+1)th iteration.

Where, xt is the previous ith position; pt is the most excellent position; gt is the best position; r1 and r2 indicates the random numbers within 0 and 1; is weights of inertia; c1 is coefficient and c2 is the social coefficient. Several type of methods are presented to weight the assignment of respective particles. Among them, standard 2011 PSO is the most popular and has been widely used among previous researchers. Here, standard 2011 PSO was used to calculate particles weight assignment using the following formula:

$$omega =frac{1}{2ln2}and {c}_{1}={c}_{2}=0.5+ln2$$


Evaluation is an important action to quantify the accuracy of each output method. In other words, the superiority of the output model is specified through a validation assessment41. Studies indicate that several statistical techniques can be applied to evaluate the accuracy of the algorithms; among them, the most frequently used technique is receiver operating characteristics-area under curve (ROC-AUC). Here, statistical techniques of sensitivity (SST), specificity (SPF), positive predictive value (PPV), negative predictive value (NPV) and ROC- AUC were all applied to validate and assess the accuracy of the models. These statistical techniques were computed in accordance with the four indices, i.e., true positive (TP), true negative (TN), false positive (FP) and false negative (FN)42. In this, correctly and incorrectly identified NCL susceptibility zones are represented through TP and FP, and correctly and incorrectly identified non-NCL susceptibility zones are represented through TN and FN respectively. The ROC is mostly used as a standard process to evaluate the accuracy of the methods. It is based on even and non-even phenomena. The output result of these techniques is such that a higher value represents good performance by the model, and a lower value represents poor performance. Applied statistical techniques of this study were measured through the following formula:

$${text{SST}}=frac{{text{TP}}}{mathrm{TP }+mathrm{ FN}}$$


$${text{SPF}}=frac{{text{TN}}}{mathrm{FP }+mathrm{ TN}}$$


$${text{PPV}}=frac{{text{TP}}}{mathrm{FP }+mathrm{ TP}}$$


$${text{NPV}}=frac{{text{TP}}}{mathrm{TP }+mathrm{ FN}}$$


$$AUC=frac{mathrm{Sigma TP }+mathrm{ Sigma TN}}{mathrm{P }+mathrm{ N}}$$


See more here:

GIS-based non-grain cultivated land susceptibility prediction using data mining methods | Scientific Reports -

Data mining the archives | Opinion – Chemistry World

History including the history of science has a narrative tradition. Even if the historians research has involved a dive into archival material such as demographic statistics or political budgets to find quantitative support for a thesis, the stories it tells are best expressed in words, not graphs. Typically, any mathematics it requires would hardly tax an able school student.

But there are some aspects of history that only a sophisticated analysis of quantitative data can reveal. That was made clear in a 2019 study by researchers in Leipzig, Germany,1 who used the Reaxys database of chemical compounds to analyse the growth in the number of substances documented in scientific journals between 1800 and 2015. They found that this number has grown exponentially, with an annual rate of 4.4% on average.

And by inspecting the products made, the researchers identified three regimes, which they call proto-organic (before 1861), organic (1861 to 1980) and organometallic (from 1981). Each of these periods is characterised by a change a progressive decrease in the variability or volatility of the annual figures.

Theres more that can be gleaned from those data, but the key points are twofold. First, while the conclusions might seem retrospectively consistent with what one might expect, only precise quantification, not anecdotal inspection of the literature, could reveal them. It is almost as if all the advances in both theory (the emergence of structural theory and of the quantum description of the chemical bond, say) and in techniques dont matter so much in the end to what chemists make, or at least to their productivity in making. (Perhaps unsurprisingly, the two world wars mattered more to that, albeit transiently.)

Such a measure speaks to the unusual ontological stability of chemistry

Second, chemistry might be uniquely favoured among the sciences for this sort of quantitative study. It is hard to imagine any comparable index to gauge the progress of physics or biology. The expansion of known chemical space is arguably a crude measure of what it is that chemists do and know, but it surely counts for something. And as Guillermo Restrepo, one of the 2019 studys authors and an organiser of a recent meeting at the Max Planck Institute for Mathematics in the Sciences in Leipzig on quantitative approaches to the history of chemistry, says, the existence of such a measure speaks to the unusual ontological stability of chemistry: since John Daltons atomic theory at the start of the 19th century, it has been consistently predicated on the idea that chemical compounds are combinations of atomic elemental constituents.

Still, there are other ways to mine historical evidence for quantitative insights into the history of science often now aided by AI techniques. Matteo Valleriani of the Max Planck Institute for the History of Science in Berlin, Germany, and his colleagues have used such methods to compare the texts of printed Renaissance books that used parts of the treatise on astronomy by the 13th century scholar Johannes de Sacrobosco. The study elucidated how relationships between publishers, and the sheer mechanics of the printing process (where old plates might be reused for convenience), influenced the spread and the nature of scientific knowledge in this period.

And by using computer-assisted linguistic analysis of texts in the Philosophical Transactions of the Royal Society in the 18th and 19th centuries, Stefania Degaetano-Ortlieb of Saarland University in Germany and colleagues have identified the impact of Antoine Lavoisiers new chemical terminology from around the 1790s. This amounts to more than seeing new words appear in the lexicon: the statistics of word frequencies and placings disclose the evolving norms and expectations of the scientific community. At the other end of the historical trajectory, an analysis of the recent chemical literature by Marisol Bermdez-Montaa of Tecnolgico de Monterrey in Mexico reveals the dramatic hegemony of China in the study of rare-earth chemistry since around 2003.

All this work depends on accessibility of archival data, and it was a common refrain at the meeting that this cant be taken for granted. As historian of science Jeffrey Johnson of Villanova University in Pennsylvania, US, pointed out at the meeting, there is a private chemical space explored by companies who keep their results (including negative findings) proprietary. And researchers studying the history of Russian and Soviet chemistry have, for obvious geopolitical reasons, had to shift their efforts elsewhere and for who knows how long?

But even seemingly minor changes to archives might matter to historians: Robin Hendry of Durham University in the UK mentioned how the university librarys understandable decision to throw out paper copies of old journals that are available online obliterates tell-tale clues for historians of which pages were well-thumbed. The recent cyberattacks on the British Library remind us of the vulnerability of digitised records. We cant take it for granted that the digital age will have the longevity or the information content of the paper age.

Originally posted here:

Data mining the archives | Opinion - Chemistry World

Top 14 Data Mining Tools You Need to Know in 2024 and Why – Simplilearn

Driven by the proliferation of internet-connected sensors and devices, the world today is producing data at a dramatic pace, like never before. While one part of the globe is sleeping, the other part is beginning its day with Skype meetings, web searches, online shopping, and social media interactions. This literally means that data generation, on a global scale, is a never-ceasing process.

A report published by cloud software company DOMO on the amount of data that the virtual world generates per minute will shock any person. According to DOMO's study, each minute, the Internet population posts 511,200 tweets, watches 4,500,000 YouTube videos, creates 277,777 Instagram stories, sends 4,800,000 gifs, takes 9,772 Uber rides, makes 231,840 Skype calls, and transfers more than 162,037 payments via mobile payment app, Venmo.

With such massive volumes of digital data being captured every minute, most forward-looking organizations are keen to leverage advanced methodologies to extract critical insights from data, which facilitates better-informed decisions that boost profits. This is where data mining tools and technologies come into play.

Data mining involves a range of methods and approaches to analyze large sets of data to extract business insights. Data mining starts soon after the collection of data in data warehouses, and it covers everything from the cleansing of data to creating a visualization of the discoveries gained from the data.

Also known as "Knowledge Discovery," data mining typically refers to in-depth analysis of vast datasets that exist in varied emerging domains, such as Artificial Intelligence, Big Data, and Machine Learning. The process searches for trends, patterns, associations, and anomalies in data that enable enterprises to streamline operations, augment customer experiences, predict the future, and create more value.

The key stages involved in data mining include:

Data scientists employ a variety of data mining tools and techniques for different types of data mining tasks, such as cleaning, organizing, structuring, analyzing, and visualizing data. Here's a list of both paid and open-source data mining tools you should know about in 2024.

One of the best open-source data mining tools on the market, Apache Mahout, developed by the Apache Foundation, primarily focuses on collaborative filtering, clustering, and classification of data. Written in the object-oriented, class-based programming language JAVA, Apache Mahout incorporates useful JAVA libraries that help data professionals perform diverse mathematical operations, including statistics and linear algebra.

The top features of Apache Mahout are:

Dundas BI is one of the most comprehensive data mining tools used to generate quick insights and facilitate rapid integrations. The high-caliber data mining software leverages relational data mining methods, and it places more emphasis on developing clearly-defined data structures that simplify the processing, analysis, and reporting of data.

Key features of Dundas BI include:

Teradata, also known as the Teradata Database, is a top-rated data mining tool that features an enterprise-grade data warehouse for seamless data management and data mining. The market-leading data mining software, which can differentiate between "cold" and "hot" data, is predominately used to get insights into business-critical data related to customer preferences, product positioning, and sales.

The main attributes of Teradata are:

The SAS Data Mining Tool is a software application developed by the Statistical Analysis System (SAS) Institute for high-level data mining, analysis, and data management. Ideal for text mining and optimization, the widely-adopted tool can mine data, manage data, and do statistical analysis to provide users with accurate insights that facilitate timely and informed decision-making.

Some of the core features of the SAS Data Mining Tool include:

The SPSS Modeler software suite was originally owned by SPSS Inc. but was later acquired by the International Business Machines Corporation (IBM). The SPSS software, which is now an IBM product, allows users to use data mining algorithms to develop predictive models without any programming. The popular data mining tool is available in two flavors - IBM SPSS Modeler Professional and IBM SPSS Modeler Premium, incorporating additional features for entity analytics and text analytics.

The primary features of IBM SPSS Modeler are:

One of the most well-known open-source data mining tools written in JAVA, DataMelt integrates a state-of-the-art visualization and computational platform that makes data mining easy. The all-in-one DataMelt tool, integrating robust mathematical and scientific libraries, is mainly used for statistical analysis and data visualization in domains dealing with massive data volumes, such as financial markets.

The most prominent DataMelt features include:

A GUI-based, open-source data mining tool, Rattle leverages the R programming language's powerful statistical computing abilities to deliver valuable, actionable insights. With Rattle's built-in code tab, users can create duplicate code for GUI activities, review it, and extend the log code without any restrictions.

Key features of the Rattle data mining tool include:

One of the most-trusted data mining tools on the market, Oracle's data mining platform, powered by the Oracle database, provides data analysts with top-notch algorithms for specialized analytics, data classification, prediction, and regression, enabling them to uncover insightful data patterns that help make better market predictions, detect fraud, and identify cross-selling opportunities.

The main strengths of Oracle's data mining tool are:

Fit for both small and large enterprises, Sisense allows data analysts to combine data from multiple sources to develop a repository. The first-rate data mining tool incorporates widgets as well as drag and drop features, which streamline the process of refining and analyzing data. Users can select different widgets to quickly generate reports in a variety of formats, including line charts, bar graphs, and pie charts.

Highlights of the Sisense data mining tool are:

RapidMiner stands out as a robust and flexible data science platform, offering a unified space for data preparation, machine learning, deep learning, text mining, and predictive analytics. Catering to both technical experts and novices, it features a user-friendly visual interface that simplifies the creation of analytical processes, eliminating the need for in-depth programming skills.

Key features of RapidMiner include:

KNIME (Konstanz Information Miner) is an open-source data analytics, reporting, and integration platform allowing users to create data flows visually, selectively execute some or all analysis steps, and inspect the results through interactive views and models. KNIME is particularly noted for its ability to incorporate various components for machine learning and data mining through its modular data pipelining concept.

Key features include:

Orange is a comprehensive toolkit for data visualization, machine learning, and data mining, available as open-source software. It showcases a user-friendly visual programming interface that facilitates quick, exploratory, and qualitative data analysis along with dynamic data visualization. Tailored to be user-friendly for beginners while robust enough for experts, Orange democratizes data analysis, making it more accessible to everyone.

Key features of Orange include:

H2O is a scalable, open-source platform for machine learning and predictive analytics designed to operate in memory and across distributed systems. It enables the construction of machine learning models on vast datasets, along with straightforward deployment of those models within an enterprise setting. While H2O's foundational codebase is Java, it offers accessibility through APIs in Python, R, and Scala, catering to various developers and data scientists.

Key features include:

Zoho Analytics offers a user-friendly BI and data analytics platform that empowers you to craft visually stunning data visualizations and comprehensive dashboards quickly. Tailored for businesses big and small, it simplifies the process of data analysis, allowing users to effortlessly generate reports and dashboards.

Key features include:

The demand for data professionals who know how to mine data is on the rise. On the one hand, there is an abundance of job opportunities and, on the other, a severe talent shortage. To make the most of this situation, gain the right skills, and get certified by an industry-recognized institution like Simplilearn.

Simplilearn, the leading online bootcamp and certification course provider, has partnered with Caltech and IBM to bring you the Post Graduate Program In Data Science, designed to transform you into a data scientist in just twelve months.

Ranked number-one by the Economic Times, Simplilearn's Data Science Program covers in great detail the most in-demand skills related to data mining and data analytics, such as machine learning algorithms, data visualization, NLP concepts, Tableau, R, and Python, via interactive learning models, hands-on training, and industry projects.

Read the original here:

Top 14 Data Mining Tools You Need to Know in 2024 and Why - Simplilearn

Ethiopia to start mining Bitcoin through new data mining partnership – CryptoSlate

The Ethiopian government is set to begin mining Bitcoin through a new partnership with Data Center Service a subsidiary of West Data Group, according to Ethiopia-based Hashlabs Mining CEO Kal Kassa.

The partnership was announced by the countrys sovereign wealth fund, Ethiopian Investment Holdings (EIH) on Feb. 15.

Under the collaboration, the sovereign wealth fund will invest $250 million in establishing cutting-edge infrastructure for data mining and artificial intelligence (AI) training operations in Ethiopia.

Kassa said the deal includes setting up Bitcoin mining operations using Canaan Avalon miners and is part of the countrys broader strategy to leverage its technological and energy resources to attract international investment and foster economic growth.

However, the government has yet to confirm the news officially.EIH did not respond to a request for comment as of press time.

The news comes amid a spike in miner activity due to the impending halving, which is less than 65 days away and set to reduce mining rewards by 50%. Many miners have already begun expansion efforts to position themselves appropriately.

The venture is not without its challenges and controversies, particularly concerning the energy-intensive nature of Bitcoin mining.

Theres an ongoing debate about the impact of such operations on local electricity supply, especially in a country where energy access remains a pressing issue for a significant portion of the population.

Despite these concerns, the Ethiopian governments move towards regulating cryptographic products, including mining, reflects a cautious yet optimistic approach to embracing the potential economic benefits of Bitcoin mining.

This regulatory framework aims to ensure that the sectors growth does not come at the expense of the countrys energy security or environmental commitments.

The new rules have paved the way for mining companies to set up shop in the country. Recent media reports revealed a significant increase in Chinese miners moving to the country as part of the BRICS movement.

There has been a notable influx of Chinese miners in Ethiopia over the past few months, drawn by the countrys strategic initiatives and favorable conditions.

The trend is part of a larger movement that has seen Chinese Bitcoin mining operations relocate in response to regulatory pressures at home and the search for cost-effective, regulatory-friendly environments abroad.

Ethiopias low electricity costs, primarily due to the Grand Ethiopian Renaissance Dam, represent a primary lure for Chinese miners. This factor, coupled with the Ethiopian governments openness to technological investments and its efforts to foster a conducive environment for high-performance computing and data mining, has made the country an attractive destination for these operations.

The dams role in providing affordable, renewable energy aligns with the miners needs for sustainable and economically viable power sources for their energy-intensive operations.

The arrival of Chinese miners is underpinned by broader geopolitical and economic considerations. Chinas increasing involvement in Ethiopia, characterized by significant investments across various sectors, has established a solid foundation for such ventures.

The relationship is further reinforced by Ethiopias strategic importance to China as a partner in Africa, offering Chinese companies a hospitable environment for expanding their operations, including Bitcoin mining.


Ethiopia to start mining Bitcoin through new data mining partnership - CryptoSlate

Ethiopia Embarks on a $250M Data and AI Venture with Hong Kong Firm – TradingView

Key points:

The Ethiopian government, through its investment arm, Ethiopian Investment Holdings, has signed a Memorandum of Understanding with Data Center Service, a subsidiary branch of the Hong Kong-based West Data Group. This partnership, valued at $250 million, aims to pioneer sophisticated data mining and artificial intelligence (AI) training facilities within Ethiopia.

State-owned Ethiopian Investment Holdings has signed a Memorandum of Understanding with Data Center Service, a subsidiary of Hong Kongs West Data Group. They will cooperate on a $250-million project to establishing cutting-edge infrastructure for bitcoin mining and AI training.

Kal Kassa, the CEO of Hashlabs Mining, revealed on an X post that through this joint venture, the Ethiopian government will delve into bitcoin mining operations. Hashlabs Mining highlights the countrys openness to mining activities since 2022, despite its stance against cryptocurrency trading.

The initiative seems to gain further complexity with the Ethiopian governments experimental sandbox for cryptographic products licensing, per a Bloomberg report dated February 7.

Additionally, Ethiopia, benefiting from low electricity rates thanks to the partially operational Grand Ethiopian Renaissance Dam, faces a dilemma. The nation boasts the worlds second-lowest electricity prices yet struggles to provide consistent electricity access to half its population. This disparity fuels the debate on the prioritization of resources in the country.

As per another report, the presence of 21 crypto miners in Ethiopia, predominantly Chinese, underscores the global interest in Ethiopias potential as a mining hub. This interest persists despite the crypto trading and mining ban in their home country, China.

Ethiopias government has also engaged with the crypto mining community, supported by entities like Project Mano and BitcoinBirr, coupled with its collaboration with Cardano blockchains IOHK to revamp its education system.

West Data Group, known for its blockchain-fueled fintech solutions and data centers globally, brings to the table its expertise in Bitcoin mining, digital currency investment, and trading. Established in 2017 with its first data center in Kentucky, the company has expanded its footprint to Texas, Kazakhstan, Angola, and Kenya, signaling a robust commitment to digital currency endeavors.

The rest is here:

Ethiopia Embarks on a $250M Data and AI Venture with Hong Kong Firm - TradingView

AI Data Mining Cloak and Dagger. Nightshade AI Poisoning and Anti-Theft | by Aleia Knight | Feb, 2024 – Medium

Probably the biggest use of AI, commercially, has been for art. Models like DALL-E or Midjourney create anything from fantasy landscapes of modern people lounging with dragons to making a 1:1 recreation of the Mona Lisa. The biggest pushback for these models came from artists who, while making their creations public, did not consent to have their creations' data mined for AI training models. Oftentimes, I see people having an AI model take art specifically from a certain artist and then having it create a commission, rather than paying the artist themself to make it.

impersonating real people online with bot accounts, text generation, and image generation.

The Deepfake situation alone has escalated to the point that it has gotten to the desks of White House representatives. A big push was this was the recent Taylor Swift situation in which a user was using AI to scrap images of her from around the internet and create nude images of her, that she never took and without her consent. Imagine, if this can happen to a realistic scale with a celebrity, what that could impact on a social and political level, especially in terms of image, trust, and information exchange.

Even more so, at the beginining of 2024, when a video was release of a fake robocall from President Joe Biden urging the voters of New Hampshire not to vote.

See more here:

AI Data Mining Cloak and Dagger. Nightshade AI Poisoning and Anti-Theft | by Aleia Knight | Feb, 2024 - Medium

Association between biochemical and hematologic factors with COVID-19 using data mining methods – BMC Infectious … – BMC Infectious Diseases

A total of 13,170 participants were recruited (n=5780 people infected to SARS-COV-2 (case) and n=7390 individuals without SARS-COV-2 (control)). Based on Table 1, participants with SARS-COV-2 were significantly older than the control group (59.298.54 versus 56.979.03 years, respectively). In addition, BMI, diastolic blood pressure (DBP), systolic blood pressure (SBP), blood urea nitrogen (BUN), sex, smoking status, serum zinc, copper, creatinine (Cr), cholesterol, triglyceride, high sensitivity C-Reactive Protein (hs-CRP), fasting blood glucose (FBG), serum phosphorus, low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), serum gamma glutamyl transferase (Gamma-GT), creatine phosphokinase (CPK), serum calcium, serum total bilirubin, serum direct bilirubin, aspartate aminotransferase (AST), alanine transaminase (ALT), alkaline phosphatase (ALP), serum uric acid and magnesium showed significant differences between groups. Several hematological factors, white blood cells (WBC), red blood cells (RBC), hemoglobin, hematocrit, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), red cell distribution width (RDW), platelet distribution width (PDW), and mean platelet volume (MPV) were higher compared to the control group (P-value<0.05).

We have attempted to use the LR, DT, and BF models to diagnostic COVID-19 tested participants and their biochemical and hematologic features. In this regard, the data were divided into two parts as training and test data (80%-20%), randomly. The models are validated using test data (20%) and built on the training dataset. Results of the LR algorithm illustrated that biochemical factors (Model I), such as age, smoking status, sex, DBP, SBP, BUN, BMI, hs-CRP, FBG, HDL-C, AST, ALT, CPK, total bilirubin, iron, magnesium, and Gamma-GT were correlated with COVID-19 status (P-value<0.05). In Model I, the BMI, BUN, age variables have been defined as the most crucial variable with high OR by the LR algorithm. With a unit increase in BMI, the chance of being Cov+was 1.092 times. With a year increase in age, the chance of being Cov+was 1.048 times, and with a unit increase in BUN, the chance of being Cov+was 1.041 (see Table 2). In Model II, BMI, age, hemoglobin, hematocrit, sex, MPV, smoking status, and MCHC were significant (P-value<0.05). The hemoglobin had an OR equal to 4.292, so, the chance of being Cov+was 4.292 times. The MPV had an OR equal to 1.550, so, the chance of being Cov+was 1.550 times. Table 3 showed the other variables and values of effect. In Model III, CPK, BMI, MPV, FBG, sex, BUN, Cr, iron, magnesium, total bilirubin, hemoglobin, hematocrit, MCHC, smoking status, age, WBC, HDL-C, and ALT were correlated with COVID-19 status (P-value<0.05). The total bilirubin and MPV had an OR 1.647 and 1.447, so, the chance of being Cov+was 1.647 and 1.447 times, respectively (see Table 4). Based on Table 5, for LR algorithm the accuracy of three models (Model I, II, and III) were 75.13%, 68.28%, and 69.63%, respectively. The other performance indices were given in Table 5 (a), (d), and (g).

In the training phase of DT, the important variables were selected and the final tree is given after pruning. Models I, II, and III runs with 17, 8, and 18 variables as input, respectively. In Model I, CPK, age, BUN, BMI, ALP, sex, total bilirubin, hs-CRP, FBG, and Gamma-GT, in Model II, age, MPV, sex, BMI, hemoglobin, and MCHC, and in Model III, CPK, Cr, BUN, BMI, FBG, age, MPV, MCHC, sex, and total bilirubin variables remained in models. Based on Table 5, the tree is made based on biochemical, hematologic, and both of the variables (Model I, Model II, and Model III, respectively) that had 73.24%, 70.53%, and 68.80% accuracy on the training data, respectively. The other performance indices were given in Table 5 (b), (e), and (h).

The rules from DTs for Model I, II, and III is shown in Table 6. Rule 1 in Model I was illustrated that in a subgroup with CPK>=114.09 & BUN>=30.00 & BMI>=26.77 & Age>=54.00 & Gamma-GT>=16.91, the chance or probability of having Cov+was 84.69%. In another subgroup, CPK<114.09 & CPK<88.06 & Sex(female) & ALT<9.00 led to a 6.57% chance of having Cov+. The rules from Model II, were illustrated that there was an 86.46% chance that participants with features such as Age>=54.00 & BMI>=26.77 & MPV>=9.60 & Sex(male) & Hemoglobin<15.8 be infected with COVID-19. Another rule was suggested that the probability of Cov+in individuals with Age<54.00 & MPV<9.10 was 12.26%. The rules from Model III, were illustrated that there was an 88.15% chance that participants with features such as CPK>=114.09 & BUN>=30.00 & BMI>=26.77 & Age>=54.00 & MPV>=9.60 & MCHC<35.6 be infected with COVID-19. Another rule was suggested that the probability of Cov+in individuals with CPK<114.09 & Cr<1.40 & Cr<1.00 & FBG<118.34 & Sex(female) was 9.90%. Other rules were stated in Table 6.

Hence, the CPK and BUN for Model I, age, BMI, and MPV for Model II, and CPK and BUN for Model III were defined as most crucial variables. The final DT is shown in Figs.2, 3, and 4.

Graphical representation of the classification tree introduced for SARS-COV-2 diagnosis for Model I

Graphical representation of the classification tree introduced for SARS-COV-2 diagnosis for Model II

Graphical representation of the classification tree introduced for SARS-COV-2 diagnosis for Model III

In the final step, for another analysis we applied BF for analyzing the data based on COVID-19. The factors included in the BF algorithm were 17, 8, and 18 variables for Model I, II, and III, respectively. Moreover, we set the following specifications for Model I: Number of Trees in the Forest: 29 for Model I, 13 for Model II, and 53 for Model III, Number of Terms Sampled per Split: 4 for Model I, 2 for Model II, and 4 for Model III, Training Rows: 10,536, Test Rows: 2634, Minimum Splits per Tree: 10, Minimum Size Split: 13 for all three models. Confusion matrix and evaluation indices for comparison of the models I, II, III were stated in Table 5 (c), (f), and (i). Additionally, the crucial variables related to COVID-19 based on BF algorithm were: CPK, BUN, FBG, BMI, total bilirubin, and age in Model I, BMI, sex, MPV, and age in Model II, and CPK, Cr, FBG, BMI, BUN, total bilirubin, sex, MPV, and age for Model III. As one can check the obtained features from BF algorithm were equal to the obtained factors from LR and DT algorithms.

Read more here:

Association between biochemical and hematologic factors with COVID-19 using data mining methods - BMC Infectious ... - BMC Infectious Diseases

Google To Block Location Data Mining In Maps 12/18/2023 – MediaPost Communications

Privacy concerns and the potential for geofence warrants havepromptedGoogle to work on storing Maps user location-history data on the device rather than in the cloud. This is a change that will make it more difficult for anyone, including law enforcement,to access the data.

Google has faced pressure for years to change the way it stores user location history. The update to Maps will roll out during the next year on iOS and Android. The companyannouncedthechangesin a blog post this week.

The featureholding the data is called Timeline, which tracks all the places visited during a specific period of time. It originally launched during thesummer of 2015.



The idea seemed interesting atthe time, especially for Google. It allowed people to visit the places they visited in a tab on Google Maps.

The feature must be turned on manually, and is off by default. Users can delete allor part of the information at any time or disable the setting entirely.

Marlo McGriff, director of product at Google Maps, and the author of the post, wrote that users will receive anotification on their when the update applies to their account.

The change comes several months after a Bloomberg Businessweek investigation found police increasingly used warrants to obtain search andlocation data. This practice has been going on for many years. It just took a search warrant and lots of waiting for Google, Meta and other platforms with location information to release the data topolice.

Google also plans to change its auto-delete settings, which previously was set to 18 months by default. The update resets the auto-delete to three months by default.

Keepingthe location data when upgrading to a new phone will require the user to save the data locally and then back it up to the cloud. Google will automatically encrypt it.

Deleting activity such assearches, directions, visits, and shares will become easier with a few taps. The delete feature will roll out on Android and iOS in the coming weeks.

Privacy advocates are also concerned aboutsomething called a reverse keyword search warrant, where police can ask a technologycompany to provide data on the people who have searched for a given term. JenniferLynch, the general counsel at the nonprofit Electronic Frontier Foundation, told Time magazine: Search queries can be extremely sensitive, even if youre just searching for anaddress.

Original post:

Google To Block Location Data Mining In Maps 12/18/2023 - MediaPost Communications

WEKA is Outdated: Here are the Best Data Mining Tools for 2024 – Analytics Insight

In the dynamic landscape of data mining, staying ahead of the curve is crucial for extracting meaningful insights efficiently. While WEKA has been a stalwart in the field, 2024 heralds the arrival of more advanced and versatile data mining tools. This article explores the evolving data mining terrain and presents a curated list of tools that outshine WEKA in the current technological landscape.

RapidMiner stands tall among data scientists for its user-friendly interface and powerful capabilities. With an extensive library of pre-built templates, it simplifies complex data mining tasks, making it a go-to choice for both beginners and experts.

KNIME, an open-source platform, has gained popularity for its flexibility and adaptability. With a modular workflow design, it enables seamless integration of various data mining components, offering a collaborative environment for data scientists and analysts.

Orange is celebrated for its visual programming interface, making it an ideal choice for those who prefer a graphical approach to data mining. With an array of visualizations, it allows users to comprehend complex patterns and relationships effortlessly.

SAS Enterprise Miner empowers organizations with robust data mining capabilities. Known for its advanced analytics and machine learning algorithms, it is a comprehensive tool for businesses seeking in-depth insights from their data.

While TensorFlow is renowned for its prowess in machine learning, its data mining capabilities have made significant strides. Widely adopted by developers and data scientists, TensorFlow offers a scalable and efficient platform for mining valuable patterns from vast datasets.

In the era of big data, Apache Spark MLlib stands out as a data mining tool capable of handling massive datasets with ease. Leveraging the power of Spark, it enables distributed data mining, making it a robust choice for organizations dealing with large-scale data.

Bringing data mining to the cloud, Microsoft Azure Machine Learning provides a scalable and efficient platform for extracting insights. With seamless integration with other Azure services, it simplifies the end-to-end data mining process.

For Python enthusiasts, Scikit-Learn remains a top choice. Its simplicity and integration with popular Python libraries make it an accessible yet powerful tool for data mining tasks.

As we bid farewell to the era where WEKA reigned supreme, the data mining landscape in 2024 is brimming with innovative and powerful tools. Whether you prioritize user-friendliness, open-source flexibility, or cloud-powered scalability, the tools mentioned above offer a diverse range of options to cater to your data mining needs. Embrace the future of data mining with these cutting-edge tools and unlock the full potential of your datasets.

Read the original here:

WEKA is Outdated: Here are the Best Data Mining Tools for 2024 - Analytics Insight

The Role of Data in Process Mining –

Imagine running a business is like tending to a garden. Everyone talks about being eco-friendly, but Janina Bauer from Celonis says it is not just talk. In fact, it is a big part of every decision.

Celonis, Global Head of Sustainability at Celonis, is like a super-smart gardener. They use something called process mining to help businesses run smoothly. It is like shining a light on how things move around in your garden and simultaneously finding better ways to do things.

Now, the problem is, some businesses see being eco-friendly as too expensive. Plus, their information is all over the place, like seeds scattered in different pots. Celonis helps in gathering all this info and make it easier for businesses to be green.

Why should businesses care? Well, Janina says successful ones are both green and make money. In todays world, where people want eco-friendly choices, being green is not just a department, but it is part of everything.

Janina, who loves green ideas, makes sure Celonis practices what it preaches. They are working hard to become eco-friendly and sets targets like a gardener aiming for perfect blooms.

In simple terms, being eco-friendly is not a headache. It is a chance to grow. As Janina says, being green and making money go together like flowers and sunshine. So, for businesses to be ready for the future, it is time to make being green a big part of the journey, not just a goal on paper.

And when businesses embrace being green, it is not just good for the planet. It is like giving their garden a boost, making it healthier and more beautiful. So, let us all be gardeners of the business world, making it bloom with green success.

Read more here:

The Role of Data in Process Mining -