New algorithm searches historic documents to discover noteworthy people
BUFFALO, N.Y. Old newspapers provide a window into our past, and a new algorithm co-developed by a University at Buffalo School of Management researcher is helping turn those historic documents into useful, searchable data.
Published in Decision Support Systems, the algorithm can find and rank people's names in order of importance from the results produced by optical character recognition (OCR), the computerized method of converting scanned documents into text that is often messy.
"It's a known fact that when OCR software is run, very often the text gets garbled," says Haimonti Dutta, PhD, assistant professor of management science and systems in the UB School of Management. "With old newspapers, books and magazines, problems can arise from poor ink quality, crumpled or torn paper, or even unusual page layouts the software isn't expecting."
To develop the algorithm, the researchers partnered with the New York Public Library (NYPL) and analyzed more than 14,000 articles from New York City newspaper The Sun published during November and December of 1894. The NYPL has scanned more than 200,000 newspaper pages as part of Chronicling America, an initiative of the National Endowment for Humanities and the Library of Congress that is working to develop an online, searchable database of historical newspapers from 1777 to 1963.
Their algorithm ranks people's names by importance based on a number of attributes, including the context of the name, title before the name, article length and how frequently the name was mentioned in an article.
The algorithm learns these attributes only from the textit does not rely on external sources of information such as Wikipedia or other knowledgebases. But since the OCR text is garbled, it can't determine how effective these attributes are for ranking people's names. So the researchers used statistical measures to model the many data attributes, which helped provide the desired ranking of names.
The researchers used two sets of the historic articles to test their algorithm: One set was the raw text produced from the OCR software, the other set had been cleaned up manually by New York City schoolchildren, who are using the articles to write biographies of local, notable people of the time.
When compared to the cleaned-up versions of the stories, the ranking algorithm is able to sort people's names with a high degree of precision even from the noisy OCR text.
Dutta says their process has wide reaching implications for discovering important people throughout history.
"We recently used this technique on African American literature from the Civil War to learn more about the important people during the era of slavery," says Dutta. "Going forward, we'll be expanding the technique to examine relationships between people and build out the social networks of the past."
Dutta collaborated on the study with Aayushee Gupta, PhD, research scholar at the International Institute of Information Technology Bangalore Department of Computer Science.
This press release was produced by the University at Buffalo - News Center. The views expressed here are the author's own.
Read more from the original source:
- Data Mining in Business Analytics - Online College | WGU - December 3rd, 2021
- Data Mining - Applications & Trends - December 3rd, 2021
- 3 Cryptomining Stocks to Profit From The Bitcoin Boom - Motley Fool - December 3rd, 2021
- Adventus Mining Announces Commencement of Drilling on Rathkeale Property and Initial Results of Kingscourt Drill Program Under The South32 Earn-In... - December 3rd, 2021
- The Ins and Outs of Privacy in Clinical Trials for Patients With Cancer - Curetoday.com - December 3rd, 2021
- Preserve Data History In The Cloud -- Or Lose It - ITPro Today - December 3rd, 2021
- Top Ten: Disruptive supply chain technologies - Supply Chain Digital - The Procurement & Supply Chain Platform - December 3rd, 2021
- Shares of companies in the metals and mining space are trading lower as omicron concerns and worse-than-expected US job growth data weigh on economic... - December 3rd, 2021
- Agency suffered data breach in September on mining permit applications - ABC 36 News - WTVQ - November 29th, 2021
- Value of data mining key in propelling Ghana's digitisation efforts Expert - GhanaWeb - November 29th, 2021
- Data Shows a Myriad of Crypto Networks Are More Profitable to Mine Than Bitcoin Mining Bitcoin News - Bitcoin News - November 29th, 2021
- No Credit Score? No Problem! Just Hand Over More Data. - The New York Times - November 29th, 2021
- Will Palantir Be a Trillion-Dollar Stock by 2040? - The Motley Fool - November 29th, 2021
- HotSpot Therapeutics Closes $100M Series C to Advance First-in-Class Allosteric Drug Discovery Platform to the Clinic - PRNewswire - November 29th, 2021
- Bigger contracts and crypto mining ignite DC Twos FY22 vision - Stockhead - November 29th, 2021
- Deswik to aid deep underground mining decision making with MineOps buy - International Mining - November 29th, 2021
- The GDPR and NZ: Why this relationship is so important to the future of data privacy in Aotearoa - SecurityBrief New Zealand - November 29th, 2021
- RANKED: World's top 10 biggest mines by tonnes of ore mined - MINING.COM - MINING.com - November 29th, 2021
- MMPWW announces exclusive tech partnership with Aqilliz in MENA - AMEinfo - November 29th, 2021
- Travis Scott's partnership with BetterHelp: The dark truth - The Mancunion - November 29th, 2021
- Now Isnt the Time to Give Users Control of Their Data - WIRED - November 25th, 2021
- What is The Role of Chief Data Officers (CDOs) - TechFunnel - November 25th, 2021
- Top 10 AI Jobs Available in Government Agencies Across the World - Analytics Insight - November 25th, 2021
- The Crucial Role Of Wild Horses In Bitcoin Mining - Bitcoin Magazine - November 25th, 2021
- Sprout Health Solutions Presents Data on Social Media Listening and the Patient Experience at ISPOR EU 2021 - PRNewswire - November 25th, 2021
- Allorion Debuts with $40 Million to Enhance and Discover Precision Targets - BioSpace - November 25th, 2021
- Needed: Discoveries to feed green economy - www.mining-journal.com - November 25th, 2021
- Lore has it that there's a lull leading up to Santa Claus rallies here's what the statistics show - MarketWatch - November 25th, 2021
- Who Says AI Is Not For Women? Here Are 6 Women Leading AI Field In India - SheThePeople - November 25th, 2021
- No-Code Analytics The Best Introduction to Data Science - Analytics Insight - November 25th, 2021
- Manchin and Cortez Masto kill chances of reforming outdated hardrock mining law - Grist - November 25th, 2021
- Seeing the Future: How to Use Predictive Analytics in Your Business - Silicon UK - November 25th, 2021
- Bullish: Analysts Just Made A Significant Upgrade To Their Evolution Mining Limited (ASX:EVN) Forecasts - Simply Wall St - November 25th, 2021
- New legal framework evolving on tech, internet; data protection bill step towards that: MoS IT - ETCIO.com - November 25th, 2021
- Data-Mining to Serve Your Patient, Your Team and Your Business Better - InvisionMag - November 13th, 2021
- 3 Reasons to Buy Palantir, And 1 Reason To Sell - The Motley Fool - November 13th, 2021
- Top Pros and Cons of AI in Cloud Computing for You to Know - Analytics Insight - November 13th, 2021
- Neural Network Software Market Research Report by Component, by Type, by Vertical, by Region - Global Forecast to 2026 - Cumulative Impact of COVID-19... - November 13th, 2021
- Southern Cities Rank High on List of Places With Highest Rates of Car Accidents - Insurance Journal - November 13th, 2021
- Comets Find Right Ingredients, Win Business Hall of Fame Scholarships - University of Texas at Dallas - November 13th, 2021
- Letter: Insurance industry is exploitative - The Columbian - November 13th, 2021
- To Give Away your Life Savings, Press One - Global Banking And Finance Review - November 13th, 2021
- SOS Reports Progress on the Construction of Its North American Super-Computing and Hosting Center - PRNewswire - November 13th, 2021
- These 4 Measures Indicate That Hut 8 Mining (TSE:HUT) Is Using Debt Reasonably Well - Simply Wall St - November 13th, 2021
- AdGuard is an ad blocker, a VPN and a security app in one -- and this deal makes it a steal! - Cult of Mac - November 13th, 2021
- Reduced Data Accuracy Helps Save Energy Tampere University, Finland, Is Coordinating a Project That Trains Young Scientists From Around the World to... - November 13th, 2021
- Data Mining Tools Market increasing demand with Industry Professionals: IBM, SAS Institute, Oracle The Host - The Host - November 1st, 2021
- COVID19 patient diagnosis and treatment data mining algorithm based on association rules - Wiley - November 1st, 2021
- Data Mining Software Market 2021 Detailed Analysis of top Ventures with Regional Outlook | Key Companies: IBM, RapidMiner, GMDH, SAS Institute,... - November 1st, 2021
- Top Roles and Skills needed to build a Data Science Team - APN News - November 1st, 2021
- Here are the 14 most important pieces of surveillance technology that make up the US 'digital border wall,' according to immigrant-rights groups -... - November 1st, 2021
- Data Preprocessing in Data Mining - GeeksforGeeks - October 28th, 2021
- Miners Are The Optimal Buyers: The Data Behind Bitcoin-Led Decarbonization In Texas - Bitcoin Magazine - October 28th, 2021
- The Benefits of Data Analytics in Clinical Reporting - Diagnostic and Interventional Cardiology - October 28th, 2021
- Ex-WNBA player Alana Beard joins effort to bring expansion team to Oakland - ESPN - October 28th, 2021
- Top Surveillance Technology That Makes up the US 'Digital Border Wall' - Business Insider - October 28th, 2021
- CORRECTING and REPLACING OLB Group Announces Total of 600 Antminer S19j Pro Cryptocurrency Asic Miners in Operation - Yahoo Finance - October 28th, 2021
- INTRUSION to Announce Third Quarter 2021 Financial Results on Thursday, November 11, 2021 - Yahoo Finance - October 28th, 2021
- How to Use App Privacy Report in the iOS 15.2 Beta - Mac Rumors - October 28th, 2021
- What are task and process mining in AI? - Ericsson - October 28th, 2021
- A glimpse into the terrifying world of social media manipulation - The Business Standard - October 28th, 2021
- NSF Awards Grant to Study Use of AI to Improve Sustainable Energy Infrastructure Network - University of Arkansas Newswire - October 28th, 2021
- Australia plans to boost mineral discovery with new drill core research lab - MINING.COM - MINING.com - October 28th, 2021
- Here are the 14 most important pieces of surveillance technology that make up the US digital border wall, according to immigrant-rights groups - News... - October 28th, 2021
- KDD 2021 Celebrates Winning Teams of 25th Annual KDD Cup - KCRG - October 20th, 2021
- The Full Picture, the Right Picture High-Resolution Mass Spectrometry for Metabolomic Profiling - Technology Networks - October 20th, 2021
- Anacortes Mining acquires Tres Cruces surface rights and drill data as option agreements with MBM terminate - Proactive Investors USA - October 20th, 2021
- Hackers mining third-party apps to steal your health data - Greater Kashmir - October 20th, 2021
- Top 10 Business Intelligence Books to Read 2021 - Analytics Insight - October 20th, 2021
- FAU Engineering One of the Top Three Fastest Improving Colleges in the U.S. - Newswise - October 20th, 2021
- Integrated Predictive Safety Systems: Highlighting the Importance of Safety in Mining - AZoMining - October 15th, 2021
- Peter Thiel bets on the far right: Tech tycoon spending millions to bankroll "Trump wing" of GOP - Salon - October 15th, 2021
- Eni S p A : selects the best startups to develop innovative solutions in Communications Data Mining - Marketscreener.com - October 15th, 2021
- R&D Solutions for Pharma and Medical Technology - Knovel - October 15th, 2021
- Windfall Geotek Completes Soil Program and 43-101 on Sobeski Lake Property in the Red Lake Area - Junior Mining Network - October 15th, 2021
- SkyChain Signed Agreement to Secure Land and Power... | INN - Investing News Network - October 15th, 2021
- Data mining the past - UB News Center - October 14th, 2021
- How Financial Institutions Are Trying To Make Sense Of ESG Data - Todayuknews - Todayuknews - October 14th, 2021
- Decarbonisation and sector disclosure for metals & mining - ING Think - October 14th, 2021
- Xin Tian, Esq., Recognized for Excellence in Patent Law and Scientific Research - PRNewswire - October 14th, 2021