Unstructured data analysis by SAS Institute

The statistician Christopher Broxe at SAS Institute worked for ten years with structured data before he took an interest in unstructured data. He finds his raw data in discussion forums, sites where consumers rate products and services, in blogs, and in online newspapers.

Using a tool from SAS Institute called TextAnalytics, he evaluates unstructured text statements about for example hotel experiences, and turns it into statistics.

The end result can be displayed as a visualization, in this case a treemap, which is described in the following award-winning way by the Swedish computer industry daily “Computer Sweden”: “The result of the sifting process is displayed as a ‘heatmap’, a color-coded rectangle which looks almost like an aerial photo of the crops fields in the Skane region, but with colors of your own choice.”

The article writer (Anders Lotsson) also says that the software from SAS Institute is not meant for consumer clients, which the price clearly indicates. It also requires software developer skills.

http://computersweden.idg.se/2.2683/1.312915/mer-an-tusen-ord

(April 25, 2010)

An illustration from the article. Bread-crumb legend says: “Purpose of travel > Age group > Co-traveller > City > Hotel name”

What it takes to be a good analyst

Steve Miller, co-founder and president of OpenBI (www.OpenBI.com) wrote a blog post for Information-management.com on April 19, 2010, titled “BI, Analytics and Statistical Science“. He writes that “I think the list provides a foundation of what it takes to succeed in BI” (Business Intelligence). Steve holds degrees in Quantitative Methods and Statistics from Johns Hopkins, the University of Wisconsin, and the University of Illinois, and he acknowledges that his list is statistics/research-centric. I my view, that is not a disadvantage.

I think the list is also very valid for somebody working in an analyst function within for example competitive intelligence or governmental open source intelligence analysis:

General Skills

  • Strong interpersonal and communication skills,
  • Customer-facing personality,
  • Ability to work productively as an individual or in collaboration with others,
  • Ability to write/communicate clearly, accurately, and effectively,
  • Ability to think analytically,
  • Data centricity – obsession with evidence-based problem resolution,
  • An understanding of the scientific method – theory, hypotheses, testing and learning,
  • Ability to use the scientific method to conceptualize business problems,
  • Orientation to business and one or more business processes – either vertical or horizontal,
  • Commitment to life-long learning.

Technical Skills

  • Intermediate programming and computation skills,
  • Facility with logical and physical relational databases (SQL),
  • Understanding of the economic approach – “the allocation of scarce means to satisfy competing ends” – to problem solving,
  • Facility with standard statistical/BI packages to perform analytic calculations,
  • Ability to interpret the the results obtained from these packages,
  • Facility with a variety of graphical/visualization techniques for exploring and presenting analytic data,
  • Understanding of the principles of management, accounting, finance and marketing,
  • Understanding of the meaning of business optimization,
  • Ability to recognize the nature of, and to model, the random variation underlying given business data,
  • Understanding the nature of statistical inference – its scope, limitations and proper role in the process of business analytical investigation,
  • Ability to express a generally-posed business problem in a statistical context; ability to translate business concepts for measurement.
  • Understanding how to obtain a suitable sample from a population and how to make inferences from that sample,
  • Understanding of experimental and quasi-experimental designs for BI,
  • Ability to provide advice on the design of business analytic investigations,
  • Understanding of a variety of commonly-used analytic techniques and the models underlying them,
  • Conversance with the mathematical underpinnings of often-used analytics techniques to facilitate simple modifications in appropriate situations,
  • Understanding of alternatives to traditional statistical modeling from computer and mathematical sciences,
  • Comfort with Internet research,
  • Obsession to stay current with the latest analytic methods/techniques.

BI, Analytics and Statistical Science.

Innumeracy – your employees can’t do math

In his book Innumeracy: mathematical illiteracy and its consequences from 1990, John Allen Paulos writes about the common inability among people – even in important positions – to do simple math. While society looks upon illiteracy as a big problem, and inability to spell correctly is shameful for the individual, nobody seems to be troubled by innumeracy. For example: Nobody says “corporation with a C or Korporation with a K, I don’t care how you spell it in the report as long as you have it done in time”. As a contrast, quotes similar to the following is not unheard of: “A billion or a trillion, I don’t care how many of them you have detected, just file the report in time”. Just ask your self – are you fully aware of the difference between a “billion” and “trillion”? If you are not, make sure you become so.

James Taylor writes about exactly this on SmartDataCollective.com (a TeraData community site) on April 4, 2010, in a post called Don’t rely on your staff’s ability to do math:

I often tell folks that one of the benefits of decision management is that it enables analytic decision making – that is decisions based on accurate analysis of data about what works and what does not – even by people who don’t have any analytic skill.[…] And this is important because most people don’t have these skills! Presenting them with data and expecting them to accurately use it is just not reasonable. […] Please, embed the analytics, don’t rely on your staff’s ability to do math.

http://smartdatacollective.com/Home/25961

Directors should have access to crucial competitive intelligence

says Leonard M. Fuld, founder of Fuld & Company, in a brief and very to-the-point article from October 2006 worth reading. A PDF is available for download here: http://www.academyci.com/ResourceCenter/Intellectual_Capital_Directorship_10-06.pdf

Fuld makes 5 very good points, which he further explains with concrete examples in his article:

1. Knowledge comes in many forms. Directors require the right kind of data, not over-worked, overproduced data.

2. People have blind spots. Board members, just like their executive counterparts, need to constantly challenge their company’s assumptions.

3. War games can outsmart the competition. Directors need to use intelligence strategically, forcing themselves to examine the options a company realistically has arrayed before it.

4. On the Internet, things may not be what they seem. Board members cannot walk into meetings after conducting amateur Internet searches a to explain a situation or a product.

5. Intelligence must be planned for. Boards need their own monitoring services. They require simple mechanisms whereby they can ask questions.

Predicting the Future With Social Media

Sitaram Asur and Bernardo A. Huberman at the Social Computing Lab at HP Labs in Palo Alto, California, have demonstrated how social media content can be used to predict real-world outcomes. They used content from Twitter.com to forecast box-office revenues for movies. With a simple model built from the rate at which tweets are created about particular topics, they outperformed market-based predictors. They extracted 2.89 million tweets referring to 24 different movies released over a period of three months. According to the  researchers’ prediction, the movie ”The Crazies” was going to generate 16,8 million dollars in ticket sales during its first weekend.  The true number showed to be very close –  16,06 million dollars. The drama ”Dear John” generated 30,46 million dollars worth of tickets sold, compared to a prediction of 30,71 million dollars.

Reported by British BBC: http://news.bbc.co.uk/2/hi/8612292.stm

Reported by SiliconValleyWatcher: http://www.siliconvalleywatcher.com/mt/archives/2010/04/twitter_study_i.php

The research report: http://www.hpl.hp.com/research/scl/papers/socialmedia/socialmedia.pdf

Previous related iOSINT posts:

https://iosint.wordpress.com/2010/03/29/ted-com-sean-gourley-on-the-mathematics-of-war/

https://iosint.wordpress.com/2010/03/17/social-media-intelligence-output/

Take a look at what they want to hide

Web site owners can block search engine web spiders and indexation bots from including parts of the content under their domain in the search engine index. That is done by placing a text file named robots.txt in the root directory of the web site. The text file will contain instructions such as “Disallow:” followed by a subdirectory or a web page, which will tell the bots that this part of the site should not be included in the search engine index. As a result, that page or the pages in that subdirectory will not be available among the results from any search.

Of course, none of these pages are truly protected or hidden – they are just not included in the search engines’ lists of “known web pages”. So, anyone knowing the exact web address will be able to browse to the page in question and view it.

In most cases, the web site owner is not really trying to prevent access to anything on his or her site. More likely, the purpose is to omit certain content from the search results, in order to give the more relevant content better visibility. Then, once a visitor is on the web site, everything published on the web site is available through the navigation menus and internal links.

However: there are cases where the site owner has published something to the web server, which is not made part of the public web site, and he or she is trying to hide this content by blocking search engine spiders from indexing that content. The problem is that robots.txt is always a file that is openly available to anyone – otherwise web spiders would not be able to read it. So, whatever anyone is trying to hide from search engines is listed in plain text, right there in robots.txt.

So if you are interested in finding out what some web site owner is hiding from search engines, and in turn ponder over why that might be, just look for the robots.txt file and read it. The file can also contain interesting comments, providing clues to why certains content has been disallowed. If a robots.txt file is in use, it will be found in the root of the site, for example: http://www.google.com/robots.txt

If you want to google for robots.txt files in general, use this query in Google:

ext:txt inurl:robots
(Try it here)

If you want to google for a robots.txt file on a particular domain, use this query in Google:

ext:txt inurl:robots site:yourselecteddomain.com

Here, for example, is the robots.txt file for Microsoft.com:
http://www.google.com/search?q=ext:txt+inurl:robots+site:www.microsoft.com

Apparently, Microsoft don’t want people to find the help pages for MacIntosh owners using Microsoft products when searching…

Read more about robots.txt on Wikipedia: http://en.wikipedia.org/wiki/Robots.txt

Posted in Websites. Tags: , . 1 Comment »

Face-recognition added in Picasa 3.6 – great OSINT processing tool

In release 3.6 of Picasa, Google added the Name Tags functionality. This means they put face recognition logic into Picasa, and added a special tag for specifying person identity by name, in addition to the previously available metadata types Labels and Caption.

So what does this mean? Well, it means that anyone with a personal computer can build a searchable library of portrait photos. Picasa will automatically locate faces in the pictures, and build a library of cropped pictures showing faces only, one by one.

As you identify a face by adding a name to it, Picasa will automatically apply the same name tag to all other pictures where a face has been detected and where the software finds high enough resemblance. This means that as new photos are added later on, Picasa will automatically name tag them, provided that there are previously tagged face images that allow for a comparison and identification to be made.

When Picasa finds a possible but not certain match, the image is tentatively tagged with the name, and you can later press green to confirm or press red to deny.

The face-recognition algorithm in Picasa is more likely to give false positives than to miss anything, in terms of finding faces in pictures. Below are a couple of examples of “false positives”, i.e. parts of images that Picasa suspected might be faces of people, but are not. As you can see, the software is not likely to miss a face where there is one.

So what are the potential use cases? Well, let’s see if we can invent a few.

Use case 1
You are assigned with the mission to collect biographic information including portrait photos of industry specialists and key decision makers at a trade show. While you don’t really know much about who is who when the trade show starts, you can start by taking massive amounts of photos of people and crowds, where ever you see them. Let Picasa index the photos, and list the faces detected in the pictures. As a first step, tag the faces with some code or number with the aim of indicating which face pictures have the same identity. As the trade show goes on, you may increasingly be able to connect names and faces. As you do, you replace the dummy name tag codes with real names. In this way, you will not have wasted any opportunity to take pictures of people just because you didn’t know who they were from the start.

Use case 2
You have a large – 100 000 images – collection of digital photos of people, that are not tagged or indexed in any way. Without being able to search for a name of someone, and as a result see pictures of that person, the photo collection certainly has limited value. Manually evaluating, classifying and tagging each photo – for each of the persons in each photo – is simply not feasible. But if Picasa does the job of framing faces in the pictures and understanding which faces are the same, the situation changes. You will definitly avoid double and tripple work caused by photo duplicates, and you may well find that many faces are automatically identified and name tagged by Picasa once you have identified a few different pictures of the same person.

Unstructured text processing – software doing the job

An Information Week article on Tuesday, March 23, picked up a press release from Clarabridge, announcing that Wendy’s will start using Clarabridge for automated processing of unstructured data:

The Clarabridge text analytics solution will be used to analyze nearly half a million text-based customer comments per year collected from Wendy’s Web-based feedback form, call center notes, e-mail messages, receipt-based surveys, and social media sources.

[…]

Over the last decade, text analytics has evolved from a rarified technology used almost exclusively by government intelligence agencies and high-end financial firms to something far more accessible.

[…]

Clarabridge and Attensity are among the leading best-of-breed text analytics vendors.

Read  the full article here:
Wendy’s Taps Text Analytics To Mine Customer Feedback

Software that can process unstructured text with a successful result is an exciting area. For example, the potential value of any investment in such systems increases day by day thanks to the growing use of social media (read more: Social Media: Marketing Input, Intelligence Output). 80% or more is a common number for describing how large the portion of unstructured data is in relation to the total amount of data available globally. In plain english: the most part is free text, not tables.