Predicting the Future With Social Media

Sitaram Asur and Bernardo A. Huberman at the Social Computing Lab at HP Labs in Palo Alto, California, have demonstrated how social media content can be used to predict real-world outcomes. They used content from Twitter.com to forecast box-office revenues for movies. With a simple model built from the rate at which tweets are created about particular topics, they outperformed market-based predictors. They extracted 2.89 million tweets referring to 24 different movies released over a period of three months. According to the  researchers’ prediction, the movie ”The Crazies” was going to generate 16,8 million dollars in ticket sales during its first weekend.  The true number showed to be very close –  16,06 million dollars. The drama ”Dear John” generated 30,46 million dollars worth of tickets sold, compared to a prediction of 30,71 million dollars.

Reported by British BBC: http://news.bbc.co.uk/2/hi/8612292.stm

Reported by SiliconValleyWatcher: http://www.siliconvalleywatcher.com/mt/archives/2010/04/twitter_study_i.php

The research report: http://www.hpl.hp.com/research/scl/papers/socialmedia/socialmedia.pdf

Previous related iOSINT posts:

https://iosint.wordpress.com/2010/03/29/ted-com-sean-gourley-on-the-mathematics-of-war/

https://iosint.wordpress.com/2010/03/17/social-media-intelligence-output/

Take a look at what they want to hide

Web site owners can block search engine web spiders and indexation bots from including parts of the content under their domain in the search engine index. That is done by placing a text file named robots.txt in the root directory of the web site. The text file will contain instructions such as “Disallow:” followed by a subdirectory or a web page, which will tell the bots that this part of the site should not be included in the search engine index. As a result, that page or the pages in that subdirectory will not be available among the results from any search.

Of course, none of these pages are truly protected or hidden – they are just not included in the search engines’ lists of “known web pages”. So, anyone knowing the exact web address will be able to browse to the page in question and view it.

In most cases, the web site owner is not really trying to prevent access to anything on his or her site. More likely, the purpose is to omit certain content from the search results, in order to give the more relevant content better visibility. Then, once a visitor is on the web site, everything published on the web site is available through the navigation menus and internal links.

However: there are cases where the site owner has published something to the web server, which is not made part of the public web site, and he or she is trying to hide this content by blocking search engine spiders from indexing that content. The problem is that robots.txt is always a file that is openly available to anyone – otherwise web spiders would not be able to read it. So, whatever anyone is trying to hide from search engines is listed in plain text, right there in robots.txt.

So if you are interested in finding out what some web site owner is hiding from search engines, and in turn ponder over why that might be, just look for the robots.txt file and read it. The file can also contain interesting comments, providing clues to why certains content has been disallowed. If a robots.txt file is in use, it will be found in the root of the site, for example: http://www.google.com/robots.txt

If you want to google for robots.txt files in general, use this query in Google:

ext:txt inurl:robots
(Try it here)

If you want to google for a robots.txt file on a particular domain, use this query in Google:

ext:txt inurl:robots site:yourselecteddomain.com

Here, for example, is the robots.txt file for Microsoft.com:
http://www.google.com/search?q=ext:txt+inurl:robots+site:www.microsoft.com

Apparently, Microsoft don’t want people to find the help pages for MacIntosh owners using Microsoft products when searching…

Read more about robots.txt on Wikipedia: http://en.wikipedia.org/wiki/Robots.txt

Posted in Websites. Tags: , . 1 Comment »

Employment ads give it all away

A very useful source of information when doing competitive intelligence work are employment advertisements.  While any organization or company will often keep the description of themselves pretty polished and non-internal in their marketing and PR communication, there is a lot more that is both explicitly said and written between the lines in their job ads.

This is particularly useful when researching non-listed companies. In my experience, companies don’t seem to think about competitors as being among the readers of their employment ads, judging from how many of them are written. Also, it can be a difficult balance act to reveal enough to attract candidates, while not giving away details that provide competitors with too much insight.

A single employment advertisement can provide a lot, and a series of ads over time even more. Things such as organizational structure and chains-of-command can be mapped up, even for companies and organizations who otherwise are very discrete with that type of information. New technology development projects can be spotted as the company hires new specialists. The size of teams and departments in terms of coworkers is often written in the clear, for the team or department in question. Salary levels and in turn approximated total cost of staff is another thing that can be deduced by collecting ads over time.  The plain frequency of employment ads publications and the variation over time is an indicator of the state of business: are they growing and expanding business, or not?

For software development companies, the programming skills and programming language knowledge demanded will tell alot about what the company is up to. In some countries, it is required by law to list a name and phone number of a workers’ union representative. That name can in turn be a key to additional information from LinkedIn or Facebook.

So, my advice in summary:

  • monitor your competitors’ career pages on the web,
  • monitor their ads on Monster.com and similar services,
  • collect the ads they publish,
  • keep good track of when they were published and when applications were due,
  • process and organize the bits of information found in the ads
  • combine the information with facts from other sources

Social media: Marketing Input, Intelligence Output

Even the slowest followers in the print media mainstream have by now picked up and echoed the imperative to make use of social media for the purpose of reaching out to customers: Get a corporate Twitter ID and twit about everything new in your offerings. Get a corporate Facebook group and start one-on-one dialogues with the buyers of your products. All of that is a new way of doing marketing.

However, very few are talking about what comes out at the other end of these social media based, outbound marketing & PR efforts. While companies have learned to do a lot of Marketing Input, they can also take the next step and pick up the Intelligence Output. By monitoring and listening to what is going on in various social media channels, companies will be able to collect information about their own brand reputation, competitors’ brand reputation, customer satisfaction levels, competitors’ activity, competitors’ customer satisfaction levels, competitors’ product problems etc etc.

In the report “Top 10 trends in Business Intelligence for 2010” from HP (Hewlett-Packard), Social Computing (the use of online social media) is named as one of the top 10 trends for 2010 and described as an increasingly important source of decision support data.

“An important influence in the continuing BI evolution is the impact of social computing on decision-making processes, methods of collaboration and interaction, and enhanced customer experience. BI can expand the insight it provides organizations if it encompasses the information from interactions that occur in social computing environments. The dynamic conversation channels available through blogs, online communities, Twitter, Facebook, LinkedIn, and a host of social computing venues engage customers, prospects, partners, influencers, and employees—touching virtually every key constituent in an organization’s value chain. Very importantly, these channels are reshaping how customers evaluate and choose products, how brands are perceived, how business processes evolve, and how people work together.

Today most organizations are only beginning to analyze the learnings from online conversation. Technologies such as Social Mining and Social Intelligence use sophisticated data mining and text analytics to understand the implicit meaning of this unstructured data, which is completely reliant on the context in which it occurs. These include social behaviors, attitudes, relationships, and knowledge, all of which carry subjective qualities not easily categorized.  We will see the expanded use of these disciplines to harvest both implicit and explicit information. They may predict future behavior that can impact plans, for example, when strong online chatter suggests product interest that drives a decision to increase production. Or they may help organizations respond to explicit feedback, for example, when user experiences reported in communities lead to a product adjustment. This wealth of intelligence can and should align with, and augment, the intelligence delivered through the organization’s traditional BI initiatives.  For now, the integration of BI with social computing will be managed through the attention of a vigilant few within an organization. Emerging technologies, such as MapReduce, are evolving to help bridge the gap between this new frontier and traditional BI. Look for BI to expand its footprint beyond its traditional realm as it embraces the additional insight available through social computing.”

There is at least one commercial service provider specializing in tracking social media: Whitevector‘s Chat Reports service is a web-based service that provides consumer brand teams with a comprehensive picture of what is being said within online dissocial media discussions such as forums, blogs, and networks like Facebook and Twitter.

Tamara Barber at Forrester recently posted an article on her blog,  Three Key Considerations On Social Media For Market Research, where she lists three of the challenges that have to be met by systems for mining social media. She quotes people from Conversition, Attensity and Alterian. The headlines are:

  • Process and methods need to be developed to make social media data be another source for Marketing Research
  • To “connect the dots” on text mining data, you need to extract noun-verb relationships, sentiment, suggestions and intent.
  • “[In social media research] 80% of your time is spent on identifying the right content, getting it into the right shape, and getting the gems out of it. Social media research is not magic.

OSINT is what hackers use

Any hacker-attempt to break in to a system starts with a research phase for the purpose of identifying soft spots and possible methods of attack. This is sometimes referred to as the Network & Business Reconnaissance phase, as for example in this article called The Five Phase Approach of Malicious Hackers. The blog ShortInfoSec.net agrees, writing that “the methodology used in OSINT is the information gathering phase of every penetration phase“.

The hacker will try to find out as much as possible about the target using information that he or she can find without committing any crime and without exploiting any software system vulnerabilities. Let’s say the target is a company. The hacker will then try to find names and positions of people working in that company, collect documents and files on the internet originating from that company, collect information from newspaper articles about the company, and collect all obtainable information on the company’s internet domain names, the ip-numbers associated with those domain names, and the servers behind those domain names. He or she would collect employment ads from the company in order to find information on which software systems are in use in the company, and information on company internal routines, terminology and details on the organizational structure.

So called dumpster diving – going through trash bags coming out of the company’s facilities – can provide loads of useful information. Knowing the names, positions and work locations of employees at the company, the hacker can continue collecting biographic information on those people using for example LinkedIn, Facebook, Orkut and Pipl  (The importance of thorough reconnaissance, research and preparation before making a social engineering penetration test is testified by ShortInfoSec.net). For an illustrative description of how social media websites such as Facebook can be used as the primary vehicle for a hacker who needs to find a way through the front door, read Social Media and Identity Theft Risks PT II by Robert Siciliano.

I suppose spelling it out isn’t really necessary, but still: The hacker is using information from Open Sources to create a target profile intelligence report about the company – using Open Source Intelligence. The ultimate use of this intelligence is to pinpoint a part of the company’s IT infrastructure which has a known vulnerability that can be exploited and/or to device a social engineering* attack whereby an employee is tricked into revealing critical information such as a password. At the RSA Security Conference 2010, the Security Researcher Pedro Varangot from Core Security Technologies even demonstrated how the trust that users have in social networks can be leveraged to execute targeted social engineering attacks.


* Social engineering is not OSINT, but spot-on HUMINT. Read more about social engineering on Wikipedia.

Grey Zone websites

There are quite a few websites which publish information that the producer did not intend to make public. The most famous one is of course WikiLeaks. What these websites do is that they transform classified, covert information into open source information…

http://wikileaks.org/

http://cryptome.org/

Saving YouTube videos to disk

When collecting internet material for intelligence production, it is always necessary to create a local copy of what you find, since you cannot rely on the material to remain accessible or unchanged in its online location. For text and images this is a simple thing to do. However, for video media it can be less obvious.

For saving YouTube videos to your harddrive, I recommend a freeware tool named VDownloader (available for Microsoft Windows). When VDownloader is running, all you need to do is copy the URL presented by YouTube after clicking the Share link below the video window. Then switch to the VDownloader window to optionally edit the file name.

Videos published on YouTube can be embedded in other websites. The easiest way to save the video to disk when you find it embedded somewhere outside of YouTube is to first right-click the video frame and select “Watch on YouTube”. Then click the Share link and copy the video sharing URL.

VDownloader also captures video from a number of other services besides YouTube: Vimeo, Metacafe, Google Video, DailyMotion, Yahoo!

If you work on a PC where you don’t have administrative rights and hence cannot install software, you can try this alternative: the site PortableFreeware.com makes a previous version available, which can be run as a portable application or no-installation application: VDownloader on PortableFreeware.com.

OSINT Source Typology

Print Media


General daily newspapers
Industry newspapers
Industry journals
Professional journals
Company internal journals
Research papers
Dictionaries and reference literature
Non-fiction books in general
Annual reports from companies and other organizations
Government public documents
Employment Advertisements

Broadcast Media


Television news
Documentary films
Other non-entertainment shows
Radio news
Radio documentaries
Other non-entertainment radio shows

Digital Media


Online news papers
Online journals
Blogs and microblogs (e.g. Blogger, WordPress, Twitter)
Social communities (e.g. Facebook, LinkedIn, MySpace, Orkut)
Discussion forums
Newsgroups (Usenet Newsgroups)
IRC (Internet Relay Chat)
Wikis (e.g. Wikipedia, Wikisummaries, Wikileaks)
Sharing services for video (YouTube is dominating, but there are several others listed here and compared here)
Sharing services for photos (e.g. Panoramio, Flickr, and a number of others listed here)
NGO websites
Company websites
Governmental websites
Government public documents
Email newsletters
Employment advertisements
Published patents and patent applications
Domain registrar WHOIS records
Audio podcasts
Video podcasts
Maps
Satellite imagery
Aerial photos
Metadata extracted from published content (in practice from image files, office suite files and PDFs)

Databases and directories


Public statistics databases
Newspaper articles databases (for material from before the Internet)
Business databases (for facts on companies)
Library & bookshop databases, and books.google.com
Documentary film productions databases (e.g. IMDb.com)
White pages phone books
Yellow pages phone books
People seach engines (e.g. Pipl.com)
All kinds of specialty search engines and directories, the content of which is often referred to as the “Deep Web”
Domain registrars WHOIS records