Collection | iOSINT - information on Open Source Intelligence

Document metadata can reveal secrets

2010-04-15 — iosint

Microsoft Office documents (.doc) may contain hidden data, or meta-data, that includes user information, such as the user’s name and initials, company name, the name of the computer, the path to the location where the document was saved, the names of previous document authors, hidden text or cells, comments, and other information about the document itself. The document owner may not be, and usually isn’t, aware that the electronic form of the document contains anything more than what he/she has written and can see on the computer screen. This could lead to the inadvertent disclosure of sensitive or proprietary data when the electronic versions of these documents are shared with others.

Doc Scrubber is a free application from JavaCool Software that lets you “scrub” word documents for data such as the user name of the document creator, the date and time of creation, the name of other users who edited the document, time of editing and more. Doc Scrubber supports Word documents in the classic .DOC format (the default format in Word 97, 2000, XP, 2003).

The value of this is that you can verify that the alleged writer of a document actually is the sender, and you can check to see if other people were involved in the production of the document. In rare cases, you can get an indication of the writers sentiment about the topic, if for example the original file location path contains an expressive folder name.

There are famous true cases where foul play has been revealed by reviewing word document metadata. Read more in this article (from 2006): http://news.cnet.com/Editing-tips-from-the-NSA/2100-1029_3-6030745.html

This whitepaper from SANS Institute (SANS.org) by Larry Pesce gives good information on metadata in several different kinds of electronic files as a source of information:

“This paper will illustrate ways in which metadata stored in common types of documents can reveal secrets about an organization”

http://www.sans.org/reading_room/whitepapers/privacy/document-metadata-silent-killer_32974
(NOTE: The above link leads to a file that you must save to disk. Then, add the file extention .pdf and you can open it with your PDF-reader software).

Take a look at what they want to hide

2010-04-09 — iosint

Web site owners can block search engine web spiders and indexation bots from including parts of the content under their domain in the search engine index. That is done by placing a text file named robots.txt in the root directory of the web site. The text file will contain instructions such as “Disallow:” followed by a subdirectory or a web page, which will tell the bots that this part of the site should not be included in the search engine index. As a result, that page or the pages in that subdirectory will not be available among the results from any search.

Of course, none of these pages are truly protected or hidden – they are just not included in the search engines’ lists of “known web pages”. So, anyone knowing the exact web address will be able to browse to the page in question and view it.

In most cases, the web site owner is not really trying to prevent access to anything on his or her site. More likely, the purpose is to omit certain content from the search results, in order to give the more relevant content better visibility. Then, once a visitor is on the web site, everything published on the web site is available through the navigation menus and internal links.

However: there are cases where the site owner has published something to the web server, which is not made part of the public web site, and he or she is trying to hide this content by blocking search engine spiders from indexing that content. The problem is that robots.txt is always a file that is openly available to anyone – otherwise web spiders would not be able to read it. So, whatever anyone is trying to hide from search engines is listed in plain text, right there in robots.txt.

So if you are interested in finding out what some web site owner is hiding from search engines, and in turn ponder over why that might be, just look for the robots.txt file and read it. The file can also contain interesting comments, providing clues to why certains content has been disallowed. If a robots.txt file is in use, it will be found in the root of the site, for example: http://www.google.com/robots.txt

If you want to google for robots.txt files in general, use this query in Google:

ext:txt inurl:robots
(Try it here)

If you want to google for a robots.txt file on a particular domain, use this query in Google:

ext:txt inurl:robots site:yourselecteddomain.com

Here, for example, is the robots.txt file for Microsoft.com:
http://www.google.com/search?q=ext:txt+inurl:robots+site:www.microsoft.com

Apparently, Microsoft don’t want people to find the help pages for MacIntosh owners using Microsoft products when searching…

Read more about robots.txt on Wikipedia: http://en.wikipedia.org/wiki/Robots.txt

Posted in Websites. Tags: Collection, Web. 1 Comment »

An image file says more than a picture

2010-03-31 — iosint

Collecting an image file from the internet provides a couple of obvious pieces of information:

The objects and/or persons shown in the picture (which can be identified or unidentified)
The URL from where the image file was collected

In addition to this, there may be metadata available in the web page where the image file was embedded, that could tell you things like:

the name of the photographer
the geographic location where the photo was taken
the time and date of the occasion
the reason for taking the photo
the purpose of publishing the photo
the message communicated along with the photo (which could be political propaganda, commercial marketing etc)

Not many people are aware of the additional information that can be read from an image file thanks to metadata being saved in the file automatically. Embedding metadata in the actual image file avoids the risk of having the image and the metadata separated by mistake. In common speech, such data is referred to as EXIF data and IPTC data, since the data is stored in the image file in accordance with specifications called EXIF, Exchangeable Image File format, and IPTC IIM, International Press Telecommunications Council Information Interchange Model. In addition to these, there is a newer standard called XMP, Extensible Metadata Platform, created by Adobe Systems Inc. Serialized XMP can be embedded into several kinds of files, while also maintaining their readability by non-XMP-aware applications. XMP information is typically included alongside EXIF and IPTC IIM data.

The list of information types that an image file can contain in its embedded metadata is in theory endless, since software and hardware manufacturers are free to define their own “XMP tags”. This is a central part of the idea with XMP, hence the name “extensible”. Among the more exciting types of information that can be added to digital photos by cameras are:

GPS coordinates for the geographic location where the picture was taken (requires GPS function in camera)
Camera temperature (which should be close to the surrounding air temperature in most cases)
Camera make and model, as well as firmware version (which can be important info in a forensics setting)
Degree of zoom used, which provides a hint about the distance from the object

An image that has been edited using some software will reveal the name and version number of that software product in the embedded metadata, as well as the date and time of the last change.

While alot of information is added to image files automatically, by cameras and software, it can also contain embedded information such as keywords and description, that was added by someone working with the picture in some photo management software before publishing the picture on the net. For example, when using Picasa, the keywords and captions you add to picture are stored as IPTC values inside the actual image file, and follow along if the file is copied. The tragic downside of that neat feature is that the second you add the first keyword (label) to a picture using Picasa, the entire set of camera maker XMP metadata in the image file is destroyed – it disappears. I would love to hear the Picasa people at Google explain why this is so, since the labels and captions added with Picasa are stored as IPTC IIM, not as XMP, so there does not seem to be any obvious conflict. Can this be a simple bug in Picasa?

For investigating (and modifying) the metadata embedded in an image file, the preferred tool by professionals world-wide is ExifTool by Phil Harvey. It comes as a platform-independent Perl library, or as a command line application without graphic user interface. For Windows users, I therefore recommend that you use the GUI for ExifTools provided by a Slovenian guy using the alias HBx: http://freeweb.siol.net/hrastni3/foto/exif/exiftoolgui.htm By putting Phil Harvey’s exiftool.exe and HBx’s ExifToolGUI.exe in the same folder, you have an easy to use application which can read and write a very large part of the different types of image file metadata out there.

In addition to ExifTool, there are dozens and dozens of free applications that let you view and edit embedded image metadata. I will point out one of them, which I recently learned about, since it has an interface that provides good overview and also presents the tag-id values and data format type: http://www.photome.de/home.html PhotoMe is created and offered for free by Jens Duttke from Germany.

And finally, a quote from the Wikipedia article on EXIF, highlighting the OSINT potential in harvesting embedded metadata from digital images: “Since the Exif tag contains information about the photo, it can pose a privacy issue. For example, a photo taken with a GPS-enabled camera can reveal the exact location it was taken, which is undesirable in some situations. By removing the Exif tag with software such as ExifTool and Exif Tag Remover before publishing, the photographer can avoid possible problems.”

Posted in Images. Tags: Collection, EXIF, IMINT, IPTC, metadata, XMP. Leave a Comment »

Employment ads give it all away

2010-03-23 — iosint

A very useful source of information when doing competitive intelligence work are employment advertisements. While any organization or company will often keep the description of themselves pretty polished and non-internal in their marketing and PR communication, there is a lot more that is both explicitly said and written between the lines in their job ads.

This is particularly useful when researching non-listed companies. In my experience, companies don’t seem to think about competitors as being among the readers of their employment ads, judging from how many of them are written. Also, it can be a difficult balance act to reveal enough to attract candidates, while not giving away details that provide competitors with too much insight.

A single employment advertisement can provide a lot, and a series of ads over time even more. Things such as organizational structure and chains-of-command can be mapped up, even for companies and organizations who otherwise are very discrete with that type of information. New technology development projects can be spotted as the company hires new specialists. The size of teams and departments in terms of coworkers is often written in the clear, for the team or department in question. Salary levels and in turn approximated total cost of staff is another thing that can be deduced by collecting ads over time. The plain frequency of employment ads publications and the variation over time is an indicator of the state of business: are they growing and expanding business, or not?

For software development companies, the programming skills and programming language knowledge demanded will tell alot about what the company is up to. In some countries, it is required by law to list a name and phone number of a workers’ union representative. That name can in turn be a key to additional information from LinkedIn or Facebook.

So, my advice in summary:

monitor your competitors’ career pages on the web,
monitor their ads on Monster.com and similar services,
collect the ads they publish,
keep good track of when they were published and when applications were due,
process and organize the bits of information found in the ads
combine the information with facts from other sources

Posted in Employment ads. Tags: Collection, Competitive intelligence, Employment ads. Leave a Comment »

Social media: Marketing Input, Intelligence Output

2010-03-17 — iosint

Even the slowest followers in the print media mainstream have by now picked up and echoed the imperative to make use of social media for the purpose of reaching out to customers: Get a corporate Twitter ID and twit about everything new in your offerings. Get a corporate Facebook group and start one-on-one dialogues with the buyers of your products. All of that is a new way of doing marketing.

However, very few are talking about what comes out at the other end of these social media based, outbound marketing & PR efforts. While companies have learned to do a lot of Marketing Input, they can also take the next step and pick up the Intelligence Output. By monitoring and listening to what is going on in various social media channels, companies will be able to collect information about their own brand reputation, competitors’ brand reputation, customer satisfaction levels, competitors’ activity, competitors’ customer satisfaction levels, competitors’ product problems etc etc.

In the report “Top 10 trends in Business Intelligence for 2010” from HP (Hewlett-Packard), Social Computing (the use of online social media) is named as one of the top 10 trends for 2010 and described as an increasingly important source of decision support data.

“An important influence in the continuing BI evolution is the impact of social computing on decision-making processes, methods of collaboration and interaction, and enhanced customer experience. BI can expand the insight it provides organizations if it encompasses the information from interactions that occur in social computing environments. The dynamic conversation channels available through blogs, online communities, Twitter, Facebook, LinkedIn, and a host of social computing venues engage customers, prospects, partners, influencers, and employees—touching virtually every key constituent in an organization’s value chain. Very importantly, these channels are reshaping how customers evaluate and choose products, how brands are perceived, how business processes evolve, and how people work together.

Today most organizations are only beginning to analyze the learnings from online conversation. Technologies such as Social Mining and Social Intelligence use sophisticated data mining and text analytics to understand the implicit meaning of this unstructured data, which is completely reliant on the context in which it occurs. These include social behaviors, attitudes, relationships, and knowledge, all of which carry subjective qualities not easily categorized. We will see the expanded use of these disciplines to harvest both implicit and explicit information. They may predict future behavior that can impact plans, for example, when strong online chatter suggests product interest that drives a decision to increase production. Or they may help organizations respond to explicit feedback, for example, when user experiences reported in communities lead to a product adjustment. This wealth of intelligence can and should align with, and augment, the intelligence delivered through the organization’s traditional BI initiatives. For now, the integration of BI with social computing will be managed through the attention of a vigilant few within an organization. Emerging technologies, such as MapReduce, are evolving to help bridge the gap between this new frontier and traditional BI. Look for BI to expand its footprint beyond its traditional realm as it embraces the additional insight available through social computing.”

There is at least one commercial service provider specializing in tracking social media: Whitevector‘s Chat Reports service is a web-based service that provides consumer brand teams with a comprehensive picture of what is being said within online dissocial media discussions such as forums, blogs, and networks like Facebook and Twitter.

Tamara Barber at Forrester recently posted an article on her blog, Three Key Considerations On Social Media For Market Research, where she lists three of the challenges that have to be met by systems for mining social media. She quotes people from Conversition, Attensity and Alterian. The headlines are:

Process and methods need to be developed to make social media data be another source for Marketing Research
To “connect the dots” on text mining data, you need to extract noun-verb relationships, sentiment, suggestions and intent.
“[In social media research] 80% of your time is spent on identifying the right content, getting it into the right shape, and getting the gems out of it. Social media research is not magic.

Posted in Social media. Tags: Collection, Social computing, Social intelligence, Social media. 2 Comments »

OSINT is what hackers use

2010-03-16 — iosint

Any hacker-attempt to break in to a system starts with a research phase for the purpose of identifying soft spots and possible methods of attack. This is sometimes referred to as the Network & Business Reconnaissance phase, as for example in this article called The Five Phase Approach of Malicious Hackers. The blog ShortInfoSec.net agrees, writing that “the methodology used in OSINT is the information gathering phase of every penetration phase“.

The hacker will try to find out as much as possible about the target using information that he or she can find without committing any crime and without exploiting any software system vulnerabilities. Let’s say the target is a company. The hacker will then try to find names and positions of people working in that company, collect documents and files on the internet originating from that company, collect information from newspaper articles about the company, and collect all obtainable information on the company’s internet domain names, the ip-numbers associated with those domain names, and the servers behind those domain names. He or she would collect employment ads from the company in order to find information on which software systems are in use in the company, and information on company internal routines, terminology and details on the organizational structure.

So called dumpster diving – going through trash bags coming out of the company’s facilities – can provide loads of useful information. Knowing the names, positions and work locations of employees at the company, the hacker can continue collecting biographic information on those people using for example LinkedIn, Facebook, Orkut and Pipl (The importance of thorough reconnaissance, research and preparation before making a social engineering penetration test is testified by ShortInfoSec.net). For an illustrative description of how social media websites such as Facebook can be used as the primary vehicle for a hacker who needs to find a way through the front door, read Social Media and Identity Theft Risks PT II by Robert Siciliano.

I suppose spelling it out isn’t really necessary, but still: The hacker is using information from Open Sources to create a target profile intelligence report about the company – using Open Source Intelligence. The ultimate use of this intelligence is to pinpoint a part of the company’s IT infrastructure which has a known vulnerability that can be exploited and/or to device a social engineering* attack whereby an employee is tricked into revealing critical information such as a password. At the RSA Security Conference 2010, the Security Researcher Pedro Varangot from Core Security Technologies even demonstrated how the trust that users have in social networks can be leveraged to execute targeted social engineering attacks.

* Social engineering is not OSINT, but spot-on HUMINT. Read more about social engineering on Wikipedia.

Posted in Legal. Tags: Collection, Hacker, Hacking, Reconnaissance, Social engineering. 3 Comments »

Grey Zone websites

2010-03-15 — iosint

There are quite a few websites which publish information that the producer did not intend to make public. The most famous one is of course WikiLeaks. What these websites do is that they transform classified, covert information into open source information…

http://wikileaks.org/

http://cryptome.org/

Posted in Grey Zone, Online Sources. Tags: Collection, Leaks, Online, Whistleblower. Leave a Comment »

Saving YouTube videos to disk

2010-03-11 — iosint

When collecting internet material for intelligence production, it is always necessary to create a local copy of what you find, since you cannot rely on the material to remain accessible or unchanged in its online location. For text and images this is a simple thing to do. However, for video media it can be less obvious.

For saving YouTube videos to your harddrive, I recommend a freeware tool named VDownloader (available for Microsoft Windows). When VDownloader is running, all you need to do is copy the URL presented by YouTube after clicking the Share link below the video window. Then switch to the VDownloader window to optionally edit the file name.

Videos published on YouTube can be embedded in other websites. The easiest way to save the video to disk when you find it embedded somewhere outside of YouTube is to first right-click the video frame and select “Watch on YouTube”. Then click the Share link and copy the video sharing URL.

VDownloader also captures video from a number of other services besides YouTube: Vimeo, Metacafe, Google Video, DailyMotion, Yahoo!

If you work on a PC where you don’t have administrative rights and hence cannot install software, you can try this alternative: the site PortableFreeware.com makes a previous version available, which can be run as a portable application or no-installation application: VDownloader on PortableFreeware.com.

Posted in Videos. Tags: Collection, Save, Software, Videos. Leave a Comment »

OSINT Source Typology

2010-03-05 — iosint

Print Media

General daily newspapers
Industry newspapers
Industry journals
Professional journals
Company internal journals
Research papers
Dictionaries and reference literature
Non-fiction books in general
Annual reports from companies and other organizations
Government public documents
Employment Advertisements

Broadcast Media

Television news
Documentary films
Other non-entertainment shows
Radio news
Radio documentaries
Other non-entertainment radio shows

Digital Media

Online news papers
Online journals
Blogs and microblogs (e.g. Blogger, WordPress, Twitter)
Social communities (e.g. Facebook, LinkedIn, MySpace, Orkut)
Discussion forums
Newsgroups (Usenet Newsgroups)
IRC (Internet Relay Chat)
Wikis (e.g. Wikipedia, Wikisummaries, Wikileaks)
Sharing services for video (YouTube is dominating, but there are several others listed here and compared here)
Sharing services for photos (e.g. Panoramio, Flickr, and a number of others listed here)
NGO websites
Company websites
Governmental websites
Government public documents
Email newsletters
Employment advertisements
Published patents and patent applications
Domain registrar WHOIS records
Audio podcasts
Video podcasts
Maps
Satellite imagery
Aerial photos
Metadata extracted from published content (in practice from image files, office suite files and PDFs)

Databases and directories

Public statistics databases
Newspaper articles databases (for material from before the Internet)
Business databases (for facts on companies)
Library & bookshop databases, and books.google.com
Documentary film productions databases (e.g. IMDb.com)
White pages phone books
Yellow pages phone books
People seach engines (e.g. Pipl.com)
All kinds of specialty search engines and directories, the content of which is often referred to as the “Deep Web”
Domain registrars WHOIS records

Posted in OSINT source typology. Tags: Collection, Sources. 2 Comments »

iOSINT – information on Open Source Intelligence

iOSINT on Twitter

General information

Article categories