Troubleshooting iFilters and Crawled Properties

As I have taught, before a document property can be used in search results, it must be promoted to a Managed Property. Many properties are already managed properties out of the box. On several recent SharePoint 2010 Search projects I have used third party iFilters to solve specific document indexing challenges and needed to work extensively with Crawled Properties. For example:

  • I want to extract XMP data from images like Latitude and Longitude
  • I need to promote a particular field for use as Document Keywords
  • I need to troubleshoot a specific iFilter

Time and time again I have evangelized the use of supported third party iFilters like the iFilter Shop and their great selection of products to solve indexing challenges (rather than free crap). Likewise, I LOVE the Foxit Software PDF iFilter. In tests the multithreaded Foxit iFilter 2.0 CRUSHED the Adobe iFilter by indexing files 39 times faster than Adobe (processing a scorching 29 files per second). Read the full review. But what happens when you decide to use these tools and the results are not what you expect, or you want to take the plunge and extract more metadata, like the XMP file information from an image? The reality is that the SharePoint Crawled Property interface are not very “user friendly”. I hope to change that with the help of our friends at the iFilter Shop and some recent experience.

Crawled Properties

I will be the first to admit that my eyes cross when I get into the Crawled Properties page. You get there by navigating from Central Administration to your Search Service Application and choosing Metadata Properties from the left navigation. Then from the tool bar choose Crawled Properties. It is cryptic and obscure with properties like Basic:10 (Text), Category 8:100 (Integer), and, occasionally, something human readable like SharePoint:ows_City(Text) and People:AboutMe(Text). Figuring out the SharePoint properties is not too hard. If the property begins with “ows_” it usually came from a field in a list or library. So if I create a list and add my own field called “DogBreed” and then crawl the list, I’ll see ows_DogBreed show up as a crawled property. What about the other properties? Well, when the crawler finds a document, web page, BCS data or what ever, it passes the item to the registered iFilter for that item. The iFilter “reads” the item and extracts the metadata properties and text. So don’t blame SharePoint, it is just cataloging what the iFilter provides. If the iFilter returns a property and calls it Basic:10 of type Text, SharePoint makes not of it exactly that way. It ain’t pretty, but it works.

Getting to the Properties

So, what about all the other fields with significantly less descriptive names? Names like Office:3 and Office:5? What the heck is that? Well, using a very cool tool from the iFIlter Shop, I will attempt to explain. But first a little background on where properties come from… For example, I have a Word document that has metadata. How do I know this? Because I can right click the document and choose Properties and view the metadata on the Details tab.

Properties Dialog

Properties Display

Of course, since I have access to the document I can also view the properties in Backstage.

Office Backstage

PDF documents have a similar properties dialog and you can access them from the file system or through the File | Properties dialog. Here is a document I created by scanning a recent news article about my dog Willa.

PDF Properties

And the corresponding file properties in Windows 7. Notice that, unlike the Word document, there are fewer document properties available in the Property dialog from the operating system.

OS File Properties

So how do we get Title, Subject, Author and Keywords if the OS cannot? Simple, iFilters.

Property Extraction

The problem I determined was this, How do I make the process of Crawled Property (with a cryptic name) to Managed Property (with a name of my choosing) a predictable experience? How can I “see the file the way the iFilter sees it?” That is where the folks at the iFilter shop came to the rescue with their free tool, iFilterView. First, download iFilterView from iFilter Shop. This tool is purpose built to show you what the iFilter sees when it opens your document. Once downloaded run it and choose File | Open and choose a file, in our case the first Word document. Notice that the output shows some of our properties. Notice the Property IDs like 2, 4 and 5. They look familiar. They look like the Title, Author and Tags from the first example above.

IFilterView

Lets do it again, this time on the PDF. Yep, Title, Subject, Author and Keywords.

IFilterView

I see that we’re getting somewhere, if only there were some identifier, some global identifier that tied our property as extracted by the iFilter to our Crawled Property as registered by SharePoint. If only…

Making the Connection

If you really dig into crawled properties (by simply clicking on one) you will see that there is more information hiding under the covers. In my case I chose Office:4 (Text).

Managed Properties

This page shows the wildly verbose Property Name, and the Property Set ID. Notice that the Property Set ID relates exactly to the ID in the iFIlterView dump of the property 4 above. So here is my ah-ha moment. If I don’t know what the property is, I can add it to a document, open the document in iFilterView and observe where my property ends up and then look in SharePoint for the GUID. How cool is that?? Now I know that Office:4 is mapped to the Managed Property Author.

Property Details

Did I Mention GPS Data?

You can extend this to new iFIlters like the XMP iFilter from iFilter Shop. If I have a photo with XMP metadata, I can open it with iFIlterView and look at the GPS Data. This takes the guess work out of configuration. Now I know what to look for in the Crawled Properties interface, I’ll find two properties GPSLatitude and GPSLongitude. Waiting for me to map to managed properties.

GPS Data

Needle in a Haystack

PowerShell Book Cover

OK, so when I said “All you have to do is find the GUID” how many of you cringed? OK, probably only those of you who have actually spent more than 15 minutes looking at crawled properties. Trust me, it is tedious. There has to be a short cut! Well I have been trying to absorb my friend Gary Lapointe’s amazing book Automating Microsoft SharePoint 2010 Administration with Windows PowerShell 2.0. I decided to write a script to dump all the Crawled Properties so that I could just search for the GUID in NotePad++. Blame me for any errors, not Gary, he’s innocent, really, really innocent, I mean, have you met the guy? Smile

Anyway, the following code will enumerate all your categories and Crawled Properties and create an XML file that you can search for your GUIDs. You just have to change the $ssaName constant to match the name of your Search Service Application.

#Test and Load the Microsoft.SharePoint.PowerShell Plugin

if (-not (Get-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue)) {
    Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue
	if (-not $?) {
		Write-Host $error[0].Exception.Message -ForegroundColor Red -BackgroundColor Cyan
		return
	}
}

#Change this to match your SharePoint Search Service Application name

$ssaName = "Enterprise Search Service Application"

$ssaID = Get-SPServiceApplication | Where-Object {$_.Name -eq $ssaName}

#Get the search application

$searchapp = Get-SPEnterpriseSearchServiceApplication $ssaID
#Get the CrawledProperties

$categories = Get-SPEnterpriseSearchMetadataCategory -SearchApplication $searchapp

#Convert to XML

$xml = "<categories/>"

#Loop and return all Properties

foreach ($category in $categories)
{
	#Create the new node

    $categoryElement = $xml.CreateElement("category")
	$categoryElement.SetAttribute("name", $category.Name)
	$categoryElement.SetAttribute("count", $category.CrawledPropertyCount.ToString())
    #Add the XML

    [void]$xml.DocumentElement.AppendChild($categoryElement);
	Write-Host 'Writing Category ' $category.Name
    
    #Add a properties Node

    $propertiesElement = $xml.CreateElement("properties")
    [void]$categoryElement.AppendChild($propertiesElement)
    #Get the Crawled Properties for the Category

	foreach ($property in $category.GetAllCrawledProperties())
	{
        #Create a Property node	

        $propertyElement = $xml.CreateElement("property")
        [void]$propertiesElement.AppendChild($propertyElement);
        #Set the Attributes

		$propertyElement.SetAttribute("name", $property.Name)
		$propertyElement.SetAttribute("propset", $property.Propset.ToString())
		$propertyElement.SetAttribute("varianttype", $property.VariantType.ToString())
	}	
}

$xml.Save("$home\crawledprops.xml")

Are We Done?

No, we’re just getting started, but I need to save that for another post.

|| iFilters || PowerShell || Search || SharePoint 2010

comments powered by Disqus

Let's Get In Touch!


Ready to start your next project with us? That’s great! Give us a call or send us an email and we will get back to you as soon as possible!

+1.512.539.0322