Friday, June 23, 2006

BLOG on 6th Annual Family History Technology Workshop

(Sorry for the long post. However, my notes from this conference have come up in conversation recently, so I wanted to post them online. They were previously circulated via email.)

6th Annual Family History Technology Workshop

Conference notes by David Lifferth of Provo Labs.

I was able to attend the 6th (or 7th depending on who you ask) Family History Technology Workshop at BYU yesterday. The 1 day conference was packed full of great information that was relevant not only to Family History, but to almost every other information technology or Internet based industry. I am very glad that I went and there are dozens of ideas that we can apply directly to what we are doing in our business models. I have the conference notebook if anyone would like to see the papers that were presented there. I also digitally recorded several of the key presentations if anyone would like to listen in.

The day started off with the Keynote Address by Peter Norvig of Google. He is an amazing individual with an incredible background. He is the Director of Machine Learning at Google and has been at Google since 2001. Prior to that he was a Fellow of American Association of Artificial Intelligence, a senior computer science at NASA, and head of the Computational Sciences Division at Ames Research Center. He has written a number of books and for someone as technical as he is, he came across as very personable and easy going.

Since his presentation was the main reason that I went to the conference, I took a lot of notes. I will summarize them and attempt to extrapolate what he said into our business modes.

Norvig displayed a map of the world that showed the density of Google queries. He quoted "the best way to predict the future is to invent it." That came from several sources. "Search engines should be as useful as HAL 2000 in the Space Odyssey but not kill anyone."

He presented the "6 W's" of improving searching:
Who-Queries need to be customized to who you are. This requires personalization.
What-What type of information do you really want. Don't return results that you don't want.
Where-Localized results including search results from you own computer and local servers.
When-Results should be timely (current results) and fast
How-results beyond 2.5 query terms and top 10 results.
Wallet-Make it easy to buy info and goods

Topic of Who (search query personalization)

He gave the example of searching for the term Jaguar. (This is also the search example that we use) He doesn't want anything about cars, animals, or sports teams only about the Mac OS. Search results should also include results from files on his local machine and file servers. He described the value of FlikR type tagging for images and other media objects. (We should allow tagging of our image collection.) Blogs and forums are valuable if they are relevant to you. He showed an example of a lady that participates in camera and photography blogs and has averaged 200 entries a day for several years. He likes the Wikipedia because it is up to date.

Britannica encyclopedia also has errors and typos but it takes several years to get their content to market. Blogs needs search engine optimization. He likes type-ahead features that show current similar queries that others have used.

General knowledge must be understood in order to share ideas.
Communication is only effective if we have a common understanding. His example of "Water flows downhill" is that unless you understand that principle that other related ideas can't be effectively communicated. He said that artificial intelligence can play a roll in helping general knowledge become useful in conveying ideas. What Google is indexing the close captions of video to allow video to be searched.

Where
He talked about applications that use Google maps combined with other datasets. Google maps and Craigslist is a powerful combination. (We are pursuing this with our data as well.)
He talked about how years ago the "Brown Corpus" was a tool used by many to research. It had 1,000,000 words or 10 to the 6 power. The Internet is 10to the 13 power and allows research to continue on a much larger scale.

Sentences that use the format "X such as Y" are great for classification of elements. For example "Phone companies such as Verizon and AT&T" let us know that AT&T and Verizon are categorized as phone companies. There are millions of sentences like on web pages.

Google's success has been because they organize the world's information and make it more accessible and useful. We have heard the expression that a picture is worth a 1,000 words-from a technology standpoint a photo or videos is significantly more expensive in terms of storage than 1,000 words, but infinitely more useful in certain disciplines. Images are currently best searched by the surrounding text (this is how we are indexing the images from Wikipedia in our own image search). However, in the near future computers should be able to interpret images for their content. Facial recognition is coming, especially with the homeland security and defense dollars being thrown at it. Currently technology is very accurate at identifying males or females in images. Google is good at removing duplicate image of the same item even if taken from different cameras at different angles under different lighting conditions. This will save space while maintaining unique images.

Google is working on technology to differentiate in searches when some types carpenter if they mean the surname or the occupation and similar problems. They are working on a feature called "Google Base" that would allow any one to upload a table or database and they will store it and serve up the data appropriately. This could facilitate genealogy and all information sharing.
(This is an area that we could also compete, more on that later.) Google has a feature called "Site Seer" that allows publishers to upload articles for publishing and searching.

Words on a website page and links between pages are Google's primary mechanism for relevancy ranking. Official tags are no longer used, as they are prone to abuse by what he called 'spammers.' They spend a lot of time trying to weed out the spammers and others that try to bubble their pages to the top artificially. He described a contest where a group of tech savvy people tried to get the top page on a completely new and meaningless phrase.

The winner won by having everyone that read his blog to link from their web pages to his page about the meaningless phrase. Everyone else went to a great deal of effort to create thousands of bogus web pages and other tricks, but they lost.

Peter described the "demo effect" where some applications are very easy to demonstrate and get people to like it because they are so easy to demonstrate. However, 5 minutes later the easy of use becomes cumbersome in regular use. (This means that we need to provide very easy to use applications that can be bypassed for more serious use after users become more comfortable with the application.) Tree navigational structures are powerful, but are difficult to manage as they become more complex and include more nodes.

One of Google's most powerful features is multiple language searches where alternative words are searched on and the results are machine translated from their source language into the viewer's native tongue.

We need better and more intelligent queries. The average Google search is 2.5 words. However, when speaking into a microphone the search terms increase by a factor of 10. He gave the comparison of going to a library and asking the librarian for a book and trying to get the right book by only saying 2.5 words. (This ties directly to the next generation search technology discussion that we are having. If we know what question the user is asking (I am looking for a book with the following terms, or I am looking for a product with the following features, or I am trying to find pages about the following person, then the search results will be more relevant that just looking for matching terms.) He described how Asian searches are different that roman based languages. Because character entry is very difficult in Asian languages, users will type in a single work (character)and then navigate through hundreds of results rather than enter a 2nd word (character) to narrow the search. There must be a better way, and probably voice recognition will be the better way. Bill Gates has claimed for 10 years that desktop voice recognition is 2 years away, and we still don't have it.

People want unbiased reviews when shopping, but it is too difficult to remove the sales hype from real reviews. Voter ranking can help, but it is also easily manipulated by unscrupulous sales and marketers. They are working on this feature.

Six months ago, Google increased the number of words in a search from 10 to 256. This was due to user demand. However, the average number of words in a query has not increased.

Peter described the "~" tilde search operator to help users find what they want. It provides synonym, stemming, and variations of that word. Someone requested that the search terms generated by the tilde be viewable by the user so they can remove and add to them. Peter said they would investigate if there was a way to do that without tipping off their competitors.

Google is trying to assess the truthfulness of statements, and may add user rankings of sites and statements in the future.

Peter described how to make online communities successful. 1)Have easy to understand rules.2)Have a very low barrier to entry. 3)Allow easy or low effort participation. 4)Provide easy linking between ideas and communities.

Web pages now are more complex with embedded java. This makes parsing indexing more difficult for search engines.

Other presentations

Deryle Lonsdale a professor at BYU described work that he was doing to vocalize or Romanize Arabic script names. This was obviously a Homeland Defense funded project. Several times he referred to the project sponsor with out saying who that was. This is a huge challenge but important to the safety of the free world.

Bruce Brown, a grad student, presented a paper on the correlation between a Chinese name and its geographic origin. There is a tight correlation even though some thought there would be none.

Matthew Smith, a grad student, presented a paper on Genealogical Implicit Affinity Networks. This described the linking or networking based on similarities between extended family members. He used the Star Wars family of Luke Skywalker as a example. I told some of the folks at MyFamily.com that they could provide this a perk for people that uploaded GEDCOM files to encourage participation.

Don Curtis, from MyFamily.com, discussed challenges of binarization in preparation for OCR processing. Don has several patents relating to this process. He has 3 patents at MyFamily.com and 7 when he worked for Microsoft. Binarization was a recurring theme throughout the conference as it is a problem that everyone that is converting printing to digital is facing. The LDS has some challenges with their microfilm at the Granite vaults that was discussed later.

Douglas Kennard, a BYU grad student, described the challenges in creating searchable indexes for handwritten documents. At first I just blew this presentation off because I didn't think that it was practical. In my mind, re-keying is the only way to deal with handwritten source material. But he described the various challenges in definable metrics to show when handwritten text can be computer interpreted. It does require lots of similar content (in this case 1,000 pages of George Washington's personal scribe). There are still major challenges that keep this from being a viable alternative to re-keying the data.

Shane Hathaway, of the LDS Church's Touchstone Team, described the Bit Mountain project. This is a project to store the massive amount of digital information in perpetuity and never lose anything due to hard drive crashes and storage media failure. They described the need for an 18 Petabyte digital repository. A Petabyte is 1,000,000 gigabytes. This is massive project, but Doug McKay calculated that they could do it with $4 million using today's technology. However, the size of their data set is growing every day. Their theory is that excess computer hard drive space all around the world could be used to store redundant pieces of data. They called this MAID (Massive array of Idle Disks.) This would also use forward error correction, which is basically self mirroring in reduced storage space.

Part of that idea is that there would be a massive index that could be stored on hard drives that could be powered off for months at a time to keep the media and ball bearings from wearing out. This is a massive project that they didn't have a real solution for, but they described the specifications that would allow someone to develop the solution and then sell it back to the Church when it is ready for use.

Heath Nielson, of the LDS Church, described the challenges of digitizing the church's vast microfilm collection. There are problems with the way that the images were originally scanned that can be overcome with improved digitizing. However, they are scanning at 200dpi and I don't think that will be high enough resolution for all of their needs. They also have to configure the digitizing for each image and not for a roll or 'region' of the roll at a time. I don't think that they have this problem fully resolved yet, which means that it will probably have to be re-done at a great expense in the future. This is problematic because each use of the microfilm reduces its life expectancy.

Burdette Pixton described using neural networks to do automatic genealogy name linking.

Jim Wray, who I worked with at MyFamily.com, described how they merged 2 different indexes for the 1920 census. The original index was head of household only, the newer index has every person. The original index was not done very well and they are basically using that one to check the new index. The problem was doing the merge without doing the 3.9 quadrillion lookups which would have taken 12 years.

One of the best presentations of the day was one of the lowest quality presentations. It was how to use Google, Yahoo, and MSN to do genealogy lookups on de-centralized data. The debate continues as to whether centralized data, organization, and retrieval (likely MyFamily.com and the LDS Church) or de-centralization like everyone else is better. Dallan Quass, from the foundation for On-Line Genealogy, described methods focusing the standard search engines for finding genealogy info. While it is not an exact science, and won't replace MyFamily.com or FamilySearch.org it has promise for the future, especially if genealogy standards can emerge for the web. However, one of the points that was made by many at the conference is that many of the legacy standards -even cultural standards are prohibitive of moving some genealogy data to the web.

Grant Skousen of the LDS Church presented their prototype for "Family Finder." The goal is to make genealogy as easy to use as the Internet it self and overcome the fear factor which keep some from genealogy research. It prompts users on their next steps and poses questions that they could ask to further their research. I think that it appeared easy. But some of the MyFamily products (OneWorldTree) are easier. I think that "demo effect" described by Norvig from Google may apply here. If a product is too easy to use, it becomes cumbersome as the users increase their abilities. I did like the interface and the icons were cheerful. I did like the time line feature that allows users to organize data in a chronological order.

Kevin Bottoms, a BYU Grad student, designed a single front that would interface with the various back ends (MyFamily, FamilySearch, web, etc). I don't know how practical it would be and I don't know if MyFamily would allow their info to be used this way.

Q&A at end of conference
Deryl Londsdal said that Utah has specialties: Genealogy, Linguistics, and computational technologies that set us apart from the rest of the world.
There are 6500 active languages in the world.

Dan Olsen: Family history should be about lives and memories and not about data elements.
Panelists provided 3 Things that are needed to improve Genealogy:Dan Olsen:
*Digital Libraries: families live in homes and data must be accessible from home
*Sources should volunteer and not have to be hunted down.
*WE need better geographic relations in our genealogy data

Deryl Lonsdale:
*We need a neutral, friendly location to share and collaborate our data
*We need to ability to evaluate and rate information. We must define who is the authority for data repositories.
*Better assimilation of data to make sense out of it

Curt Wincher:
*Capture, store, and share any digital anything without fear of losing it

David ___ (of LDS Church):
*Long term preservation of valuable content.
*Critical mass of data collections (1/3 of all live births today are not
documented)
*Improved user experience
*Establish authorities and standards
*Improve search results by removing 'noise'

Lunch presentation by Curt Wincher:We need to do a much better job of sharing important information with members of our family and affinity groups.

"It is in there, but we can't find it." We need better access to information that we know exists. We want to use all of our toys. Personal documents, email, digital photos,analog photos, maps, gps, URL, etc. Many more genealogists are digital and web only than the previous generations of analog genealogists. Google has raised everyone's expectations for ease of use in finding information.

Friday, June 09, 2006

Pentagon sets its sights on social networking websites

Why does the NSA need to go tap phones when people are posting all sorts of private data on public forums?

Pentagon sets its sights on social networking websites

http://www.newscientist.com/article/mg19025556.200?DCMP=NLC-nletter&nsref=mg19025556.200

"New Scientist has discovered that Pentagon's National Security Agency, which specialises in eavesdropping and code-breaking, is funding research into the mass harvesting of the information that people post about themselves on social networks."

Al Jazeera links to WorldHistory.com

A very interesting thing happened yesterday (June 8, 2006) on WorldHistory.com. This was the day that the international news story broke about bombing and killing of Abu Musab Al-Zarqawi in Iraq.

Abu Musab Al-Zarqawi in the leader of Al Qaeda in Iraq. He is second only to Osama bin Laden in being wanted by Allied forces. Both of them have a $25 million bounty on their heads. Al-Zarqawi was the day to day operations manager for Al Qaeda and fighting on the front lines of the insurgency in Iraq, while Osama is the world-wide, philosophical leader who hides out in caves along the Afghanistan and Pakistan border.

A U.S. F-16 dropped 2 five hundred pound bombs on the house that Al-Zarqawi was in. This action launched news headlines around the world. Al Jazeera (www.aljazeera.net or english.aljazeera.net), an Arabic equivalent of CNN, posted several links on their web site to relevant articles on WorldHistory.com. Al Jazeera regularly links to WorldHistory.com for background information on news stories. WorldHistory.com has in-depth articles on Osama bin laden (www.worldhistory.com/article.php?q=22468), Iraq (media.worldhistory.com/ciawfb/file/iz.htm), Afghanistan (media.worldhistory.com/ciawfb/file/af.htm), and even Zarqa, Jordan--the city where Abu Musab Al-Zarqawi is from (www.worldhistory.com/article.php?q=1175187).

Yesterday, WorldHistory.com experienced a 622% spike in page views. Links from the Al-Jazeera web site accounted for 80% of the page views for the day! The Google Ad Sense revenue for WorldHistory.com increased over 10 fold from the previous day!

WorldHistory.com is such a great resource with all of the word history data that we can accumulate in a fast and easy to use web site. We at Provo Labs are regularly adding high quality content as it becomes available. There are over 20 million records that represent approximately 1 million articles covering everything imaginable. The content on WorldHistory.com comes from the Wikipedia, Cyclopedia of Classified Dates, CIA World Fact book, world classics, several map collections, and much more. There are approximately 250,000 high-quality, full size images that are searchable by descriptive tags.

The WorldHistory.com web site is an amazing resource that is now getting the international recognition that it deserves.