LinkLog: WikiData


Wikidata is a free knowledge base that can be read and edited by humans and machines alike. It is for data what Wikimedia Commons is for media files: it centralizes access to and management of structured data, such as interwiki references and statistical information. Wikidata contains data in all languages for which there are Wikimedia projects.

The initial development of Wikidata is funded by donations by:


Reached here from an article on Future of Search

A few Useful Links:

Incremental Knowledge Discovery – Finding Some Popular Terms

We start our Knowledge Discovery Journey on Cloud Computing with a few simple steps. The first step to find some of the Cloud Computing Terms. It is very easy for humans to just look at a document and quickly identify relevant terms. However, it assumes that you have some knowledge of the topic. Since we want to automate this process as much as possible, we will use some simple tools.

  1. The first tool is Google Search. We simply search on the term “Cloud Computing” and the first entry happens to point to this  Wikipedia page. For this stage of the experiment, we will take that page to be a reasonable representation of the current information about cloud computing.
  2. How do we know that this information is current? A look at the history of edits shows that it is being updated almost daily (this is one of the benefits of sources like Wikipedia)
  3. We will parse this page and find the top 20 most frequent bigrams (pairs of words). We do this using a simple python program using the Natural Language Tool Kit library (there are other methods of doing this as well).
  4. We pick a few of the more interesting terms. In the list below, the first column represents the term and the second column the number of times the term occurs in the document.
cloud computing 156
cloud services 14
public cloud 14
private cloud 13
hybrid cloud 10
cloud providers 8
cloud applications 7
cloud based 7
cloud cloud 7
cloud infrastructure 7
cloud service 7
category cloud 6
heterogeneous cloud 6
use cloud 6
cloud clients 5
cloud environment 5
cloud storage 5
cloud symbol 5
cloud user 5
data cloud 5
software cloud 5

Now we have a very crude version of the vocabulary on Cloud computing. This provides us a good starting point for further searches. Before we do that, we will eliminate some of the terms (like cloud cloud).

We can improve this process in several ways.

  1. We can look at more than one page or document. A good candidate is NIST’s Cloud Computing Definition document, which is listed as one of the references in the wikipedia page. There may be others. If we use multiple documents, we may use tf/idf (term frequency/inter document frequency) or some other metric.
  2. We can repeat the term frequency program to include trigrams (triple words like “cloud computing platforms”) and add them to the list.
  3. There are other (better) ways to get the terms and we will reserve that option for the future. A web search for “cloud terminology”, “cloud ontology” reveals some interesting sources like this one – a dictionary of cloud terms.
  4. Our quest, however, is to come up with simple methods of generating these terms ourselves. There are two reasons for doing this. One is that we may need to research topics that are not as popular as cloud computing for which the terminology may not exist. The second reason is that if we know how to automate and refine these terms, we can keep them updated as frequently as we want.


If you want the (really crude) Python program I used to derive these terms, you can find it here.

Used another experimental tag cloud generator we built to visualize these tags.


Next step:

We will use the terms to find some relevant sources of information.

On Experiments

From Acquiring Knowledge

The greatest experiment is nearly always a solo. The individual, seeking to learn, tries something new but only tries it on himself. If he fails, he has hurt only himself. If he succeeds he has made a discovery many people can use. Experiment only with your own time, your own money, your own labor. That’s the honest, sincere type of experiment. It’s rich.

LinkLog: Incremental Discovery of Knowledge

From this Incremental Knowledge Discovery in Online Social Media by Xuning Tang

In light of the prosperity of online social media, Web users are shifting from data consumers to data producers. To catch the pulse of this rapidly changing world, it is critical to transform online social media data to information and to knowledge.

This dissertation centers on the issue of modeling the dynamics of user communities, trending stories, topics and user interests in online social media. However, knowledge discovery and management in online social media is challenging because: 1) social media data arrive in the form of continuous streams; 2) the volume of social media data is potentially infinite; and 3) more importantly, social media data is very complex which consists of network, text, tag, click and other information.

 via Google Scholar Alert on “Topic Modeling”

Twitter And Knowledge Discovery

You can discover four types of Knowledge (from Twitter). The classification is inspired by Bloom’s Taxonomy.

  1. Factual Knowledge
  2. Conceptual and Meta Knowledge

Let us briefly look at each type with a couple of examples.

Factual Knowledge- Some observations

  • You get more news from Twitter than from news papers
  • You get more real time news than any other single medium
  • You get more reactions to news  faster and reactions to reactions (second and third wave)
  • Hot news pops up as trending topics making it more discoverable
  • Patterns of information propagation increases your meta knowledge (who, what, why etc.)

Conceptual and Meta Knowledge

In addition to bits of information, you also get higher levels of knowledge, if you just take the time to analyze it.

  • You can glean inter-relationships and Structure by analyzing followers, lists and retweets and other referral formats. For example, by looking at the people and lists that thought leaders (like Tim O’Reilly) follow you can get some sense of the information relationships. By looking at the lists Tim is in, you can also understand a lot more of his following.
  • You can identify influencers and experts in various topics and industries. Klout tries to do this a bit. You can look at the reach and network effect of certain people on Twitter. You need to augment this analysis by looking beyond Twitter, but Twitter gives you some great starting points.
  • Every one on Twitter is reachable (for example, if you want me to notice something, you can just add @dorait in your tweet, drawing my attention. If the information you share makes sense to my audience or emotionally appeals to me, I may retweet it. Guy Kawasaki once mentioned that he looks at Tweets where he is tagged.
  • By analyzing the retweet patterns of experts, you can understand their areas of interest and spheres of influence. You can do it with a few  open source tools.
  • You can understand how information propagates – what, why, how, when by analyzing tweets. Organizations like InfoChimps, Datasift can provide you with a large body of tweets you can use for research. You can make intelligent guesses based on the velocity of propagation (speed at which topics trend)
  • A lot of papers have been written about Modeling Topics from Tweets. You can use several techniques to create your own watch signals in a specific space (market, industry segment, geographic region etc.)

If you are interested in this area contact me at dorait (at) Will be happy to answer any questions, elaborate some of these ideas and have a chat.

Cloud Computing – Incremental Knowledge Discovery

We are building and testing tools for research. Our (experimental) tools help us gather information and help us in incremental knowledge discovery.

Let us start with a simple experimental topic that is also a rapidly emerging trend – “Cloud Computing”. Our goal is to gain as much knowledge about Cloud Computing as possible. We will use two different approaches.

6W Framework

  1. What – A set of what questions. What is cloud computing? What is the difference between public, private, hybrid clouds? What is the difference between the Web and the Cloud?
  2. Why – A set of why questions. Why should your business care? Why now?
  3. Who – A set of who questions. Who is driving it? Who is adopting it?
  4. When – A set of when questions.  When is it appropriate for your business? When is it not?
  5. Where – A set of where questions. Where is it being adopted? Where is it successfully being used?
  6. How – A set of how questions. How do you get started? How do you evaluate it? How do you find the ROI for your business?

To gain some useful knowledge, we need to look at several aspects. These include (in no particular order):

  1. Technologies
  2. Vendors
  3. Market Segments
  4. Products
  5. People (Experts, Influencers)
  6. Trends
  7. Topics (Vocabulary and Ontology)
  8. Research
  9. Patents
  10. Investors and funding
  11. Applications
  12. Adoption
  13. Drivers – that support the trend
  14. Barriers to adoption
  15. Intersections (with other emerging trends)
  16. Events
  17. Publications
  18. Communities
  19. Knowledge Bases
  20. Opportunities

We will piece together this knowledge about cloud computing step by step (over several posts).

What Makes a Community Vibrant?

There are lots of reasons why some communities flourish and others languish. Here are a few things that make a community vibrant.

  1. A sense of belonging
  2. A sense  of sharing and helping other members of the community
  3. A few champions that seed the community and help it grow
  4. A set of core members that provide some sense of continuity
  5. A common purpose or cause may help as well

There may be a lot of other factors. You are welcome to add the ones I missed.

LinkLog: PaaS in IT Shops

Apprenda’s case study outlining the JP Morgan Chase use of private PaaS in IT shops of large enterprises.  Some of the measures of success in the case study were:

  • 2000 applications hosted
  • 700% improvement in developer productivity
  • 70% increase in infrastructure utilizations
  • 50 days improvement in average application time to market

via Reflections on the Future of Platform as a Service (PaaS)

BookLog: The Art of Explanation

From The Art of Explanation

The Art of Explanation is built on my years of experience in creating explanations for organizations and educators. My company, Common Craft, is known around the world for making complex ideas easy to understand in the the form of short videos. Through projects with companies such as Google, LEGO, Intel, and Ford Motor Company and the creation of our own library of video explanations, we have been students of explanation for many years. We have experimented and studied the explanation process and seen what is possible. Our videos have been viewed more than 50 million times online, and no other brand is better known for explanations (

This book, however, is not a series of case studies and exercises or an academic exploration of “the science of explanation.” More than anything, it is a manifesto based on our experiences as professional explainers. We believe deeply in the power of explanation and see this book as an invitation to recognize that power by looking at explanation from a new perspective. When you do, you will see that it represents an unexplored part of your communications, a skill you can understand, practice, and improve.

The various ideas, approaches, and models I provide in these pages are secondary to a simple, higher-level goal: to make explanation a priority. This means thinking about how you explain ideas and how you can put explanations to work to accomplish your goals. It requires that you use explanation as a strategy in problem solving. You must also introduce others to the idea that explanations can create positive change.

I am a big fan of Common Craft In Plain English videos. I use them in my talks and recommend it to others. It is nice to see a book that explains the art of explanation. Looking forward to reading it and will be back with some notes and learning in a future post.

A Few Python Links: Dictionary Idioms, PyMongo, A Book and A Python Analysis Environment

A few selected links from TopicMinder alerts for “Python”.

Python Expression Idiom: Merging Dictionaries

It’s common in Python programming to need to merge 2 or more dictionaries together.

The first idiom is using the dict constructor.  This idiom has it’s limitations, however it will always work fine as long as the keys are all strings.

Python Expression Idiom: Dict Slicing

Python dicts do not natively support slicing.  One of the issues is slicing in Python seems limited to defined ranges, rather than an ad-hoc collection of values.

Dev of the Week: A. Jesse Jiryu Davis – An interview with the maintainer of PyMongo

I’m a Python developer at 10gen, the company that makes MongoDB. I help maintain the standard Python driver for MongoDB (PyMongo), and I’m the author of a non-blocking driver called Motor. Both are open source. Coders at 10gen wear lots of bonnets: I do customer support, blogging, consulting, and speaking, and I spend a lot of time making open source contributions and working with people who contribute to our projects.

Hacking Secret Cyphers with Python Released

My third book, Hacking Secret Ciphers with Python, is finished. It is free to download under a Creative Commons license, and available for purchase as a physical book on Amazon for $25 (which qualifies it for free shipping). This book is aimed at people who have no experience programming or with cryptography. The book goes through writing Python programs that not only implement several ciphers but also can hack these ciphers.

Enthought Canopy – A Python Analysis Environment

Enthought Canopy is a comprehensive Python analysis environment with easy installation & updates of the proven Enthought Python distribution – all part of a robust platform you can explore, develop and visualize on.