Today I want to talk about how we can calculate tf-idf with hadoop streaming. First of all, for those who don’t know what TF-IDF is, I can explain. It’s statistical metrics of words, which reflects the importance of each word to a document. The bigger TF-IDF value of a particular word and a particular document the more frequently this word appears in a document and the rarely in other documents. You can gather more information from the Wikipedia article. It’s…