Skip to content
This repository was archived by the owner on Feb 18, 2026. It is now read-only.

rav009/PhraseExtract

Repository files navigation

PhraseExtract

  • The master branch is the prototype. For more details, please reference the azure_hdinsight branch.
  • Use the following command to search the frequently occurring sentences(the generic options like -files and -D show be placed before the command options):
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.8.0.jar \
	-files /home/rav009/PycharmProjects/untitled/PhraseExtract/sentence_mapper.py,/home/rav009/PycharmProjects/untitled/PhraseExtract/sentence_reducer.py \
	-D mapred.map.tasks=7 \
	-D mapred.reduce.tasks=3 \
	-input /input/text.txt \
	-output /sentences/above100/ \
	-mapper "python sentence_mapper.py" \
	-reducer "python sentence_reducer.py -t 100"

python sentence_reducer.py -t 100 stands for output all the sentence appears for more than 100 times.



  • Use the following command to search the frequently occurring phrases which contains 2 or 3 words:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.8.0.jar \
	-files /home/rav009/PycharmProjects/untitled/PhraseExtract/phrase_mapper.py,/home/rav009/PycharmProjects/untitled/PhraseExtract/phrase_reducer.py,hdfs://127.0.0.1:9000/sentences/above100/part-00000 \
	-D mapred.map.tasks=4 \
	-D mapred.reduce.tasks=4 \
	-D mapred.text.key.partitioner.options=-k1 \
	-input hdfs://namenode/input.txt \
	-output /phrase/above2000 \
	-mapper "python phrase_mapper.py -l 3" \
	-reducer "python phrase_reducer.py -t 2000 -c" \
	-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

python phrase_mapper.py -l 3 stands for generate the phrases contain less than or equal to 3 words.
python phrase_reducer.py -t 2000 -c stands the threshold of frequency of phrase is 2000 and also output the ID number of each passage(assume the ID and the content is split by '|').

  • The zip file is the Kettle ETL project and the SSAS project.

About

A map-reduce framework based on python to extract phrases from tremendous text data according to the frequency of the phrase.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages