postgresql - Summarize repeated data in a Postgres table -
i have postgres 9.1 table called ngram_sightings. each row record of seeing ngram in document. ngram can appear multiple times in given document.
create table ngram_sightings ( ngram varchar, doc_id integer );
i want summarize table in table called ngram_counts.
create table ngram_counts ( ngram varchar primary index, -- number of unique doc_ids given ngram doc_count integer, -- count of given ngram in ngram_sightings corpus_count integer );
what best way this?
ngram_sightings ~1 billion rows. should create index on ngram_sightings.ngram first?
give shot!
insert ngram_counts (ngram, doc_count, corpus_count) select ngram , count(distinct doc_id) doc_count , count(*) corpus_count ngram_counts group ngram;
-- edit --
here longer version using temporary tables. first, count how many documents each ngram associated with. i'm using 'tf' "term frequency" , 'df' "doc frequency", since heading in direction of tf-idf vectorization , may use standard language, next few steps.
create temporary table ngram_df select ngram , count(distinct doc_id) df ngram_counts group ngram;
now can create table total count of each ngram.
create temporary table ngram_tf select ngram , count(*) tf ngram_counts group ngram;
then join 2 on ngram.
create table ngram_tfidf select tf.ngram , tf.tf , df.df ngram_tf inner join ngram_df on ngram_tf.ngram = ngram_df.ngram;
at point, expect looking ngram quite bit, makes sense index last table on ngram. keep me posted!
Comments
Post a Comment