postgresql - Summarize repeated data in a Postgres table -


i have postgres 9.1 table called ngram_sightings. each row record of seeing ngram in document. ngram can appear multiple times in given document.

 create table ngram_sightings (     ngram varchar,     doc_id integer ); 

i want summarize table in table called ngram_counts.

create table ngram_counts (     ngram varchar primary index,      -- number of unique doc_ids given ngram     doc_count integer,      -- count of given ngram in ngram_sightings     corpus_count integer ); 

what best way this?

ngram_sightings ~1 billion rows. should create index on ngram_sightings.ngram first?

give shot!

insert ngram_counts (ngram, doc_count, corpus_count) select   ngram , count(distinct doc_id) doc_count , count(*) corpus_count  ngram_counts group ngram; 

-- edit --

here longer version using temporary tables. first, count how many documents each ngram associated with. i'm using 'tf' "term frequency" , 'df' "doc frequency", since heading in direction of tf-idf vectorization , may use standard language, next few steps.

 create temporary table ngram_df  select    ngram  , count(distinct doc_id) df  ngram_counts  group ngram; 

now can create table total count of each ngram.

create temporary table ngram_tf  select    ngram  , count(*) tf  ngram_counts  group ngram; 

then join 2 on ngram.

create table ngram_tfidf select   tf.ngram , tf.tf , df.df ngram_tf inner join ngram_df on ngram_tf.ngram = ngram_df.ngram; 

at point, expect looking ngram quite bit, makes sense index last table on ngram. keep me posted!


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -