pycs-devel archive weblog

2003-10-15

Phillip Pearson: [PyCS-devel] Sensible weblog search ... [was: big cvs commit]

> > Hmm, right.  Part of my motivation for doing it this way is that it

> > lets you filter out junk (page templates, blogrolls etc) that would

> > otherwise screw up searches.  So right now it only really searches

>

> Yes, I understand that. The problem is, that PyDS isn't specificially a

> blogging software - that's just one part. So first I would have to add

> bags of mirrorPost calls to all places where posts or other content is

> produced. I am still unsure how to do that exactly - for example what

> about structured text elements. I think it would be better to push in

> the reST source and not the HTML stuff, as that would trigger searches

> for "strong" for example, if <strong> is used. Hmm. I can't put exactly

> why I feel uncomfortable, though.

>

> Maybe combining swish with mirror searches is a solution. At least it

> would make weblog posts or other microcontent be better targetable for

> searches and would allow to keep the rest as it is (as full pages). But

> it _feels_ icky, I just don't know what a better way would be ;-)



I've been thinking about this for a good nine months or so, and this

is the most useful-feeling search function I've come up with :-)



I guess you can look at search in two ways:



1. The search engine is responsible for looking at the HTML and

figuring out what to search



2. The blogging tool is responsible for telling the search engine

about anything worth searching



Google/ht://Dig do it the first way, and Lucene/new-PyCS-code do it

the second way.  I'm convinced that the second way gives better

results, but it _does_ require more effort on the part of the blogging

tool.



If you felt like hacking the Swish indexer, you could get it to index

the mirror as well, so then you could use Swish to search everything.

That would make it quicker, and bring everything back into one place.

But it would be even _more_ work ... :-)



Cheers,

Phil

Georg Bauer: Re: [Pyds-dev] Re: [PyCS-devel] big cvs commit

Hi!



> 1400 posts?!



Yep, blogging _does_ work for me ;-)



> Hmm, right.  Part of my motivation for doing it this way is that it

> lets you filter out junk (page templates, blogrolls etc) that would

> otherwise screw up searches.  So right now it only really searches



Yes, I understand that. The problem is, that PyDS isn't specificially a

blogging software - that's just one part. So first I would have to add

bags of mirrorPost calls to all places where posts or other content is

produced. I am still unsure how to do that exactly - for example what

about structured text elements. I think it would be better to push in

the reST source and not the HTML stuff, as that would trigger searches

for "strong" for example, if <strong> is used. Hmm. I can't put exactly

why I feel uncomfortable, though.



Maybe combining swish with mirror searches is a solution. At least it

would make weblog posts or other microcontent be better targetable for

searches and would allow to keep the rest as it is (as full pages). But

it _feels_ icky, I just don't know what a better way would be ;-)



bye, Georg

Phillip Pearson: Re: [Pyds-dev] Re: [PyCS-devel] big cvs commit

> > Sound OK?

>

> Depends. :-)

>

> I still have the problem that PyDS creates much more than just posts.

> What about that stuff - I would like to search that stuff, too. So

> seeding would require to push in much more than just my 1400 posts ...



1400 posts?!



It would be interesting to see some benchmark results for your blog --

search for something common, and go right to the last page of the

results, then run ApacheBench on that URL.  I've benchmarked the raw

database access (the search code can run through 450 posts [from

Second p0st] about 45 times a second) but haven't tried it since

plugging it in as /system/search.py.



> The problem is, your solution only searches what is in the database. And

> we need to think about what to do with stuff that is upstreamed, but not

> mirrored to the search database. For example what about text stuff

> people just put into their upstreaming spool? This is upstreamed (both

> PyDS and Radio have this feature), but isn't generated. And so it isn't

> mirrored.

>

> Maybe files upstreamed should be automatically mirrored, too? Hmm. That

> would produce duplicates on weblog stuff ...

>

> So, no, I am not fully satisfied ;-)



Hmm, right.  Part of my motivation for doing it this way is that it

lets you filter out junk (page templates, blogrolls etc) that would

otherwise screw up searches.  So right now it only really searches

blog posts and stories, if you send them specifically (I only have

bzero sending posts).  I don't really want to search everything

(e.g. the 500K HTML file that contains the first run of the Blogging

Ecosystem), just stuff I've written.



If you _do_ want to search everything, it might make sense to run the

query both through the new search code and through Swish, and combine

the results.  When you index with Swish, if the user has sent in some

posts to mirror, get Swish to ignore all URLs that have been sent via

mirrorPosts() but index everything else.  Then combine the results

later on...



Does that sound better?



Cheers,

Phil :)

Georg Bauer: Re: [Pyds-dev] Re: [PyCS-devel] big cvs commit

Hi!



> Sorry, I didn't explain clearly.  mirrorPosts _does_ just add/update

> posts.  You send it a few posts, and it adds them into the database if



Ok, that's clearer ;-)



> A deletedMirroredPosts call is required, yes.  Perhaps

> xmlStorageSystem.deleteMirroredPosts(email, password, posts), with

> 'posts' being a list of postids to delete.



Yep, one is definitely needed.



> If it's possible for a post to have more than one category (i.e. more

> than one URL), this would force you to only pass one in.  Do you think

> this could be a problem?



No, I already send in only one if I need to create a GUID - I just use

homepage or either first category.



> Sound OK?



Depends. :-)



I still have the problem that PyDS creates much more than just posts.

What about that stuff - I would like to search that stuff, too. So

seeding would require to push in much more than just my 1400 posts ...



The problem is, your solution only searches what is in the database. And

we need to think about what to do with stuff that is upstreamed, but not

mirrored to the search database. For example what about text stuff

people just put into their upstreaming spool? This is upstreamed (both

PyDS and Radio have this feature), but isn't generated. And so it isn't

mirrored.



Maybe files upstreamed should be automatically mirrored, too? Hmm. That

would produce duplicates on weblog stuff ...



So, no, I am not fully satisfied ;-)



bye, Georg

Phillip Pearson: Re: [PyCS-devel] big cvs commit

> Hmm. I think that is not sufficient. Reason: what if people want to remove

> postings? What if they update postings? Should people allways send all

> postings in (that wouldn't be a real option for me - too much postings).

>

> So how could you decide what to do with the postings in the database? I think

> mirrorPosts should just add/update postings. deleteMirroredPosts should

> get a list of post ids and delete them.



Sorry, I didn't explain clearly.  mirrorPosts _does_ just add/update

posts.  You send it a few posts, and it adds them into the database if

it doesn't have them, or updates them if it does have them.  If you're

seeding the database, you send it 10 posts at a time until you're

done.  You never send all the posts at once :-)



A deletedMirroredPosts call is required, yes.  Perhaps

xmlStorageSystem.deleteMirroredPosts(email, password, posts), with

'posts' being a list of postids to delete.



> PyDS does allow both post updates and post deletes. And then there is the

> problem of categories - posts can be on the homepage or not (as with Radio)

> and can be in categories or not. So people might change posts with regard

> of their homepage/categories and so might move them from homepage to a

> category or back. This would imply changed guids (and therefore changes in

> postings, too).



If a post only lives in one place, you can just call mirrorPosts again

and the GUID and URL will be updated.



If it's possible for a post to have more than one category (i.e. more

than one URL), this would force you to only pass one in.  Do you think

this could be a problem?



Maybe in future we could give the mirroredPosts.posts table a new

field, category[catname:S], and allow search on category...



> So I think we need explicit delete an update support. And maybe we need

> some way to sync - think of people doing strange things with their data

> and don't know what posts actually changed (like some changes in shortcuts

> in PyDS - this would change posts, even though currently this isn't

> upstreamed automatically, this might some day when conditional rerendering

> is in place). Some sync-Call that returns a list of hashes to store in the

> client database might help, so people can say "look, this is the list of

> post hashes I have in my database, tell me which ones you need me to upload

> (again) to get in sync with me".



That's a good idea.  "xmlStorageSystem.getMirroredPostSummary", which

returns a list of dicts, something like:



[{'postid': 'foo',

'hash': md5("|".join(guid + url + title + text))}]



> Oh, and we should return a new flag in the server capabilities to automatically

> enable this search mirror feature.



flCanMirrorPosts <-- always true



and then if the user has sent in some posts, flCanHostSearch should be

true and flSearchUrl should point to /system/search.py?u=0001234 ...

(I might have got those last two key name wrong; substitute with

whatever RCS uses).



Sound OK?



Cheers,

Phil :-)

Georg Bauer: Re: [PyCS-devel] big cvs commit

Hi!

On Tue, Oct 14, 2003 at 11:10:10PM +1300, Phillip Pearson wrote:

> - added a new xmlStorageSystem.mirrorPosts(email,password,posts) method that

> takes a list of posts (structs: {postid,date,url,guid,title,text}) and dumps

> them in a new database table.  It deletes them (searches by postid) if they

> already exist.

Hmm. I think that is not sufficient. Reason: what if people want to remove

postings? What if they update postings? Should people allways send all

postings in (that wouldn't be a real option for me - too much postings).

So how could you decide what to do with the postings in the database? I think

mirrorPosts should just add/update postings. deleteMirroredPosts should

get a list of post ids and delete them.

PyDS does allow both post updates and post deletes. And then there is the

problem of categories - posts can be on the homepage or not (as with Radio)

and can be in categories or not. So people might change posts with regard

of their homepage/categories and so might move them from homepage to a

category or back. This would imply changed guids (and therefore changes in

postings, too).

So I think we need explicit delete an update support. And maybe we need

some way to sync - think of people doing strange things with their data

and don't know what posts actually changed (like some changes in shortcuts

in PyDS - this would change posts, even though currently this isn't

upstreamed automatically, this might some day when conditional rerendering

is in place). Some sync-Call that returns a list of hashes to store in the

client database might help, so people can say "look, this is the list of

post hashes I have in my database, tell me which ones you need me to upload

(again) to get in sync with me".

Oh, and we should return a new flag in the server capabilities to automatically

enable this search mirror feature.

bye, Georg

Georg Bauer: Re: [PyCS-devel] big cvs commit

Hi!



On Tue, Oct 14, 2003 at 11:10:10PM +1300, Phillip Pearson wrote:

> I just pushed a whole bunch of stuff into CVS.  Some tidying / refactoring



I added translations to german in search.py and fixed some broken translations.



bye, Georg