> > Hmm, right. Part of my motivation for doing it this way is that it
> > lets you filter out junk (page templates, blogrolls etc) that would
> > otherwise screw up searches. So right now it only really searches
>
> Yes, I understand that. The problem is, that PyDS isn't specificially a
> blogging software - that's just one part. So first I would have to add
> bags of mirrorPost calls to all places where posts or other content is
> produced. I am still unsure how to do that exactly - for example what
> about structured text elements. I think it would be better to push in
> the reST source and not the HTML stuff, as that would trigger searches
> for "strong" for example, if <strong> is used. Hmm. I can't put exactly
> why I feel uncomfortable, though.
>
> Maybe combining swish with mirror searches is a solution. At least it
> would make weblog posts or other microcontent be better targetable for
> searches and would allow to keep the rest as it is (as full pages). But
> it _feels_ icky, I just don't know what a better way would be ;-)
I've been thinking about this for a good nine months or so, and this
is the most useful-feeling search function I've come up with :-)
I guess you can look at search in two ways:
1. The search engine is responsible for looking at the HTML and
figuring out what to search
2. The blogging tool is responsible for telling the search engine
about anything worth searching
Google/ht://Dig do it the first way, and Lucene/new-PyCS-code do it
the second way. I'm convinced that the second way gives better
results, but it _does_ require more effort on the part of the blogging
tool.
If you felt like hacking the Swish indexer, you could get it to index
the mirror as well, so then you could use Swish to search everything.
That would make it quicker, and bring everything back into one place.
But it would be even _more_ work ... :-)
Cheers,
Phil
Hi!
> 1400 posts?!
Yep, blogging _does_ work for me ;-)
> Hmm, right. Part of my motivation for doing it this way is that it
> lets you filter out junk (page templates, blogrolls etc) that would
> otherwise screw up searches. So right now it only really searches
Yes, I understand that. The problem is, that PyDS isn't specificially a
blogging software - that's just one part. So first I would have to add
bags of mirrorPost calls to all places where posts or other content is
produced. I am still unsure how to do that exactly - for example what
about structured text elements. I think it would be better to push in
the reST source and not the HTML stuff, as that would trigger searches
for "strong" for example, if <strong> is used. Hmm. I can't put exactly
why I feel uncomfortable, though.
Maybe combining swish with mirror searches is a solution. At least it
would make weblog posts or other microcontent be better targetable for
searches and would allow to keep the rest as it is (as full pages). But
it _feels_ icky, I just don't know what a better way would be ;-)
bye, Georg
> > Sound OK?
>
> Depends. :-)
>
> I still have the problem that PyDS creates much more than just posts.
> What about that stuff - I would like to search that stuff, too. So
> seeding would require to push in much more than just my 1400 posts ...
1400 posts?!
It would be interesting to see some benchmark results for your blog --
search for something common, and go right to the last page of the
results, then run ApacheBench on that URL. I've benchmarked the raw
database access (the search code can run through 450 posts [from
Second p0st] about 45 times a second) but haven't tried it since
plugging it in as /system/search.py.
> The problem is, your solution only searches what is in the database. And
> we need to think about what to do with stuff that is upstreamed, but not
> mirrored to the search database. For example what about text stuff
> people just put into their upstreaming spool? This is upstreamed (both
> PyDS and Radio have this feature), but isn't generated. And so it isn't
> mirrored.
>
> Maybe files upstreamed should be automatically mirrored, too? Hmm. That
> would produce duplicates on weblog stuff ...
>
> So, no, I am not fully satisfied ;-)
Hmm, right. Part of my motivation for doing it this way is that it
lets you filter out junk (page templates, blogrolls etc) that would
otherwise screw up searches. So right now it only really searches
blog posts and stories, if you send them specifically (I only have
bzero sending posts). I don't really want to search everything
(e.g. the 500K HTML file that contains the first run of the Blogging
Ecosystem), just stuff I've written.
If you _do_ want to search everything, it might make sense to run the
query both through the new search code and through Swish, and combine
the results. When you index with Swish, if the user has sent in some
posts to mirror, get Swish to ignore all URLs that have been sent via
mirrorPosts() but index everything else. Then combine the results
later on...
Does that sound better?
Cheers,
Phil :)
Hi!
> Sorry, I didn't explain clearly. mirrorPosts _does_ just add/update
> posts. You send it a few posts, and it adds them into the database if
Ok, that's clearer ;-)
> A deletedMirroredPosts call is required, yes. Perhaps
> xmlStorageSystem.deleteMirroredPosts(email, password, posts), with
> 'posts' being a list of postids to delete.
Yep, one is definitely needed.
> If it's possible for a post to have more than one category (i.e. more
> than one URL), this would force you to only pass one in. Do you think
> this could be a problem?
No, I already send in only one if I need to create a GUID - I just use
homepage or either first category.
> Sound OK?
Depends. :-)
I still have the problem that PyDS creates much more than just posts.
What about that stuff - I would like to search that stuff, too. So
seeding would require to push in much more than just my 1400 posts ...
The problem is, your solution only searches what is in the database. And
we need to think about what to do with stuff that is upstreamed, but not
mirrored to the search database. For example what about text stuff
people just put into their upstreaming spool? This is upstreamed (both
PyDS and Radio have this feature), but isn't generated. And so it isn't
mirrored.
Maybe files upstreamed should be automatically mirrored, too? Hmm. That
would produce duplicates on weblog stuff ...
So, no, I am not fully satisfied ;-)
bye, Georg
> Hmm. I think that is not sufficient. Reason: what if people want to remove
> postings? What if they update postings? Should people allways send all
> postings in (that wouldn't be a real option for me - too much postings).
>
> So how could you decide what to do with the postings in the database? I think
> mirrorPosts should just add/update postings. deleteMirroredPosts should
> get a list of post ids and delete them.
Sorry, I didn't explain clearly. mirrorPosts _does_ just add/update
posts. You send it a few posts, and it adds them into the database if
it doesn't have them, or updates them if it does have them. If you're
seeding the database, you send it 10 posts at a time until you're
done. You never send all the posts at once :-)
A deletedMirroredPosts call is required, yes. Perhaps
xmlStorageSystem.deleteMirroredPosts(email, password, posts), with
'posts' being a list of postids to delete.
> PyDS does allow both post updates and post deletes. And then there is the
> problem of categories - posts can be on the homepage or not (as with Radio)
> and can be in categories or not. So people might change posts with regard
> of their homepage/categories and so might move them from homepage to a
> category or back. This would imply changed guids (and therefore changes in
> postings, too).
If a post only lives in one place, you can just call mirrorPosts again
and the GUID and URL will be updated.
If it's possible for a post to have more than one category (i.e. more
than one URL), this would force you to only pass one in. Do you think
this could be a problem?
Maybe in future we could give the mirroredPosts.posts table a new
field, category[catname:S], and allow search on category...
> So I think we need explicit delete an update support. And maybe we need
> some way to sync - think of people doing strange things with their data
> and don't know what posts actually changed (like some changes in shortcuts
> in PyDS - this would change posts, even though currently this isn't
> upstreamed automatically, this might some day when conditional rerendering
> is in place). Some sync-Call that returns a list of hashes to store in the
> client database might help, so people can say "look, this is the list of
> post hashes I have in my database, tell me which ones you need me to upload
> (again) to get in sync with me".
That's a good idea. "xmlStorageSystem.getMirroredPostSummary", which
returns a list of dicts, something like:
[{'postid': 'foo',
'hash': md5("|".join(guid + url + title + text))}]
> Oh, and we should return a new flag in the server capabilities to automatically
> enable this search mirror feature.
flCanMirrorPosts <-- always true
and then if the user has sent in some posts, flCanHostSearch should be
true and flSearchUrl should point to /system/search.py?u=0001234 ...
(I might have got those last two key name wrong; substitute with
whatever RCS uses).
Sound OK?
Cheers,
Phil :-)
Hi!
On Tue, Oct 14, 2003 at 11:10:10PM +1300, Phillip Pearson wrote:
> - added a new xmlStorageSystem.mirrorPosts(email,password,posts) method that
> takes a list of posts (structs: {postid,date,url,guid,title,text}) and dumps
> them in a new database table. It deletes them (searches by postid) if they
> already exist.
Hmm. I think that is not sufficient. Reason: what if people want to remove
postings? What if they update postings? Should people allways send all
postings in (that wouldn't be a real option for me - too much postings).
So how could you decide what to do with the postings in the database? I think
mirrorPosts should just add/update postings. deleteMirroredPosts should
get a list of post ids and delete them.
PyDS does allow both post updates and post deletes. And then there is the
problem of categories - posts can be on the homepage or not (as with Radio)
and can be in categories or not. So people might change posts with regard
of their homepage/categories and so might move them from homepage to a
category or back. This would imply changed guids (and therefore changes in
postings, too).
So I think we need explicit delete an update support. And maybe we need
some way to sync - think of people doing strange things with their data
and don't know what posts actually changed (like some changes in shortcuts
in PyDS - this would change posts, even though currently this isn't
upstreamed automatically, this might some day when conditional rerendering
is in place). Some sync-Call that returns a list of hashes to store in the
client database might help, so people can say "look, this is the list of
post hashes I have in my database, tell me which ones you need me to upload
(again) to get in sync with me".
Oh, and we should return a new flag in the server capabilities to automatically
enable this search mirror feature.
bye, Georg
Hi!
On Tue, Oct 14, 2003 at 11:10:10PM +1300, Phillip Pearson wrote:
> I just pushed a whole bunch of stuff into CVS. Some tidying / refactoring
I added translations to german in search.py and fixed some broken translations.
bye, Georg