Tesugen

Burger keywords

For weblogs, I think it’s a bad idea to create categories up-front. I fear that it will restrict the posts to these categories. I’d rather let the categories emerge as I detect frequent topics in my posts. But organizing the posts into categories afterwards is a drag. I’d prefer that my blog tool could do this automatically: “Hey, he’s blogged 10 times about software architecture. Let’s create a category!”

Several weeks ago (see here) I wrote a script to list the most common words in my posts as an attempt to try to identify which categories to create. Then, I identified the somewhat broad categories “programming”, “Internet” and “life”. The categories would have to be more narrow to be useful.

About ten years ago I worked together with a guy that had studied heuristic clustering (Google search) in university. As I understood, it was about grouping texts by topic. But perhaps there’s something more high tech today. Anyway, a satisfactory solution could be achieved by defining categories as collecting all texts that contain one or more keywords, so that the blog tool will automatically list posts by category.

The keywords can probably be detected automatically by analyzing all words in all posts and computing the average frequency for each of them. The interesting keywords are probably found somewhere in the middle. The most common words are words such as “a”, “the”, “if” etc. What needs to be done is to determine the frequency interval that indicates an important keyword.

I suspect that single keywords aren’t interesting by themselves, but the grouping of them. The keywords “software”, “architecture” and perhaps “metaphor” would mark a post about software architecture. Keywords might be optional or groups might be defined as “software AND (architecture OR metaphor)”, or something like that.

Combine this with my idea for generating Components beside Pages, and allowing Components to be included in Pages. Then I could create a component to list the most common keyword groups and one for the most recent. This would probably be very useful for readers of Tesugen.com.

The above was posted to my personal weblog on August 6, 2002. My name is Peter Lindberg and I am a thirtysomething software developer and dad living in Stockholm, Sweden. Here, you’ll find posts in English and Swedish about whatever happens to interest me for the moment.

Tags:

Related posts:

Posted around the same time:

The seven most recent posts:

  1. Tesugen Replaced (October 7)
  2. My Year of MacBook Troubles (May 16)
  3. Tesugen Turns Five (March 21)
  4. Gustaf Nordenskiöld om keramik kontra kläddesign (December 10, 2006)
  5. Se till att ha två buffertar för oförutsedda utgifter (October 30, 2006)
  6. Bra tips för den som vill börja fondspara (October 7, 2006)
  7. Light-Hearted Parenting Tips (September 16, 2006)
Bloggtoppen.se