New paste Repaste Download
use search_hub::tagging::{default_tags, TaggingEngine};
struct Sample {
    name: &'static str,
    text: &'static str,
}
const SAMPLES: &[Sample] = &[
    Sample {
        name: "Rust backend API",
        text: r#"
Building a Modern Web API with Rust
In this tutorial we will build a RESTful API using the Actix-web framework
in Rust. The API will expose CRUD endpoints for managing a book collection
stored in a PostgreSQL database. We will use SQLx for async database access
with connection pooling, and serde for JSON serialization and deserialization
of our request and response types.
To get started, create a new cargo project and add the required dependencies
to your Cargo.toml: actix-web, serde, serde_json, sqlx, tokio, and uuid.
Enable the runtime-tokio feature for sqlx so we can use async database
operations throughout the application.
First define our data model with a Book struct containing id, title, author,
isbn, and published_year fields. Derive Serialize and Deserialize from serde
so actix-web can automatically convert between JSON and our Rust types.
Then create database migration files that define the books table schema with
appropriate indexes on the isbn and author columns.
Next implement the HTTP handlers. The list handler queries all books and
returns them as a JSON array. The create handler validates the incoming JSON
body, inserts a new row, and returns the created book with a 201 status code.
The get, update, and delete handlers follow the same pattern using the book
id extracted from the URL path parameters.
For error handling we define an ApiError enum that maps to appropriate HTTP
status codes. Use actix-web's ResponseError trait to automatically convert
our error types into JSON error responses. This keeps the handler code clean
and focused on business logic rather than HTTP plumbing.
Add middleware for logging, CORS support, and request validation. Configure
the server to bind on 0.0.0.0:8080 with 4 worker threads. Finally write
integration tests using actix_web::test to verify each endpoint works
correctly with both valid and invalid inputs.
Deploy the application using Docker with a multi-stage build for minimal
image size. Use docker-compose to run the API server alongside a PostgreSQL
container, with environment variables for configuration. Add health check
endpoints and structured logging for production monitoring.
"#,
    },
    Sample {
        name: "Python data science",
        text: r#"
Exploratory Data Analysis with Python and Pandas
Data analysis begins with loading your dataset into a pandas DataFrame.
Use the read_csv function to import CSV files and inspect the first few
rows with the head method. Check data types with dtypes and get summary
statistics using the describe method on numerical columns.
Data cleaning is a critical step before any modeling. Handle missing values
by either dropping rows with dropna or filling them with fillna using the
mean or median of the column. Remove duplicate rows with drop_duplicates
and convert data types as needed using the astype method.
For data visualization, matplotlib and seaborn are the standard libraries
in the Python ecosystem. Create scatter plots with plt.scatter to explore
relationships between numeric variables, histograms with plt.hist to
understand distributions, and box plots with seaborn.boxplot to detect
outliers in your data. Customize your plots with titles, axis labels,
and color palettes for publication-quality figures.
Feature engineering transforms raw data into inputs suitable for machine
learning models. Create new columns from existing ones, encode categorical
variables using one-hot encoding with pandas.get_dummies, and scale numeric
features using scikit-learn's StandardScaler. Split your data into training
and test sets with train_test_split to evaluate model performance.
Build a regression model using scikit-learn's LinearRegression or a
classification model using RandomForestClassifier. Fit the model on the
training data with the fit method, make predictions with predict, and
evaluate accuracy using metrics like mean_squared_error for regression
or accuracy_score for classification tasks.
Use Jupyter notebooks for interactive development with inline plotting
and markdown annotations. Document your analysis steps clearly so others
can reproduce your results. Save your cleaned datasets with to_csv for
future use and export your models with joblib or pickle for deployment.
"#,
    },
    Sample {
        name: "Frontend web design",
        text: r#"
Responsive Web Design with Modern CSS
Building a responsive website starts with a solid CSS foundation using
Flexbox and CSS Grid for layout. Define a container with display: flex
to create horizontal or vertical layouts that adapt to screen size. Use
justify-content and align-items to position elements within the flex
container, and the flex-wrap property to allow items to flow onto
multiple lines on smaller screens.
CSS Grid provides two-dimensional layout control with grid-template-columns
and grid-template-rows. Define named grid areas with grid-template-areas
and place items using the grid-area property. This makes it easy to create
complex page layouts that reflow naturally from desktop to tablet to mobile.
Typography is the foundation of good design. Set a harmonious type scale
using clamp for fluid typography that scales between minimum and maximum
values. Use custom properties (CSS variables) to maintain consistency
across your design system. Define --color-primary, --font-heading, and
--spacing-unit variables that can be changed globally.
Accessibility is not optional. Use semantic HTML elements like header,
nav, main, section, and footer. Add aria labels to interactive elements
and ensure color contrast ratios meet WCAG AA standards. Test your site
with keyboard navigation and screen readers to verify all functionality
is accessible to users with disabilities.
Animations enhance user experience when used thoughtfully. Use CSS
transitions for hover effects on buttons and links, and keyframe
animations for loading states and page transitions. The prefers-reduced-
motion media query respects users who prefer less animation.
Mobile-first design means starting with the smallest screen and adding
complexity with min-width media queries. This approach ensures your site
works well on all devices and loads efficiently on mobile connections.
Test regularly using browser dev tools in responsive design mode.
"#,
    },
    Sample {
        name: "Linux devops",
        text: r#"
Linux Server Administration and Automation
Managing Linux servers efficiently requires mastery of the command line
and automation tools. Start with the basics of process management using
ps, top, and htop to monitor running processes. Use kill and killall
to terminate unresponsive processes and systemctl to manage systemd
services. Check resource usage with free for memory, df for disk space,
and netstat or ss for network connections.
Shell scripting is essential for automation. Write bash scripts using
variables, loops, conditionals, and functions. Use find with exec to
batch-process files, grep with regex for pattern matching in logs, and
awk or sed for text processing. Schedule recurring tasks with cron
and systemd timers for more complex scheduling needs.
Containerization with Docker simplifies application deployment. Write
Dockerfiles that specify the base image, install dependencies, copy
application code, and define the startup command. Use docker-compose
to orchestrate multi-container applications with linked services,
networks, and persistent volumes. Tag and push images to a registry
for deployment across environments.
Kubernetes orchestrates containers at scale. Define deployments with
replica counts, services for networking, and configmaps for environment
configuration. Use kubectl to inspect pods, view logs, and scale
applications horizontally. Implement health checks with liveness and
readiness probes to ensure your applications are running correctly.
Configuration management with Ansible keeps your infrastructure
consistent. Write playbooks in YAML that define the desired state of
your servers. Use roles to organize tasks, handlers, and variables
into reusable components. Run ad-hoc commands with ansible to quickly
check server status across your entire infrastructure.
Monitor your infrastructure with Prometheus for metrics collection and
Grafana for dashboards. Set up alerts for critical conditions like high
CPU usage, disk space running low, or services going offline. Centralize
logs using the ELK stack or Loki for troubleshooting and analysis.
"#,
    },
    Sample {
        name: "AI machine learning",
        text: r#"
Training Deep Learning Models with PyTorch
Deep learning has transformed how we approach complex pattern recognition
tasks. PyTorch provides a flexible framework for building and training
neural networks using tensor computations with automatic differentiation.
Define a model by subclassing nn.Module and implementing the forward
method that specifies how input data flows through the network layers.
Data preparation is crucial for model performance. Use the DataLoader
class to efficiently batch and shuffle your dataset during training.
Apply data augmentation techniques like random cropping, flipping, and
color jitter to reduce overfitting and improve generalization. Normalize
input tensors to have zero mean and unit variance for stable training.
The training loop iterates over epochs, processing batches of data
through the model, computing the loss with a criterion like cross-entropy
for classification or mean squared error for regression, and calling
backward to compute gradients. Use an optimizer like Adam or SGD with
learning rate scheduling to minimize the loss function over time.
Convolutional neural networks excel at image recognition tasks. Stack
Conv2d layers with increasing channel depth, interleaved with ReLU
activations and max-pooling layers to reduce spatial dimensions.
Add batch normalization to stabilize training and dropout layers to
prevent overfitting. End with fully connected layers for classification.
Transformer architectures dominate natural language processing. The
self-attention mechanism allows the model to weigh the importance of
different positions in the input sequence. Multi-head attention runs
multiple attention operations in parallel, capturing different types
of relationships between tokens. Positional encodings provide sequence
order information to the model.
Transfer learning leverages pretrained models for new tasks. Load a
model pretrained on ImageNet, freeze the early layers, and replace the
final classification head with new layers for your specific dataset.
Fine-tune the model with a lower learning rate to adapt the pretrained
features to your domain while preserving the general visual knowledge.
"#,
    },
    Sample {
        name: "Mobile development",
        text: r#"
Building Cross-Platform Mobile Apps with Flutter
Flutter enables building native-quality mobile applications for both
iOS and Android from a single Dart codebase. The framework uses a
widget-based architecture where everything from a simple text label
to complex layouts is a widget. Compose widgets together using
child and children properties to build your user interface hierarchy.
State management is a key concern in mobile app development. Use
setState for simple local state, or adopt Provider, Riverpod, or
Bloc for more complex application state that needs to be shared
across multiple screens. Keep your business logic separate from
your UI code by using ViewModels or Controllers that manage state
and expose it to widgets via streams or change notifiers.
Navigation and routing handle moving between screens in your app.
Use the Navigator widget with named routes for simple apps, or
implement a router with GoRouter for more complex navigation
patterns including deep linking and nested navigation. Pass data
between screens using constructor arguments or route parameters.
Platform-specific features require accessing native APIs through
platform channels. Implement features like camera access, location
services, biometric authentication, and push notifications by
writing platform-specific code in Kotlin or Swift and invoking it
from Dart through MethodChannel calls. Use community packages from
pub.dev for common native features.
Performance optimization is critical for a smooth user experience.
Profile your app using the Flutter DevTools to identify widget
rebuilds and jank. Use const constructors where possible to reduce
rebuilds, implement lazy loading for lists with ListView.builder,
and cache images using cached_network_image. Reduce app size by
removing unused resources and using code shrinking.
Testing mobile apps requires multiple approaches. Write unit tests
for your business logic and data models. Use widget tests to verify
individual widget behavior and integration tests for full user flows.
Run tests on both iOS and Android simulators to catch platform-
specific issues before releasing to app stores.
"#,
    },
    Sample {
        name: "Gaming graphics",
        text: r#"
Real-Time 3D Rendering with Vulkan and GLSL
Modern game engines leverage GPU compute capabilities to render
complex 3D scenes at interactive frame rates. The Vulkan API
provides low-level access to graphics hardware with explicit
control over memory management and command buffers. Set up a
Vulkan instance, select a physical device, create a logical
device, and configure graphics and present queues for rendering.
The rendering pipeline transforms 3D geometry into 2D images.
Vertex shaders process individual vertices, applying model-view-
projection matrix transformations to place objects in clip space.
Fragment shaders determine the color of each pixel using lighting
calculations, texture sampling, and material properties defined
in GLSL shading language source code.
A game loop runs at 60 frames per second, processing input events,
updating game state, and rendering each frame. Use delta time to
ensure consistent movement speeds regardless of frame rate. Implement
fixed time step for physics simulations to maintain stability.
Separate the update and render phases for better parallelism.
Physics simulation brings game worlds to life. Use a physics engine
like PhysX or Bullet for rigid body dynamics including collision
detection, joint constraints, and force-based movement. Implement
broad phase and narrow phase collision detection to efficiently
find colliding pairs among thousands of objects in the scene.
Spatial data structures accelerate rendering by culling objects
outside the camera view frustum. Use bounding volume hierarchies,
octrees, or binary space partitioning trees to organize scene
geometry. Implement occlusion culling to skip rendering objects
hidden behind other geometry, saving GPU processing time.
Post-processing effects enhance visual quality after the main
render pass. Apply bloom for glowing highlights, ambient occlusion
for realistic shadowing in corners, and tone mapping to convert
HDR values to displayable colors. Use compute shaders for GPU-
based particle systems and screen-space reflections.
"#,
    },
    Sample {
        name: "Audio production",
        text: r#"
Digital Audio Production and Music Streaming
Digital audio workstations have revolutionized music production by
providing powerful tools for recording, editing, and mixing audio.
Record multiple tracks simultaneously through audio interfaces with
low-latency monitoring. Edit waveform regions with cut, copy, paste,
and crossfade operations to arrange your recordings into a coherent
composition.
Audio effects processing shapes the character of your sound. Use
equalizers to boost or cut specific frequency ranges, compressors
to control dynamic range by reducing loud peaks, and reverbs to
simulate acoustic spaces from small rooms to large concert halls.
Delay and chorus effects add depth and width to your mixes.
Podcast production follows a different workflow focused on spoken
word clarity. Record with quality microphones in acoustically treated
spaces to minimize background noise and room reflections. Use noise
gates to silence pauses between speech, de-essers to reduce sibilance,
and compressors to smooth out volume variations across the episode.
Music streaming platforms deliver audio content to millions of
listeners worldwide. Encode audio files using codecs like AAC or
Opus that balance sound quality with bandwidth efficiency. Generate
album artwork, metadata tags, and playlist descriptions to help
listeners discover your content through search and recommendations.
Radio broadcasting combines live and pre-recorded content with
scheduling automation. Use playout software to manage playlists,
cues, and commercial breaks. Broadcast audio over internet radio
using Icecast or Shoutcast servers with streaming protocols like
HLS for adaptive bitrate delivery to listeners on various devices.
Live sound reinforcement requires understanding of acoustics and
signal flow. Set up a mixing console with auxiliary sends for
stage monitors and effects returns. Use graphic equalizers to tune
the room response and feedback suppressors to prevent howling.
Balance the front-of-house mix so every instrument and voice is
clear and present in the audience area.
"#,
    },
    Sample {
        name: "Social community",
        text: r#"
Building Online Communities and Social Platforms
Social media platforms connect people around shared interests and
experiences. Designing a social platform requires careful consideration
of user profiles, content feeds, and interaction mechanisms. Users
create profiles with biographical information, profile pictures, and
privacy settings that control who can see their content and activity.
Content feeds are the central feature of any social platform. Implement
algorithms that surface relevant posts based on recency, engagement
metrics, and user preferences. Support multiple content types including
text posts, image sharing, video uploads, and link previews with rich
metadata fetched from shared URLs.
Real-time messaging enables direct communication between users. Build
chat systems using WebSocket connections for instant message delivery
with typing indicators, read receipts, and push notifications. Organize
conversations into private direct messages and group chats with support
for media attachments, emoji reactions, and message threading.
Community management tools help moderators maintain healthy discussions.
Provide reporting mechanisms for inappropriate content, automated spam
detection using machine learning classifiers, and moderation queues
where flagged content is reviewed before being shown to the broader
community. Implement warning systems and temporary or permanent bans.
Social features like likes, shares, comments, and follows create
engagement loops that keep users returning to the platform. Notify
users when someone interacts with their content through in-app
notifications and email digests. Show trending topics and popular
content in discovery sections to help users find new communities.
Content moderation at scale requires both automated and human review.
Train natural language models to detect hate speech, harassment, and
misinformation. Establish clear community guidelines that define
acceptable behavior and content standards. Provide appeals processes
so users can challenge moderation decisions they disagree with.
"#,
    },
];
#[test]
fn explore_tagging_thresholds() {
    let tags = default_tags();
    let mut engine = TaggingEngine::new(&tags, 0.40).expect("failed to init tagging engine");
    let thresholds = [
        0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90,
    ];
    for sample in SAMPLES {
        println!();
        println!("=== {} ===", sample.name);
        println!("Text length: {} chars", sample.text.len());
        println!();
        for &threshold in &thresholds {
            let matched = engine
                .tags_for_with_threshold(sample.text, 5, threshold)
                .expect("tagging failed");
            if matched.is_empty() {
                println!("  {:.2}: (none)", threshold);
            } else {
                let tags_repr: Vec<String> = matched
                    .iter()
                    .map(|(tag, score)| format!("{} ({:.3})", tag, score))
                    .collect();
                println!("  {:.2}: {}", threshold, tags_repr.join(", "));
            }
        }
    }
}
Filename: tests/tagging_threshold.rs. Size: 21kb. View raw, , hex, or download this file.
use fastembed::{EmbeddingModel, TextEmbedding, TextInitOptions};
use fastembed::similarity::cosine_similarity;
use serde::Deserialize;
/// A named tag with example texts used for semantic similarity scoring.
///
/// # Example
///
/// ```rust
/// use search_hub::tagging::TagDef;
///
/// let tag = TagDef {
///     name: "rust".into(),
///     examples: vec!["Rust ownership".into(), "cargo build system".into()],
/// };
/// assert_eq!(tag.name, "rust");
/// ```
#[derive(Debug, Clone, Deserialize)]
pub struct TagDef {
    /// The tag label (e.g. "rust", "web").
    pub name: String,
    /// Example phrases that exemplify this tag for embedding comparison.
    pub examples: Vec<String>,
}
/// Return the hardcoded default set of 25 tags with 3 example texts each.
///
/// # Example
///
/// ```rust
/// use search_hub::tagging::default_tags;
///
/// let tags = default_tags();
/// assert_eq!(tags.len(), 25);
/// assert_eq!(tags[0].name, "rust");
/// ```
pub fn default_tags() -> Vec<TagDef> {
    vec![
        TagDef { name: "rust".into(), examples: vec![
            "Rust ownership and borrow checker enforcing memory safety at compile time".into(),
            "pattern matching with enums and the Result type for error handling".into(),
            "cargo build system, crates.io ecosystem, and procedural macros".into(),
        ]},
        TagDef { name: "python".into(), examples: vec![
            "Python indentation-based syntax, list comprehensions, and generator expressions".into(),
            "dynamic typing, duck typing, and Python's data model protocols".into(),
            "pip packaging, virtual environments, and Python import system".into(),
        ]},
        TagDef { name: "web".into(), examples: vec![
            "HTML semantic markup, accessibility attributes, and document structure".into(),
            "CSS layout with flexbox and grid, responsive design with media queries".into(),
            "DOM manipulation, event bubbling, and Web API interfaces in the browser".into(),
        ]},
        TagDef { name: "audio".into(), examples: vec![
            "music streaming, albums, playlists, and artist discovery".into(),
            "podcast episodes, RSS feeds, and audio content distribution".into(),
            "radio stations, live broadcasts, and audio programming".into(),
        ]},
        TagDef { name: "backend".into(), examples: vec![
            "HTTP server routing, request handling, and response middleware chains".into(),
            "connection pooling, ORM patterns, and server-side template rendering".into(),
            "backend service architecture, message queues, and inter-service communication".into(),
        ]},
        TagDef { name: "devops".into(), examples: vec![
            "container images, Dockerfiles, and Kubernetes pod orchestration".into(),
            "infrastructure provisioning with Terraform and configuration management".into(),
            "CI/CD build pipelines, artifact management, and deployment strategies".into(),
        ]},
        TagDef { name: "data".into(), examples: vec![
            "data frame operations, statistical analysis, and numerical computing".into(),
            "data visualization with plotting libraries and charting techniques".into(),
            "ETL workflows, data cleaning, and batch processing pipelines".into(),
        ]},
        TagDef { name: "ai".into(), examples: vec![
            "transformer attention mechanisms, tokenization, and embedding layers".into(),
            "gradient descent, backpropagation, and neural network loss functions".into(),
            "model quantization, fine-tuning strategies, and inference optimization".into(),
        ]},
        TagDef { name: "linux".into(), examples: vec![
            "file permission bits, process management, and signal handling".into(),
            "piping stdout, redirecting file descriptors, and shell expansion rules".into(),
            "package managers, init systems, and systemd unit files".into(),
        ]},
        TagDef { name: "security".into(), examples: vec![
            "authentication tokens, OAuth flows, and JWT session handling".into(),
            "input sanitization, parameterized queries, and XSS/CSRF prevention".into(),
            "certificate authorities, TLS handshakes, and mTLS configurations".into(),
        ]},
        TagDef { name: "design".into(), examples: vec![
            "design tokens, component libraries, and design system consistency".into(),
            "typographic scale, whitespace rhythm, and visual hierarchy principles".into(),
            "color contrast, WCAG accessibility ratios, and responsive breakpoints".into(),
        ]},
        TagDef { name: "mobile".into(), examples: vec![
            "touch gesture handling, viewport sizing, and responsive mobile layouts".into(),
            "app lifecycle, push notifications, and background task management".into(),
            "native platform APIs, mobile sensors, and cross-platform mobile frameworks".into(),
        ]},
        TagDef { name: "gaming".into(), examples: vec![
            "game loop architecture, frame-rate independence, and delta time".into(),
            "physics simulation, collision detection, and spatial partitioning".into(),
            "shader programs, GPU rendering pipeline, and 3D transformations".into(),
        ]},
        TagDef { name: "tutorial".into(), examples: vec![
            "beginner-friendly walkthroughs with code examples and expected output".into(),
            "learning objectives, prerequisite knowledge, and progressive skill building".into(),
            "interactive code playgrounds, exercises, and quiz-based reinforcement".into(),
        ]},
        TagDef { name: "news".into(), examples: vec![
            "version bumps, deprecation timelines, and migration announcements".into(),
            "community announcements, conference talks, and ecosystem updates".into(),
            "release notes, changelogs, and feature release highlights".into(),
        ]},
        TagDef { name: "video".into(), examples: vec![
            "video streaming platforms, channels, and content creation".into(),
            "video editing, encoding formats, and transcoding workflows".into(),
            "live streaming, video on demand, and media playback".into(),
        ]},
        TagDef { name: "tools".into(), examples: vec![
            "text editor configuration, IDE plugins, and developer workflow tooling".into(),
            "version control workflows, git branching strategies, and merge patterns".into(),
            "debugger breakpoints, profiling tools, and performance tracing utilities".into(),
        ]},
        TagDef { name: "database".into(), examples: vec![
            "SQL table schemas, foreign key relationships, and constraint design".into(),
            "index structures, query plan analysis, and query performance tuning".into(),
            "ACID transactions, isolation levels, and connection pool configuration".into(),
        ]},
        TagDef { name: "cli".into(), examples: vec![
            "command argument parsing, subcommand patterns, and flag conventions".into(),
            "terminal output formatting, colored logging, and progress indicators".into(),
            "stdin/stdout pipes, exit codes, and shell completion scripts".into(),
        ]},
        TagDef { name: "social".into(), examples: vec![
            "social media platforms, feeds, and community discussions".into(),
            "user profiles, followers, and content sharing features".into(),
            "messaging systems, real-time chat, and social networking APIs".into(),
        ]},
        TagDef { name: "testing".into(), examples: vec![
            "unit test assertions, test fixtures, and parametrized test cases".into(),
            "mocking external dependencies, test doubles, and fake implementations".into(),
            "integration tests, end-to-end testing, and continuous testing in CI".into(),
        ]},
        TagDef { name: "javascript".into(), examples: vec![
            "JavaScript closures, prototypal inheritance, and the event loop".into(),
            "async/await patterns, Promise chaining, and callback conventions".into(),
            "ES modules, npm packages, and JavaScript bundler tooling".into(),
        ]},
        TagDef { name: "api".into(), examples: vec![
            "RESTful resource design, URL patterns, and HTTP method semantics".into(),
            "request validation, error response formatting, and status code conventions".into(),
            "API versioning, rate limiting, and OpenAPI specification documents".into(),
        ]},
        TagDef { name: "documentation".into(), examples: vec![
            "API reference docs, docstrings, and inline code annotations".into(),
            "architecture decision records and design documentation practices".into(),
            "README writing, project wikis, and onboarding guides for contributors".into(),
        ]},
        TagDef { name: "productivity".into(), examples: vec![
            "habit tracking, time management, and personal workflow optimization".into(),
            "note-taking systems, knowledge base management, and personal wikis".into(),
            "task organization, prioritization frameworks, and automation of repetitive work".into(),
        ]},
    ]
}
/// Engine that embeds content and scores it against tag prototypes using cosine similarity.
///
/// # Example
///
/// ```ignore
/// let tags = search_hub::tagging::default_tags();
/// let mut engine = search_hub::tagging::TaggingEngine::new(&tags, 0.40)
///     .expect("failed to init tagging engine");
/// let matched = engine.tags_for("the rust programming language borrow checker", 3)
///     .expect("tagging failed");
/// assert!(matched.contains(&"rust".to_string()));
/// ```
pub struct TaggingEngine {
    model: TextEmbedding,
    tag_examples: Vec<(String, Vec<Vec<f32>>)>,
    threshold: f32,
}
impl TaggingEngine {
    /// Create a new tagging engine from the given tag definitions.
    ///
    /// Downloads the ONNX embedding model on first run (cached afterwards).
    ///
    /// # Parameters
    ///
    /// * `tags`      - Slice of `TagDef` entries (from config or `default_tags()`).
    /// * `threshold` - Minimum cosine-similarity score (0.0 to 1.0) for a tag
    ///                  to be assigned. Default 0.40 in `tags_for()` but can
    ///                  be overridden per-call with `tags_for_with_threshold()`.
    ///
    /// # Returns
    ///
    /// A `TaggingEngine` ready to score content.
    ///
    /// # Errors
    ///
    /// Returns an error if the embedding model cannot be loaded or the
    /// tag examples fail to embed.
    ///
    /// # Example
    ///
    /// ```ignore
    /// let tags = search_hub::tagging::default_tags();
    /// let mut engine = search_hub::tagging::TaggingEngine::new(&tags, 0.60)
    ///     .expect("model init");
    /// ```
    pub fn new(tags: &[TagDef], threshold: f32) -> anyhow::Result<Self> {
        let mut model = TextEmbedding::try_new(
            TextInitOptions::new(EmbeddingModel::BGESmallENV15)
                .with_show_download_progress(true),
        )?;
        let mut all_examples: Vec<String> = Vec::new();
        let mut tag_indices: Vec<(usize, &str)> = Vec::new();
        for (ti, tag) in tags.iter().enumerate() {
            for example in &tag.examples {
                tag_indices.push((ti, &tag.name));
                all_examples.push(format!("passage: {}", example));
            }
        }
        let embeddings = model.embed(all_examples, None)?;
        let mut tag_examples: Vec<(String, Vec<Vec<f32>>)> = tags
            .iter()
            .map(|t| (t.name.clone(), Vec::new()))
            .collect();
        for ((ti, _name), emb) in tag_indices.iter().zip(embeddings.iter()) {
            tag_examples[*ti].1.push(emb.clone());
        }
        Ok(Self { model, tag_examples, threshold })
    }
    fn truncate(content: &str, max_chars: usize) -> &str {
        let end = content.char_indices()
            .take(max_chars)
            .last()
            .map(|(i, c)| i + c.len_utf8())
            .unwrap_or(content.len());
        &content[..end.min(content.len())]
    }
    fn score_content(&mut self, content: &str) -> anyhow::Result<Vec<(String, f32)>> {
        let truncated = Self::truncate(content, 2000);
        let emb = self.model.embed(
            vec![format!("passage: {}", truncated)],
            None,
        )?;
        if emb.is_empty() {
            return Ok(Vec::new());
        }
        let query_emb = &emb[0];
        let mut scores: Vec<(usize, f32)> = self.tag_examples
            .iter()
            .enumerate()
            .map(|(i, (_, examples))| {
                let max_sim = examples
                    .iter()
                    .map(|proto| cosine_similarity(query_emb, proto))
                    .fold(f32::NEG_INFINITY, f32::max);
                (i, max_sim)
            })
            .collect();
        scores.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
        Ok(scores
            .into_iter()
            .map(|(i, score)| (self.tag_examples[i].0.clone(), score))
            .collect())
    }
    /// Score `content` against all tag prototypes and return tags above the
    /// configured threshold.
    ///
    /// # Parameters
    ///
    /// * `content`  - The text to tag (e.g. page body converted to Markdown).
    /// * `max_tags` - Maximum number of tags to return.
    ///
    /// # Returns
    ///
    /// A `Vec<String>` of tag names matching the content, sorted by score
    /// descending.
    ///
    /// # Errors
    ///
    /// Returns an error if the embedding model fails to process the content.
    ///
    /// # Example
    ///
    /// ```ignore
    /// let tags = search_hub::tagging::default_tags();
    /// let mut engine = search_hub::tagging::TaggingEngine::new(&tags, 0.40)
    ///     .expect("model init");
    /// let matched = engine.tags_for("the rust programming language", 3)
    ///     .expect("tagging failed");
    /// println!("{:?}", matched);
    /// ```
    pub fn tags_for(&mut self, content: &str, max_tags: usize) -> anyhow::Result<Vec<String>> {
        Ok(self
            .tags_for_with_threshold(content, max_tags, self.threshold)?
            .into_iter()
            .map(|(tag, _)| tag)
            .collect())
    }
    /// Score `content` and return tag-score pairs above a custom threshold.
    ///
    /// # Parameters
    ///
    /// * `content`   - The text to tag.
    /// * `max_tags`  - Maximum number of tags to return.
    /// * `threshold` - Minimum cosine-similarity score (0.0 to 1.0).
    ///
    /// # Returns
    ///
    /// A `Vec<(String, f32)>` of (tag_name, score) matching the content,
    /// sorted by score descending.
    ///
    /// # Errors
    ///
    /// Returns an error if the embedding model fails to process the content.
    ///
    /// # Example
    ///
    /// ```ignore
    /// let tags = search_hub::tagging::default_tags();
    /// let mut engine = search_hub::tagging::TaggingEngine::new(&tags)
    ///     .expect("model init");
    /// let matched = engine.tags_for_with_threshold("rust programming", 5, 0.30)
    ///     .expect("tagging failed");
    /// for (tag, score) in &matched {
    ///     println!("{}: {:.3}", tag, score);
    /// }
    /// ```
    pub fn tags_for_with_threshold(
        &mut self,
        content: &str,
        max_tags: usize,
        threshold: f32,
    ) -> anyhow::Result<Vec<(String, f32)>> {
        let scored = self.score_content(content)?;
        Ok(scored
            .into_iter()
            .filter(|(_, score)| *score >= threshold)
            .take(max_tags)
            .collect())
    }
}
Filename: src/tagging.rs. Size: 16kb. View raw, , hex, or download this file.
cargo test -- --no-capture tagging_thresholds
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.15s
     Running unittests src/lib.rs (target/debug/deps/search_hub-3772568fc9c44310)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 17 filtered out; finished in 0.00s
     Running unittests src/main.rs (target/debug/deps/search_hub-ef865f4abff29f07)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
     Running tests/convert_strips_html.rs (target/debug/deps/convert_strips_html-2232dad9b2803fc6)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 3 filtered out; finished in 0.00s
     Running tests/tagging_thresholds.rs (target/debug/deps/tagging_thresholds-c2474eba62cfa7a8)
running 1 test
=== Rust backend API ===
Text length: 2177 chars
  0.30: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.35: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.40: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.45: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.50: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.55: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.60: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.65: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.70: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.75: javascript (0.799), rust (0.796), tools (0.785), data (0.784), security (0.781)
  0.80: (none)
  0.85: (none)
  0.90: (none)
=== Python data science ===
Text length: 2001 chars
  0.30: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.35: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.40: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.45: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.50: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.55: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.60: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.65: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.70: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.75: tutorial (0.795), python (0.784), data (0.770), design (0.765), database (0.757)
  0.80: (none)
  0.85: (none)
  0.90: (none)
=== Frontend web design ===
Text length: 1948 chars
  0.30: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.35: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.40: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.45: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.50: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.55: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.60: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.65: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.70: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.75: web (0.905), tools (0.807), design (0.803), mobile (0.795), documentation (0.795)
  0.80: web (0.905), tools (0.807), design (0.803)
  0.85: web (0.905)
  0.90: web (0.905)
=== Linux devops ===
Text length: 2105 chars
  0.30: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.35: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.40: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.45: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.50: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.55: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.60: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.65: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.70: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.75: linux (0.850), devops (0.831), rust (0.806), data (0.800), python (0.798)
  0.80: linux (0.850), devops (0.831), rust (0.806), data (0.800)
  0.85: (none)
  0.90: (none)
=== AI machine learning ===
Text length: 2180 chars
  0.30: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.35: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.40: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.45: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.50: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.55: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.60: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.65: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.70: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.75: ai (0.839), design (0.801), video (0.796), tutorial (0.786), python (0.776)
  0.80: ai (0.839), design (0.801)
  0.85: (none)
  0.90: (none)
=== Mobile development ===
Text length: 2153 chars
  0.30: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.35: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.40: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.45: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.50: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.55: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.60: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.65: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.70: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.75: tools (0.806), tutorial (0.803), documentation (0.799), productivity (0.796), devops (0.792)
  0.80: tools (0.806), tutorial (0.803)
  0.85: (none)
  0.90: (none)
=== Gaming graphics ===
Text length: 2059 chars
  0.30: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.35: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.40: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.45: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.50: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.55: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.60: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.65: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.70: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.75: gaming (0.852), rust (0.804), devops (0.801), video (0.799), design (0.798)
  0.80: gaming (0.852), rust (0.804), devops (0.801)
  0.85: gaming (0.852)
  0.90: (none)
=== Audio production ===
Text length: 2082 chars
  0.30: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.35: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.40: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.45: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.50: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.55: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.60: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.65: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.70: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.75: audio (0.868), video (0.845), news (0.796), documentation (0.783), devops (0.779)
  0.80: audio (0.868), video (0.845)
  0.85: audio (0.868)
  0.90: (none)
=== Social community ===
Text length: 2078 chars
  0.30: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.35: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.40: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.45: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.50: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.55: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.60: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.65: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.70: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.75: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.80: social (0.911), documentation (0.840), productivity (0.839), news (0.839), video (0.835)
  0.85: social (0.911)
  0.90: social (0.911)
test explore_tagging_thresholds ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.47s
   Doc-tests search_hub
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 26 filtered out; finished in 0.00s
Filename: None. Size: 10kb. View raw, , hex, or download this file.

This paste expires on 2026-06-25 18:33:50.081970+00:00. Pasted through web.