<RETURN_TO_BASE

Creating an Ultra-Light AI Coding Assistant with Mistral Devstral for Low Storage Environments

This tutorial walks through building a low-footprint AI coding assistant with Mistral Devstral, optimized for environments with limited disk space and memory. It covers installation, model loading with quantization, demos, and interactive coding mode.

Efficient Installation and Cache Management

This tutorial demonstrates how to build a lightweight AI coding assistant using the Mistral Devstral model optimized for environments with limited disk space and memory, such as Google Colab. It begins by installing minimal and essential packages like kagglehub, mistral-common, bitsandbytes, transformers, accelerate, and torch without caching to reduce disk usage. A cache cleanup function is defined to remove unnecessary files from directories like /root/.cache and /tmp/kagglehub, ensuring optimal free space before and after running the model.

def cleanup_cache():
   """Clean up unnecessary files to save disk space"""
   cache_dirs = ['/root/.cache', '/tmp/kagglehub']
   for cache_dir in cache_dirs:
       if os.path.exists(cache_dir):
           shutil.rmtree(cache_dir, ignore_errors=True)
   gc.collect()
 
cleanup_cache()
print(" Disk space optimized!")

Loading the Ultra-Compressed Devstral Model

The core component is the LightweightDevstral class, which streams the devstral-small-2505 model via kagglehub to avoid redundant downloads. The model is loaded with aggressive 4-bit quantization using BitsAndBytesConfig to minimize memory and disk footprint while maintaining performance. A custom tokenizer specific to Mistral Devstral is loaded locally. After loading, the cache is cleaned again to maintain minimal disk usage (~2GB).

class LightweightDevstral:
   def __init__(self):
       print(" Downloading model (streaming mode)...")
      
       self.model_path = kagglehub.model_download(
           'mistral-ai/devstral-small-2505/Transformers/devstral-small-2505/1',
           force_download=False 
       )
      
       quantization_config = BitsAndBytesConfig(
           bnb_4bit_compute_dtype=torch.float16,
           bnb_4bit_quant_type="nf4",
           bnb_4bit_use_double_quant=True,
           bnb_4bit_quant_storage=torch.uint8,
           load_in_4bit=True
       )
      
       print(" Loading ultra-compressed model...")
       self.model = AutoModelForCausalLM.from_pretrained(
           self.model_path,
           torch_dtype=torch.float16,
           device_map="auto",
           quantization_config=quantization_config,
           low_cpu_mem_usage=True, 
           trust_remote_code=True
       )
      
       self.tokenizer = MistralTokenizer.from_file(f'{self.model_path}/tekken.json')
      
       cleanup_cache()
       print(" Lightweight assistant ready! (~2GB disk usage)")
  
   def generate(self, prompt, max_tokens=400): 
       """Memory-efficient generation"""
       tokenized = self.tokenizer.encode_chat_completion(
           ChatCompletionRequest(messages=[UserMessage(content=prompt)])
       )
      
       input_ids = torch.tensor([tokenized.tokens])
       if torch.cuda.is_available():
           input_ids = input_ids.to(self.model.device)
      
       with torch.inference_mode(): 
           output = self.model.generate(
               input_ids=input_ids,
               max_new_tokens=max_tokens,
               temperature=0.6,
               top_p=0.85,
               do_sample=True,
               pad_token_id=self.tokenizer.eos_token_id,
               use_cache=True 
           )[0]
      
       del input_ids
       torch.cuda.empty_cache() if torch.cuda.is_available() else None
      
       return self.tokenizer.decode(output[len(tokenized.tokens):])
 
print(" Initializing lightweight AI assistant...")
assistant = LightweightDevstral()

Demonstration of Coding Capabilities

A set of demos showcases the assistant’s coding abilities by sending prompts like writing a prime number checker, debugging a Python function, and creating a simple text analysis class. Each demo runs the prompt through the assistant, prints the result, and then clears memory to maintain responsiveness.

def run_demo(title, prompt, emoji=""):
   """Run a single demo with cleanup"""
   print(f"\n{emoji} {title}")
   print("-" * 50)
  
   result = assistant.generate(prompt, max_tokens=350)
   print(result)
  
   gc.collect()
   if torch.cuda.is_available():
       torch.cuda.empty_cache()
 
 
run_demo(
   "Quick Prime Finder",
   "Write a fast prime checker function `is_prime(n)` with explanation and test cases.",
   ""
)
 
 
run_demo(
   "Debug This Code",
   """Fix this buggy function and explain the issues:
```python
def avg_positive(numbers):
   total = sum([n for n in numbers if n > 0])
   return total / len([n for n in numbers if n > 0])
```""",
   ""
)
 
 
run_demo(
   "Text Tool Creator",
   "Create a simple `TextAnalyzer` class with word count, char count, and palindrome check methods.",
   ""
)

Interactive Quick Coding Mode

Users can engage with the assistant interactively through a Quick Coding Mode, which limits sessions to five prompts to conserve memory. After each prompt, the assistant generates a concise solution, and memory is aggressively cleaned.

def quick_coding():
   """Lightweight interactive session"""
   print("\n QUICK CODING MODE")
   print("=" * 40)
   print("Enter short coding prompts (type 'exit' to quit)")
  
   session_count = 0
   max_sessions = 5 
  
   while session_count < max_sessions:
       prompt = input(f"\n[{session_count+1}/{max_sessions}] Your prompt: ")
      
       if prompt.lower() in ['exit', 'quit', '']:
           break
          
       try:
           result = assistant.generate(prompt, max_tokens=300)
           print(" Solution:")
           print(result[:500]) 
          
           gc.collect()
           if torch.cuda.is_available():
               torch.cuda.empty_cache()
              
       except Exception as e:
           print(f" Error: {str(e)[:100]}...")
      
       session_count += 1
  
   print(f"\n Session complete! Memory cleaned.")

Monitoring Disk Usage and Space-Saving Tips

The tutorial concludes with a disk usage check using the df -h command wrapped in Python’s subprocess module and provides practical tips for maintaining a low footprint, such as limiting token generation, automatic cache cleanup, and deleting the assistant instance when finished.

def check_disk_usage():
   """Monitor disk usage"""
   import subprocess
   try:
       result = subprocess.run(['df', '-h', '/'], capture_output=True, text=True)
       lines = result.stdout.split('\n')
       if len(lines) > 1:
           usage_line = lines[1].split()
           used = usage_line[2]
           available = usage_line[3]
           print(f" Disk: {used} used, {available} available")
   except:
       print(" Disk usage check unavailable")
 
 
print("\n Tutorial Complete!")
cleanup_cache()
check_disk_usage()
 
print("\n Space-Saving Tips:")
print("• Model uses ~2GB vs original ~7GB+")
print("• Automatic cache cleanup after each use") 
print("• Limited token generation to save memory")
print("• Use 'del assistant' when done to free ~2GB")
print("• Restart runtime if memory issues persist")

This approach enables using Mistral Devstral’s powerful language model capabilities in constrained environments without sacrificing speed or usability. The assistant supports real-time code generation, debugging, and prototyping with minimal resource consumption.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский