Skip to content

Conversation

@quic-calvnguy
Copy link
Contributor

Description
Enables the file mapping of weights as well as the overall context bin. This feature is currently only enabled for ARM64 WIN devices

Motivation and Context
Currently, when reading the context bin, ORT allocates a large buffer on the heap. Assuming the same model is used, each ORT session will allocate a buffer for the context bin. This is incredibly wasteful when large models are used. Instead, WIN file mapping can be leveraged to map the context bin, then every time a context needs to be created with the context bin, the pointer to the context bin can be retrieved and used instead of some pre-allocated buffer, thus making QNN EP more memory-efficient. In the case of multiple ORT sessions, the context bin will only be loaded once for all sessions, increasing memory efficiency and overall initialization performance. This is very useful regarding the use of LLMs going forward.

 - Create file mapping callback interface class
   - Android expected to have support in the future
 - Implement Windows callbacks in WindowsFileMapper
 - New option disable_file_mapped_weights
   - Feature is enabled by default with retry logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant