what is the possible max new token?
#15
by
PF94
- opened
I tried the example inference code and it worked great. But when I increase the max_new_token to 2048, most of the time I got either RuntimeError: CUDA error: device-side assert triggered or RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED. In config.json max_position_embeddings and max_length are 4096. Does it mean the model can output up to 4096 tokens?